I have in the resent past done both voice overs and Text to speech (TTS) generation for some training video's. After mentioning this to several people I was repeatly asked - what to you think about AI TTS, is it any good? So here are my thoughts. But maybe first you may want to to have a listen. For a little demonstration I created a sound story using Edgar Allen Poe's "The Cask of Amontillado". I thought this would show the stenghts and weakness of the current state of TTS. All of the voices in this story are AI generated.
AI Text to Speech. Is it any good?
So what do I think about AI-TTS, in short: AI-TTS has improved in leaps and bounds lately and is very usable - in certain situations but still has some issues to overcome before being full replacement for human voiceovers. Let me explain.
When would I use AI TTS and what are it's advantages
There are some very distinct advantages to using AI-generated voiceovers
Cost
Most professional AI Speech generation sites or systems do charge, but it is still way cheaper than a professional voice-over. The cost analysis may be more even if you are doing the voice-over yourself, but if you cost out the equipment & time for recording, editing, processing it yourself vs the cost of the AI and time taken (usually less) - it's still probably cheaper.
Time
It will usually be much quicker to generate speech than record it, especially if you write the script first anyway.
Editability
This is the big one. You can go back and re-edit and regenerate the speech at any time and it will sound fine. Yes, you can always re-record but if you only want to do part of it, it can be surprisingly difficult to get the same tone. The human ear can pick up amazingly small differences. Then you may well say you can tell it's a fake voice - yes. Here's the question - does it matter? My take is that consistency is more important, if it's fake but consistent then we care less than if it's real but the tone jumps all over the place. We find changes distracting.
And this editability is not just for one project, you can come back to it weeks, months or even years later and re-edit and it will still sound the same. Something difficult with a real recording. This means for example if you have a new version of software you can go back and re-edit the original training material and create a new version of the material instead of having to go back, rewrite and re-record it all.
When would I use voiceover
So if AI-TTS has all these advantages why would I use a real voice-over artist or just record yourself? This comes down to what I see as the greatest issue still with AI-TTS - context or more importantly the lack of contextual awareness. As mentioned I belive there has been great strides in making the AI generated voice sound more natural, and it is getting harder to tell the differnce between the AI voice and a real one, but AI voices still lack context. When a story gets exciting, sad, fast-paced or tragic the voice stays the same. You can hear it at the end of my story, the character is being walled in and about to be left there to die but the voice sounds flat and unemotional.
There are of course controls within most AI systems to the tone of the voice, but these usually need to be set manually and in my experence have limited effect. In find it's random, you may get the right tone if you re-generate the phrase enough but the AI doesn't pickup the context of the story for you.
Processing AI-TTS
One thing I discovered is that using traditional audio editing and processing can help AI generated voices sound a lot better. One issue with AI voices is the timing, it doen't alway seem to get it right. So a little bit of old-fashioned dialogue editing can go a long way to improving the overall flow and naturalness of the dialogue. The other issue is the actual audio quality of the voice isn't always the greatest, well-used audio techniques like EQ, compression and saturation really help the overall sound of the audio.
So even if you do use AI generated voices you may want to consider some basic audio processing to help it along the way.
Conculsion
Yes use AI-TTS, it's cheaper and editable but only for voiceovers where emotion and variation based on the context of the text are not required like training videos, technical demonstrations or audio versions of technical or business documents (great to offer for the visually impaired on your website). But where true emotion and timing are required like stories, poems, advertisements. Anywhere you are trying to make a true emotional connection - stick with the voice-over artist.