Animation of man and robot looking at each other with a voice icon in between them Audio

A Guide to Synthetic Voices, AI and Human Voice for Your Brand

Fueled by the voice tech boom, everybody who’s anybody is building out a content audio strategy for their brand.

Voice search and voice assistant technology using synthetic, artificial intelligence and human voices are becoming increasingly popular, and the need for brands to have a voice representing them on these growing audio-based mediums is expanding. Brands are contemplated if they should use a synthetic voice in their communication strategy.

The voice you select to represent your brand will have an impact on how and if customers trust you. You’ve spent a ton of time already considering how your brand can be authentic and trustworthy but now it has a literal voice that people will be interacting with on Amazon Echo, Google Home, the latest Apple HomePod and HomePod Mini, not to mention other voice-powered technology. We’re aiming to educate and inform you on this fast-changing voice technology landscape, so you, and your brand, don’t get left in the dust.

In this article, we will highlight what synthetic voices, artificial intelligence (AI) voices and human voices offer, and outline the pros and cons that the three voice options offer to your brand.

What is Synthetic Voice?

A synthetic voice is an artificially produced version of human speech.

Speech synthesis is just another form of information output where a computer reads words to you out loud in a real or simulated voice, played through the device’s speaker; this is often called text-to-speech (TTS).

How is a Synthetic Voice Produced?

Say you need a paragraph of written text that you want your computer to speak aloud. How does it turn those physical typed-out words into ones you hear? Synthetic voice is produced in three stages: Text to words, words to phonemes and phonemes to sound.

A synthetic voice is created in three stages:

  1. Text to words: Pre-processing or normalization is done to reduce ambiguity as the computer narrows down how the piece of text is read.
  2. Words to phonemes: The speech synthesizer has to generate speech sounds that make up those words. In the most straightforward explanation possible, the computer has a dictionary of words and ways to pronounce certain groups of letters (phonemes) and reads the words.
  3. Phonemes to sound: The sequence of written words is now into a sequence of sounds that need speaking. The computer can take a few different approaches. It can use recordings of humans saying the phonemes (concatenative), it can reference basic sound frequencies to generate the sounds itself (formant) or it can mimic the mechanisms of the human voice (articulatory).

Once the synthetic voice is produced, it can be implemented in software or hardware products like Google Home, Amazon Echo, your tablet, smartphone, GPS, ebook reader, etc.

Synthetic Voice Pros:

  • Cheap. Speech synthesizers are a dime a dozen these days, so most are free. Just type ‘speech synthesizer’ into any search engine, and have your pick of whatever online text-to-speech tool you want to use.
  • Fast. You can literally plop in your script or text, hit enter and the computer will repeat your lines. Boom! There’s your robotic voice actor.

Synthetic Voice Cons:

  • Unrealistic. Every single one of these speech synthesizers sounds a lot like a robot. Sure, there are a few that sound less like a robot, but this voice is not typically a good fit when it comes to most companies’ brand voice.
  • Unoriginal. Chances are, thousands of other people are using one of these free or relatively inexpensive speech synthesizers. That means other people have heard this same, robotic voice speaking before.

What is AI Voice?

Artificial intelligence or AI voice is type of synthetic voice, but it operates a little differently. Where it differs is that AI voice uses ‘deep learning,’ which is a type of artificial intelligence, to turn text into audible human-sounding speech.

While a lot of robotic text-to-speech sounding speech synthesizers use task-based algorithms, deep learning allows AI voice companies to use machine learning methods, based on learning data representations to create audio like this:

Those were three artificial voices made to sound like Barack Obama, Donald Trump and Hillary Clinton. Montreal-based tech company, Lyrebird, was able to create the imitating voices, which say phrases that none of the American politicians said, using just a few minutes of audio from speeches with background noise and reverb.

Lyrebird also claims it can recreate your voice and turn it into your digital voiceprint using a minute of sample audio that you can upload on their website.

And they did a pretty convincing job with Ashlee Vance’s voice in this Bloomberg piece.

Lyrebird does this by analyzing a recording of your voice, breaking it into pieces based on phonemes. You then type whatever you want in the website’s textbox. Their platform uses your uploaded voice model to build completely new words and phrases. Yes, that means ones that weren’t in the original recording.

Companies like Voysis are also pushing the limits. They directly process raw audio to create new and markedly more human voices in contrast to every other text-to-speech synthesizer out there.

The staggering part of this, is Voysis built their voice off of an existing method called WaveNet that was discovered by researchers at Google’s DeepMind in 2016.

Give it a listen:

Thankfully, a company is emerging to ensure this cutting-edge voice tech is kept in check.

Pindrop is putting together the software that will protect all of these digital vocal identities created by AI voice platforms.  

The new voice ‘fingerprinting’ tech company analyzes 1,400 different acoustic attributes to validate vocal identities on voice-powered tech.

And, when it comes to editing, software has now been developed that makes editing voice as easy of editing words in a word processor. In 2020, Groupon founder Andrew Mason has launched Descript. They are applying AI to allow anyone to record, edit, mix, collaborate, and master your audio and video.

What brands are Using Synthetic or AI Voice?

Burger King, Uber, Whirlpool and a few others are starting to use voice to interact with their customers. Tide, Campbell’s and Nestle are also following suit with Alexa’s Skills and Google’s Actions.

Note: If you’re wanting to build out an Alexa Skill for your brand, check out this comprehensive guide we’ve put together for you.

AI Voice Pros:

  • Control. Using a platform like Voysis or Voicery will allow you to have a full control over a completely unique voice that was customized for your brand. You could have complete ownership over that voice and not worry about any other company in the world having that same voice.
  • Cost. Voicery charges $0.001 per character on their scripts. Lyrebird is currently free for users. Voysis doesn’t publicly list their pricing.
  • Instant production. As soon as you input the words, you can get an AI voice interpretation of the content at the click of a button.

AI Voice Cons

  • Ethics. There are some serious ethical issues with robotic voices appearing to be humans communicating with humans. Should robots be allowed to sound like humans or should there always be a way to distinguish a robot from a human?
  • Still not life-like. While these latest AI Voice advancements are impressive, you can still detect a layer of robotic, non-human sounding tones and inflection. There’s a good chance that this may be detected by your customers.
  • Soul. Even when AI voice does catch up and can completely mimic the tone, pace, delivery, pitch and inflection of a human voice, it will still be missing the most important part of what separates us from the robots: a soul. Think about the brands that customers make that deep connection with. They all have a heartbeat behind the brand that people can sense and connect with. This will be felt by your customers when you don’t use a real human voice.
  • Authenticity. When a real person (be it a celebrity or well-known public figure) voices your brand, that person’s lifestyle and ethos are layered on to how people will view your brand. Think Matthew McConaughey for the Lincoln Motor Company. His smooth, relaxed and sophisticated voice and perceived public lifestyle ooze into the ads he does for Lincoln. This, in turn, makes the customer associate those traits with the brand. Your brand won’t be able to access that depth when you use AI voice.

What is Human Voice?

Long before Synthetic and AI Voice were following another three-stage sound creation process, our incredible bodies were making and creating unique sounds, songs and voices. In terms of communication, the human voice is unmatched in its ability to convey detailed information that extends beyond the words we’re using.

When two people talk and actually understand each other, this incredible brain-imaging study suggests that both human brains synchronize.

“It is as if they are dancing in parallel, the listener’s brain activity mirroring that of the speaker with a short delay,” says Emma Seppala, science director of Stanford University’s Center for Compassion and Altruism Research and Education.

This level of natural brain synchronizing will never be able to happen between a human and computer. It also perhaps unlocks the code to how humans convey that deeper level of emotion to each other.

A study by Michael Kraus of the Yale University School of Management showed that when we only listen to voices (compared to looking at facial response and voice), the human’s ability to detect subtleties (specifically emotion) in vocal tone increases.

We can isolate the way speakers are (or aren’t) expressing themselves.

This may be why it’s so hard for algorithms to capture the unique sounds of the human voice – and why so many Synthetic or AI voices sound robotic or just flat-out false, even when we can’t put our finger on why.

There are currently around seven billion unique voices in the world and growing. All of them have a different story and experience that is distinctly theirs.

Human Voice Over Pros:

  • Real. A human voice will always be a human voice. No amount of programming will allow a robot to communicate the way that a completely unique person, who has a world of specific experiences and moments in their life, can. Everyone’s journey shapes their voice and how they tell your story. That takes a lifetime, not the short time span required for programming and simulation.
  • Fewer legal headaches. Next to no laws have been put in place to police AI Voice companies on what ways robots can communicate like humans to humans. Yes, it’s a wild-west right now, but just like every other piece of culture-shifting tech, government policy will follow (slowly) behind. Using a human voice will allow your brand to avoid any potential legal headaches that may arise from adopting AI Voice early on.

Human Voice Over Cons:

  • Limited Career and Lifespan. It may seem morbid, but if you have a particular human voice representing your brand, that individual has a limited lifespan and at any point, could change careers or retire.
  • Variable Compensation. Humans will seek more compensation for their work. The scale of pay for a voice actor can depend on many factors, including their experience, level of fame and ultimately, their skill and fit for their brand. You will also have to consider how you pay them for the end product, which has its own pay scale depending on the time it takes to produce, as well as the duration of your license.
  • Reputation Can Be Unpredictable: Many brands have been burned by being associated with a celebrity, whose behavior is found to be in misalignment with the brand values. This is unpredictable and can cause you to have to pivot quickly.

When Should a Brand Use a Synthetic Voice?

Is it better to choose a synthetic voice or go with the default voice that comes with providers like Amazon and Google? And, when should I work with a human voice actor? What is the interplay between the two? These are common questions we hear when consulting with clients.

Source: Synthetic Voice Matrix, developed by David Ciccarelli, Voices CEO

Here’s a new framework that we’ve developed and shared with our customers for making that very decision. In fact, we call it the Human versus Synthetic Decision Making Matrix. 

Across the bottom we have a time dimension. What’s interesting is that people are comfortable listening to a synthetic or automated voice if those voice prompts are less than 1-minute long.

What we’re looking at here are short prompts that are navigational in nature, hence the term Navigate. This includes information that doesn’t take long to share like turn by turn directions or updates to an itinerary. For example, you’ve probably heard announcements in the airport when flight information has changed. These messages are often delivered by a synthetic voice. The original recordings are likely that of a real person, but times, dates and numbers are stitched together on the fly. You’ll find this use of synthetic voice where content is dynamically changing with messages being short and to the point. This is a perfectly appropriate use for a synthetic voice instead of a live human voice actor.

Another ideal use case is asking questions that yield quick answers, like ‘What is the weather like today,” or “How many ounces are in a pound.”

As we extend the listening timeline to more than a minute, we refer to this as Educate.

For a lot of brands, this content is usually internal and includes recordings meant to serve those working at an organization, corporate training material being the primary use case.

Despite the longer listening experience, we find that some organizations are still choosing to work with a synthetic voice, possibly because they are more intent on getting the message out versus the quality of the narration.

But as we move up the quality scale, even though it is short form content, voice actors add significantly more value in this situation. We refer to this as Inform. This includes commercials, promotional content, and when a brand wants to introduce a new product or service.

Flash briefings fit here as well. Giving engaging information on a daily basis with a lot more emotion seems like the ideal use case. We refer to this as Entertain, as it’s longer than 1 minute and can be as long as 1 hour or 10 hours of content. It might be episodic or choose your own adventure. As soon as the content is story-driven or character-driven, this is where working with a voice actor makes the most sense.

To layer on some beautiful graphics the lower quadrant is dominated for the foreseeable future by the synthetic voice. 

The point at this stage where it makes the most sense to work with voice actors is where the content is static in nature and is driven by stories or characters.

Historically, all voice recordings were performed by professional voice actors. Now, synthetic voices can meet the needs of specific edge cases where the content is constantly changing such as airport or train station announcements, or, where the content is so diverse that a synthetic voice might be more adaptable such as turn by turn directions for a navigation system.

For navigational micro content, sound bites shorter than 5 seconds, where the content library requires ongoing maintenance, then a synthetic voice makes sense. For content that informs, entertains, educates or inspires nothing will replace the emotion that only the human voice can evoke.

How Brand Voice Will Be Heard Now and Into the Future

Now that you have all of the options and the pros and cons of each, the next step is sorting out how you will apply voice to your brand now and into the future.

If you’re looking to get started, or build upon your audio content strategy, here are some of the biggest opportunities for modern brand marketers to extend their marketing into the audio medium:

How Will Your Brand Be Leveraging Audio?

Have you considered or used any of the above vocal options? What do you think is the best match for your brand and why?

Please share in the comments below – our community would love to learn from your experience!

Related articles

Black text that reads "The Ambies" in all caps, with a gold trophy sandwiched between the words "The" and "Ambies"
Introducing The Ambies: Celebrating the Best In Podcasting

The Ambies are a new awards show designed to celebrate the best podcasts of the year. Will your favorite show take home a trophy?

A woman typing on her laptop computer editing a podcast
10 Tips for Editing a Podcast in GarageBand

If you're looking for a way to edit your podcast using free software, follow these 10 tips for editing a podcast in GarageBand.

A woman using a bullhorn
How to Create Audio Ads: 8 Tips for Digital Advertising

Audio ads are the most effective form of media advertising. Read our detailed guide on how to create audio ads, to ensure you're equipped.

Leave a Reply

Your email address will not be published. Required fields are marked *