Synthetic and AI Voices: Market Size and How They’re Made
Since the advent of modern technology, people have been engaging with tech through typing, to tapping, to talking.
Smart speaker adoption is faster than any other technological advancement in history. In just under eight years since they launched (Amazon’s Echo launched in 2014), 94 million households in America have a voice assistant somewhere in their home. Some have two, or more.
The creation of the synthetic and AI voices used in this new tech era almost always require that a human voice be supplied first to form the foundation of the app’s vocal database. This is an emerging line of work for voice actors and could signal great opportunity, as more and more brands and technologies seek to develop their own branded synthetic voices, as well as source voices for Alexa Skills and other voice assistant apps.
It also signals a great opportunity for brands from a variety of industries to begin entering this market early as a way of differentiating themselves through providing their own applications or devices that utilize synthetic voice just as smart speakers do.
Today, the synthetic voice market is fragmented and still developing, offering ample room for new entrants, as alluded to above. In most cases, a standard set of synthetic voices are included in text-to-speech services, like those provided by Google Cloud Text-To-Speech and Amazon Polly. There are also pure play synthetic voice companies including VocalID, Voicery, LyreBird, and Acapella Group. Some of these companies allow customers to have access to a library of voices or even exclusive voices, but the number of options are still limited.
Because this post covers a ton of intricate topics on the state of AI voice, we’ve given you a table of contents that may help you jump to the sections you’re most interested in:
The Synthetic Voice Market Size and Value
By The Numbers
Even with the fragmentation and developmental stage synthetic voice is in, the market, made up of digital assistants and voice applications, is valued at $14.8B and projected to grow to $36B by 2025.
The AI voice market, as it currently stands, can be thought of as three core pillars: computers, healthcare, and digital assistants. All three of these pillars share a widespread adoption which permeates every consumer’s daily life.
- Digital assistants ($10.9B) represent the largest space in the voice synthesis market and range from assistants on commercial devices to advances in automated customer service.
- The computer voice market ($1.4B) is primarily driven by voice-to-text applications, and soon will be a differentiating factor in the highly competitive hardware and software markets.
- Healthcare ($1.8B) requires complex medical transcription software and offers vast opportunities in the vocal disorder and treatment adherence markets.
Read more about each pillar below.
Computers are ingrained in every industry and businesses are always looking for new and interesting ways to integrate themselves into the daily lives of their consumers through computer tech. As a result, customized voice-recognition, voice-to-text, and other voice user interface services have been identified as the next frontier of differentiation for software and hardware firms alike.
In order to differentiate, technology firms will need to gain a competitive edge over their competitors through access to a wider range of voices to maximize their software efficiency or perhaps by offering a novel voice product in an individual’s native dialect or tongue.
Healthcare is an interesting market for synthetic and natural voices for a few reasons. Not only does it offer the complex technical advances we are accustomed to with synthetic voice and AI, but also relies on the human aspect of the industry to make their stories more relatable and engaging. The healthcare-voice industry is characterised by its growth and ability to make the lives of both physicians and patients just a little bit easier. For example, the same technology physicians use in their speech-to-text documentation or in their ability to narrate in an operating room is key to giving a voice to individuals affected by ALS or other conditions that affect the voice.
While healthcare holds a fair share of the current AI voice market, the future of synthetic voice poses particularly interesting opportunities in healthcare. First, we highlight the ~7.5M individuals who suffer from some form of voice disorder who could be given their own unique voice in the future using voice technology.
We can also take a look at the millions upon millions of individuals who take some form of medication or are on some kind of treatment regimen in North America and abroad. With more and more research and evidence being published on the effectiveness of voice interactive reminders and adherence strategies, the market for unique or specialized medical assistants for patients represents a massive and growing market.
Finally, we come to the largest pillar within the organic and synthetic voice market: digital assistants. Smart devices, homes, and general technology have given brands the opportunity to develop their own personalities as they look to blur the line between autonomy and automation. As a result of this new trend, the market is expanding both in the product offerings of such assistants, but also in the level of funding and R&D which is funnelled towards creating something that would make even Alan Turing pause in wonder. The opportunity within this market lies in the ability for an entrant to partner with tech firms on the ground level who offer them access to unique, differentiating voices in which to develop a synthetic one from.
Users of Voice Technology
With an understanding of the kinds of industries that are potentially available for entry, we must then look at the individuals who are most available as purchasers.
When discussing the B2B distribution channel (as is the AI voice seller to the healthcare professional purchaser, for instance), we are discussing the development of voice ecosystems and brands. Brands in all industries are looking at how they interact and connect with their consumers on a deeper (and more profitable) level. The future of the B2B distribution channel is dependent on firms’ abilities to build in unique technological offerings, either organically through their own development, through acquisition, or, in some cases, through outsourcing management of such customer interface systems.
The largest distribution channel: customers and consumers themselves. As the largest channel, their needs and the segmentation of the channel is vast, and the most important piece is that these individuals are looking for more streamlined and unique ways to interact with their technology. Part of this new interaction will be the rise of new voice interfaces and platforms, behind which there is an unprecedented need for huge swaths of voice talent for branding as well as research and development.
How Synthetic Voice is Currently Being Developed
The current speech synthesis process involves three key steps.
Step 1: Preprocessing & Normalization
Synthetic voice begins with the pre-processing and normalization of the text or data intended for synthesis and the generation of the sounds needed for a natural sounding AI voice.
In pre-processing, the main thing to consider is the format and context provided in the text. Computers are using neural networks to determine the probability of what the next best action should be. This means they will interpret numbers as dates, financial information, or values, depending on the context. They will also use context to help determine how homographs should be read (i.e. looking for past, present or future tense).
For example, if a sentence contains the word year, the network would assume the next best action is to read “1998” out as a date. But if the context was discussing the value of a product, it may presume the data to be a financial number and read it as “$19.98”. As such, best practice in pre-processing and normalization is to re-write any numbers or special characters in the way they should be read (i.e. 1998 becomes nineteen ninety eight).
Step 2: Generation of Sounds
In the second step, the synthesis and phonemes—distinct units of sound in language that differentiate one word from another, like p, b, d, and t in the English words pad, pat, bad, and bat—are generated. This involves creating the proper sound for each letter.
After the text has been generated, the computer needs to create the sounds required for proper articulation. This means converting characters into their phonemes, or sounds which involves understanding the context of the sentence to determine the proper tense.
Step 3: Synthesis
Finally, the text can be synthesized into voice. Currently there are two main approaches.
Parametric Voice Synthesis
Parametric voice synthesis, which is a newer approach when compared to concatenative voice synthesis discussed next. In this methodology, the information required to generate data or speech is stored in the parameters of the sample, allowing contents and characteristics of the speech to be controlled. The biggest benefit of the parametric approach is that it allows for flexibility in the content and characteristics going forward.
As well, from an investment standpoint, parametric systems are computationally expensive but allow for a wider range of voices compared to concatenative voice synthesis methods.
Concatenative Voice Synthesis
Concatenative voice synthesis is currently the most natural sounding approach and involves stitching together snippets of soundbites to create new sentences. This requires a voice actor to have recorded a significant amount of speech with the associated text. Currently, it is said that reading the book Alice in Wonderland is best practice for capturing all the words and inflections needed for voice synthesis. (Voice actors who read Alice in Wonderland for these synthetic voice projects are encouraged to consider the future use of their recording in the development of a synthetic voice and quote accordingly.)
This is the approach that Amazon took when developing and launching Alexa in Saudi Arabia and other UAE countries. Through the use of audio samples in the concatenative approaches, the smart speaker and virtual assistant was able to be developed into not only the Arabic language, but also the various locational dialects and accents as well. While this is just one step forward, the implications of such an achievement include a major step forward in the human rights to accessible technology world wide.
Synthetic voices are getting better because of parametric voice synthesis. But because the concatenative methodology of storing large amounts of voice snippets produces the most realistic sounding AI voice, it has increased the demand for parametric voice synthesis to evolve to sound more natural, too. This is where neural networks come in.
Speech Synthesis uses Neural Networks to Generate a Natural Sounding Voice
The neural networks take initial input sequences from real waveforms generated by humans that are paired with the written text for training purposes. The network is then sampled to generate random synthetic utterances to continue training the model. The model will then calculate a probability for the next best way to respond to the utterance and this value will be fed back into the model for a new prediction to be made in the next step. Training the model by building the samples up one at a time is crucial for a realistic sounding voice.
As mentioned above, concatenative approaches have the most realistic sounding sounds, but the Google and Deep Mind programs have been actively working on this problem to improve the parametric approach.
Dynamic and Static Voice Applications
Dynamic voice applications indicate that there is a bi-directional, or two-way interaction with the end user. For instance, the user is asking questions, and the voice assistant is providing answers out loud. These interactive experiences are separate from static experiences, which are one-way, such as in the example of a podcast or audiobook.
Opportunities for Voice Actors with Dynamic Voice Applications
One of the first challenges that a developer or publisher may face when creating dynamic voice applications, is that it’s difficult to anticipate what the user will say, and therefore, how your device should answer. For this reason and more, the creation of synthetic and AI voices almost always require that a human voice be supplied first to form the foundation of the app’s vocal database. This is an emerging line of work for voice actors and could signal great opportunity, as more and more brands and technologies seek to develop their own branded synthetic voices, as well as source voices for Alexa Skills and other voice assistant apps.
Opportunities for Voice Actors with Static Voice Applications
On the other end of the spectrum are applications for voice that require a pure custom voice over. Some of these are linked to the proliferation of audio content, which has been made possible thanks to the adoption of home devices. For example, consumption of podcasts continues to grow, as more and more people are enabled to listen to them.
With static voice applications, all of the vocal interactions are defined in a script. In that way, these kinds of jobs are not that different from any other voice over work, except that the channels for distribution may be more diverse and far reaching.
Now is a Time of Incredible Growth and Opportunity for All
Voices is committed to bringing the best audio and voice products and services to market, and we’re excited to be exploring how the new world of synthetic and AI can complement our core business.
We are in the ‘next wave,’ in the evolution of computing, which has taken us from typing, to tapping on touch screens, to talking directly to our devices. In some instances, we’re even having conversations.
This is a time in our history that will be well-remembered and well-documented, as businesses adapt to a new, audio-driven world. We’re so excited for you, our client and voice actor customers, to join us on this continued journey.
Article originally written June 2019 by David Ciccarelli