Text to Speech Technology: How Voice Computing is Building a More Accessible World
In a world where new technology emerges at exponential rates, and our daily lives are increasingly mediated by speakers and sound waves, text to speech technology is the latest force evolving the way we communicate.
Text to speech technology refers to a field of computer science that enables the conversion of language text into audible speech. Also known as voice computing, text to speech (TTS) often involves building a database of recorded human speech to train a computer to produce sound waves that resemble the natural sound of a human speaking. This process is called speech synthesis.
The technology is trailblazing and major breakthroughs in the field occur regularly. Popular consumer devices that have introduced text to speech technology into our everyday lives include artificial intelligence-powered virtual assistants such as Amazon’s Alexa and Google Assistant.
Beyond simply converting language text into speech, these virtual assistants use speech recognition software to intake sound waves produced by a human talking, derive meaning from that audio data, and deliver a response in a synthetic voice. At its most advanced form, text to speech technology has enabled artificial intelligence to hold a conversation with a human being.
Text to speech technology has been employed for advertising purposes with the advent of interactive voice ads, which are proven to drive brand recall far more than adjacent forms of advertising. Text to speech can also be an optimal tool for converting immense masses of text into playable audio data.
Get to know how speech technology works, the role the human voice plays in molding a synthetic voice, and how text to speech is being used to make speaking and listening more accessible for people everywhere.
How do voice computing and text to speech technology work?
On a fundamental level, the way text to speech technology functions can be broken down into the following processes:
First, a text to speech engine hears sound waves produced by a human voice, and converts them into language data. This process is called automatic speech recognition (ASR). Before it can do anything with said data, however, it must derive meaning from those words. This process is called natural-language generation (NLG).
Artificial intelligence has developed the ability to come up with original, creative responses to the audio data it intakes. As James Vlahos, author of Talk to Me: How Voice Computing Will Transform the Way We Live, Work, and Think, articulates, “neural networks are crafting original things for the computer to say. They’re not just grabbing prescripted words, they’re doing so after being trained on huge volumes of human speech—movie subtitles and Reddit threads and such. They’re learning the style of how people communicate and the types of things person B might say after person A.”
Once a text to speech engine has generated the text it intends to convert to speech, it needs to produce the sounds required for articulation. This stage of the process involves converting language characters into phonemes, or distinct sounds. To achieve this, the text to speech engine must understand the context of the sentence in order to determine the proper tense.
Using the Human Voice to Forge a Synthetic One
One of the foremost models for speech synthesis is called concatenative text to speech, which is “where a very large database of short speech fragments are recorded from a single speaker and then recombined to form complete utterances.”
While famous reference points for voice computing include the sentient computer HAL in the film 2001: A Space Odyssey and the speech synthesizer used by Stephen Hawking, the synthetic voice of the future isn’t wholly robotic. The sound of authentic human speech will play a key role toward the formation of original synthetic voices that sound increasingly like humans.
2001: A Space Odyssey (1968)
If you’re producing a synthetic voice for your brand, by inputting the voices of real actors, you have the opportunity to imbue your brand voice with its own personality or vocal identity. As text to speech technology grows more widespread, selecting the race, gender, and other vocal characteristics of the voices you input will allow you to create a unique synthetic voice that represents who you are.
How Text to Speech Technology is Contributing to Increased Accessibility
In a variety of capacities, text to speech is being wielded as an assistive technology to help make the world more accessible when it comes to the way we speak and listen. Here are a few of the central ways that text to speech technology is being used:
Text to speech as an aid for people with learning disabilities
When you’re publishing written content intended for as broad an audience as possible, employing text to speech technology is one tactic to help make it more accessible for those dealing with certain types of learning disabilities.
Over 750 million youth and adults around the world lack reading skills or deal with illiteracy, and between 15-20% of the global population has a language-based learning disability. Of these, dyslexia is the most common.
Even for members of your audience who can understand a portion of your content, reading everything comfortably may still be an issue. Granting your audience the option to hear any piece of your content read aloud makes it easier for people across a wide range of literacy levels to access.
Text to speech for learning a new language
Learning a new language is rarely easy. On top of that, the way a word appears spelled out in letters often doesn’t match up with the way it sounds phonetically.
Text to speech technology offers new learners the ability to listen along to the way words sound at the same time as they read. Used in this capacity, text to speech technology can serve as a useful aid for immigrants to new countries who are working on learning a new language, as well as pre-literate children learning to speak for the first time.
Text to speech for people with visual impairment
285 million people worldwide are estimated to have some form of visual impairment, 39 million of whom are blind. Text to speech technology allows those who are unable to read from a screen to access written content by listening to it.
If an individual doesn’t have any form of visual impairment, reading for extended periods of time can still cause considerable visual strain. In such cases, text to speech technology is a valuable tool that offers readers a reprieve from staring at a screen without putting a pause on their engagement with the textual material.
Text to speech enabling consumption on the go
Text to speech technology allows consumers to listen to any text while they’re on the go or multitasking. Studies have shown that we’re spending more time than ever plugged into audio sources: from listening to music and podcasts, to relying on smart speakers to deliver the news and instructive audio content, like recipe lists or weather reports, while we juggle tasks around the house.
Many people can’t find sufficient time in their day to read. Text to speech technology converts the words that a reader would otherwise need to focus on intently, into sound that can accompany them wherever they go.
Text to speech technology is also advantageous because it doesn’t require a human to stand at a microphone and record lengthy streams of text, especially when audio content must be delivered with little or no warning. The technology is ideal for converting news briefings or regularly updated elearning courses—like microlearning, which is training content delivered in short bursts—from text into automated speech.
In addition to being optimal for content that is frequently updated, text to speech technology is similarly ideal for long form content. This can include books, articles, training documents, or any piece of writing with a hefty word count. Text to speech can allow anyone to consume any content anywhere, even at the same time as a listener engages in a supplementary activity.
Text to speech for people with medical conditions that impact their voice
Text to speech technology can help provide a voice for those who have a speech impairment or have been faced with a medical condition that has impacted their ability to speak.
Roughly one in ten people in the United States deals with an acquired speech impairment, resulting from the likes of ALS, Parkinson’s, strokes, and brain injuries. Acquired speech impairments can include the loss of one’s ability to speak altogether.
For a lot of people, their voice is like their identity, as distinct to them as their own fingerprints. In recent years, new forms of text to speech technology have been developed that can recreate the sound of an individual’s voice from before the time when they were diagnosed.
Groundbreaking initiatives such as Project Euphonia, developed by artificial intelligence company DeepMind in collaboration with Google, are making great strides to “synthesise a high-quality, natural sounding voice using minimal recorded speech data.”
After football player Tim Shaw was diagnosed with ALS, he lost his ability to speak. Yet, using NFL audio recordings of Shaw speaking, DeepMind and Google’s AI team were able to recreate the football player’s former voice. The results are captured in this short documentary:
In the documentary, Google AI Product Manager Julie Cattiau outlines that Project Euhponia’s two primary goals are “to improve speech recognition for people who have a variety of medical conditions,” as well as “to give people their voice back, which means actually recreating the way they used to sound before they were diagnosed.”
Virtual assistants providing support around the home
It has been reported that virtual assistants using text to speech technology can provide support to people who benefit from regular oral reminders or conversation. Because AI can generate responses and engage with individuals, the technology can be of benefit for people living alone.
“Elderly people have the issue often of being alone a lot, so they are the ones that might be more likely to chitchat with Alexa,” says James Vlahos. “There are also applications out there where voice AI is used almost as a babysitter, giving medication reminders or letting family members do remote check-ins.”
Voice Computing is Transforming the Way We Communicate
Text to speech technology and speech synthesis are among the most cutting-edge advances that artificial intelligence has made possible. Beyond merely allowing an individual to input text to be recited by a computer, voice computing is enabling entirely original synthetic voices to come into existence.
These voices are allowing individuals to regain a voice that they have lost, engage in increasingly realistic conversations with computers, and convert any amount of language text into natural-sounding speech.
In order to create a bespoke synthetic voice, you must begin with the human voice. When you are developing a new voice, whether for a brand or an individual, you are going to need access to a diversity of voices, including actors who are different ages, and speak different languages with different accents.
Voices allows you to hire a custom text to speech voice. When you need a cutting-edge synthetic voice for your next project, look no further.