Podcasts Vox Talk Voice Consumer Index 2022 and Synthetic Voices with James Poulter

Voice Consumer Index 2022 and Synthetic Voices with James Poulter

Duration:

0:00

Details Transcript

Stephanie Ciccarelli

September 13, 2022

Where is synthetic voice going and how can you be part of it? James Poulter from Vixen Labs in the UK joins Stephanie Ciccarelli to discuss his findings from Voice Consumer Index 2022 and what they mean for voice talent. Are the robots coming? Is the sky falling? James answers many questions from a voice agency perspective around the adoption and acceptance of synthetic voices in everyday life, how AI technologies are capable of doing so much more than we had previously thought, the process of making a synthetic voice and how talent and brands can ensure good business dealings when creating and using synthetic voices.

Mentioned on the show:

Vixen Labs

Voice Consumer Index 2022

Continue the conversation on the Voices Community Forum

Thank you for listening to Vox Talk. We are so glad you did! Be sure to share this episode with anyone you think should hear it. Use the hashtag #voxtalk to continue the conversation online.

Stephanie Ciccarelli:

Hi there and welcome to Vox Talk, your weekly review from the world of voiceover. I'm your host, Stephanie Ciccarelli from Voices. Curious about consumer awareness, adoption and openness to AI voices? James Poulter, CEO and co-founder of Vixen Labs in London, UK, joins me to discuss The Voice Consumer Index 2022. Vixen Labs is Europe's leading full service voice agency. James and his team worked with Fortune and FTSE 500 brands to develop voice and conversational strategies, products and services to drive business value and connect with audiences in the most intuitive way possible. Vixen Lab's full service offering covers strategy development, voice search optimization, voice app builds audio content and marketing. Welcome to the show, James.

James Poulter:

Thanks so much for having me, Stephanie.

Stephanie Ciccarelli:

Well, thank you for being here. I was just thinking, James, it's so exciting to be talking to you again. Or the last time we saw each other was a while ago in New Jersey. It was at the Voice conference. That was really great. But for those people who are new to Vixen Labs and the Voice Consumer Index, what does the report cover and aim to achieve?

James Poulter:

Yes, so the Voice Consumer Index is a report that we are now in its second fully fledged year. We did also have a kind of smaller version of it one year before that. And it's a study of what actual usual people in their homes, in their cars and out on the street are doing with their voices when they're talking to their conversational assistance. And most of us will know those things by the letter A. I'm not going to set her off, but Alexa, Google Assistant and Siri and all of those others. And we took a study for the past couple of years looking at 20 consumers in the UK, the US and Germany. So 6000 people in all and really trying to understand what are they doing with their voices, why are they doing it, what content are they looking for, what are they trying to achieve day to day and where are they using it as well, which is particularly important. And this year we've particularly looked into a couple of key areas. One is this whole theme of the metaverse and where's all that going? As you can imagine, many people are intrigued if this will be the year of the meta verse or just the year that we talk about it. And then we're also looking at it by industry as well. So digging really deep into things like healthcare, into things like consumer packaged goods and retail, ecommerce obviously a huge space for us, particularly with Amazon being such a big player in this space. And then we're also looking into more of those practical use cases when it comes to things like entertainment, managing your content, managing getting things done through IoT devices. And we have Vixen Labs, as you said in the intro, we're a voice and conversational AI agency. We work with big brands around the world to really help them leverage this type of technology for their own purposes, whether that's building businesses, whether that's helping reduce friction from a conversational commerce perspective or customer services. And yeah, we put this service together for the past couple of years with our partners at the Open Voice Network who stand very much for that ethical implementation of AI in the voice and conversational space. And that's why we put it back together. And this year, again, we've just learned so much, which I'm excited to share with you guys about where Voice is going in 2022 and beyond.

Stephanie Ciccarelli:

Well, we're happy to be part of that and to hear it. Everyone listening has either heard an AI voice or they've been the AI voice. The audience is just kind of looking at this as maybe work that they could be doing, but also just understanding how it's interacting in our day to day lives more. So in the opening letter of your report, you said that you've hit a tipping point. So what do you mean by that?

James Poulter:

Yeah, so I think with every major technology revolution that comes around, whether that was as far back as the desktop PC or the Web, onto things like social media, mobile phones and apps, and then now, more recently Voice, we've gone past that point where less than half of people are using it every day. We're now into around about 60% of people in each of these major markets using Voice on a daily basis, sometimes multiple times a day, and using Voice. Because obviously when we're talking about our voices, that's different from using a Voice technology. We're talking about people saying things to their speakers, their smart speakers, perhaps their connected headphones to the assistant, maybe through their car headset, or maybe in the retail environment be talking to like a kiosk or screen, an ATM or something like that. And at some point during the day, multiple times a day, people are now doing that. And we've reached that what we call that tipping point because we've gone into that majority. Now, most of us in these major markets where these services and technology has been around for saying more than five years, are now in a point where we're doing this multiple times a day, and that means that it's becoming as commonplace as picking up a mobile phone, as logging on a desktop browser. And we're used to having these things around us. And when that happens, when you reach that tipping point, what most crucially happens is it begins actually to fade into the background kind of mentally. And what I mean by that is we don't think about it so much anymore. It just becomes natural, habitual, and therefore it becomes ingrained in what we do. And that's when it begins to take hold of a life of its own, because we're no longer having to think actively about choosing to do it. We're just doing it naturally. And that presents a much bigger opportunity for all of us working in this space, whether we're on the providing end of providing voices or on the brand side of things, creating experiences around those voices, because we're all naturally just general public choosing to do this day to day.

Stephanie Ciccarelli:

How interesting, because I remember even back in 2019 when we were all in New Jersey together, it was still very much, ‘how good can this stuff really get?’ ‘Who's talking to these devices?’ And a lot of people are still quite apprehensive. And I think around that same time, maybe the year before, was that Amazon was like, ‘oh, let's make it really, really easy to get Alexa into people's homes.’ It was like almost everyone had an Alexa at that point. It was becoming more adopted. So as we're looking at this, you're saying this is not so much novel anymore, this is normal.

James Poulter:

Yeah, absolutely. I love putting it that way, going from novel to normalcy, because that is kind of where we're at now. So I think one of the big things, obviously, that we've seen during the Pandemic is that as many people spent more and more time at home, they're trying to do multiple things during the day, right? You're juggling working from home, you've got the kids who are home from being home schooled, perhaps. You've got to still get the usual things done around the house in terms of household chores. And we've all become a lot more health conscious, and particularly not going to go swiping the screens in McDonald's or kind of tapping away on ATMs that people have used before. Those habits have been built up during the two years of the Pandemic. And at the same time, we also saw this rapid adoption of smart speakers, particularly the lower price point, and say, the Amazon and Google's of the world, in some cases, basically giving away devices and new classes of devices as well, where people are beginning to find more utility from them. Things like smart speakers with displays like the Echo shows or the Google Nest hubs, which now form over 20% of the market in each case. And so during those two things happening, we've cemented this behavior into our everyday lives. We started going, oh, actually, I can just get some of the simple things that I need to get done every day by just using my voice, by just asking and taking kind of action via conversation. And that doesn't mean that people want to show up just to have a conversation with these things, right? We're not kind of like spending all day chatting to these devices. Often use the example of you don't go to the drivethru window, right, to have a conversation with the server behind the little microphone in the booth. You go to get a burger, but you get it done through the act of conversation. And that's what we've seen began to happen, is that many people will now choose to just try and get those simple everyday tasks, those repeatable habitual tasks done with their voice, rather than trying to get it brought up on a mobile phone. Perhaps if their hands are busy, their eyes are occupied, or they're just finding it simpler and easier and that's why people are turning to it.

Stephanie Ciccarelli:

Right. And I think for our part at Voices, we've certainly seen five years of digital transformation within five months because of the Pandemic when it first got going there in 2020. And just like the whole idea of people becoming more comfortable with these devices and talking to them and giving them directives and asking them like, ‘oh, tell me a joke, Siri,’ or whatever it might be, just thinking of the big three, you cover them a lot in your report. That would be the Alexa, Sir and Google Assistant, still going strong. And as you've also said, people are less they are more hesitant to touch screens of any kind that are not their own. Right? So it's no wonder that voice search and voice activated encounters, if you will, are actually picking up quite a bit. So the Pandemic affected this, but just how much?

James Poulter:

Well, we've seen in some cases double digit growth from 2021 to 2022. And obviously we don't have the benefit of having done this study for many years prior to the Pandemic. So we can only kind of see that growth as we kind of come out of that kind of peak. Certainly in the UK and Germany, for example, at the time we conducted 2021 study was May of 2021. We're in the third lockdown, basically just emerging from that. So we do at least see kind of quite a stark difference of life being relatively back to normal here in Europe, comparatively to last year. And so we have seen both adoption, as in people buying new devices or using those devices grow in some cases by 10% to 12%. But what's really interesting is more not the total population of people that are using them, but where they're using and what they're choosing to do with them. And in particular, what we've seen there is the habits that have been formed in the home where many of us were using our mobile phones when we're walking around the house and talking to them or turning to a smart speaker. We've seen that habit that's been built up carry over into the outside world as the world has kind of unlocked. And we've began to take our technology on the move again with us, and particularly that's showing up in some new types of devices that are being used more things particularly like connected headphones. One of the facts that gets me every time I sit down and talk about this Apple AirPods stephanie, do you own any AirPods?

Stephanie Ciccarelli:

David does. Yeah, I definitely am familiar with them.

James Poulter:

Okay, so the little kind of white earbuds we've all gotten used to seeing people around. If Apple was to break just AirPods out from Apple as a business, it would be larger than Nvidia or Adobe on the stock market. Just the sale of those little white earbuds, right? And one of the biggest features of those is Siri being embedded into those headphones. And so what we're seeing is that people are getting so used to using their headphones. Using their mobile phone to get things done with their voice, that they're beginning to carry that action out outside of the home when they're on the move. Maybe when you're walking or commuting. Whether in public transport. Whether you're in the car. Which again, you're still using your mobile phone. But it's connected usually to the head unit of your car. And so we're seeing people using that voice behavior that they built up at home and take that out into the real world. And what that really does for us is presents all sorts of really interesting new opportunities, because if you're out in the world again, all the different marketing queues that we're susceptible to, whether that's in out of home events and cinema and kind of out in the street billboards, etc. Suddenly that becomes a new advertising vehicle for brands and businesses to call us to action, not through a URL, not through a social media handle that we have to remember, but natural language prompts that we can ask our voice assistants to get things done. It's beginning to open up that new passive behavior is beginning to open up new opportunities that come with it.

Stephanie Ciccarelli:

Wow, that's a lot of just like, I guess, trust that's been built up in these voices because I don't know. I personally do not have Siri turned on on my phone. I've just chosen to not do that because I just don't I don't talk to Alexa. Even though there might be one home, there's any number of I've been one of those people who's just like, I don't know about this. And you know what? Certainly in 2019, I was definitely more like that than I was now. But it's interesting because it's out there. People are using it. People are trusting these devices and the voices in them. In your report, you also said that conversational, convenient and helpful voices seem to be, I guess, what is helping to build that trust with people.

James Poulter:

Yeah, absolutely. I think what we've seen with trust is that there's this year for the first time in all time that we tracked any of this, is that the mistrust that people have in these technologies are listening to me. Kind of what you're alluding to, Stephanie, is that that has decreased for the first time. And that's really fascinating to us because one of the big things and why we call this moment that kind of tipping point moment, is that what we've generally seen and this is my kind of thesis on the Internet for the past few decades is that whenever the history of the internet is giving over privacy for utility, that's all it's ever been, right? The more data we give up, the more we get out of it. But obviously we have that sacrifice of data and privacy as we go. But what happens is that there becomes a point where people get so much utility out of it that not only are they happy to give over that privacy, but they also just stop caring and stop caring about it and therefore they begin to trust these services. And we've begun to see the same thing time and again, whether that was giving credit card details to Amazon in the early days because they could get you a book in 24 hours to connecting with my friends on Facebook, I'm willing to give you basically all of my life information to keep track of what people are up to. And now we've seen the same thing in voice. I'm willing to give over that recording of my voice and even the potential that these devices listen to me on a regular basis because it freed me to get so much more done, because I'm not having to have a screen in front of me, I'm not having to be tethered to a device anymore. And so we see this time and time again and that trust and empathy is a big part, and this is why it's so interesting for people working in the voice space, in the voice acting and voiceover space in particular, is that one of the big things that really builds trust in these devices is that empathy, is that warmth, is the right character and profile of that voice for the right utility. What we're beginning to see, particularly on the technology side, is both the creation of these synthetic voices that can be based upon real voice actor profiles or the creation of entirely new voices from AI generated to try and fit those different use cases, but we're also seeing far greater utilities being provided by those platforms to make different voices for different use cases. So we're no longer tethered to just that one voice that carries through all of our experiences when we interact with this technology and that's what's helping build that trust.

Stephanie Ciccarelli:

Wow. Yeah, that is definitely interesting that it's these sorts of, I guess the empathy, I guess with the voice talent and studio, can they infuse that empathy into the reader or is this something that is being created by a machine? What do you think?

James Poulter:

Well, what's really interesting we're seeing is there's a couple of major trends developing in this space and I mentioned before this idea of synthetic voices. And that doesn't necessarily mean that those voices come from entirely generated means and not entirely AI based. In some cases they're going to be coming from voice actors like many of you who may be listening to this podcast. Who are going to create samples and create voice models that will allow your voice to be carried forward into new and different use cases and maybe banish the days of having to just jump back into studios to do pickups to become a thing of the past. But what we're also beginning to see is that some of these synthetically trained voices can also still require us to have real voice actors around the place to bring them to life. And what I mean by that is, let's just take say, for example, you spent some time synthetically duplicating Morgan Freeman's voice, right? Now, you can obviously kind of write text and have that spoken out through his lovely deep. I can't do the accent, but you can have a go at doing that. But what we also can do is now speech to speech synthetic translation, which is really fascinating, which is when we can take these synthetic voice models and actually have them brought to life by voice actors who can embody those voices and kind of think of it like going into a game and putting on a new skin or a new avatar. You're still the one playing, but you look different. And it's the same thing with being able to go in and play the way that you would play, act the way you would act, speak the way you would speak, but be brought to life with a different voice. And that's where there's a new generation of people that may also begin to come into this arena.

Stephanie Ciccarelli:

That's wild. You know the first thing when you say, I'm like, okay, I know what motion capture is. And it's kind of like you're moving in there, but it shows potentially a different body, like, let's say Lord of the Rings and go back to Gollum, and you've got Andy Serkis running around with little dots, but you don't see Andy, you see Gollum. So it's the same thing, but with your voice.

James Poulter:

Yeah, that's a great example and a very good analogy for it. It's very much that principle if we can begin to puppeteer the voices of other people that have been previously created as well. So it's not just about having your voice duplicated, but maybe being able to manipulate and manage the voice of somebody else

Stephanie Ciccarelli:

That is just like, out of this world, James. People listening to I know my head is just going to be like this too much, too much. But this is where it's all going. And there's all kinds of interesting ways that voice is being used. And puppeteering with a voice. I don't know where you see Big Bird and someone's in the suit and they're doing…

James Poulter:

But that puppeteering thing is fascinating. Maybe just to kind of give another kind of example why we're excited about this and why I can imagine many people worrying about this, maybe is this a threat to my job? Is this going to kind of take over? But actually, what we're beginning to find is there are many voice actors and also professional celebrities and people like that are being able to reach new audiences entirely because of this technology. In particular, because we take a famous voice actor that you know of, let's say Hugh Jackman, for example, right? Take his voice kind of very obvious, kind of Australian, kind of back, but he does some interesting accents when he tries to do different parts. But he can't speak Spanish. He can't speak Swahili. And with this technology, we're now able to synthetically duplicate his voice and not just allow it to sound like him speaking with a funny accent, but we can actually make him speak French. We can make him speak Spanish, and it sounds like him speaking French, and it sounds like him speaking Spanish even though he doesn't know these languages. We've got examples of voice actors and podcasters and kind of celebrities now having their voices synthetically duplicated. We're even working on a program right now for a major pastor in North America to have his voice synthetically duplicated so that his sermons on a Sunday can be translated into Spanish for his Spanish audiences and have them listen to them like it's him speaking Spanish for the first time. So you can imagine what this might mean for everybody from individual influencers to CEOs to your customer services reps, all the way through, obviously, to dubbing and subbing out content that maybe you as a voice actor can't speak those languages. The opportunities there are huge.

Stephanie Ciccarelli:

Wow. Okay. That's just everything you just said is going to hit me like a big tidal wave here. So I'm just like, whoa. Because I know that sometimes there's nuance between languages and so on. And if you were to have a voice filter, say, Spanish for a Hugh Jackman, and it's not to say that film is less important, but let's just say you got Wolverine doing whatever and he's talking in Spanish. It's not nearly as bad if there's a few slip ups in language or the wrong word is used for him in that situation. But if you were to take, say, that pastor you just talked about, and if something were to be out of context or not the right word, or that could get pretty dicey. How do you ensure the integrity of the translation?

James Poulter:

Well, obviously, the integrity of the translation still comes down to traditional translation services and making sure that that content has been written in the right language. Right. So when we translate it, we're not just sticking it to Google Translate and hoping for the best.

Stephanie

Oh, good!

James Poulter:

That's the translation element of it definitely needs to be managed. But once we have that script in the right language, we then have this ability to put that synthetic voice model through and generate the audio that sounds like that person speaking that language correctly. We do a lot of work with our partners, Veritone, who I'm sure many listening may be familiar with, and using their Veritone voice software we're beginning to do this for projects on behalf of some Vixen clients right now. And one of the things I love about that is because it has this integrity built into it of being everything that comes out of the platform is digitally watermarked so we can track it. We know that it's been created safely by that voice model inside the platform and has been verified by the person that has created that voice model or given us permission to use that voice model. So it's one of the benefits we have of using that solution. But whether or not you're doing this yourself as an individual or doing this on behalf of a brand, you can imagine the opportunities discuss to kind of scale different voices into jurisdictions that just never would have before had content in their own language spoken by the actor that they know and love of whatever voice or whatever language that person originally spoke. It opens up so many possibilities.

Stephanie Ciccarelli:

I just thought of a job that could be lost, actually, as you're saying this. It's the talent who are the voice of Brad Pitt in this country or the voice of Julia Roberts and is that going to affect them, the dubbing talent?

James Poulter:

For established actors and established acting kind of talent, many of them, as you say, they have the Brad Pitt of Mexico or however, and that voice is now so recognizable in that market that you wouldn't swap that out. You wouldn't go changing that at this point. But I do think it means that for new talent that are coming through maybe where that's never been created, but particularly also for languages that often get forgotten. I have many colleagues across Europe, kind of my friends and colleagues over in the Netherlands, for example, or in Sweden or Denmark where they like, ‘do you know what? Most of the time people just don't bother subtitling or dubbing into our languages. We just get the English version or something close,’ you know, that opens up the opportunity for us to actually just make content so much more accessible, but also the content that wouldn't usually get that service right. Yes, TV, take it for granted movies, we know that that often comes with the rights and publishing. But think about all of the blog posts that are being written. Think about all the thousands and thousands of podcasts that are being created every single day, which in the majority are often only ever heard by people that speak the mother tongue of that author or actor. We now have that ability to take what content that would usually be seen as kind of too cheap or kind of too simplistic and either never get taken into voice at all, even in English. And then seeing that not only get taken into English, but obviously speaking to you guys in Toronto, right? If you want to have the stuff in French and English that's often a problem, right? Many of the content creators create for one or the other, and it's really expensive and time consuming to do both, particularly for content that's been created so frequently, like news bulletins, podcasts, blog posts, tweets even. We can now begin to have those interactions in language with maintaining the empathy and integrity of the voices that originally created them in their mother tongue. I think that's what makes it so exciting.

Stephanie Ciccarelli:

Wow, that is wild. And just thinking like, yeah, I'm sure that technology will come along. We know that entertainment is kind of like the final frontier, if you'll excuse the kind of idea of that. But I just think that when we have a synthetic voice or kind of an AI voice working in a field that is traditionally dominated by live actors, like entertainment, animation in particular audiobooks. I know audiobooks, a lot of people just dread the day. They don't even want to see the day where they're listening to an AI voice reading an audiobook because that's just to them. It's just like. No. That's not storytelling like that. So are there some limitations here, James, that you know, are inherent to this?

James Poulter:

I think what we're seeing is that for now, the majority of this technology is best deployed around short form pieces of audio content, usually kind of in the sub few minutes of examples we have seen, particularly on the translated front, people doing this for podcasts and kind of the half hour shows type things. But that's usually for that translation use case rather than synthetically generating large chunks of audio because it doesn't exist already in an English audio format, for example, audiobooks is an interesting one because depending on the format of the audiobook I agree, narrative fiction, quite difficult. Even much nonfiction also not necessarily interesting, but many textbooks, for example, or kind of other forms of educational content that maybe is much more short form or bite sized, actually, we can see that this can be used and the fidelity that we can get these voices to now is often good enough for those examples. Where I actually think we're going to see this really come into its own is both in the corporate sector and also on the kind of written web where much content wasn't being turned into audio. We've seen more and more examples of people embedding these native audio players into their website so that it generates a synthetically read audio version of an article, for example, or a blog post, an instructional video, perhaps those types of pieces of content. You can begin to see why that might be really useful. I think we're going to see massive use of this by influencers and YouTube creators who are creating short form video content, but often only in one language or where they want to generate many versions. I don't know, say you make demo videos of I've got a pair of Bose headphones on my. Desk. No credit to Bose, but I'll always take an extra pair of you sending them. How many different pairs of headphones are there? How many different car versions are there out there? Perhaps we need to bring to life lots of these different instruction manuals or other pieces of content. So I think short form content in particular. But as the facility gets better, there is no reason to not believe that we won't get to a state of play where longer form content could be generated. In these fictions. One of the things I mentioned at the top of the discussion was around the metaverse. I think this is an area where we're going to really see this stuff coming to its own. We're not going to type into the metaverse, we're not going to read the metaverse, we are going to talk to it and listen to it. And if we really believe the future, whether that's an augmented reality one where we're wandering around with kind of AR glasses on or in VR headsets experiences, these are going to be experiences where we are going to be talking to and interacting with many types of individuals, be they synthetic or real. And there will be a lot of audio content that needs to be made available. And if you want major services, things like financial services, things like healthcare, things like governmental services, to be viable in that environment, you have to make it available to all of the people that need to interact with it. And therefore, the content that previously has lived on a web page that we often leave it to Google to Auto Translate, is going to have to be brought to life vocally, in an audible fashion in these experiences because they are for all intents 3D voice driven experiences, right? And if you want any of those services, whether you're McDonald's, the Quebec was government through to kind of whoever it is that's kind of managing your local takeaway, they all need to be able to interact with many different voices at scale. And asynchronously right, you're not going to have a bunch of people sat there staffing the virtual phone lines of the metaverse, but these things are going to have to be driven by AI experiences, otherwise they will never scale beyond the time that is available to the individuals that can be employed to be there. So all of these experiences will rely upon us to have these AI driven voice experiences. And that's where I think particularly we'll see the synthetic creation of not only synthetic voices, but avatars and everything else really begin to take flight.

Stephanie Ciccarelli:

Yes. Just yesterday when I was preparing for this interview, I was looking through YouTube videos of famous computer voices from before, like from Star Trek. We have Mabel Barrett Roddenberry as the voice of the computer and Star Trek. And I think the earlier clip, she's definitely sounded more robotic. But then she started talking to you and by the time you've got her talking to Data or Data. However, I think it's Data in this series. That's always the word. How do you say a data or data so far as, like, North Americans are concerned? Exactly. But the voice, as these voices are with AI are getting better and better all the time. Just kind of a little example there to kind of compare for those who are Trekkies who might be like, ‘oh yeah, I remember the computer voice kind of progressed in certain ways.’ But there are some telltale signs that, as you've said, this technology is not exactly on par with the human voice. So what are those signs that people who are maybe a little keenly attuned to what their ears can listen for, to know whether or not it's real or a synthetic?

James Poulter:

Well, some of the things that we find is that certain accents and inflections are a little bit hard to duplicate. If someone's got a particularly poky version of an accent, or maybe they've spent 50% of their year kind of in North America and the other half in Australia, you're going to kind of find that harder to duplicate because those sounds are just not going to come out well. There's also most of the voice models and language models that we use to train with a lot of training data tend to be with the not the full 400 words of the English language, but usually about the 4000 words that we use most of the time. And so when you train these data models, it means that they're not always able to pick up on things like synonyms, homonyms, and things like that, but also on just difficult words or uncommon words. I mention this example that we're working on with the church right now. There are some pretty long lists of names in the Old Testament which you should have a go at trying to read out loud for anybody speaking with English mother tongue, let alone ask a computer to do that effectively the first time out. So there are going to be word in the languages that are also harder to get right, but also many regular people can't get those right either. So it's worth watching out for. So there's definitely areas where this will take time to improve. And usually you can tell the better trained models and look at how good your Alexa or how good your Google Assistant is at talking back to you. It's because it's being trained on millions of people's voice training data, speaking to it every single day in a whole variety of different languages and models. When you go interact with maybe a voice that's been created in the studio, maybe for a single use case, it's probably not going to be as good because it just simply hasn't been trained by the same level of data. So there's definitely a sliding scale on how good these things get. But we're certainly seeing that both the cost of entry and also the use cases for it are coming down and going up, respectively. And you're going to begin to find it harder and harder, I think, to tell the difference between these two things to the point where even Amazon themselves are beginning to make this available in some form of consumer accessible version, with us being able to duplicate our own voices. To my colleague, Rich Merrett, who goes by EchoDad on YouTube. He reviews all sorts of things, what you can do with your Echo. He's done versions where he's read bedtime stories for his kids and kind of when he's away, the kids try and guess whether or not this is really Dad's voice or voice or whatever.

Stephanie Ciccarelli:

Oh, wow. That would be a fun game.

James Poulter:

We'll see consumer applications of this as well, I think, in due course.

Stephanie Ciccarelli:

So what you're saying is that when it comes to reading genealogies and that sort of thing, the voices right now for AI are no Max McLean or no David Suchet. Is that fair?

James Poulter:

Yeah, David Suchet is kind of the high bar that we kind of go for here in the UK, who reads the NIV translation in the YouVersion Bible App. So it definitely is worth it worth a listen, yeah. I mean, we know that they can get close and they are getting closer all the time, so I think that's something that we're looking for and seeing whether or not we can kind of get to that stage, but particularly in that kind of longer form audio, it's definitely harder to get your heads around, but it's just a matter of time. These robots are pretty crafty.

Stephanie Ciccarelli:

Yeah. Oh, tell me about it. The robots are coming like Chicken Little running, but the sky is falling all the time, the robots are coming for so long and I think that there's just something that can't be replicated, frankly, by these robotic, synthetic voices and it's that we as humans, we have a soul. There's something that these machines do not have that clearly humans are endowed with. So I think we're safe to say there's going to be a role for voice actors in what we're doing…

James Poulter:

Let me be clear, voice actors are not going anywhere. If anything, they become all the more important, both in terms of being able to bring to life those experiences that cannot be replicated, but also to help us create the best and fast voices that we need out there. Because every voice is unique and so every synthetic voice also needs to be unique, right? No brand out there wants to just be pulling down. It's like when you see a brand logo that's been written in Arial or Times New Roman, you're like, ‘Guys, you could have done better.’ Right? So whether that is a voice that's been synthetically duplicated or it is voice acted in the first instance, we're going to need these unique voices because we all have them to generate the next slew of artificial voices. There just comes a scale that becomes available to us with the technology backing us. But we still need that heart, that empathy, that originality, that comes from original voices to power, that next generation of AI voices.

Stephanie Ciccarelli:

Yeah, absolutely. And I was just talking to Bev Standing the other day, a voice artist who has been involved in this kind of work inadvertently, we’ll say for her, she became the voice of TikTok without knowing it. That's a whole other story. But this idea of voice cloning, it's something that's very important right now for talent to understand. It's an opportunity. A lot of them are getting into it. Even Bev says that she's still doing work in the AI space just selectively. But what can you tell us about voice cloning and what goes into it?

James Poulter:

Well, voice cloning really is the process that we've been discussing. So if you were to come to Vixen Labs and you were saying I want to clone my voice, then it's not the cheapest thing to do right now. But it's also becoming increasingly more available for many people. And typically what we'll need is both a first of all kind of signing over your right that you're having that voice recreated. We will then need some training data. So perhaps podcast audio that you already have or maybe show reels or existing example content that you've already created, or we can ask you to read lots of scripts that will kind of get you to the point where you've got enough audio to be created. But usually with between two to 3 hours worth of spoken audio, we can begin to duplicate a voice to a very high level of fidelity. But we can do it with as little as two to three minutes of spoken audio to get a kind of low fidelity model generated. And so once that's been done, then you would be able to have that access yourself to be able to generate audio out of the platform based upon your own voice. So then perhaps you can also begin to provide that content to people at the other end. So that process of voice cloning, it really is something that's entirely owned and driven by the original creator or whoever the paymasters of that creator are, I suppose you're already tied into a brand contract or something along those lines. But it's something that very much can be controlled and manipulated by that individual and absolutely requires that individual's permission to do in the first place because it is such an important element of consent and control. And we want to create you heard this term of deep fakes before. We want to create these ethically driven deep fakes, these kind of voices, voice clones that we can all have confidence and trust in, and again, this is why we're at Vixen and also at Veritone, where we partner with them, work with people like the Open Voice Network to make sure that these tenants of openness trust and privacy and security remain very much paramount in all that we're doing in this space. Because this AI market is broadly unregulated and we need to ensure that the community itself puts in great standards around doing so.

Stephanie Ciccarelli:

Right. And for anyone who's listening and you heard Veritone, Veritone, well, it's familiar, dear friends, because Veritone actually had recently just acquired VocaliD, which if you know Doctor Rupal Patel, that's her company, obviously Rupal is wonderful. I'm really happy that they've gone ahead and had this neat step in their progression for what they're all doing. But, yeah, a lot of talent in the community who do know who Rupal is, so I just wanted to make sure we mentioned that. And yeah, she's very much versed in this whole idea of replicating voices for various reasons. So obviously we've talked about what kind of goes into voice clones and why and who's using them. I'm guessing these are brands that would want, but obviously a talent, if you're a voice artist and you want to clone your own voice and sell it as an AI solution off to any client who might want to buy it, then that's an option as well. I guess you just have to go to the right place to get it done.

James Poulter:

Yeah, absolutely. And if anyone wants to kind of get in touch and find out more about how to do so, we're more than happy to help guide you on those steps and take you through the process of what it is that is required to generate that voice and to then manage that kind of an ongoing basis. Because we are going to see people want to get into this themselves, they're going to want to do it, get ahead of the game. Perhaps particularly we are tied into pre-existing contracts. Just read the fine print carefully, but I suppose that's always the good advice. Or work with talented folks like yourself to make sure that they get selected in the first place. So I'm sure that this will become a kind of ongoing part of everyone's selection processes, thinking about what rights they're giving over and I think we're going to begin to see kind of synthetic voice cloning rights begin to show up in contracts for voice actors, if it's not already doing so. I mean, Stephanie, you know better than I, but I think that that's an inevitability as the platforms that allow us to make this stuff happen, whether that is the folks at Veritone or other competitors that are in that space, as that becomes more consumer available, people are going to start doing this themselves and try and manage that platform and their virtual presence themselves. And I believe they should, because actually our voices are unique identifiers to us. They are something that is uniquely yours and I believe that we should hold the rights and management over those things. But if you work with the right partners, to do that, you can maintain that right and maintain that ownership while also leveraging that great asset that you have as a voice actor when you take it into the synthetic space.

Stephanie Ciccarelli:

Right, that's so key. Like everything you've said about just having contracts in place and understanding how is my voice going to be used, where is it going to be used, by whom is it going to be used? Because there's all sorts of things that could happen with your voice and just like what happened with Bev, all of a sudden one day she's the voice of cats on videos on TikTok and she has no idea until people have told her and she's like, wait a minute. So there are likely ways that you're aware of potentially, I don't know, of how people can actually safeguard or protect themselves in this process. Is that embedded in your process, a way to make sure that the talent and the brands you're serving have a full understanding of what they are agreeing to and how things should work?

James Poulter:

Particularly if brands are coming to us and saying, yeah, we want to create a synthetic voice, then we'll always want to review pre-existing contracts they already have with voice actors and if the provisions aren't there, recommend them to put those provisions in place. Whereas if voice actors are coming to us themselves, well, that's great. That's their own rights to be able to kind of manage that themselves, which is fantastic. And I think this area is still very much in its evolutionary stage. We're beginning to see in some cases people are allowing synthetic voice creations with caveats. So you can create a synthetic voice, but you can only use it for this project or for this purposes. Therefore following very similarly to kind of standard contract terms that you would always expect for talent. But we are beginning to see some people saying, ‘hang on a minute, what if I have my rights to this bought in perpetuity,’ which I would never particularly recommend in the first place. But if it has, does that mean my synthetic voice is also available for all use cases? Of course, the answer should be no to that. So again, trying to make sure that you would put all of the same standard checks, balances, caveats and considerations into contracting your synthetic voice as you would do your own. Whether you can command the same fees and charges is a different thing, based very much because obviously there is a time element that is being saved. But particularly if you've invested that time yourself up front to create that voice, then there's also something to be considered there. So it's always a balance. And when people come to us at Vixen, we'll be managing that process between the brand, the voice actor, our platform partners at Veritone and that use case and try and kind of come to agreements that make it fair and equitable for all. Because as I say voices, whether they are synthetically created or kind of originally actor owned, they are unique identifiers and they carry weight and value that we should treat with respect and honor the work of great actors that come forward to do so.

Stephanie Ciccarelli:

Absolutely. So we should assume that licensing is always a part of this. It should never be in perpetuity because your voice could be off. Like we've used the example of maybe the voice of Coca-Cola that's just for compare for argument's sake, just say Coca-Cola and then all of a sudden your synthetic voice is running around doing its own little thing and someone's got a voicing Pepsi. And you're like, ‘oh my goodness, this is like the worst possible thing to happen.’ To have these two brands that are head to head with their product having the same voice or market being oversaturated with the voice that shouldn't be there. I guess it's the Wild West is kind of what I'm getting from you, James, is that maybe the policing of this isn't as I guess, figured out yet, other than someone hears your voice, it's not being used the right way.

James Poulter:

I think that's true that the policing of it hasn't been figured out entirely. I think that it's not quite the Wild West yet because we just don't have it happening on that right scale right now. There's still quite high cost of barrier to entry cost being the main one. And obviously also just kind of the types of brands that are technologically advanced enough to know that this is even an option that they should be considering. Right. So those things are kind of helping manage that tide. I think it's really important that we begin to see really great governance around this stuff as we begin to put these things in place again. It's why we work with our partners to make sure that this stuff is done in a safe and secure way and in a way that is kind of trusted platform that users have the option to opt out of. I think we may see a time when certain actors and certain voice talent or even just individuals begin to put their voice prints, their synthetic voices into the public domain. I think there's a possibility that may happen in the future. And I also think that we'll see, particularly with the evolution of a lot of the NFT and kind of blockchain technology, people beginning to capture their voice prints as what I've often termed NFV's nonfungible voices. Voices that come with smart contracts that can be managed, that can be used as biomarkers and Identifiers, but also to be used for content creation in the synthetic and digital spaces, but actually be managed and where you can manage your own identity and market for that because at the moment there isn't a marketplace for it. So I think we'll see new kind of technology solutions come along to help us kind of govern that space as well.

Stephanie Ciccarelli:

Wow. NFVs wow. Okay. That's a term that I've not heard before today, James Poulter. So you've got that one coined probably, I don't know.

James Poulter:

I’ll take that.

Stephanie Ciccarelli:

That's awesome. You know what? There's so much left on the table to talk about and we're just going to have to have you come back.

James Poulter:

Well, I would absolutely love to do that. Yeah. And by all means, carry on the conversation. Maybe it's the best thing. If people are interested in listening, then the best place to start is really with the Voice Consumer Index that we spoke about at the start. We've done this report now for a few years and again made it publicly available for free for people to go and download and digest because we truly believe that this is a really important, as I said, start a tipping point for the industry as this becomes more commonplace. So, yeah, happy to dive into more of those findings with you another time, Stephanie, but people can kind of do so themselves right now.

Stephanie Ciccarelli:

Absolutely. So, James, what's your website? How can people go and get that great document? We'll link to it from our show notes, but I'd love for you to give that information now.

James Poulter:

Yeah, sure. So if you're listening, then you just go to www.vixenlabs.co, no dot com, not co.uk, we try and be cool like that. Just head there and you can go and download the report today. There's an executive summary document you can download. You can sign up for upcoming webinars to be able to kind of digest this information yourself. And you can also drop us an email, just info (at) vixenlabs.co if you would like to kind of arrange an opportunity for us to take that through for maybe a customer or a client or a brand. So if you want to schedule a direct session, then you can do that, too.

Stephanie Ciccarelli:

Outstanding. Thank you so much, James, for being on the show.

James Poulter:

Oh, such a pleasure. Thanks so much for having me. It's always fun to dig into this topic in much more detail, and I'm excited to see when the next time we chat, how far it's come.

Stephanie Ciccarelli:

And that's the way we saw the world through the lens of voice over this week. Thank you for joining me here on Vox Talk. This was such a great, informative show. We have James Poulter. Thank you so much for coming on and sharing all of this amazing information. I don't even know the number of times that I just stopped and had to process what James was saying, and I typically process right in the moment, but it's going to take me a while to kind of come down from after this talk, so I hope you all enjoyed it. You got a lot of the show. We'll sure have James back on. For Voices, I'm Stephanie Ciccarelli. Vox talk is produced by Geoff Bremner. This is Vox Talk you've been listening to, and we'll see you next week.

Stephanie Ciccarelli

Stephanie Ciccarelli is a Co-Founder of Voices. Classically trained in voice as well as a respected mentor and industry speaker, Stephanie graduated with a Bachelor of Musical Arts from the Don Wright Faculty of Music at the University of Western Ontario. For over 25 years, Stephanie has used her voice to communicate what is most important to her through the spoken and written word. Possessing a great love for imparting knowledge and empowering others, Stephanie has been a contributor to The Huffington Post, Backstage magazine, Stage 32 and the Voices.com blog. Stephanie is found on the PROFIT Magazine W100 list three times (2013, 2015 and 2016), a ranking of Canada's top female entrepreneurs, and is the author of Voice Acting for Dummies®.

Weird and Wonderful Audio Facts with Geoff Bremner

Marking Up Scripts with Anthony Reece

Voice Consumer Index 2022 and Synthetic Voices with James Poulter

Leave a comment Cancel reply