Meryl Streep is pitch perfect as the narrator of the Norah Ephron novel Heartburn. In the audiobook version, Streep’s classic delivery brings alive the emotional turmoil as well as the self-deprecating wit of Rachel Samstat, who has just found out about her husband’s affair. In the Harry Potter audiobooks, it’s singer-actor Jim Dale who creates the magic.
Now let’s say you are discomforted by American and British accents. You prefer to hear Heartburn and the Potter books in voices you can relate with. You want to switch the genders of the narrators. You want the Muggles speaking in the voice of your favourite Bollywood actor. You want to be the narrator.
Those are real options a Bengaluru startup expects to offer as it develops an artificial intelligence model for cloning voices. It reckons there is massive business opportunity in impersonating voices, and not just from the growing popularity of audiobooks. Think voiceovers for ads, narrations for education-technology platforms, real-time translations, automated responses, voice assistants, smart speakers.
The startup, Deepsync Technologies, is working on an AI platform that can clone a voice within hours for a fee. Based on the recordings of a consenting person, the platform creates a voice model that can reproduce any text fed to it. So, say, if Shah Rukh Khan can have his voice cloned by learning platform Byjus, for which he is the brand ambassador, the company can use that as the default voice for all its lessons.
“For now, we require two hours of audio data of a person to train the model, and then the voice cloning takes some time. But we are working on the next iteration of our system, which will be able to do the cloning in near real-time and with better accuracy,” says Ishan Sharma, CEO and cofounder of Deepsync .
We gave Deepsync’s AI model a shot. Not Meryl Streep and Jim Dale level of emoting, but then, the cloned voice sounded almost real, not robotic.
Building an AI voice
One seed for this idea came from the demand for audio content that Sharma witnessed while at his previous job at a SAAS-based video production company. “There was constant demand for audio content, especially given that consuming audio content does not require you to be glued to a screen,” he says. “You can be mobile or doing another activity physically while consuming this content.”
He later met his cofounder Rishikesh Kumar on Github, where they connected over their mutual interest in voice synthesis. While researching on the topic, the duo realised there existed a whole lot of good quality content that could be consumed as audio. But then, voice was a problem.
“Most traditional text-to-audio engines sounded like robots and we saw that there was a demand for realistic-sounding voices that could be synthesised by these systems,” says Sharma.
Ishan Sharma and Rishikesh Kumar of Deepsync
For the synthesised voice to sound realistic, Deepsync’s AI model is trained on a minimum two hours of audio recording of the voice to be cloned. The data is checked for quality and fed into the system along with a textual transcript. The company can generate the textual transcript if it is not available. The AI system then maps the audio and textual data to match the pronunciation and enunciation of the narrator. And a voice model is created.
The more the audio data that is available, the better and closer the cloned voice will sound to the real voice. Also, adds Sharma, “the more data that is uploaded and the more the cloned voice is used, the quality increases and cloning time reduces.”
The business of voice synthesis
We are entering a voice-first world, even in India where voice interaction and audio content consumption is on the rise. Increasingly, many content creators are moving to audio podcasts as platforms such as Amazon’s audiobook service, Audible, and other services seek to grab a chunk of India’s growing audio streaming market.
The avenues where audio tech such as voice synthesis can be deployed is ever increasing, with the voice cloning technology market estimated to touch $1.74 billion by 2023. From generating audiobooks with your favourite voice in real-time, and actors being able to dub in different languages using audio tech, to customising your smart home assistant’s voice, the areas in which audio tech can be put to use is rapidly expanding.
In this growing ecosystem, Deepsync has, for now, decided to focus on enabling companies to quickly convert existing textual content into audio narrated by real voices, and without the need for any actual recording.
“We are currently piloting the solutions with a select group of individuals and companies, including Exploritage, which is a heritage travel company, and a handful of instructors on Udemy,” says Sharma, who cofounded Deepsync along with Kumar in 2018.
The company, which is being accelerated by Hong Kong-based Zeroth, will monetise by charging these creators to model voices. For now, the company plans to charge a flat fee to create a voice model but might explore other avenues to monetise as demand for the tech picks up.
Sounding proper
Audio cloning and synthesised voice technology, though now a hot and emerging space, is not new. Text-to-speech engines, found under accessibility or other options in many software products, are early forms of voice synthesis systems that come with a fixed set of voices to be used with. Many of these sound robotic but new systems from the likes of Google sound much more realistic.
Adobe, in 2016, demoed a technology, then titled VoCo, that could clone a person’s voice within just 20 minutes of training the software. There are a handful of companies internationally working on similar technology, including Montreal-based Lyrebird, which claims to be able to clone voices with just a minute of a person’s audio sample. Internet companies including Google and Baidu are also working on AI-based voice synthesis technologies.
One of the biggest challenges for the cloning technology market is reproducing the correct pronunciations and prosody.
Prosody, essentially the patterns of stress and intonation in a language, poses a huge hurdle for voice synthesis systems. Just take the case of spoken English in India, vastly influenced by the over 19,500 regional languages and dialects spoken across the country.
“Every voice has a complex prosody and with the amount of regional languages and dialects that influence the way English is spoken in India, it is a big challenge when it comes to building the AI model,” says Sharma.
Spectrogram of Olivia’s voice. The real human voice on the left and the cloned voice on the right while speaking the same line
C Mohan Ram, managing director of speech technology company Lattice Bridge Infotech, says tech like this will work better in geographies where the number of languages is limited. “We are a multilingual country and that poses a big challenge… (Also,) the same word in English might be pronounced differently as we move across the country, and this is a problem for speech synthesis technologies,” he says.
Another challenge, Ram says, lies in how we normally speak. “AI-based models are grammar- and rule-based but the way we speak is very different from those rules.”
But India’s vast linguistic diversity poses an opportunity as well, with Sharma eyeing the vernacular language market as the next frontier for Deepsync to crack.
In FactorDaily’s trials of Deepsync’s technology, the platform got most of the pronunciations and enunciations right, except in cases involving certain names and punctuations. The voice that we tried was modelled using an open speech dataset. Deepsync did show us samples from voice models of some other companies it is working with, but due to security reasons, it did not give access to the model.
Tech misuse and consent
Like with any other emerging tech, voice synthesis, too, is susceptible to misuse. Last year, an AI-based image reconstruction technology called Deepfakes, created by a Reddit user under the same name, went viral across the internet. It was used to morph faces to create fake celebrity porn videos and revenge porn clips. Some fake clips of Indian celebrities, too, did the rounds on the internet, as FactorDaily noted last year.
A desktop version of the tech called FakeApp soon surfaced, which American actor and comedian Jordan Peele’s production company used to create a fake video featuring former American president Barack Obama. In most such examples, the AI models were trained using publicly available video and audio material.
“(Audio) tech is also susceptible to misuse, especially with the way in which fake news has been spreading. Tech like this is a perfect tool for people creating fake news and other such malicious content,” says Ram.
Sharma says Deepsync is aware of this problem and is putting in place measures to prevent misuse of their platform once it is opened to the public, likely in February.
“Consent is of utmost importance to us and we need explicit consent from the user before their voice is cloned,” says Sharma. “The other thing is that we have a CAPTCHA-like method to ensure that the voice data that is uploaded and the person putting in the clone request are the same, or have explicit written consent to do so.”
On the question of consent, there’s one instance that Sharma recollects that is quite Black Mirrorish, the dystopian sci-fi television series. “I was at this tech meetup and one of the persons I was conversing with asked me if we could clone his deceased dad’s voice from existing media he had,” he says. Sharma declined the request. But he knows that at the rate at which the voice synthesis tech is developing, it will not be the last of such requests.
Via Factor Daily