Microsoft researchers have announced a new application that uses artificial intelligence to ape a person’s voice with just seconds of training. The model of the voice can then be used for text-to-speech applications.
The application called VALL-E can be used to synthesize high-quality personalized speech with only a three-second enrollment recording of a speaker as an acoustic prompt, the researchers wrote in a paper published online on arXiv, a free distribution service and an open-access archive for scholarly articles.
There are programs now that can cut and paste speech into an audio stream, and that speech is converted into a speaker’s voice from typed text. However, the program must be trained to emulate a person’s voice, which can take an hour or more.
“One of the standout things about this model is it does that in a matter of seconds. That’s very impressive,” Ross Rubin, the principal analyst at Reticle Research, a consumer technology advisory firm in New York City, told TechNewsWorld.
According to the researchers, VALL-E significantly outperforms existing state-of-the-art text-to-speech (TTS) systems in both speech naturalness and speaker similarity.
Moreover, VALL-E can preserve a speaker’s emotions and acoustic environment. So if a speech sample were recorded over a phone, for example, the text using that voice would sound like it was being read through a phone.
Continue reading… “Microsoft’s New AI Can Simulate Anyone’s Voice From a 3-Second Sample”
