Recording an audiobook is no easy feat, especially for seasoned voice actors. However, the growing demand for audiobooks, coupled with major streaming platforms like Spotify dedicating space to them, has led MIT and Microsoft researchers to embark on an innovative project. They are leveraging artificial intelligence (AI) to transform online texts into audiobooks. In collaboration with Project Gutenberg, the world’s oldest and perhaps most extensive repository of open-license ebooks, they aim to generate 5,000 AI-narrated audiobooks. This impressive collection includes literary classics such as “Pride and Prejudice,” “Madame Bovary,” “Call of the Wild,” and “Alice’s Adventures in Wonderland.” The researchers recently published an arXiv preprint detailing their endeavors.

Mark Hamilton, a PhD student at MIT’s Computer Science & Artificial Intelligence Laboratory and a lead researcher on the project, explains their goal: “What we wanted to do was create a massive amount of free audiobooks and give them back to the community. Lately, there’s been a lot of advances in neural text-to-speech, which are these algorithms that can read text, and they sound quite human-like.”

The key ingredient enabling this venture is a neural text-to-speech algorithm that undergoes training on millions of instances of human speech and then replicates it. This algorithm is capable of producing diverse voices with various accents and languages and can even create custom voices with just five seconds of audio. Hamilton elaborates, “They can read any text you give them and they can read them incredibly fast. You can give it eight hours of text, and it will be done in a few minutes.”

Crucially, this algorithm can discern nuances such as tones and the adjustments humans make when reading words. It understands how to render items like phone numbers or website addresses, how to group words together, and where to insert pauses. This AI technology is founded on the earlier work of some of the paper’s co-authors at Microsoft.

Hamilton draws a parallel with large language models, stating that both rely on machine learning and neural networks. While large language models fill in gaps within text, neural text-to-speech algorithms take in text and transform it into sound, aiming to create audio that accurately corresponds to the input text.

“They’re trying to generate sounds that are faithful to the text that you put in. That also gives them a little bit of leeway,” Hamilton adds. “They can spit out the kind of sound they feel is necessary to solve the task well. They can change, group, or alter the pronunciation to make it sound more humanlike.”

To evaluate the model’s performance, a tool called a loss function is employed. Implementing AI in this manner promises to expedite projects like Librivox, which presently relies on human volunteers to create audiobooks of public domain works.

By Impact Lab