Meta, the technology conglomerate formerly known as Facebook, has introduced its latest innovation – SeamlessM4T, a revolutionary multimodal AI model designed for speech and text translations. This neural network boasts the unique ability to process both audio and text, enabling it to perform a range of translation tasks including text-to-speech, speech-to-text, speech-to-speech, and text-to-text conversions for over 100 languages. The primary aim of SeamlessM4T is to facilitate more effective communication between individuals who speak different languages.
In a move consistent with its relatively open approach to AI, Meta is releasing SeamlessM4T under a research license (CC BY-NC 4.0), allowing developers to build upon its framework. Alongside this, Meta is also introducing SeamlessAlign, which the company proudly claims is “the largest open multimodal translation dataset to date, encompassing a staggering 270,000 hours of mined speech and text alignments.” This comprehensive dataset is poised to jumpstart the training of future translation AI models by researchers outside of Meta.
Prominently featured in Meta’s promotional blog, SeamlessM4T boasts a range of functionalities, including speech recognition, speech-to-text translation, speech-to-speech translation, text-to-text translation, and text-to-speech translation. Each of the text translation functions supports nearly 100 languages, while the speech output functions are compatible with approximately 36 output languages.
Drawing a playful reference to the Babel Fish from Douglas Adams’ iconic sci-fi series, Meta’s announcement likens SeamlessM4T’s instant translation capabilities to the fictional fish that could decipher any spoken language when placed in one’s ear.
The training process for SeamlessM4T involved Meta’s researchers creating a multimodal corpus of automatically aligned speech translations totaling over 470,000 hours, known as SeamlessAlign. This dataset was then refined using human-labeled and pseudo-labeled data, culminating in a subset totaling 406,000 hours.
While Meta remains somewhat ambiguous about the sources of its training data, it is known that the text data was drawn from the same dataset used in NLLB (Natural Language Learning Benchmark), containing sentences from various sources such as Wikipedia, news outlets, and scripted speeches. The speech data, on the other hand, originated from a pool of 4 million hours of raw audio sourced from a publicly available web data repository. Notably, Meta utilized 1 million hours of English audio from this collection for training purposes.
Although Meta is not the first AI company to delve into machine-learning translation tools, with Google Translate’s integration of machine learning since 2006 and the emergence of large language models like GPT-4, it has entered a new realm of innovation in audio processing. SeamlessM4T extends this trend by expanding multimodal translation to a diverse array of languages. Moreover, Meta asserts that SeamlessM4T’s “single system approach” – utilizing a unified AI model rather than a chain of multiple models – minimizes errors and enhances translation efficiency.
For those seeking more in-depth technical insights into SeamlessM4T’s functionality, Meta has provided detailed information on its website. Additionally, the code and neural network files used for training can be accessed on the Hugging Face platform. This latest stride in AI-powered language translation reaffirms Meta’s commitment to pushing the boundaries of technological advancement.
By Impact Lab

