Emerging speech neuroprostheses may offer a way to communicate for people who are unable to speak due to paralysis or disease, but fast, high-performance decoding has not yet been demonstrated. Now, transformative new work by researchers at UCSF and UC Berkeley shows that more natural speech decoding is possible using the latest advances in artificial intelligence.
Led by UCSF neurosurgeon Edward Chang, the researchers have developed an implantable AI-powered device that, for the first time, translates brain signals into modulated speech and facial expressions. As a result, a woman who lost the ability to speak due to a stroke was able to speak and convey emotion using a talking digital avatar. The researchers describe their work in a study published today (Wednesday, Aug. 23) in the journal Nature.
Study co-author Gopala Anumanchipalli, assistant professor, and Ph.D. student and co-lead author Kaylo Littlejohn, both from UC Berkeley’s Department of Electrical Engineering and Computer Sciences, discussed this breakthrough study with Berkeley Engineering. The following Q&A has been edited for length and clarity. This study is groundbreaking in many ways. What was your role and what did you set out to do?
Gopala: There is a decade-long history behind this project. When I was a post-doc in Edward Chang’s lab, we were on this mission to both understand the brain function underlying fluent speech production and also translate some of these neuroscience findings into engineering solutions for those who are completely paralyzed and are communication disabled. We investigated ways to do speech synthesis from brain activity recordings while working with epilepsy patients. But these are otherwise abled speakers. This proof of principle work was published in Nature in 2019. So we had some kind of inkling that we could read out the brain. We then thought that we should try using this to help people who are paralyzed, which was the focus of the BRAVO [BCI Restoration of Arm and Voice] clinical trial.
That trial, which used a new device called a speech neuroprosthesis, was successful and showed that we could decode full words from brain activity. It was followed by another study in which we managed to decode more than 1,000 words to create a spelling interface. The participant could say any NATO code words — such as Alpha, Bravo, Charlie — and have that be transcribed. We improved the machine learning models used to decode speech, specifically by using decoders that had explicit phonetic and language models that went from these code words into fluent sentences, like how Siri would recognize your voice.
In this project, we set out to increase the vocabulary and accuracy, but most importantly, we aimed to go beyond decoding spelling. We wanted to go directly to spoken language because that is our mode of communication and is the most natural way we learn.
The motivation behind the avatar was to help the participant feel embodied, to see a likeness and then control that likeness. So, for that purpose, we wanted to give a multimodal communication experience.
How did you translate brain signals into speech and expression? What were some of the engineering challenges you encountered along the way? Kaylo: Because people with paralysis can’t speak, we don’t have what they’re trying to say as a ground truth to map to. So we incorporated a machine-learning optimization technique called CTC loss, which allowed us to map brain signals to discrete units, without the need for “ground truth” audio. We then synthesized the predicted discrete units into speech. The discrete units of speech encode aspects like pitch and tone, which are then synthesized to create audio that comes closer to natural speech. It’s those inflections and cadence changes that convey a lot of meaning in speech beyond the actual words.
In the case of text, Sean Metzger [co-lead author and Ph.D. student in the joint Bioengineering Program at UC Berkeley and UCSF] broke down words into phonemes.
We also extended this further into more natural communication modes like speech and facial expressions, in which the discrete units are articulatory gestures, like specific movements of the mouth. We can predict the gestures from the brain activity, then transform them into how the mouth moves.
For the facial animation, we worked with Speech Graphics to animate the gestures and speech into a digital avatar.
Gopala: To underscore Kaylo’s point, we used all the existing AI technology to simulate essentially what a valid output would be for a given sentence. And we do that by using the speech data that is available in the big speech models used by Siri, Google Assistant and Alexa. So we have an idea of what a valid sequence of representative units is for a spoken language. That could be what the brain signal corresponds to. For instance, the participant was reading sentences, and we then used simulated pairs of this data: the input is from her brain signals, and the output is the sequence of discrete codes predicted from these large spoken-language models.
We also were able to personalize the participant’s voice by using a video recording of her making a speech at her wedding from about 20 years ago. We kind of fine-tuned the discrete codes to her voice. Once we had this paired alignment that we had simulated, we used the sequence alignment method that Kaylo had mentioned, the CTC loss.
An important part of this multimodal speech prosthesis is the avatar. Were there any special considerations or challenges with using that type of visual component?
Kaylo: The main motivation for using this avatar is to provide a complementary output to the speech and text decoding. The avatar can be used to convey a lot of non-speech expressions. For example, in the paper, we showed that we could decode the participant’s ability to smile, frown or make a surprise face – and at different intensities, from low to high. Also, we showed that we could decode non-speech articulatory gestures, such as opening the mouth, puckering of the lips and so forth.
The participant wants to someday work as a counselor and expressed that being able to convey emotions through facial expressions would be valuable to her.
That said, the challenge with using an avatar is that it needs to be high fidelity, so that it’s not too unrealistic-looking. When we started this project, we worked with a very crude avatar, which wasn’t very realistic and didn’t have a tongue model. As neuroengineers, we needed a high-quality avatar that would allow us to access its muscle and vocal-tract system. So scoping out a good platform to do that was critical.
You had mentioned decoding the signals that control facial expression. Could you talk a little bit more about how you did that?
Gopala: Here’s an analogy: A piece of music can be broken down into discrete notes, with each note capturing a very different kind of pitch. Think of the discrete codes that Kaylo is mentioning as these notes. And there is a correlate for the note in terms of what it sounds like, but there’s also a correlate for what needs to happen for that sound to be produced. So if the note is for the sound “pa,” it sounds like the “pa,” but it also embodies the action of the lips puckering together and releasing.
The mechanism is coded by these units that the avatar is handling, and the sound is where the synthesis is happening. Essentially, we’re breaking down the neuro-speech sequence into a discrete sequence of notes.
Kaylo: Imagine the sentence: “Hey, how’s it going?” There is a sequence of vocal tract movements that are associated with that sound. And we can train a model, which takes those muscle movements and converts them into that discrete code, similar to the notes for music. And then we can predict that discrete code from the brain, and from there, go back to the continuous muscle tract movement, and that’s what drives the avatar.
How has AI played a role in the development of this new brain-computer interface and multimodal communication?
Gopala: All of the algorithms and things developed for having your Alexa work are really key to getting some of this to fruition. So we would not be able to do it without AI, broadly speaking. And by AI, I mean not just current AI like ChatGPT, but the core engineering that’s enabled decades of AI and machine learning.
More importantly, we’re still limited in terms of how much of the brain we can access with neural implants, so our view is very sparse. We are essentially peeping through a keyhole, so we’ll always have to use AI to fill in the missing details. It’s like you can give the AI a raw sketch, and it can fill in the details to make it more realistic.
Eventually, when we get to the point of a completely closed-form solution for a prosthesis, the goal is for a communication partner. This could be an AI that works with whatever signal it senses from the person, but, like ChatGPT, also uses a whole lot of statistics on how best to respond to make it a more contextually appropriate response.
Were there any surprising findings tied to your work?
Kaylo: One super-important thing is that we showed the vocal tract representations are preserved in the participant’s brain. We know from studies of healthy participants that when someone tries to speak, those mouth movements are encoded on their cortex. But it was unclear if that would be the case for someone who has severe paralysis. For example, do those regions atrophy over time, or are those representations still there that we can use to decode speech from?
We confirmed that, yes, articulatory or vocal tract representations are preserved in the participant’s cortex, and that’s what allows for all three of these modalities to work.
Gopala: Exactly! So the brain part still retains these codes in the right place. We kind of hit a jackpot there. Because if there was loss, the surgery would have been for naught. And the AI is helping there, filling in the details as well. But it also helps make the participant feel embodied and learn new ways of speaking, and that’s key to getting to the next stage.
That said, current AI is centered around computers, not humans. We need to rethink what AI should be when there is a human in the loop and have it be more human centered, rather than do its own thing. It needs to share its autonomy with the human, so the human can still be in the driver’s seat, while the AI is the cooperative agent.
What do you see as your next steps?
Kaylo: For real-world use, having a stable decoder that works long term is going to be really important. It would be ideal if we could develop something that the participant could take home with her and use on a day-to-day basis over several years, without needing another neurosurgery.
Gopala: I think that the immediate logical next step is to reduce the latency involved in the process. So instead of having a few seconds of delay between the participant thinking of what she wants to say and the words coming out of the avatar’s mouth, we would minimize the latency to the point that the process feels like real time for her.
We should also look into miniaturizing the prosthesis and making it a standalone device, much like a pacemaker. It should act on its own, be powered on its own, and always be with the participant, without the researchers driving the apparatus.
How did the partnership between UCSF and Berkeley Engineering factor into the success of this project?
Gopala: This study heavily uses tools that we developed here at Berkeley, which in turn are inspired by the neuroscientific insights from UCSF. This is why Kaylo is such a key liaison between the engineering and the science and the medicine — he’s both involved in developing these tools and also deploying them in a clinical setting. I could not see this happening anywhere else but somewhere that is the best in engineering and the best in medicine, on the bleeding-edge of research.
Kaylo: I don’t think this project would have happened if we didn’t have all the resources provided by both Berkeley and UCSF. We leveraged a lot of recent advances in engineering, in AI, and our understanding of neural speech processing to make this project work well. This is a great example of two institutions coming together and creating a good piece of research.