PC Magazine is running an interview with two of the research leaders in IBM’s speech recognition group, Dr. David Nahamoo, manager of Human Language Technologies, and Dr. Roberto Sicconi, manager of Multimodal Conversational Solutions.

After spending the day at IBM headquarters viewing demos of the company’s latest research projects, reporter Robyn Peterson caught up with two of the research leaders in IBM’s speech recognition group, Dr. David Nahamoo, manager of Human Language Technologies, and Dr. Roberto Sicconi, manager of Multimodal Conversational Solutions. The following transcript has been edited for clarity.

PC Magazine: What’s wrong with speech recognition software that’s available on the market today?

Dr. David Nahamoo (DN): The way to look at speech technology is to look at the progression of human language.

How many years did it take from the starting point of the tribesman, so to speak, to get to this level? We have been essentially doing speech recognition for the past 30 or 40 years. So we are at the earliest stage at the evolution of the technology. So there are things that we don’t know.

One important aspect is that our machines have not seen enough variety of speech: different contexts, different applications, different tasks, and different vocabularies. We have not seen them enough to model them. When we grow up as children, we have a machine in our head, which has been programmed generation after generation until it has been made ready to accept speech from parents and from teachers and map [the speech to meaning] and learn how to speak, how to understand, how to transcribe.

So, there is nothing really wrong [with speech recognition today] except that we are just at the beginning of this technology. But the surprising thing is that [at] the end, if you compare [a computer’s ability to understand speech] to human performance, it might not be too far apart. At IBM, we have this Superhuman Speech Recognition [program] that has a goal to get there, comparable to human performance, in the next five years.

Comparable to human performance in the next five years — in terms of what, exactly?

DN: Transcribing. Listening to a conversation and transcribing it, not understanding it. There’s a big difference between transcribing, which is turning [speech] into a sequence of words, versus [understanding] what was meant by what was said. That is a different discussion.

In one of the demo rooms today, we saw MASTOR (a program that performs dynamic translation) which does seem to have conceptual understanding of some English phrases or [Mandarin] Chinese phrases.

DN: Right. With speech recognition, when we talk in terms of "superhuman," that means its going to be domain independent. It’s going to be broad.

Domain refers to a context, like, for instance, in a car?

DN: Correct. In a car, in the finance or banking industry. With regard to understanding or translating speech, to get high quality performance, it’s usually domain specific. So, what you saw in MASTOR was domain specific for a human application, in health. … [When the domain is restricted,] there is a [finite] set of conversations [that can occur]. The vocabulary and the phraseology are not unlimited.

Correct. In a car, in the finance or banking industry. With regard to understanding or translating speech, to get high quality performance, it’s usually domain specific. So, what you saw in MASTOR was domain specific for a human application, in health. … [When the domain is restricted,] there is a [finite] set of conversations [that can occur]. The vocabulary and the phraseology are not unlimited.Continue reading to find out if you can translate speech in different languages on your laptop. So, I have this picture of myself in Morocco with a laptop translating everything I’m saying to vendors, and possibly having conversations with other locals. When is that going to be a reality? Is it five years off or ten years off?

DN: It all depends. Where do you want to go? Morocco? (Laughs) It depends on what you want to do. There are a few things you want to take care of. You want to make sure that if you get sick, you get the proper care.

It all depends. Where do you want to go? Morocco? (Laughs) It depends on what you want to do. There are a few things you want to take care of. You want to make sure that if you get sick, you get the proper care. That’s domain-specific?

DN: Yes. If you want to find transportation and directions, if you want to go to a restaurant and have a conversation, those are not too far away. If you want to go and essentially trade, do business, and conduct complicated negotiations, then I don’t have a good answer for you.

Yes. If you want to find transportation and directions, if you want to go to a restaurant and have a conversation, those are not too far away. If you want to go and essentially trade, do business, and conduct complicated negotiations, then I don’t have a good answer for you.Is it a matter of breadth, where you have to spend time tackling these specific domains one-at-a time, or is it a different level of conceptual understanding?

DN: It is a different level of conceptual understanding. When you go to a broader domain, negotiating on price, a lot of parameters come into the picture. And just modeling it is a lot of work.

It is a different level of conceptual understanding. When you go to a broader domain, negotiating on price, a lot of parameters come into the picture. And just modeling it is a lot of work.

Dr. Roberto Sicconi (RS): Even in your own language, when you talk to somebody and try to be sarcastic or make a reference to a movie by some [obscure] detail, you expect the other person to [understand] your logic. Machines don’t have that level of perception.

DN: For the [MASTOR] translator then, we are doing it by true understanding. And Roberto is correct in that understanding, by itself, is a very difficult problem. Forget about the language part of it. Think about interviewing a machine. Would that machine ever be able to have an intelligent conversation?

Continue reading to learn about where speech technology is adopted. We’re starting to see adoption [of speech technology] in places like hospitals and auction houses. Is the next phase going to be additional types of businesses, say finance, or are we going to see more of a push into consumer markets, like the car?

RS: There has to be a good reason to use speech, maybe you’re hands are full [like in the case of driving a car]. … Speech has to be important enough to justify the adoption. I’d like to go back to one of your original questions. You were saying, "What’s wrong with speech recognition today?" One of the things I see missing is feedback. In most cases, conversations are one-way. When you talk to a device, it’s like talking to a 1 or 2 year old child. He can’t tell you what’s wrong, and you just wait for the time when he can tell you what he wants or what he needs. Today’s devices have the same problem. You talk to the device and it doesn’t respond. You don’t know whether the microphone isn’t working or it got some words wrong [and it doesn’t understand you]. A person will tell you, "Sorry, I didn’t understand this part," or [say], "You’re breaking up," while on a cell phone conversation. That kind of feedback is just not available today.

There has to be a good reason to use speech, maybe you’re hands are full [like in the case of driving a car]. … Speech has to be important enough to justify the adoption. I’d like to go back to one of your original questions. You were saying, "What’s wrong with speech recognition today?" One of the things I see missing is feedback. In most cases, conversations are one-way. When you talk to a device, it’s like talking to a 1 or 2 year old child. He can’t tell you what’s wrong, and you just wait for the time when he can tell you what he wants or what he needs. Today’s devices have the same problem. You talk to the device and it doesn’t respond. You don’t know whether the microphone isn’t working or it got some words wrong [and it doesn’t understand you]. A person will tell you, "Sorry, I didn’t understand this part," or [say], "You’re breaking up," while on a cell phone conversation. That kind of feedback is just not available today.

So that’s something we need to work on. The problem is, if you have a talking machine, you start associating that machine with a person. … If the machine misses too much it may appear to be stupid, or annoying. It’s almost unavoidable. … So if you want to sell a device like this, you have to be really careful about the personality that most people see when dealing with the device. That really brings the threshold very high.

So expectations [by consumers] are too high?

RS: Expectations are pretty high. It’s not that people expect it to be perfect, but people look at Star Trek and expect that any machine listening to you can understand every single word.

Expectations are pretty high. It’s not that people expect it to be perfect, but people look at Star Trek and expect that any machine listening to you can understand every single word. Continue reading to learn how visual cues can help aid speech recognition. A good part of communication is not only speech, but facial expressions, too. David, you mentioned "audio/visual" recognition in a speech today. Can you explain that more?

DN: Essentially depending on where you want to do the task of speech recognition, the audio signal may not be enough, for instance, a trading floor where everyone is shouting. … When the audio is not enough, you want to back it up with other information, [like] video information. Video information is very, very useful.

Essentially depending on where you want to do the task of speech recognition, the audio signal may not be enough, for instance, a trading floor where everyone is shouting. … When the audio is not enough, you want to back it up with other information, [like] video information. Video information is very, very useful. What kind of visual information?

DN: … Visual information comes from looking at you. First, your mouth is either moving or not. If it’s not, then the machine doesn’t have to pay attention to any audio information. Suppose I design a system that takes an action as soon as you speak. In that case, if you haven’t opened your mouth, and another noise comes, [that noise] would be rejected because the machine is focusing on you. … So that’s the first step. It can replace push-to-talk functionality. You don’t need to push to talk since the system can actually look at you and say, "you’re not talking."

… Visual information comes from looking at you. First, your mouth is either moving or not. If it’s not, then the machine doesn’t have to pay attention to any audio information. Suppose I design a system that takes an action as soon as you speak. In that case, if you haven’t opened your mouth, and another noise comes, [that noise] would be rejected because the machine is focusing on you. … So that’s the first step. It can replace push-to-talk functionality. You don’t need to push to talk since the system can actually look at you and say, "you’re not talking."

The second thing is that, now you’re talking and the radio is blasting. So the signal-to-noise ratio is very low. Some signal is coming, but the machine is having difficulty. Your visual cues can be a wonderful set of additional information that can be given to that [machine]. And when the two things are put together, [i.e., limited audio supplemented with video,] the signal-to-noise ratio is much better.

Where does this stand now in IBM Research? You’ve obviously researched the issue. Are you also working on devices? Is there something you can show soon?

RS: Yes, prototypes. So, in a sense, we can compensate for some acoustic noise by looking at features in the visual space. What we’re trying to do is understand the processing required to get enough information to compensate.

Yes, prototypes. So, in a sense, we can compensate for some acoustic noise by looking at features in the visual space. What we’re trying to do is understand the processing required to get enough information to compensate.

DN: So there are two aspects to what we do. One is that we continue to do the research. The second is that we always have an eye on where our research is going to show up in the company’s business. Some things might not necessarily come up. A consumer market in some form or fashion is not always the focus that we have. But there are some aspects of it that have tremendous focus. For instance, automotive. Visual information could be a very powerful technology inside the car to help dealing with noise. The machine could be on all the time, but people all around are talking, but the system ignores them because it’s focused on me and I’m not doing anything.

Will there be a time when a machine can simply read your lips?

RS: If you rely on the visual channel only, it’s difficult. If you can trust what deaf people have told me, … even people reading lips only have 30-40% accuracy. That’s the best you can get. … [Lip readers] try to compensate by guessing what you are saying and constantly trying to match what you are saying with what they are expecting.

If you rely on the visual channel only, it’s difficult. If you can trust what deaf people have told me, … even people reading lips only have 30-40% accuracy. That’s the best you can get. … [Lip readers] try to compensate by guessing what you are saying and constantly trying to match what you are saying with what they are expecting. With a lot of IBM Research’s demonstrated [speech technology] products, it seems as though open standards, and even open source, are a priority. Now, it’s no secret that IBM makes a lot of money on consulting services. How much of the drive for open standards and open source comes from a desire to bolster the consulting practice?

DN: For standards and open source, you really have to look at it as you implied, it’s a business decision. It’s a business decision because if you don’t have standards, when the application provider, the service provider, and the solution provider cannot agree on a common set of things then nothing gets done. If things are not standardized … then you are essentially impacting the cost of the delivery.

For standards and open source, you really have to look at it as you implied, it’s a business decision. It’s a business decision because if you don’t have standards, when the application provider, the service provider, and the solution provider cannot agree on a common set of things then nothing gets done. If things are not standardized … then you are essentially impacting the cost of the delivery.

Standards have a tremendous value in reducing the costs in two forms. One, it allows reuse of what already exists. When you’re building a speech recognition application, it’s only the interaction layer that you have to program. The entire database and business logic, and all of those things, shouldn’t change. You should be able to use it from the existing application. That’s the beauty of Voice XML as it made that possible. [Two,] open source is another aspect, as you allow a large number of people to play around and innovate on many aspects of it.

Voice user interface, for example is a tricky thing. It happens by trial and error. Innovation has to happen here and there. Larger number of people using a technology, playing with it, actually accelerates the delivery of [applications]. [In the case of developing] services, at some point in time, lots of things need to be put together, and it just doesn’t happen by itself. Somebody who has expertise, somebody who already knows how to do it, somebody who has done the hard work of learning how to put these things together, can come and provide service and we all love it and we say please do it for me because I don’t want to do it myself.

So it’s all about acceleration, reducing the total costs of ownership, making the enterprise feel comfortable that the money that is spent now won’t have to be spent again redoing the system six months from now because technology changed, something changed.

RS: Open standards, to me, are key to creating fast moving, huge markets, which of course increases competition. So if you want to look at it as a big opportunity, and IBM is interested in this, you have to play by the rules and be competitive. If you are a competitor and you know how to innovate and compete with everybody, then open standards will provide you the best type of game.

0