Design Article
Comment
VUI Guy
The CPU hit factor is not something to be taken lightly, which is why companies ...
rajuchaluva
Speech recognition cuts driver distraction
Rick DeMeis
8/5/2011 10:57 AM EDT
Processing power: The enabler
While such functionality is done in software, there is a "CPU hit" to run it, notes Radloff. Early voice recognition systems ran using 100 MIPS processors. That number was up in the 500 MIPS range in the mid-2000s, to where today more capable processors are ranging from 800 to 1,500 MIPS. Radloff sees this key enabling technology up to 2,500 MIPS around mid-decade. These numbers are more toward the trailing- than leading-edge of processor development because the devices must be made robust enough for the temperature extremes and EMI of the taxing automotive environment.
So what will automotive speech recognition technology be capable of with such power? Greater processing power will enable more natural language interfacing, says Radloff, using less-structured phrases like, "I want to hear Van Morrison." There will also be 3G connectivity to the car and the off-board migration of some voice recognition features to servers in the "cloud," he adds (see page 3). "You can run very sophisticated applications then such as [Nuance] Dragon dictation for SMS texting," which is currently being demonstrated and should be operational in about 18 months. Here, a voice message is sent to a server that sends out the text message.
By going off-board, voice recognition is not "grammar bound" by a fixed vocabulary as with the memory and CPU limits of an embedded system in the car. "SMS is 'general' grammar [i.e. any combination of letters]. So if you have the connectivity, take advantage of it" to do the processing needed off-board, notes Radloff. Cloud-based service also keeps navigation system points-of-interest (POIs) and construction site data up-to-date.


Microphone manipulation
More insight into the nuances of microphone installation is provided by Scott Pennock, senior hands-free standard specialist, Hands-Free and Speech Technology, for QNX Software Systems, which partners with Nuance and provides acoustic processing middleware in creating speech interfaces. One QNX focus is delivering better voice signals to the speech recognition system.
"Vehicle noise is diffuse, the same throughout the cabin," Pennock says. "The challenge with the far-field mics comes about because if you double the distance to the speaker (driver), you take a 6 dB hit in the signal-to-noise (S/N) ratio." Thus it is better to install a microphone in the headliner, about 12 inches from the driver's mouth, rather than on the rearview mirror, up to 24 inches away.
As for adding another mic for beam forming on the driver, there is also a S/N benefit, adds Pennock. But this is only a 3 dB improvement because the "noise floor" is raised by 3 dB (i.e. the second mic not only picks up speech but noise as well).
In developing speech recognition systems, Pennock cites another challenge that may not be all that obvious. A system is specified with a required accuracy rate, but determining if that rate has been achieved can be daunting. This task was easier when systems used set commands rather than natural language. Systems are now pitted against natural utterances as people speak normally.
Testing may be done with live subjects, which is time consuming, and sampling may still not be large enough for today's increasing grammars to ensure all accents are adequately covered, Pennock notes. It is better to build a library of utterances that can be played back more efficiently. The utterances should be collected in a vehicle noise environment, where people tend to talk louder at a higher pitch. Interestingly, a person speaking a string of familiar phone number digits in a natural cadence produces a higher recognition rate than deliberately slowing, Pennock says.
Voice systems need to be tested under different operating conditions. These can range from idle to 70 mph with climate control fans on high; during rain, where the noise is not steady but a dynamically varying signal; and riding over louder concrete or quieter asphalt.
Good speech-recognition user-interface design is more than just high recognition rates, however. How a system recovers from errors has to take into account both expert and novice users, notes Pennock. A re-prompt from the system when a phrase is not recognized may at first be, "Did you say xyz?" By detecting response pauses, the system can assume the user needs more verbal prompts to perhaps learn phrases, where as an experienced user will just confirm or repeat a request. The system then transitions a user over time to a more expert level.
Pennock concludes that with the multimode user interfaces available today, it seems speech is most effective to input more complex "information in a user's head" (such as requests for POIs, audio selections, or phone calls) with more natural language interaction without resorting to distracting touch scrolling. Whereas a single, simple action (climate mode selection or temperature increase) is done effectively with a quick touchscreen/switch stroke.
Similarly, Brigitte Richardson, Global Voice Control Technology/Speech Systems lead engineer for Ford notes some fans of voice recognition want to expand its use for such functions as seat adjustments and window control—applications she feels are an overuse of the technology because these simple tasks are handled adequately now with familiar, basic switches.
But one trend is apparent—speech recognition is an increasing enabler for interacting with new automotive features, and user devices' connectivity, offering ease of use in a minimally distractive, safer manner.
While such functionality is done in software, there is a "CPU hit" to run it, notes Radloff. Early voice recognition systems ran using 100 MIPS processors. That number was up in the 500 MIPS range in the mid-2000s, to where today more capable processors are ranging from 800 to 1,500 MIPS. Radloff sees this key enabling technology up to 2,500 MIPS around mid-decade. These numbers are more toward the trailing- than leading-edge of processor development because the devices must be made robust enough for the temperature extremes and EMI of the taxing automotive environment.
So what will automotive speech recognition technology be capable of with such power? Greater processing power will enable more natural language interfacing, says Radloff, using less-structured phrases like, "I want to hear Van Morrison." There will also be 3G connectivity to the car and the off-board migration of some voice recognition features to servers in the "cloud," he adds (see page 3). "You can run very sophisticated applications then such as [Nuance] Dragon dictation for SMS texting," which is currently being demonstrated and should be operational in about 18 months. Here, a voice message is sent to a server that sends out the text message.
By going off-board, voice recognition is not "grammar bound" by a fixed vocabulary as with the memory and CPU limits of an embedded system in the car. "SMS is 'general' grammar [i.e. any combination of letters]. So if you have the connectivity, take advantage of it" to do the processing needed off-board, notes Radloff. Cloud-based service also keeps navigation system points-of-interest (POIs) and construction site data up-to-date.


Using cloud-based services, no onboard navigation system is needed to deliver turn-by-turn navigation, via voice commands, to the driver.
Also in the offing more near term is installation of more than one microphone, which allows more sophisticated noise cancellation and beam forming. Processing directs the "listening beam" (for instance by manipulating delay of the same sound between mics) to "focus" on the driver, lowering the tendency to pick up passenger voices.
More insight into the nuances of microphone installation is provided by Scott Pennock, senior hands-free standard specialist, Hands-Free and Speech Technology, for QNX Software Systems, which partners with Nuance and provides acoustic processing middleware in creating speech interfaces. One QNX focus is delivering better voice signals to the speech recognition system.
"Vehicle noise is diffuse, the same throughout the cabin," Pennock says. "The challenge with the far-field mics comes about because if you double the distance to the speaker (driver), you take a 6 dB hit in the signal-to-noise (S/N) ratio." Thus it is better to install a microphone in the headliner, about 12 inches from the driver's mouth, rather than on the rearview mirror, up to 24 inches away.
As for adding another mic for beam forming on the driver, there is also a S/N benefit, adds Pennock. But this is only a 3 dB improvement because the "noise floor" is raised by 3 dB (i.e. the second mic not only picks up speech but noise as well).
Two or more microphones using audio processing can form a sensitivity "beam" to pick up the driver's voice and reject sounds from the background-talking passengers.
In developing speech recognition systems, Pennock cites another challenge that may not be all that obvious. A system is specified with a required accuracy rate, but determining if that rate has been achieved can be daunting. This task was easier when systems used set commands rather than natural language. Systems are now pitted against natural utterances as people speak normally.
Testing may be done with live subjects, which is time consuming, and sampling may still not be large enough for today's increasing grammars to ensure all accents are adequately covered, Pennock notes. It is better to build a library of utterances that can be played back more efficiently. The utterances should be collected in a vehicle noise environment, where people tend to talk louder at a higher pitch. Interestingly, a person speaking a string of familiar phone number digits in a natural cadence produces a higher recognition rate than deliberately slowing, Pennock says.
Voice systems need to be tested under different operating conditions. These can range from idle to 70 mph with climate control fans on high; during rain, where the noise is not steady but a dynamically varying signal; and riding over louder concrete or quieter asphalt.
Good speech-recognition user-interface design is more than just high recognition rates, however. How a system recovers from errors has to take into account both expert and novice users, notes Pennock. A re-prompt from the system when a phrase is not recognized may at first be, "Did you say xyz?" By detecting response pauses, the system can assume the user needs more verbal prompts to perhaps learn phrases, where as an experienced user will just confirm or repeat a request. The system then transitions a user over time to a more expert level.
Pennock concludes that with the multimode user interfaces available today, it seems speech is most effective to input more complex "information in a user's head" (such as requests for POIs, audio selections, or phone calls) with more natural language interaction without resorting to distracting touch scrolling. Whereas a single, simple action (climate mode selection or temperature increase) is done effectively with a quick touchscreen/switch stroke.
Similarly, Brigitte Richardson, Global Voice Control Technology/Speech Systems lead engineer for Ford notes some fans of voice recognition want to expand its use for such functions as seat adjustments and window control—applications she feels are an overuse of the technology because these simple tasks are handled adequately now with familiar, basic switches.
But one trend is apparent—speech recognition is an increasing enabler for interacting with new automotive features, and user devices' connectivity, offering ease of use in a minimally distractive, safer manner.
Navigate to related information


agk
8/6/2011 7:04 AM EDT
This is a nice system. The driver can keep the hands on the steering wheel and eyes on the road while operating all the gadjets by giving voice commands. The driver can control the entertainment gadgets,navigation,get the vehicle information,climate control and have telephonic conversation by voice commands.
Sign in to Reply
cdhmanning
8/10/2011 1:39 AM EDT
It is not just keeping your hands on the wheel and eyes on the road that is important. You need to keep your brain engaged too.
If you are swearing at the machine because it is not hearing you properly then you are probably not in the right mood for driving!
It has been noted with Bluetooth headsets (which are legal in many areas) that these are as distracting as holding a cellphone to your ear during a voice call.
I must say I have not used voice recognition for a long time, but anyone that remembers the Microsoft VR fiasco will remember how bad it can be!
I once worked for a telecom company where we tested some VR gear that had been trained to understand British voices. None of the British guys in the office could make it work. Nor me (South African accent). The only guy who could make it work was a Fijian Indian guy!
Sign in to Reply
rajuchaluva
8/25/2011 2:54 AM EDT
Thats true!
Great work
Sign in to Reply
prabhakar_deosthali
8/8/2011 2:59 AM EDT
Sure! Voice recognition UIs can definitely reduce the distractions for the driver. But some of those voice conversation themselves may be too distracting, for example if the driver happens to get a nagging call from his wife! There will be such situations when the drivers eyes will be on the road, his hands on the steering wheel but his mind getting dragged into some other world and a driver in everybody knows well what disasters can happen when your mind is not on the job at hand.
Sign in to Reply
hm
8/8/2011 5:47 AM EDT
Very good progress but it has long way to go. Soon CMOS Camera and advance image processing will add to new UI.
Sign in to Reply
Rick DeMeis
8/16/2011 1:58 PM EDT
I had the chance recently to interact the SYNC system in a Ford Edge. It was easy to get to use, with responses to most of the intuitive commands I could think up to control the audio and the climate control. Unfortunately, this car did not have a navigation system, which is where I think the utility of speech recognition can prove most useful.
I hope to use some other Nuance-based voice recognition products in the near future, and will report on the results.
Sign in to Reply
selinz
8/24/2011 2:47 PM EDT
I like the system and guess what, my phone has the same function. I just hope we can keep the legislators from limiting the usability.
Sign in to Reply
VUI Guy
6/21/2012 4:47 AM EDT
The CPU hit factor is not something to be taken lightly, which is why companies turn to a vendor like Rubidium, which offers a small footprint, low resource, cost effective solution as opposed to the larger Nuance product.
Sign in to Reply