Design Article
TECH TRENDS: Front-end voice activation improves auto safety
Todd Mozer, Sensory
10/20/2011 3:00 AM EDT
Cars are inherently dangerous… two-ton masses of metal shooting along at 30-plus miles an hour and controlled by distracted people trying to do far too many things at once. The real problem for drivers is the distractions, and the desire for additional productivity or entertainment that creates them. Much debate has gone on about what causes driver distraction. Mental distractions take our minds off of driving, visual distractions take our eyes off the road, and physical distractions take our hands off the wheel.
Mental distraction has caused the greatest controversy. For example, does voice dialing really help if we’re just going to end up talking on a speakerphone and be distracted mentally? If we really want our cars to be safe, then we shouldn’t think and drive but we’ll never get legislation that outlaws singing along with the radio or requires us to limit our thoughts while driving. The mental distractions, although quite serious, are somewhat unavoidable. Fortunately, visual distractions and physical distractions can be reduced through voice user interfaces that allow us to control our car without looking over at knobs and buttons, and without having to reach out and touch them.
Unfortunately, most implementations of speech recognition in cars have required buttons to activate the recognition. Many design approaches to the automotive speech recognition user experience have centered on “where to put the button?” The automotive industry has been resigned to have a button press and the winning solution has been to place it on the wheel so it is convenient. A new generation of speech technologies including truly hands-free voice controls are now emerging that enable complete avoidance of buttons, allowing drivers to better keep their eyes on the road and hands on the wheel.
Creating voice user interfaces
To create a truly hands-free, eyes-free user experience, there are a number of technology stages that need to be addressed.
Stage 1: Voice Activation. This essentially is replacing the button press to permit the driver to keep hands on the wheel and eyes on the road. The "recognizer" needs to be always on and ready to call Stage 2 into operation. But the recognizer must be able to activate in very high noise situations (because the radio may be on loud) and the user shouldn’t be forced to lean over to touch volume controls to make it work.
Another key criterion for this first stage is a very fast response—it must be real time, because if the function is to adequately replace a button, then the response time must be the same as a button, which is near instantaneous. Simple command and control functions can be handled embedded in the car by the Stage 1 recognition system or a more complex Stage 2 system which could be embedded or cloud based.
Stage 2: Speech Recognition and Transcription. The more power hungry and powerful Stage 2 recognizer translates what is spoken into text. If the purpose is text messaging or voice dialing, then the process can stop here. If the user wants a question answered or data access then the system moves on to Stage 3.
Recognizers are pretty good at translating, as long as the user speaks in a clear relatively unaccented voice with a good sound to noise ratio (loud voice spoken close to the mic and/or minimal background noise). Recognizers also tend to be best at “grammars” where the query types and phrase structures are well understood and highly predictable. Because the Stage 1 recognizer can respond in high noise, it can drop volume on the in car radio to assist in Stage 2 recognition.
Stage 3: Intent and Meaning. This is probably the biggest challenge in the process. The text is accurately translated, but what does it mean? For example, what is the desired query for an Internet search? Today’s “intelligence” might try to modify the transcription to better fit what it thinks the user wants. Computers are remarkably bad at figuring out intent. Possibly the best examples of this failure are outside of speech recognition and exist in simple text typing. If correcting typing is hard, then correcting speaking is that much harder.
Stage 4: Data Search and Query. Searching through data and finding the correct results can be straightforward or complex depending on the query. Mapping data and directions can be quite reliable, because it is a well understood grammar, with a clear goal of a map search. With Google and other search providers pouring lots of money and time into this it will just get better and better, for the more unusual requests.
Stage 5: Voice Response. A voice response to queries in car is a nice alternative to displays which take eyes off the road. Today's state-of-the-art Text-To-Speech (TTS) systems are highly intelligible and even quite natural sounding. This Stage 5 is arguably in the best position today, and improvements tend to be focused more on “naturalness” of speech and putting more personality and expression into the voices so that they are essentially indistinguishable from a real person.
Mental distraction has caused the greatest controversy. For example, does voice dialing really help if we’re just going to end up talking on a speakerphone and be distracted mentally? If we really want our cars to be safe, then we shouldn’t think and drive but we’ll never get legislation that outlaws singing along with the radio or requires us to limit our thoughts while driving. The mental distractions, although quite serious, are somewhat unavoidable. Fortunately, visual distractions and physical distractions can be reduced through voice user interfaces that allow us to control our car without looking over at knobs and buttons, and without having to reach out and touch them.
Unfortunately, most implementations of speech recognition in cars have required buttons to activate the recognition. Many design approaches to the automotive speech recognition user experience have centered on “where to put the button?” The automotive industry has been resigned to have a button press and the winning solution has been to place it on the wheel so it is convenient. A new generation of speech technologies including truly hands-free voice controls are now emerging that enable complete avoidance of buttons, allowing drivers to better keep their eyes on the road and hands on the wheel.
Creating voice user interfaces
To create a truly hands-free, eyes-free user experience, there are a number of technology stages that need to be addressed.
Stage 1: Voice Activation. This essentially is replacing the button press to permit the driver to keep hands on the wheel and eyes on the road. The "recognizer" needs to be always on and ready to call Stage 2 into operation. But the recognizer must be able to activate in very high noise situations (because the radio may be on loud) and the user shouldn’t be forced to lean over to touch volume controls to make it work.
Another key criterion for this first stage is a very fast response—it must be real time, because if the function is to adequately replace a button, then the response time must be the same as a button, which is near instantaneous. Simple command and control functions can be handled embedded in the car by the Stage 1 recognition system or a more complex Stage 2 system which could be embedded or cloud based.
Stage 2: Speech Recognition and Transcription. The more power hungry and powerful Stage 2 recognizer translates what is spoken into text. If the purpose is text messaging or voice dialing, then the process can stop here. If the user wants a question answered or data access then the system moves on to Stage 3.
Recognizers are pretty good at translating, as long as the user speaks in a clear relatively unaccented voice with a good sound to noise ratio (loud voice spoken close to the mic and/or minimal background noise). Recognizers also tend to be best at “grammars” where the query types and phrase structures are well understood and highly predictable. Because the Stage 1 recognizer can respond in high noise, it can drop volume on the in car radio to assist in Stage 2 recognition.
Stage 3: Intent and Meaning. This is probably the biggest challenge in the process. The text is accurately translated, but what does it mean? For example, what is the desired query for an Internet search? Today’s “intelligence” might try to modify the transcription to better fit what it thinks the user wants. Computers are remarkably bad at figuring out intent. Possibly the best examples of this failure are outside of speech recognition and exist in simple text typing. If correcting typing is hard, then correcting speaking is that much harder.
Stage 4: Data Search and Query. Searching through data and finding the correct results can be straightforward or complex depending on the query. Mapping data and directions can be quite reliable, because it is a well understood grammar, with a clear goal of a map search. With Google and other search providers pouring lots of money and time into this it will just get better and better, for the more unusual requests.
Stage 5: Voice Response. A voice response to queries in car is a nice alternative to displays which take eyes off the road. Today's state-of-the-art Text-To-Speech (TTS) systems are highly intelligible and even quite natural sounding. This Stage 5 is arguably in the best position today, and improvements tend to be focused more on “naturalness” of speech and putting more personality and expression into the voices so that they are essentially indistinguishable from a real person.
Navigate to related information


Dr DSP
10/20/2011 5:24 PM EDT
Just what we need. Let's put lots of capabilities into the car to distract us and then more to take up the slack when we are distracted. Let the car do the driving and then we can just watch a movie on the wind shield.
Sign in to Reply
ndancer
10/21/2011 4:22 PM EDT
I have, on a 2010 Prius, a voice activated GPS / radio module. On a recent trip, I tried, "Zoom out", a legal command, by the way, and got everything from "Show restaurants" to "CD on". And we're going to trust these things to drive the car?
Guys, this is not a good thing.
It could be greatly improved if it asked, "I heard CD on. Is this correct?" That alone would decrease frustration level.
I no longer even try to use voice commands, although I was raised in Colorado, and, according to most people, enunciate clearly enough, but not, apparently, clearly enough for slockware, er, I'm sorry, software.
Sign in to Reply
Bert22306
10/22/2011 6:35 PM EDT
Maybe eventually, but I'm skeptical still. I have an OnStar system that is voice activated, but it's not fool proof. Oddly enough, it has most problems when I tell it to call "home." "Home" is one word it finds baffling. Needs repeating just about every time. So, just imagine being stressed out, in some critical situation, and barking out a command the system can't decipher. Perhaps because your voice sounds different when stressed. Come now.
On the subject of hands-free calling, I've read more than one report that says it's just about as distracting as using a cell phone. While I don't dispute that, I'm curious why talking hands-free on a phone should be any different from having a conversation in the car, with a passenger, while driving.
All I can come up with is that while driving and conversing with a passenger, the driver feels freer to stop talking when the driving requires more attention? And on the phone, you can't just leave the other guy hanging? Or perhaps conversing over a low-fidelity audio link, like voice telephone links, is simply more of a burden on the brain, than in person conversations? I mean, it's not like the driver can actually look at the person he's speaking to, while driving, so the missing visual cues can't be the difference in this phone while driving matter.
Sign in to Reply
prabhakar_deosthali
10/24/2011 1:31 AM EDT
I also have serious doubts about the efficacy of such voice activated systems in critical situations when your voice may not remain the same with which you would have trained your voice system.
It is better to have some touch pad kind of systems with one touch command facility , on the dashboard.
Sign in to Reply
ndancer
10/28/2011 12:16 PM EDT
How about have the car automatically detect, A: when traffic is too heavy to be safe, or, B: when you're driving like a jerk, and then, auto-magically, turn off the radio, TV, GPS, telephone, ipod, and everything else that could possibly be distracting.
Might cut down on the DWS syndrome (Driving While Stupid).
Sign in to Reply
WKetel
10/28/2011 1:16 PM EDT
I agree that voice commands are very complex to handle, and the comment about "probablistic algorithms means that the system would always assume the most common command. One option that could help might be similar to what we saw in Star Trek, where Kirk would first say "computer", and then state his request.
Of course, all of this technology will certainly serve to provide more distractions and reduce safety, since most human minds can't handle large amounts of distraction without ignoring the more important task of driving. This is not just some wild assertion, it has been proved with research several times.
Sign in to Reply
eckna
2/18/2013 5:59 AM EST
jkhh
Sign in to Reply