Cars are inherently dangerous… two-ton masses of metal shooting along at 30-plus miles an hour and controlled by distracted people trying to do far too many things at once. The real problem for drivers is the distractions, and the desire for additional productivity or entertainment that creates them. Much debate has gone on about what causes driver distraction. Mental distractions take our minds off of driving, visual distractions take our eyes off the road, and physical distractions take our hands off the wheel.
Mental distraction has caused the greatest controversy. For example, does voice dialing really help if we’re just going to end up talking on a speakerphone and be distracted mentally? If we really want our cars to be safe, then we shouldn’t think and drive but we’ll never get legislation that outlaws singing along with the radio or requires us to limit our thoughts while driving. The mental distractions, although quite serious, are somewhat unavoidable. Fortunately, visual distractions and physical distractions can be reduced through voice user interfaces that allow us to control our car without looking over at knobs and buttons, and without having to reach out and touch them.
Unfortunately, most implementations of speech recognition in cars have required buttons to activate the recognition. Many design approaches to the automotive speech recognition user experience have centered on “where to put the button?” The automotive industry has been resigned to have a button press and the winning solution has been to place it on the wheel so it is convenient. A new generation of speech technologies including truly hands-free voice controls are now emerging that enable complete avoidance of buttons, allowing drivers to better keep their eyes on the road and hands on the wheel.
Creating voice user interfaces
To create a truly hands-free, eyes-free user experience, there are a number of technology stages that need to be addressed.
Stage 1: Voice Activation.
This essentially is replacing the button press to permit the driver to keep hands on the wheel and eyes on the road. The "recognizer" needs to be always on and ready to call Stage 2 into operation. But the recognizer must be able to activate in very high noise situations (because the radio may be on loud) and the user shouldn’t be forced to lean over to touch volume controls to make it work.
Another key criterion for this first stage is a very fast
response—it must be real time, because if the function is to adequately replace a button, then the response time must be the same as a button, which is near instantaneous. Simple command and control functions can be handled embedded in the car by the Stage 1 recognition system or a more complex Stage 2 system which could be embedded or cloud based.Stage 2: Speech Recognition and Transcription.
The more power hungry and powerful Stage 2 recognizer translates what is spoken into text. If the purpose is text messaging or voice dialing, then the process can stop here. If the user wants a question answered or data access then the system moves on to Stage 3.
Recognizers are pretty good at translating, as long as the user speaks in a clear relatively unaccented voice with a good sound to noise ratio (loud voice spoken close to the mic and/or minimal background noise). Recognizers also tend to be best at “grammars” where the query types and phrase structures are well understood and highly predictable. Because the Stage 1 recognizer can respond in high noise, it can drop volume on the in car radio to assist in Stage 2 recognition.Stage 3: Intent and Meaning.
This is probably the biggest challenge in the process. The text is accurately translated, but what does it mean? For example, what is the desired query for an Internet search? Today’s “intelligence” might try to modify the transcription to better fit what it thinks the user wants. Computers are remarkably bad at figuring out intent. Possibly the best examples of this failure are outside of speech recognition and exist in simple text typing. If correcting typing is hard, then correcting speaking is that much harder.Stage 4: Data Search and Query.
Searching through data and finding the correct results can be straightforward or complex depending on the query. Mapping data and directions can be quite reliable, because it is a well understood grammar, with a clear goal of a map search. With Google and other search providers pouring lots of money and time into this it will just get better and better, for the more unusual requests.Stage 5: Voice Response.
A voice response to queries in car is a nice alternative to displays which take eyes off the road. Today's state-of-the-art Text-To-Speech (TTS) systems are highly intelligible and even quite natural sounding. This Stage 5 is arguably in the best position today, and improvements tend to be focused more on “naturalness” of speech and putting more personality and expression into the voices so that they are essentially indistinguishable from a real person.