Processing power: The enabler
While such functionality is done in software, there is a "CPU hit" to run it, notes Radloff. Early voice recognition systems ran using 100 MIPS processors. That number was up in the 500 MIPS range in the mid-2000s, to where today more capable processors are ranging from 800 to 1,500 MIPS. Radloff sees this key enabling technology up to 2,500 MIPS around mid-decade. These numbers are more toward the trailing- than leading-edge of processor development because the devices must be made robust enough for the temperature extremes and EMI of the taxing automotive environment.
So what will automotive speech recognition technology be capable of with such power? Greater processing power will enable more natural language interfacing, says Radloff, using less-structured phrases like, "I want to hear Van Morrison." There will also be 3G connectivity to the car and the off-board migration of some voice recognition features to servers in the "cloud," he adds (see page 3). "You can run very sophisticated applications then such as [Nuance] Dragon dictation for SMS texting," which is currently being demonstrated and should be operational in about 18 months. Here, a voice message is sent to a server that sends out the text message.
By going off-board, voice recognition is not "grammar bound" by a fixed vocabulary as with the memory and CPU limits of an embedded system in the car. "SMS is 'general' grammar [i.e. any combination of letters]. So if you have the connectivity, take advantage of it" to do the processing needed off-board, notes Radloff. Cloud-based service also keeps navigation system points-of-interest (POIs) and construction site data up-to-date.
cloud-based services, no onboard navigation system is needed to deliver
turn-by-turn navigation, via voice commands, to the driver.
Also in the offing more near term is installation of more than one microphone, which allows more sophisticated noise cancellation and beam forming. Processing directs the "listening beam" (for instance by manipulating delay of the same sound between mics) to "focus" on the driver, lowering the tendency to pick up passenger voices.
More insight into the nuances of microphone installation is provided by Scott Pennock, senior hands-free standard specialist, Hands-Free and Speech Technology, for QNX Software Systems
, which partners with Nuance and provides acoustic processing middleware in creating speech interfaces. One QNX focus is delivering better voice signals to the speech recognition system.
"Vehicle noise is diffuse, the same throughout the cabin," Pennock says. "The challenge with the far-field mics comes about because if you double the distance to the speaker (driver), you take a 6 dB hit in the signal-to-noise (S/N) ratio." Thus it is better to install a microphone in the headliner, about 12 inches from the driver's mouth, rather than on the rearview mirror, up to 24 inches away.
As for adding another mic for beam forming on the driver, there is also a S/N benefit, adds Pennock. But this is only a 3 dB improvement because the "noise floor" is raised by 3 dB (i.e. the second mic not only picks up speech but noise as well).
or more microphones using audio processing can form a sensitivity "beam"
to pick up the driver's voice and reject sounds from the background-talking
In developing speech recognition systems, Pennock cites another challenge that may not be all that obvious. A system is specified with a required accuracy rate, but determining if that rate has been achieved can be daunting. This task was easier when systems used set commands rather than natural language. Systems are now pitted against natural utterances as people speak normally.
Testing may be done with live subjects, which is time consuming, and sampling may still not be large enough for today's increasing grammars to ensure all accents are adequately covered, Pennock notes. It is better to build a library of utterances that can be played back more efficiently. The utterances should be collected in a vehicle noise environment, where people tend to talk louder at a higher pitch. Interestingly, a person speaking a string of familiar phone number digits in a natural cadence produces a higher recognition rate than deliberately slowing, Pennock says.
Voice systems need to be tested under different operating conditions. These can range from idle to 70 mph with climate control fans on high; during rain, where the noise is not steady but a dynamically varying signal; and riding over louder concrete or quieter asphalt.
Good speech-recognition user-interface design is more than just high recognition rates, however. How a system recovers from errors has to take into account both expert and novice users, notes Pennock. A re-prompt from the system when a phrase is not recognized may at first be, "Did you say xyz?" By detecting response pauses, the system can assume the user needs more verbal prompts to perhaps learn phrases, where as an experienced user will just confirm or repeat a request. The system then transitions a user over time to a more expert level.
Pennock concludes that with the multimode user interfaces available today, it seems speech is most effective to input more complex "information in a user's head" (such as requests for POIs, audio selections, or phone calls) with more natural language interaction without resorting to distracting touch scrolling. Whereas a single, simple action (climate mode selection or temperature increase) is done effectively with a quick touchscreen/switch stroke.
Similarly, Brigitte Richardson, Global Voice Control Technology/Speech Systems lead engineer for Ford notes some fans of voice recognition want to expand its use for such functions as seat adjustments and window control—applications she feels are an overuse of the technology because these simple tasks are handled adequately now with familiar, basic switches.
But one trend is apparent—speech recognition is an increasing enabler for interacting with new automotive features, and user devices' connectivity, offering ease of use in a minimally distractive, safer manner.