How many automobile accidents have something to do with a driver fiddling with the car radio, searching for directions, reading a map or using a cellular phone? How many physically impaired drivers and passengers would benefit from speech-activated features within the automotive environment?
To make highways safer, lawmakers nationwide and abroad are cracking down on motor vehicle operators distracted by dialing their wireless phones while driving. But hands-free technology developers have a better solution than creating more laws. Using speech recognition, these developers are creating alternative interfaces for in-dash wireless e-mail, compact disk, radio and cellular phone operation. They promise to make using such devices in a vehicle much easier by removing the need to dial the phone or change a compact disk or radio station while driving. However, the road is anything but clear.
Consumer electronic products such as PDAs and information appliances face a set of seemingly conflicting objectives. That is, how to provide more and more functionality in smaller and smaller packages. With small screens and little or no keyboard, it becomes difficult to provide a user-friendly interface for the next generation of computing devices.
A solution used by Microsoft in its AutoPC platform, is to complement a minimal set of traditional tactile controls (buttons, pointers, keypads, etc.) with a speech user interface. By providing access to device functions through speech, the AutoPC is able to offer a safe, convenient and natural interface.
The AutoPC offers the form factor of a car radio and the functionality of a PC. Operating under Microsoft's Windows CE operating system, the AutoPC provides a speech user interface (SUI) to travel instructions, hands-free voice dialing, address books and car audio functions.
Though the AutoPC is in the passenger space and not under the hood with the engine, it has to offer a noise-robust, word-based recognizer, provide speaker-independent command and control of system functions and applications, such as the ASR200. For implementation in the AutoPC, such a device has to meet several critical design requirements, including high accuracy in a noisy environment, speaker independence, speaker-dependent capability and high reliability, to name a few. All this, and yet it has to consume minimal system resources such as CPU Mips and RAM.
Microphone on visor
To provide accurate recognition in varying acoustic conditions, such as an automobile operating under different driving conditions, requires the use of proprietary signal-processing routines that are able to normalize the automobile's interior noise and allow the AutoPC to operate with a "far-talk microphone" stashed on the visor, headliner or overhead console near the driver.
In addition to providing accurate recognition of spoken words, it is also important that these products provide accurate "rejection." Rejection is the ability of an engine to ignore extraneous speech, or speech that does not represent an intentional command meant for the device. In the automobile, extraneous speech could range from the driver talking to a passenger, someone on a cell phone, or even dialog coming from the entertainment system itself.
To solve the problem of determining "who is talking to whom," the speech recognizer chip provides a "wake-up" word or a "push-to-talk"(PTT) switch, which allows the user to effectively grab the attention of the AutoPC application with a single operation. After responding to this initial command (wake-up or PTT), the AutoPC is alerted and ready to respond to any subsequent commands from the user. After a predetermined amount of time (usually around 15 seconds), the AutoPC then returns to the resting state, awaiting its next instruction.
To provide the best initial impression and usability for a consumer electronics product, it is desirable to provide speaker-independent vocabularies. By doing so, the consumer is not required to participate in any training or enrollment procedure and can begin using the product out of the box. To support speaker independence, it is necessary to program the device with pretrained vocabularies. The vocabularies developed by L&H, for example, are designed for use in multiple languages and offer a wide set of command words, digits and alphanumerics.
It is also possible to extend and customize the speaker-independent vocabulary, or even allow the end user to add or replace words with their own unique trained words. These trained words, or templates, created or replaced by the end user, are referred to as "speaker-dependent vocabularies." Since they depend on the speaker creating the word by speaking it, they are biased toward that individual speaker.
Essentially, speech recognition alone does not provide a complete natural interface; it must be combined with some method of speech output, to provide user feedback, retrieve and deliver information, and confirm user requests, using a cognitive synthesizer. In addition to providing a pleasant, natural voice, it is also important that a text-to-speech (TTS) engine is able to interpret and pronounce text correctly. Currencies, dates, e-mail addresses and abbreviations are only a few of the types of text that can be encountered by a TTS engine.
Fortunately, there are advanced text preprocessors, and even application-specific preprocessors such as e-mail, available today that can be utilized to deliver accurate spoken output.
Scott Pyles, Director of Product Management, Automotive Division, Lernout & Hauspie, Burlington, Mass.