CAMBRIDGE, Mass. The dominant user interface for 21st-century computing may well be voice, and it is coming soon. Not necessarily on the PC, although there are certainly add-on packages that handle voice input there. But voice interfaces are appearing first in a variety of information appliances and services that have one thing in common: no room for a keyboard.
A driving force in the transition to voice is the relentless increase in computing power available even in inexpensive devices. Voice recognition no longer requires a high-end PC with gobs of memory and a hard drive a RISC chip with a few megabytes of memory can provide a serviceable interface for less than $20. Within a year or two, industry watchers say, a voice interface will not only take less space and be easier to use than a keyboard, it will be less expensive than adding a USB port and keyboard.
Information appliances are ideal candidates for voice interfaces. Even where a keyboard fits physically, such as in a WebTV, voice can provide a simpler interface for users who don't want to deal with the complexities associated with PCs for services like Internet access.
Command and control
Current voice-recognition systems excel at command and control. In these applications, the device accepts a limited number of voice commands that are appropriate to its function, such as "turn on," "select" or "connect." With only a few words to distinguish, recognition can be very accurate, even in noisy environments.
Voice recognition has improved significantly over the past few years, due to a growing research effort in the field as well as the availability of cheap Mips to run the latest algorithms. Where older systems required users to speak one word at a time, a tedious task at best, current ones support natural or continuous speech. Newer systems also require far less training often only a few minutes to learn the speech patterns of a new user. In fact, many systems are now speaker-independent, and require no training at all.
These systems recognize a variety of speech patterns, although they still have trouble with heavily accented speech. For example, MIT's Spoken Language Systems Group has developed a system called Jupiter that provides weather information over the telephone. The automated system uses a speaker-independent speech engine to process a request such as "What is the chance of rain in Chicago tomorrow?"
Jupiter recognizes about 2,000 words, including the names of hundreds of places around the world. It runs on a 500-MHz Pentium III in a standard PC, and the researchers have not optimized the code to reduce CPU or memory usage.
This type of application has many advantages for voice recognition. The limited context domain keeps the vocabulary to a manageable level. More important, it allows the software to interpret the meaning of the words, a task known as natural-language processing (NLP).
For example, the software might be unsure whether the user said tomorrow or tornado. Both words are in the weather domain, but in the question cited above, "tomorrow" makes more sense than "tornado." If the software still cannot decode a phrase, the conversational interface provides natural opportunities to request clarification.
This type of technology is being commercialized in an attempt to turn the telephone into the world's most widely installed information appliance. Motorola Inc. (Phoenix) is advertising "Mya, the 24-hour talking Internet." And Tellme Networks Inc. (Mountain View, Calif.) and a slew of other startups are racing to deploy voice portals to the Web.
Although the business models for the new phone-based services remain unproved, their voice technology is adequate today and will improve over time. These services run the voice interface on a remote server, taking advantage of the telephone's ability to make a voice connection. But there are many opportunities for voice interfaces that run locally on inexpensive computing devices.
Lernout & Hauspie, which recently purchased Dragon Systems Inc. (Burlington, Mass.), is a leading supplier of voice-recognition software. The company is based in Ieper, Belgium, the crossroads of Europe, where most citizens know three or four languages. This linguistic skill has helped L&H researchers develop speech engines for a variety of applications, said Klaus Schleicher, a director of product management there.
The company's low-end speech engine requires less than 200 kbytes of memory and less than 100 Mips of CPU power. Many 32-bit embedded processors, including the popular ARM7, can deliver this level of performance at prices well below $20.
Plenty of power
This speech engine provides speaker-independent recognition of up to 100 words. This may not sound like much, but it's plenty for many applications: programming the microwave, dialing the phone, sending a fax. For more complex devices, L&H offers a midrange speech engine with a vocabulary of up to 1,000 words, some of which may be entered by the user.
The larger vocabulary is suitable for accessing larger amounts of data. For example, a user could program a set-top box by saying the name of the show she wishes to record, or ask a digital audio system to play songs by naming a particular artist. This engine has its limitations, however. It is fine for command and control but not for generating e-mail messages or other arbitrary chunks of text.
In a true Internet appliance, users might have to spell out many URLs and words that are not in the limited vocabulary. To solve this problem, L&H offers a deluxe speech engine with a vocabulary of 20,000 words twice as many as the average adult. This version does require some training, but only about five minutes for each new user.
Amazingly, even this engine can run on a relatively inexpensive device: L&H has built a demonstration unit code-named Nuk that runs this engine on a handheld computer containing a 200-MHz StrongARM processor and 32 Mbytes of DRAM. That's less than $50 worth of hardware at today's prices, likely to drop to $20 over the next few years.
Once systems go beyond command and control, however, voice input runs into problems. Even the best PC-based engines have an accuracy of 95 percent to 98 percent, and that means the user must correct several errors after dictating a few hundred words.
A key problem is that natural-language processing is much harder outside of a limited domain; an e-mail message could cover any topic. Without understanding the context, the software can do little more than try to match up nouns and verbs.
Today's programs can learn from their mistakes, improving their accuracy over time. Accuracy will also rise as faster processors become available. Voice recognition is a real-time process, so faster CPUs are needed to do more intelligent processing, particularly in NLP, to increase accuracy. More memory also helps, so voice interfaces will be aided by continuing declines in per-bit prices of memory.
Even at today's prices, there are many potential applications for voice input. Command-and-control interfaces are better suited to information appliances, which handle a small set of functions, than the general-purpose PC. Even traditional appliances such as VCRs, microwave ovens and fax machines in fact, any device with more than a few buttons on the front panel could benefit from voice input.
Handheld computers are also good candidates. Today's devices typically require awkward handwriting recognition or tapping on a touch-screen keyboard. A voice interface with a 2,000-word vocabulary could handle all commands plus many common words, names and URLs. Other names could be spelled by voice ("Jack I-N-M-A-N"). Entering general text would require a lot of spelling out loud or using the touch screen. A Web pad will typically have a faster processor and more memory, so it could implement a larger vocabulary, making it more suitable for composing e-mail and other text.
Hands on the wheel
The car is one of the first areas that will benefit from voice input. As cell phones, navigation systems and Internet access move into the cockpit, it is important for drivers to keep their hands on the wheel. Voice input is available today on some Jaguars and will be in some Lincolns this fall. Delphi recently announced a Palm V car dock that features an L&H speech engine with a 200-word vocabulary, allowing drivers to access their address and date books on the go. Clarion's AutoPC, available now as an aftermarket upgrade, also features voice input.
PCs, ironically, may be one of the last devices to embrace voice input. The best use of voice on today's PCs would be command and control, replacing (or supplementing) pull-down menus and icons. Microsoft Corp. (Redmond, Wash.), however, has chosen not to offer this function. That will no doubt change in the future, since the OS vendor has bought a piece of L&H.
Dictation software, such as IBM's Via Voice and Dragon's Naturally Speaking, is becoming more popular, but keyboards remain the fastest and most accurate way for most people to enter lots of text.
Another concern is that PC users are often packed into cubicle farms, where lots of people talking to their computers could become annoying. For now, consumers will be talking to their cars, handhelds and other appliances. In time, even PC users will spend more time talking to their computers.