Eighty million Americans carry around a device that has access to the Internet. It offers vast amounts of information, instant messaging, e-mail and commerce. That device is a cell phone. Yet even though phones are made to be spoken to, the Internet isn't, so cell phone users who want directions to a restaurant or stock quotes have to punch in the query using the phone's buttons. Spelling with a tiny numeric keypad is an exercise that makes even the dentist seem like fun.
The solution is to use your voice to tell your phone what you want it to do. But how will speech recognition be implemented, on the client or the server? I believe it should be on the client.
There will certainly be a market for server-based speech recognition for nonmobile devices due to the large installed base of land-line telephones. Land-line phones can't do local processing, but they have the high-quality connection needed to execute high-performance speech recognition on remote servers. Server-based speech applications are perfect for calling in on a land line and checking a credit card balance, for example.
For mobile devices, however, server-based speech interfaces bring a number of problems. The connection between phone and server is inconsistent. As a result, the recognizer is constantly bombarded by speech that is noisy, interrupted, delayed or distorted. Server-based speech recognizers are guaranteed to fail, and they do so frequently enough to deter most long-term users.
Cell phones have a CPU powerful enough for embedded speech applications, spare memory and a well-designed acoustic-capture system. They do not drop bits, delay packets or add acoustic interference between two channels. A local speech engine can get to know your personal speech idiosyncrasies-after all, the recognizer is yours-and can understand you accurately. It can also read the acoustics of the environment you're in and adjust to local conditions.
When speech is translated into data locally, data can be sent across the network, and it can come back to the phone. You can ask for a stock quote while running to catch a cab, and get the answer in digits, on your screen and in the memory of the phone, where you can use it.
Recent developments in speech recognition software designed specifically for mobile platforms, as well as advances in hardware performance, are about to overcome the barriers that made embedding speech recognition in mobile devices difficult. Cell phones will offer speech access that is personalized, fast and efficient. It will let the Internet do what it does best-move and manage data. The time has come to put speech recognition where it belongs-in your hand.
Jordan Cohen is Chief Technology Officer for Voice Signal Technologies Inc. (Cambridge, Mass.).