Speech technologies, which include automatic speech recognition (ASR) and text-to-speech (TTS) have experienced profound growth over the past several years. It is projected to become a $1.4B market segment worldwide by 2004, especially as so-called multi-modal multimedia applications take hold in wireless phones and voice enabled small footprint Internet-centric personal computing appliances.
The establishment of standardized development and runtime environments such as the Speech Application Language Tags (SALT) specification and VoiceXML will serve to move the current level of deployment of speech-enabled applications to a new high, by lowering development costs both in terms of the need for specialized knowledge and the establishment of common development tools, as well as by leveraging infrastructure developments already in place to support web activity.
One of the primary venues for the deployment of speech technologies thus far has been Interactive Voice Response (IVR) systems, which enable callers to use their wired or wireless phone to retrieve information or complete business transactions.
The first generation of IVR applications relied on simple menu navigation using Dual Tone Multi Frequency (DTMF) signaling using the phone keypad. With the maturation of automatic speech recognition (ASR) and text-to-speech (TTS) algorithms, the Moore's Law price/performance increase in processing power, and the establishment of new development environments for speech applications, the latest generation of speech-enabled IVR applications has greatly expanded the scope of applications that can be realistically deployed.
Central to speech-enabled IVR applications is the "Voice Gateway." It includes application servers, which serve up IVR script "documents" containing the scripted interaction with the caller to the underlying speech platform. This includes the prompts to play to the caller (either as recorded audio or using TTS for dynamic text) as well as the expected caller input using ASR.
While speech platforms have existed in the past, the evolution to standardized development and runtime environments is the key to driving speech to the mainstream. Standardized environments enable application and platform interoperability between client/server architectures, and allow a large number of developers to create innovative applications that will drive customer demand. While there have long been standards for building the Web world of graphics and text, building IVR and speech applications has traditionally used proprietary interfaces, resulting in the inability to bring these two worlds together to create more powerful multimodal interfaces. The SALT specification, drafted and published by the SALT Forum in July 2002, addresses the need of adding speech to the world of the Web.
Established in October of 2001 by a group of industry leaders (Cisco, Comverse, Intel, Microsoft, Phillips Speech Processing and SpeechWorks), the SALT Forum currently is supported by over 50 companies. The specification has been submitted to the World Wide Web Consortium (W3C) for consideration as a standard for the development of voice user interfaces. The W3C is currently in the process of evaluating the SALT specification, in conjunction with a complimentary speech application development language intended primarily for telephony only applications called VoiceXML, to establish a comprehensive speech application development language suitable for the widest range of deployment environments.
As a lightweight set of four tags, SALT is compatible with existing (and future) markup languages, and is intended to help web developers easily add speech capabilities to markup-based documents such as HTML and WAP.
To address a wide range of device capabilities, SALT offers two modes of operation. For devices such as cell phones and wireless personal computing devices with limited local processing power that use "down level" browsers, the "declarative mode" facilitates multimodal capabilities by binding speech input and output to graphical elements, albeit resulting in a more "directed dialog" question/response manner.
For more powerful devices capable of supporting "up level" browsers SALT offers an "object mode" that provides the developer a richer set of user interaction capabilities of the individual SALT elements, and finer control that results in a more natural, conversational interface of the "mixed initiative" type for example, user input can redirect the call flow.
By providing the capability to play and record audio to/from the caller, recognize the caller's commands to the IVR system, and then bind those responses to standard markup languages, SALT readily provides developers with the ability to add the power of speech interfaces to their applications. SALT can be used in either traditional telephony-based IVR applications, where the caller only interacts with the system using a telephone, or in multimodal environments, where the user interacts by speaking to a device that provides text output on a screen. The ease by which SALT can be integrated into existing markup-based applications makes it a natural development environment for the migration of web applications to multimodal devices such as the next generation of handheld computers (Compaq iPAQ), tablet PCs and cell phones.
Using SALT within a traditional telephony-based IVR application presents so-called "Voice User Interface (VUI)" challenges that differ greatly from those in the case of multimodal applications.Looking first at the simpler case of multimodal, SALT enables the use of speech as alternate input (ASR) and output (TTS) media for the normal graphical user interface (GUI). In most instances, little modification between the GUI and VUI are required, since the use of speech to replace manual input serves mainly to ease the user's difficulty in physically interacting with the device, making the addition of speech relatively straightforward.
Memory issues
However, in the telephony-only paradigm, the VUI may, out of necessity, differ dramatically from the GUI based upon the amount of data that the GUI presents at any one time. Psychological studies have suggested that due to human short term memory limitations, providing more than seven choices at any given stage within an audio dialog will result in callers forgetting their choices, leading to IVR input errors. Common experience leads to the conclusion that most Web-based GUI applications provide far more than this number of choices. In addition, GUI data has "persistence" in that often, by looking at different parts of the screen, the user is able to retrieve information that they may need to make a decision about something else. This persistence for VUI-only applications is completely dependent upon the caller's ability to remember past parts of the interaction memory.
Development tools are an essential part of providing non speech scientists the ability to rapidly create applications. The development of a VUI requires understanding of human conversational interaction how people are likely to respond to questions and is much more art than hard science.
So-called "dialog components" seek to encapsulate this VUI experience into application objects that perform specific actions, such as collecting yes/no responses, zip codes and addresses, selecting items from a list, etc. Vendors, by providing these types of components, seek to enable non-speech scientists to quickly develop speech-enabled applications.
In addition, the creation of development environments that support VUI tools such as dialog components in forms readily recognizable by experienced web developers seeks to facilitate speech adoption. For example, Microsoft has recently announced the availability of the .NET Speech Software Development Kit (SDK), which provides a rich development toolset for generating SALT-based applications, and includes question/answer dialog components, grammar creation / editing tools and audio editing capabilities seamlessly integrated with its widely used Visual Studio development product.
An important requirement for widespread application of a development language for speech technologies is in the area of call control. To support sophisticated features such as conferencing, call forwarding, find me/follow me, SALT uses two mechanisms; a formal call control object with well defined methods (or actions) associated with it; and a general purpose key/value messaging capability that enables an application to exchange messages with the voice gateway platform. This latter capability also allows for extension of platform capabilities to support features beyond those supported by the core specification, ensuring constant innovation and enhancement.
Wireless access represents the next generation of connectivity, but the design criteria for the devices to provide that access must deal with the conflicting requirements of extending battery life, reducing size and increasing processing power. For speech technologies, the engineering questions center on the appropriate architectures to optimize performance and usability.
Central to maximizing battery life is minimizing the amount of data that the device transmits to the receiving station. This suggests embedding the speech resources within the device itself, which assumes the adequate processing power and potentially the ability to use high bandwidth 2.5 and 3G networks to push grammars and other speech components to the device for local computation. Alternatively, handsets with minimal local processing capability that rely on network-based speech resources may allow for reduced size and battery life.
In addition, the very nature of multimodal applications may play a role in the optimal configuration, since in general a multimodal interface minimizes the complexity of the ASR task; selecting a known item from a drop down list presents less of a challenge that recognizing open ended input from a caller unable to see the selections. But then again, the ability to use the same device in a non-visual environment, such as while driving, suggests a wide range of implementations may be required.
See related chart