In a quiet, controlled environment, today’s speech recognition engines have become quite effective. Whether doing dictation with a quality headset in a quiet office, or giving search-phrases to a smartphone in a silent room, hit rates of close to 100 percent are now commonly achieved. However, adding a few disturbances tends to quickly degrade the performance.
The automobile environment is one of the most challenging in this respect. A variety of noise sources both outside of the car (passing cars, honking horns) and inside (multiple passengers talking, the air conditioning fan, the radio) along with audio reverberations off the hard surfaces result in the lackluster performance with which many car owners are familiar.
Further, in order to avoid false triggers, the driver of the car needs to push a button to trigger the speech command system. This is not just a nuisance but also a safety hazard.
Yet few applications could benefit more from using speech recognition for voice command operation than the automobile. It is therefore critical and of great value if technology can make speech recognition more effective in cars, detecting commands reliably in the presence of all disturbances without use of button-presses. While fundamentally being a speech recognition problem, performance improvements will primarily come by processing the voice input signal by removing noise and disturbances.
In recent years, one of the key areas that Conexant has focused its vast experience in audio technology is in Voice Input Processing (VIP). By doing careful design from the microphone interface, providing clean bias signals and low-noise pre-amplification and gain control, to implementing complex digital signal processing algorithms on its high-performance yet low-power DSPs, Conexant has been able to deliver VIP devices for a number of applications including TVs, home appliances and automobiles. Within those applications, one of the primary advantages of using the Conexant solution is to improve the performance of speech recognition engines, where the Conexant solution has been optimized for many of the common speech recognition algorithms for use in challenging environments.
To achieve superior performance, several algorithms are employed to enhance the desired input signal and suppress noise sources in a coordinated manner. Conexant’s Selective Source Pickup (SSP) algorithm is uniquely able to separate the desired signal from the noise sources by analyzing statistical and spatial information in the signal.
The interference coming from the local loudspeakers is cancelled with Conexant’s advanced Multi-channel Acoustic Echo Canceller (MAEC), reverberation is suppressed with a novel de-reverberation algorithm, and the remaining environmental noise is attenuated by a Non-Stationary Noise Reduction (NSNR) algorithm. Tuning these algorithms together, and in particular if they are tuned for a specific speech recognition engine, can vastly improve the word hit rate without any changes to the speech recognition system.
Figure 1. Disturbances in automobile environmentSelective source pickup (SSP)
Independent Component Analysis (ICA) is an emerging area of research within audio technology that attempts to separate or extract different voice or noise sources. Established in the early 90s, it is based on the idea that the underlying sources of a mixed signal are statistically independent. Using prior knowledge of the statistics of the certain types of signals combined with the measured correlation parameters, adaptive techniques can in fact separate or “de-mix” the combined signal to extract one or more of the underlying sources. Typically, ICA algorithms require an extreme amount of processing power and memory. This makes them impractical for implementation in embedded real-time systems.
Conexant’s SSP algorithm utilizes some of the fundamental ideas from ICA, reduces these requirements to a practical level and yet delivers on the promise of separating one talker from another talker or from the environmental noise using only two microphones. The decision of which source to extract can be made in real time. The algorithm can simply extract the dominant talker or use the position of the talker with respect to the microphones to decide what signal to extract. In effect, this allows the VIP to zoom in on a single talker in a room or car filled with interference from other sources, which can be extremely useful for a speech recognition application in an automobile environment.