The market demand for voice-controlled applications is
expected to triple in the next three years. Much of this growth will take place in the telephony market where, increasingly, phones will be controlled by voice commands. Other areas include toys and hand-held equipment, like calculators, voice-controlled climate and security systems, voice-controlled home appliances, and voice controlled automobile functions (stereos, windows, climate control, lights, and cruise control).
Columns Ltd. (Singapore) is a marketer of low-cost "give-away" products that its
customers typically distribute as promotional premiums. Columns was interested in adding low-cost, hand-held, voice-controlled products to its product portfolio; the first one of which would be a voice-controlled Euro-currency changer that would convert Euros (the new currency of the European Union) to and from other European currencies (Deutsche marks, francs, lira).
The Euro-changer Columns envisioned had several constraints. It had to be extremely low-power, with a battery life of at least a year. It had
to be extremely low cost, so that the entire product could retail for $9 or less. And it had to have the flexibility to accurately recognize and synthesize speaker-dependent speech in multiple languages. Finally, since Columns wanted to create an entire portfolio of voice-controlled products, the design had to be reusable.
High-level synthesis and language
Columns, which has no internal electronic-design expertise, investigated a variety of speech recognition alternatives with various
design houses, including commercially-available software and standard voice-recognition ICs. Unfortunately, the cost and power consumption of all these solutions was too high. This left a custom ASIC as the only realistic option, and the company selected Frontier Design BV (Holland) to develop the EURO-changer ASIC, as well as the end product. Frontier Design develops C-language based, system-level intellectual property (IP) and also provides design services for the integration of that IP into
systems-on-a-chip (SOCs).
Although implementing complex DSP algorithms into ASICs is frequently a demanding task, we used Frontier's A|RT Designer tool, a high-level architectural-synthesis tool that helped us receive an optimized RT-level description very quickly. It also gave us the freedom to explore alternative architectures to optimize the design for the application, as demonstrated by the optimization of the FFT and exponential functions.
Additionally, through the use of a C-based design flow, we could
perform the design-in and hardware optimization of new features during the architectural-design phase, and gain a 50 percent reduction in silicon area at the expense of available clock ticks. The rapid cycle from C to prototyping hardware further expanded the scope of engineering capabilities in meeting demanding product specifications for us.
Developing the algorithms
Like most algorithm designers, Frontier Design develops algorithms in behavioral C. The company also maintains the
algorithmic and system IP in the C-language, rather than in VHDL or Verilog. The company believes that keeping the IP in the C-language enhances reusability and provides the ability to achieve highly optimized implementations. The reason for keeping the IP in C is that making even a slight change to an RT-level HDL IP core requires that the design be almost completely redone, while modifying a C-language IP core is only a matter of changing a few lines of code.
|
Figure 1 - Voice-command comparison
|
|
|
The Euro-Changer's effectiveness is judged, partly, on its ability to compare a voice-command to a stored database and then carry out that command.
|
Developing an algorithm that met the end-product requirements was critical to the
success of the design. Since nobody wants a voice-controlled device that doesn't consistently recognize the commands, the algorithm needed to consistently provide 98 percent (or better) accuracy. Special challenges included detecting and eliminating background noise, differentiating between the actual word and other noise (breathing, clicking, microphone rumbling), determining the beginning and end of words, and comparing input against a stored "voice-finger-print" database and the subsequent identification of
the word (see Figure 1). Several advanced, computationally-intensive DSP algorithms were adapted for the task: the Mel frequency cepstral coefficient (MFCC) algorithm, which consists of the calculation of a fast Fourier transform (FFT) power spectrum, followed by Mel scaling, log ii; an inverse discrete cosine transform (iDCT); continuous noise level estimation routines that use multiple estimates and a selection algorithm to continuously identify and eliminate background sounds and speech artifacts;
coarse- and fine-word boundary detection algorithms that perform detailed analyses of the energy levels during and surrounding the word; and a dynamic time warp algorithm that compares series of vectors with unequal length and compares duration variations within the series.
The algorithms were written in floating-point C. In order to allow parameters to be tuned and optimized, floating-point C-code can be quickly compiled and simulated to verify algorithmic behavior. Finally, C-code can be run on a
conventional PC, so the behavior of the speech recognition and synthesis algorithms can be tested in real situations. The final speech recognition algorithm (see Figure 2) was tested using a 450 MHz Pentium and delivered 99 percent accuracy when tested against the company's internal library of speech recordings.
Conversion to fixed-point
A silicon implementation requires that floating-point algorithms be converted to a fixed-point format - a non-trivial task that is frequently carried out using
trial-and-error techniques. The dynamic range and precision must be guarded in order not to hit the roof in dynamics. Thus, the non-optimal range of regular fixed-point operators may cause the operators to wrap around (such as (max+1) yields (min)), resulting in serious clipping and errors. The fixed-point precision is equally important, especially in repetitive signal-processing calculations. When precision is insufficient, the repetitive signal- processing arithmetic produces fault propagation and fault
accumulation, such that the signal information may gradually degrade to white noise - a fairly catastrophic situation for a voice-controlled product. Lots of dynamic and precision bits eat lots of silicon, however; so it's important to keep the word-widths as small as possible without altering the algorithm's behavior.
Frontier has a library of C++ classes, called A|RT Library, that provides tools for the analysis of the fixed-point behavior of the C-code. It allows the specification of multiple
fixed-point data types, and offers bit-true modeling of multiple overflow-behaviors such as saturation, and wrapping- and multiple-quantization models such as truncation and round to zero. The original 32-bit floating-point speech-recognition algorithm had data coming in at 8 KHz with typical signal bandwidths of 32 bits and memory-storage requirements of several kilobytes. The output of a typical speech user-interface is measured in several bytes per second.
Merging code for successful execution
Analysis showed that 16 bits were sufficient for global data-types and arrays (1 bit for the sign, 10 bits for dynamics, and 5 bits for precision) to maintain the algorithm's accuracy without introducing noise. However, the highly-repetitive FFT sub-routines required eight dynamic bits and seven precision bits, plus a sign bit. Typically, this type of analysis would lead to the global use of a 19 bit word-width to meet the maximum number of dynamic and precision bits required for any operation. Since A|RT
Library allows the word-width to be dynamically changed, however, the global data-types were defined with 1 sign-bit, 10 dynamic-bits and 5 precision-bits, while the MAC results of FFTs were assigned one sign-bit, eight dynamic-bits and seven precision-bits. Thus, the word-width of the design (including the buses) was kept to just 16 bits. This resulted in considerable silicon savings.
When the conversion to fixed-point C was complete, the C code was compiled using regular C++ compilers and executed on a
PC. (An HP or SUN machine could also be used.) The bit-true definition of all signals guaranteed a correct reference for hardware mapping, plus a direct interface to other elements of the digital world such as HDL-based compilers and simulators. The fixed-point recognition code was merged with the C-code of the Euro changer application yielding a complete executable of the final product. Thus, we could test and evaluate the product in the laboratory and demonstrate the functionality to the customer.
System considerations
Instrumental to the project was the realization of a low-cost, low-power, battery-operated end product. In order to meet the target cost, a single-chip SOC solution was quickly found to be the only reasonable path. The SOC would have to integrate the following resources into no more than 25,000 gates to meet the cost constraints:
- Speech recognition and synthesis (SRS) recognition core
- SRS program and Euro changer code (max 30 Kbyte)
- Speech synthesis
samples (max 30 Kbyte)
- RAM memory for the storage of voice-prints and as scratchpad (max 30 Kbyte)
- AD/DA converters
- Microphone interface
- Speaker interface
Power consumption was also a significant issue. The minimum in-use battery life for the Euro changer was set at one year. Adding reserve for storage and distribution, this meant the battery would have to last 1.5 years.
Meeting these relatively severe power constraints required efficient power-down, storing the voice-prints in
RAM, a low clock frequency for the processor, and efficient amplification of the audio to the speaker. The most serious challenges were deemed to be: 1) implementing the basic SRS functionality; 2) addressing the limited RAM requirements; and 3) addressing the speaker interface.
Getting SRS about processor architecture
Given the estimated required processing and low-power constraints, selecting the target clock frequency was the first task. Based on initial power consumption and estimated
processing calculations, a clock-frequency of 2-4 MHz was deemed to be adequate. The specific frequency of 3.579 MHz was selected because it's fundamental in National Television Standards Committee (NTSC) video systems and the crystals are inexpensive.
|
Figure 2 - Don't bring the noise
|
|
|
The algorithm needs to detect and eliminate background noise.
|
In order to get the clock down to 3.5 MHz from the 450 MHz required using the Pentium, and to keep the core within the 25,000 gate budget, a dedicated architecture was required. Historically, designing a dedicated processor has been a time-consuming manual task that required re-writing the algorithms in an HDL. The determination of the architecture was left to the designer's best
guess about what will work. Consequently, Frontier Design developed a C-language based architectural-synthesis tool, called A|RT Designer, that synthesizes a controller-based architecture, based directly on the behavioral C-language algorithm. The designer may then analyze the performance of that architecture for bottlenecks or underutilized resources, and make trade-offs in the architecture and quickly analyze the results. The design remains at the behavioral level in ANSI C until the optimization is
complete. Once fully optimized, it can be automatically converted to Verilog or VHDL.
The designers used the A|RT Designer tool to synthesize a suitable architecture for the speech-recognition algorithm, prior to going to the RT-level description. This tool allocates the required data-path resources (multipliers, adders, ALUs, I/O, RAM, ROM, etc), assigns all algorithmic operations to those resources, and then schedules the operations on the resources. Both a controller and a micro-code (they control the
resource assignment and scheduling) are also automatically generated, along with registers, Muxes, and buses.
Subsequently, the designer may interactively change the hardware resources and the assignment and/or scheduling of operations on the controller and micro-code and then quickly synthesize the results to see the effects of the changes. Graphical analysis tools and reports show the performance of the algorithm, as well as detailed information on the number of registers, muxes, buses and other
resources used.
The key parameter in mapping the SRS algorithm onto the hardware architecture was to run the complete SRS code at the 3.5 MHz target clock-frequency without violating the maximum budget of 25,000 gates. Using A|RT Designer's "load view", designers identified several multiple cycle operations that represented performance bottlenecks. Highlighting the bottleneck location on the graphical view also causes the relevant C-code and RT-level representation to be highlighted so the designer could
identify the cause of the bottleneck and try alternative solutions.
The most obvious bottleneck was the intensive FFT calculus of the MEL computations, which took 80 percent of cycles of the real-time processing. By adding a second adder and a dedicated address-calculation unit (ACU), the FFT was optimized to only 10 percent of the original cycle count. This increased the hardware and the cost was only 4,000 gates - well within the hardware budget. Even with this improvement, the total number of cycles was
still too high to achieve the 3.5 MHz clock frequency.
Further analysis suggested that additional improvements could be achieved in the calculus of the logarithmic functions. This calculation takes about 1000 cycles when running the C-language algorithm on a RISC DSP (NSC CR16B) - about 15 percent of real-time computation requirements. Adding a dedicated application-specific unit (ASU) further reduced the cycle count for these functions to only 3 cycles, while adding only 200 gates. The above changes to
the architecture brought the minimum clock frequency to only 1.5 MHz - less than half the target frequency.
As a final optimization of both the gate count and the power consumption of the speech-recognition core, the number of register flip-flops was reduced. Flip-flops are expensive (10 gates each) and consume considerable power. A|RT Designer's "life-time view" was used to analyze the number of cycles that constitute each variable's life and the frequency with which that variable is used. By
storing infrequently used, but long-lived variables in local RAM, the total number of registers could be reduced, further decreasing the required silicon and power. This step saved an additional 50 percent of register gates while leaving plenty of headroom in the cycle budget.
The implementation of RAM compression
At the start of the design cycle, it was already clear that the 30 Kbyte RAM constraint was tight. The reference SRS C code generates about 1-2 KBytes per word for the voice finger print
(about 1 second of speech) and 30 commands were required. This left no space for processor scratchpad SRAM. Since the budgeted 30 KBytes of RAM represented a significant proportion of the silicon area, however, no additional RAM could be added within the silicon budget.
|
Figure 3 - SRS SOC
|
|
|
The whole chip was manufactured in a standard 0.35µm CMOS process
|
The only solution was to employ some form of speech compression. Voice-print data can be compressed in two ways: lossless or lossy. Lossless compression was examined first since it wouldn't effect speech recognition performance. Several lossless compression methods were implemented in C, based on existing standard C-code sources. Sample voice-print data was used as
a reference. The best lossless algorithm yielded a 30 percent compression. By adding lossy compression, another 20 percent compression was achieved without significant degradation of recognition quality. The lossy compression is completely scalable, enabling a variable compression that is dependent on the actual voice-print lengths or the vocabulary sizes. The resulting C-code algorithm was 500 lines and yielded a 50 percent compression on voice-print data. The next step was integrating the speech
compression and the speech-recognition IP blocks.
The 500 lines of code were simply merged with the 10,000 lines of SRS code. The new functionality was a single subroutine, called before storing a voice print to, or when reading a voice-print from, RAM. The computational effort was considerable, however, requiring around 1.5 million cycles after initial optimization - about the same as the SRS processing. Fortunately, the nearly 2.5 MHz of headroom in the available clock frequency could handle the processing
without further optimization. The compression scheme cut the RAM requirements down to 20-25 Kbytes, leaving at least 5 Kbytes for processor scratchpad.
Implementing the speaker interface
The single battery power supply mandated an offset external network. The maximum power efficiency that can be reached in that case is only 50 percent. A second issue was the large size of the digital-to-analog converter (DAC) and analog amplifiers. The digital process could cope with the required
specifications, but at the expense of a large area. Instead of using this silicon intensive approach, a straightforward C-language implementation of a full-digital, pulse-width modulation (PWM) speaker driver was written for this purpose.
The key question was, how would it sound? The only way to tell how something sounds is by listening to it. Since the C code was relatively simple and required no hardware sharing logic, the C code was directly converted to VHDL using the company's A|RT Builder C-to-HDL
conversion tool. It was then synthesized using Exemplar's (San Jose) Leonardo Spectrum, and mapped to a Xilinx (San Jose) Virtex FPGA. Using a Xilinx FPGA board from the lab, we connected the speaker directly to two digital outputs, flipped the switch, and listened to the results.
The implementation didn't sound right the first time around. Through multiple iterations, we quickly developed the correct PWM conversion block using this design cycle. We automated this design cycle by writing a script that took us
from C code to actual listening in less than 15 minutes per iteration. The final PWM implementation oversamples 512 times, uses a dedicated noise-shaping algorithm, and calculates with 25 bits of internal precision. Its power efficiency is 99 percent and the hardware cost was only 2,000 gates, considerably less than an alternative DAC-plus-amplifier solution.
Generating RT-level description
Once engineers were satisfied with the behavior and architecture of the merged functionality for the
speech-recognition SOC, they used the A|RT Designer tool to automatically generate an RT-level VHDL description which was used for the final silicon. The tool automatically generated the RT-level code for the controller, associated microcode, RAM, ROM and datapath functions. Additionally, the A|RT Designer tool automatically generates test benches at every stage of the design flow, so the simulations of original floating-point algorithm may be compared to simulations of the fixed-point C and HDL versions.
The VHDL simulation corresponded exactly to the original floating-point C, meaning that the SOC would have the same accuracy as the floating-point algorithm.
The final architecture
All functions required were integrated on the single silicon for the completed SRS ASIC (see Figure 3). Additionally, all the IP developed for this SOC is currently being reused. The SRS algorithm is currently used for speech recognition in DECT phones running on the CR16B RISC core. The data-compression functions
have been reused as well, to further enhance a dedicated variable bit-rate ADPCM audio compression code (VADPCM). The VAPDM is also used in the SRS core. The PWM algorithm and implementation are useful in any digital system in which quality audio output must be generated without adding any analog components. The SRS implementation itself can modified for future generation products. Design cycles of the PWM, from C to Xilinx FPGA audio output, took less than 15 minutes.
The SRS architecture we used
offers new engineering opportunities in "soft perceptual" areas where evaluation of the final real-time performance is required as final proof of the design. These include audio compression, wireless communication, and video compression. In short, none of the architectural improvements in the speech recognition SOC could have been accomplished using an HDL-based or other design methodology. Nor could the reusability of the designs themselves have been so perfectly maintained.
Remco de Zwart is a DSP system engineer and technical design manager for Frontier Design's Netherland's design center.
Roel Janssen is a DSP design engineer at Frontier Design's Netherlands design center.
Andre Pool is a DSP design engineer at Frontier Design's Netherlands design center.
Cees Heikamp is the founder and CEO of Columns Asia Ltd, in Malaysia, and of Music Delux in Luxemburg. He founded Columns in 1992.
To voice an opinion on this or any other article in
Integrated System Design, please e-mail your comments to mikem@isdmag.com
Send electronic versions of press releases to
news@isdmag.com
For more information about isdmag.com e-mail
webmaster@isdmag.com
Comments on our editorial are welcome.
Copyright © 2000
Integrated System Design
Magazine