Spectral subtraction, perceptual linear prediction and parameter filtering are well-known speech-recognition algorithms, and all of them are developed using floating-point and many divides and conditional branches. Low-cost hardware generally cannot process floating-point or integer divides without using slow software emulation, and conditional branches are slow on many embedded processors because they cause pipeline breaks.
Efficient operation of these algorithms on low-cost hardware generally requires that they be implemented without floating-point and with a minimum of integer divides and conditional branches. So let's look at some of the issues involved in using these algorithms in integer arithmetic with a minimum of divides and conditional branches.
Compared with floating-point implementations, using speech-recognition algorithms with integer arithmetic and a minimum of divides would reduce hardware costs. One technique is to use interpolated lookup tables to avoid floating-point computations, divides or conditional program branches.
In speech processing, three main algorithms are used: spectral subtraction, perceptual linear prediction and parameter filtering. Spectral subtraction suppresses additive stationary noise by subtracting an estimate of the noise spectrum from the signal spectrum magnitude. The noise spectrum is computed during nonspeech periods.
Perceptual linear prediction (PLP) is an alternative to linear predictive coding (LPC) based on human perception. PLP maps the spectrum into a perceptual spectrum that is then processed using autoregressive modeling and cepstrum computation.
Parameter filtering applies a linear filter to each of the cepstrum coefficients and removes components that change too slowly or too rapidly (they are likely to be the result of spectrum shaping in the channel, or noise). The filtered cepstrum values are the output of the front end and are used as the input to a speech recognizer, typically based on a Hidden Markov Model (HMM) or Dynamic Time Warp Algorithm (DTW). The filtered cepstrum values contain information about the input speech and effects resulting from noise and channel suppression.
You can define each of these algorithms in terms of their computational requirements. DSP operations for spectral subtraction, for example, require a fast Fourier transform and can be reduced to a fixed-point implementation. If you need only the magnitude of the spectrum, then the phase calculation, restoration and inverse FFT are not needed.
An integer implementation of the (radix 2) FFT is fairly straightforward. All of the multiplies in the FFT have one number that is a variable; the other number is a constant between plus/minus 1, called a "twiddle factor." For integer implementation to succeed, the twiddle factors must be scaled up so they can be stored as integers. Typically the scale factor is a power of two. In this case the multiplication by the twiddle factor can be implemented as an integer multiply (typically 16 bits x 16 bits) followed by a bit shift (typically 16 bits). To prevent overflows, shift right by one bit in each stage of the FFT. The same method is used in the inverse FFT.
The dynamic range of this type of integer FFT decreases by 1 of a bit for each stage where the number of stages is log base 2 of the number of data points. Thus for a 128-point FFT with 16-bit input data, the output has only 9 bits of dynamic range even though it is stored in 16 bits. Using block floating-point can mostly eliminate this loss of dynamic range. Left shift the data array as much as possible without overflowing before taking the FFT and the shift count stored. Keep in mind that the shift count is stored as a fractional number since some subsequent operations, such as cube root, result in fractional shift counts.
Searching for the correct shift count adds conditional branches to the code, so do it only when necessary; use fixed shift factors wherever possible. Similarly, in cases where the spectrum of the input is restricted, you may be able to circumvent this limitation by optimizing the scale factor, but for the general case of unrestricted input it holds. It is also possible to dynamically rescale the data at each stage of the FFT, but this adds computational load.
In normal processing you compute the phase of the captured signal from the spectrum before subtracting off the noise model. You then apply it to the subtracted magnitude before taking the inverse FFT. In many instances you'll need only the spectrum magnitude of the cleaned speech, in which case the math-intensive phase computation and restoration is unnecessary.
A straightforward way to compute the phase and magnitude from the real and imaginary parts of the spectrum involves a square root, a divide, an inverse tangent and a cosine calculation, which are difficult on an integer processor. If the accuracy requirements are not too great, use a math approximation to eliminate the square root.
But you can obtain better accuracy by using an interpolated lookup table algorithm. If f() is the function you are approximating, b is the number of bits in the unsigned input number x and t is the number of bits used to index the table, the table will have values of f(x) for x = n 2(b-t) for n = 0, 1, 2, ... 2t-1. Basically, the index for the table entry j ( 2(b-t) )) is just j left shifted by (b-t). To approximate f(x) using the table, just right shift x by (b-t) bits and then index into the table.
To interpolate between table entries, let the value obtained from the lookup table be f1, and let the next higher entry in the lookup table be f2. Obviously f(x) is between f1 and f2. To interpolate, use standard linear interpolation where xtrunc is x with the lower b-t bits zeroed out. Note that x-xtrunc is just the lower (b-t) bits of x that can be implemented as a bitwise AND. The division is by a power of two, so it is just a shift. It may be necessary to distribute the shift by (b-t) into two operations to prevent overflow in computing the numerator. The index for f2 is just one more than the index for f1.
With this method you can approximate any function with an interpolated lookup table without divides or conditional branches.
The PLP algorithm typically depends on block floating-point math. In operation, the spectrum is integrated over a small number (typically 17) of overlapping bands. Perceptual band weighting multiplies each band by a constant that is larger for higher frequencies than for lower ones.
Intensity to Loudness Compression is a cube root. Take the cube root by using an integer logarithm, multiplying by a constant, right shifting and then taking an inverse log. The logarithms are base 2. The perceptual weighting is also done in the log domain. You must divide the shift count by three to account for the cube root. Implement the divide by three as a multiply and shift.
To compute the log base 2 of an integer, first shift the number down until it is less than a threshold, and then use the remaining bits to index into a table. Add the shift count to the result. For inverse logarithm use the interpolated lookup table. Before taking the inverse logarithm of the array elements, subtract the integer part of the largest value from the array. This greatly reduces the dynamic range. Add the value subtracted to the shift count to keep track of the absolute magnitude of the array.
The output of the inverse DFT is an autocorrelation. Computing the autoregressive model and cepstrum coefficients from the autocorrelation is the same as in linear predictive coding (LPC) analysis.
Finally, parameter filtering requires feeding the acoustic parameters into an IIR filter. This can be described with traditional transfer functions and allows one to implement the feedback multiplication in integer math as a shift and a subtract operation.
See related chart