The transmission of speech from one point to another over GSM mobile phone network is something that most of us take for granted. The complexity is usually perceived to be associated with the network infrastructure and management required in order to create the end-to-end connection, and not with the transmission of the payload itself. The real complexity, however, lies in the codec scheme used to encode voice traffic for transmission.
The GSM standard supports four different but similar compression technologies to analyse and compress speech. These include full-rate, enhanced full-rate (EFR), adaptive multi-rate (AMR), and half-rate. Despite all being lossy (i.e. some data is lost during the compression), these codecs have been optimized to accurately regenerate speech at the output of a wireless link.
In order to provide toll-quality voice over a GSM network, designers must understand how and when to implement these codecs. To help out, this article provides a look inside how each of these codecs works. We'll also examine how the codecs need to evolve in order to meet the demands of 2.5 and 3G wireless networks.
Speech Transmission Overview
When you speak into the microphone on a GSM phone, the speech is converted to a digital signal with a resolution of 13 bits, sampled at a rate of 8 kHzthis 104,000 b/s forms the input signal to all the GSM speech codecs. The codec analyses the voice, and builds up a bit-stream composed of a number of parameters that describe aspects of the voice. The output rate of the codec is dependent on its type (see Table 1), with a range of between 4.75 kbit/s and 13 kbit/s.
Table 1: Different Coding Rates
After coding, the bits are re-arranged, convoluted, interleaved, and built into bursts for transmission over the air interface. Under extreme error conditions a frame erasure occurs and the data is lost, otherwise the original data is re-assembled, potentially with some errors to the less significant bits. The bits are arranged back into their parametric representation, and fed into the decoder, which uses the data to synthesise the original speech information.
The Full-Rate Codec
The full-rate codec is a regular pulse excitation, long-term prediction (RPE-LTP) linear predictive coder that operates on a 20-ms frame composed of one hundred sixty 13-bit samples.
The vocoder model consists of a tone generator (which models the vocal chords), and a filter that modifies the tone (which models the mouth and nasal cavity shape) [Figure 1]. The short-term analysis and filtering determines the filter coefficients and an error measurement, the long-term analysis quantifies the harmonics of the speech.
Figure 1: Diagram of a full-rate vocoder model.
As the mathematical model for speech generation in a full-rate codec shows a gradual decay in power for an increase in frequency, the samples are fed through a pre-emphasis filter that enhances the higher frequencies, resulting in better transmission efficiency. An equivalent de-emphasis filter at the remote end restores the sound.
The short-term analysis (linear prediction) performs autocorrelation and Schur recursion on the input signal to determine the filter ("reflection") coefficients. The reflection coefficients, which are transmitted over the air as eight parameters totalling 36 bits of information, are converted into log area ratios (LARs) as they offer more favourable companding characteristics. The reflection coefficients are then used to apply short term filtering to the input signal, resulting in 160 samples of residual signal.
The residual signal from the short-term filtering is segmented into four sub-frames of 40 samples each. The long-term prediction (LTP) filter models the fine harmonics of the speech using a combination of current and previous sub-frames. The gain and lag (delay) parameters for the LTP filter are determined by cross-correlating the current sub-frame with previous residual sub-frames.
The peak of the cross-correlation determines the signal lag, and the gain is calculated by normalising the cross-correlation coefficients. The parameters are applied to the long-term filter, and a prediction of the current short-term residual is made. The error between the estimate and the real short-term residual signalthe long-term residual signalis applied to the RPE analysis, which performs the data compression.
The Regular Pulse Excitation (RPE) stage involves reducing the 40 long-term residual samples down to four sets of 13-bit sub-sequences through a combination of interleaving and sub-sampling. The optimum sub-sequence is determined as having the least error, and is coded using APCM (adaptive PCM) into 45 bits.
The resulting signal is fed back through an RPE decoder and mixed with the short-term residual estimate in order to source the long-term analysis filter for the next frame, thereby completing the feedback loop (Table 2).
Table 2 - Output Parameters from the Full Rate Codec
The Enhanced Full-Rate Codec
As processing power improved and power consumption decreased in digital signal processors (DSPs), more complex codecs could be used to give a better quality of speech. The EFR codec is capable of conveying more subtle detail in the speech, even though the output bit rate is lower than full rate.
The EFR codec is an algebraic code excitation linear prediction (ACELP) codec, which uses a set of similar principles to the RPE-LTP codec, but also has some significant differences. The EFR codec uses a 10th-order linear-predictive (short-term) filter and a long-term filter implemented using a combination of adaptive and fixed codebooks (sets of excitation vectors).
Figure 2: Diagram of the EFM vocoder model
The pre-processing stage for EFR consists of an 80 Hz high-pass filter, and some downscaling to reduce implementation complexity. Short-term analysis, on the other hand, occurs twice per frame and consists of autocorrelation with two different asymmetric windows of 30mS in length concentrated around different sub-frames. The results are converted to short-term filter coefficients, then to line spectral pairs (for better transmission efficiency) and quantized to 38 bits.
In the EFR codec, the adaptive codebook contains excitation vectors that model the long-term speech structure. Open-loop pitch analysis is performed on half a frame, and this gives two estimates of the pitch lag (delay) for each frame.
The open-loop result is used to seed a closed-loop search for speed and reduced computation requirements. The pitch lag is applied to a synthesiser, and the results compared against the non-synthesised input (analysis-by-synthesis), and the minimum perceptually weighted error is found. The results are coded into 34 bits.
The residual signal remaining after quantization of the adaptive codebook search is modelled by the algebraic (fixed) codebook, again using an analysis-by-synthesis approach. The resulting lag is coded as 35 bits per sub-frame, and the gain as 5 bits per sub-frame.
The final stage for the encoder is to update the appropriate memory ready for the next frame.
The principle of the AMR codec is to use very similar computations for a set of codecs, to create outputs of different rates. In GSM, the quality of the received air-interface signal is monitored and the coding rate of speech can be modified. In this way, more protection is applied to poorer signal areas by reducing the coding rate and increasing the redundancy, and in areas of good signal quality, the quality of the speech is improved.
In terms of implementation, an ACELP coder is used. In fact, the 12.2 kbit/s AMR codec is computationally the same as the EFR codec. For rates lower than 12.2 kbit/s, the short-term analysis is performed only once per frame. For 5.15 kbit/s and lower, the open-loop pitch lag is estimated only once per frame. The result is that at lower output bit rates, there are a smaller number of parameters to transmit, and fewer bits are used to represent them.
The Half-Rate Codec
The air transmission specification for GSM allows the splitting of a voice channel into two sub-channels that can maintain separate calls. A voice coder that uses half the channel capacity would allow the network operators to double the capacity on a cell for very little investment.
The half-rate codec is a vector sum excitation linear prediction (VSELP) codec that operates on an analysis-by-synthesis approach similar to the EFR and AMR codecs. The resulting output is 5.7 kb/s, which includes 100 b/s of mode indicator bits specifying whether the frames are thought to contain voice or no voice. The mode indicator allows the codec to operated slightly differently to obtain the best quality.
Half-rate speech coding was first introduced in the mid 1990's, but the public perception of speech quality was so poor that it is not generally used today. However, due to the variable bit-rate output, AMR lends itself nicely to transmission over a half-rate channel. By limiting the output to the lowest 6 coding rates (4.75 -- 7.95kbps), the user can still experience the quality benefits of adaptive speech coding, and the network operator benefits from increased capacity. It is thought that with the introduction of AMR, use of the half-rate air-channel will start to become much more widespread.
Table 3 shows the time taken to encode and decode a random stream of speech-like data, and the speed of the operations relative to the GSM full-rate codec.
Table 3: General Encoding and Decoding Complexity
The full-rate encoder operates on a non-iterative analysis and filtering, which results in fast encoding and decoding. By comparison, the analysis-by-synthesis approach employed in the CELP codecs involves repetitive computation of synthesised speech parameters. The computational complexity of the EFR/AMR/half-rate codecs is therefore far greater than the full-rate codec, and is reflected in the time taken to compress and decompress a frame.
The output of the speech codecs is grouped into parameters (e.g. LARs) as they are generated (Figure 3). For transmission over the air interface, the bits are rearranged so the more important bits are grouped together. Extra protection can then be applied to the most significant bits of the parameters that will have biggest effect on the speech quality if they are erroneous
Figure 3: Diagram of vocoder parameter groupings.
The process of building the air transmission bursts involves adding redundancy to the data by convolution. During this process, the most important bits (Class 1a) are protected most while the least important bits (Class 2) have no protection applied.
This frame building process ensures that many errors occurring on the air interface will be either correctable (using the redundancy), or will have only a small impact on the speech quality.
The current focus for speech codecs is to produce a result that has a perceptually high quality at very low data rated by attempting to mathematically simulate the mechanics of human voice generation. With the introduction of 2.5G and 3G systems, it is likely that two different applications of speech coding will be developed.
The first will be comparatively low bandwidth speech coding, most likely based on the current generation of CELP codecs. Wideband AMR codecs have already been standardised for use with 2G and 2.5G technologies and these will utilise the capacity gains from EDGE deployment.
The second will make more use of the wide bandwidth employing a range of different techniques which will probably be based on current psychoacoustic coding, a technique which is in widespread use today for MP3 audio compression.
There is no doubt that speech quality over mobile networks will improve, but it may be some time before wideband codecs are standardised and integrated with fixed wire-line networks, leading to potentially CD-quality speech communications worldwide.
About the Authors
Richard Meston is a software engineer at Racal Instruments, working with GSM/GPRS/EDGE and CDMA test equipment. He primarily works GSM mobile and base-station measurement and protocol testers, as well as speech coding and coverage analysis applications. Richard has an Electrical Engineering degree from the University of Sussex and can be reached at firstname.lastname@example.org.