# Cutting Power in Turbo Coding Architectures

Turbo coding architectures lie at the heart of all of the third-generation (3G) wireless standards, including UMTS and cdma2000. These coding architecture are called for because they allow systems to meet the tough bit-error-rate (10^{-6}) requirements and low signal-to-noise ratios (SNRs) placed on emerging 3G designs.

However, calling for turbo coding architectures and actually implementing them are worlds apart. Yes, turbo codes deliver excellent BER and SNR performance. But these achievements come at the cost of intense computing requirements. At high data rates, the turbo decoder alone may consume more power than the rest of the baseband transceiver.

With this in mind, designers need to rework their turbo coding architecture in order to meet stringent 3G transceiver requirements. Below we'll layout some techniques that allow designers to build more power-efficient turbo coding architectures. Additionally, we'll look at other techniques designers can employ to reduce power requirements in turbo decoding schemes.

**Turbo Complexity**

The optimum decoding of turbo codes is impossible to implement in practice due to its huge complexity, which is proportional to 2^{L}, where L is the frame length. The turbo codes in 3G have frame lengths ranging from 40 to 20730, thus even for the shortest frame length the optimum decoder is beyond the realm of implementation.

Practical turbo decoders are of iterative type; the decoder works on a received frame iteratively. In general, the errors in the frame decrease with the number of iterations (NOI). Typical choices for the NOI are 15 to 20, beyond which further improvement appears minuscule.

The computational complexity of the iterative turbo decoder remains huge. For example, at 2-Mbps data rate and 20 iterations, a turbo decoder would require about tens of BOPS. This kind of processing demand far exceeds what the state-of-the-art DSP could offer, making an ASIC the only viable implementation approach.

Even with the ASIC implementations, the power consumption is the top concern of the designers. The power consumption is proportional to the NOI. Thus as NOI increases, power consumption also increases. Therefore, to keep NOI down, designers must implement some new strategies. Specifically, designers need to consider implementing one of the various early-termination (ET) strategies in. Let's look at the ET strategies and the impact they have on turbo-coding architecture.

**ET, ET, ET...**

One of the clear myths in a turbo coding architecture is that all noisy frames must be corrected using the same NOI. This is actually inaccurate. Many noisy frames can be corrected with a few iterations while some of the more noisy frames need to experience a full NOI. Thus, if the decoder could stop the iteration as soon as the frame becomes correct, the average NOI would be reduced. For example, the average NOI over many frames can be less than 2, achieving more that 90% power reduction if the maximum NOI is 20.

Of course, the decoder will never know for sure if a decoded frame is correct or not. Therefore, various ET strategies have been developed based upon different indicators of the likelihood that a decoded frame is correct.

The first ET strategy, called comparison-based ET, compares the decoder frames from two consecutive iterations. If the two frames are "close" by a certain measure, the reason goes, then the frame is unlikely to improve upon further iterations. A popular measure of "closeness" is the number of differing bits in two frames. If the hard-decisioned bits of the current frame are the same as that of the frame of last iteration, the iteration will be stopped.

The drawback of the comparison-based ET is that the decision to stop the termination is not very reliable. The decoding can be terminated immaturely when further iteration would continue to improve the number of errors in the frame. This is especially true when the frame length is small. For example, use of comparison-based ET degrades the BER by more than an order of magnitude for a 320-b frame turbo code.

The second ET strategy, dubbed CRC-based ET, checks the CRC codes embedded in the turbo frames. If the decoded frame passes the CRC check, the decoder will stop the iteration. The reliability of the CRC-based ET depends on the number of CRC bits, C, since the probability of an incorrect frame passed the CRC check is well approximated by 2^{-C}. If, say, C = 30 to 40, the contribution to the overall BER of an incorrect frame slipping through the CRC check is negligible.

Long CRC bits, however, increase the transmission overhead, particularly for small frames. Typical choices of CRC bits in 3G standards are no more than 16, which is not sufficient for reliable ET.

One way to avoid increased transmission overhead is to wait out the first several iterations, and activate the CRC check when the frame is likely to be correct. The disadvantage of this is that the average NOI is always higher than the initial NOI, which has to be somehow conservatively determined.

There is also an added difficulty when applying the CRC-based ET to UMTS. The CRC bits in the UMTS signaling format are not intended to assist in terminating the turbo decoder early. The CRC frames are not aligned with the turbo-coded frames, even there can be up to 24 CRC bits in UMTS.

The third approach is called SNR thresholding. The decoded results in each iteration of the turbo decoder are soft-output. For binary turbo codes, this means the decoder outputs are not simply --1 or 1 but instead are centered around --1 and 1 after proper scaling. This output pattern is very similar to the BPSK signaling over the Gaussian channel and thus can be modeled as such.

It is observed that the decoder outputs are "noisy" initially and become more and more concentrated to --1 and 1 as the iteration progresses. In other words, the SNR of the decoder outputs improves with continued iterations. With SNR thresholding, the decoder will terminate the iteration process if the calculated SNR from the decoded output exceeds a pre-selected threshold.

Like comparison-based ET, SNR-thresholding saves power consumption at the price of degrading the BER. SNR thresholding simply allows a trade-off between the BER and power consumption. A lower SNR threshold decreases the average NOI but increases BER, and a higher SNR threshold does the converse. In other words, SNR thresholding is unable to reduce the power while holding the BER down.

Under the SNR thresholding approach, optimum thresholds are also channel dependent. Since the wireless channel is often fast changing even within a turbo code-frame, a fixed threshold will further degrade the BER.

It can be seen from the above discussions that existing ET strategies are either not very reliable or not very effective or both. Now let's layout a new turbo coding architecture that delivers a more reliable ET strategy in order to achieve lower NOI and better SNR.

**A New Turbo Decoder Design**

We now describe the performance of a new turbo decoder processor (TDP) design that implements an extremely reliable ET strategy, called adaptive ET. The adaptive ET retains all the benefits of existing ET strategies without their shortcomings. During the iteration decoding, the adaptive ET performs tests on the decoded frame. The tests are adaptive to the turbo-coded frames and are designed to maximize the termination reliability.

To show the benefits of the TDP implementing the advanced ET scheme, let's compare the TDP against a commercially available turbo decoder (design A). Figures 1 and 2 show the BER and average NOI, respectively, of design A and the TDP. Note that design A has multiple curves in each figure, corresponding to different SNR thresholds.

In **Figure 1**, it can be seen that different thresholds have different BER performances for design A. The TDP, however, outperforms design A by large margins for all thresholds.

A point of exception is at 1 dB on the SNR threshold = 100 curve. The abrupt change in direction at that point indicates that a statistical aberration may occur due to possibly insufficient samples, since the point is well below the BER under ideal circumstances.

*Figure 1: Comparison of BER of the turbo decoder processor (TDP) vs. traditional turbo coding schemes.*

*Click Here for a larger version of Figure 1*

**Figure 2** reveals the power/BER trade-off of SNR-thresholding. For design A, the best performance is obtained with the highest SNR threshold. However the average NOI remains the same as the maximum NOI, thus generating no power savings.

*Figure 2: Comparison of NOI of the turbo decoder processor (TDP) vs. traditional turbo coding schemes.*

*Click Here for a larger version of Figure 2*

With lower SNR thresholds, design A does have significantly lower average NOI, yet its BER performances are dismal. It takes an additional 1 dB to reach 10^{-6>} BER and barely stays under thereafter.

The TDP, on the other hand, beats the best average NOI of design A by 30 to 40% while surpassing the design A's best BER performance by more than an order of magnitude. The undesired power/BER trade-off is thus completely eliminated by the TDP in the process of designing 3G systems.

**Iteration Abnormality**

By using the adaptive ET strategy, a turbo decoder can stop the iteration almost immediately after the frame becomes correct. The average NOI is thus nearly minimized, resulting in minimized power consumption. There are certain "problematic" frames, however, that requires the maximum NOI and may still have errors in the end. It is those frames that contribute to the BER of the iterative turbo decoder.

The number of errors in a frame generally decreases with the NOI, which is the premise of the iterative decoding. This rule of thumb breaks down for the problematic frames. With the progression of the iterations, the number of errors in a problematic frame can go down and go up again, exhibiting a cyclic pattern. Since the problematic frames are the source of the BER, more iterations may not necessarily bring down the BER. In fact, BER may even be worse with more iterations. This phenomenon is called iteration abnormality.

Iteration abnormality is demonstrated in **Figure 3**, where the BER curves of a rate-1/2, 20730-bit frame turbo code are shown. This code is used in cdma2000 systems.

*Figure 3: Example of iteration abnormality on a cdma2000 signal.*

*Click Here for a larger version of Figure 3*

From Figure 3, it can be seen that the BER is indeed not monotonically decreasing with the NOI. It can also be observed from Figure 3 that at a given number of iterations, the BER may even be worse as the SNR increases.

The fact that some hard-squeezed dB would have an adverse effect on the system performance would certainly be annoying to a system designer. The good news is that an innovative iteration abnormality suppression scheme has been designed into the TDP so that the iteration abnormality can be successfully suppressed **(Figure 4)**.

*Figure 4: The TDP described here integrates an suppression scheme to handle iteration abnormality. This chart plots the impact of the suppression scheme on a BER vs. SNR chart.*

*Click Here for a larger version of Figure 4*

The suppression of iteration abnormality by the TDP is also shown in **Figure 5**, where curves of the BER vs. iterations are plotted.

*Figure 5: Plot of the iteration abnormality scheme on a BER vs. iteration chart.*

*Click Here for a larger version of Figure 5*

**Other Power-Saving Considerations**

The reliable adaptive ET strategy in the TDP reduces the power consumption to a small fraction of what it used to be, while maintaining the BER performance (with the iteration abnormality suppression scheme, the BER is actually getting better). When designing the turbo decoder architectures, carefully chosen parameters can further cut the power consumption substantially, if not as dramatic as the ET schemes.

One additional way that designers can cut power is to minimize data width. The data inside the turbo decoder are represented by a finite number of bits. The arithmetic operations consume power that is proportional to the data width. The smaller the width, the less the power consumption.

The majority of the operations in the turbo decoder are spent on the computation of the so called forward and backward probabilities (α and β, respectively) and the soft output based on α and β. Some turbo decoder designs conveniently chose the widths of α and β to be 16. If the width can be reduced to 10, for example, the power consumption will be reduced by 37.5%.

There are many constraints on minimizing the data width. One of them is the input data width. Large input data width makes the widths of α and β large as well. One way of properly specifying the input data width is to consider the dynamic range of the input signal. This is particularly important for maximum a posteriori (MAP)-based turbo decoder.

Unlike the Viterbi decoder in which the scaling of the input signal does not affect the decoding outcome in BPSK/QPSK signaling, the MAP-based turbo decoder requires that the input signal is properly scaled. Thus, after scaling, the input signal will be in the unit of E_{s}/N_{o} while the SNR per symbol will be in the linear scale. In other words, if E_{s}/N_{o} = 3 dB = 2.0, the scaled input samples to the turbo decoder will be centered around 2.0 or --2.0.

While certain input data width, such as 5 bits, was concluded to be sufficient in some study based on results on a particular code and a particular channel, for 3G turbo decoders the situation is more complex. A 3G turbo decoder must be able to decode with various frame lengths and under various channel conditions. The shorter the frame length, the higher the SNR for specified BER performance, meaning larger input data width. If the channel is fading, additional width is needed to deal with the signal fluctuations.

Another factor affecting the data width is the input quantization. Now the scaled input is in the unit of E_{s} /N_{o}, how many bits do we need to represent a fraction? Past study shows that 3 fraction bits is sufficient to make the quantization effects negligible. Innovative schemes involving 2 or even 1.5 fraction bits have further pushed the design envelope.

Through a combination of analysis, simulation, and design tricks, the TDP is able to minimize the data width while balancing out the input range and quantization issues. The BER performance of the TDP is kept within 0.08 dB from the performance of the ideal, floating point decoders.

** Sliding-Window Overhead**

In addition to reducing overhead, designers can turn to a sliding-window overhead to reduce power in turbo designs. For long frames the memory for storing the entire frame of the forward probability α or the backward probability β can be very large. Available products all use sliding-window version of the turbo decoder to reduce the memory requirements.

Sliding-window decoding induces the decoding overhead. In some turbo decoder designs this overhead can be as high as 25 to 33%, and increases the power consumption by the same amount. The TDP incorporates a sliding-window architecture with carefully chosen parameters, which results in only 3% overhead.

**Other Metrics for Turbo Coders**

Besides the power consumption and the BER, there are also other metrics for a turbo decoder design that indicates the applicable range of and the manufacturing cost of the device. Among them are memory size, gate count, and decoding speed.

Memory size dominates the size of the turbo decoder. Most of the memory is for storing the forward probability α or the backward probability β, and the interleave table. Alternatively, the interleave table can be replaced by a circuit that generates the interleave addresses in-line.

Both cdma2000 and UMTS have interleavers that can be generated algorithmically. The generating algorithms are implemented in the TDP, saving 65 kbits of memory for UMTS and 300 kbits for cdma2000.

Gate count relates to the design complexity of the turbo decoder. The TDP discussed here required about 20,000 gates.

Decoding speed is lower bounded by 1 bit/half-iteration/clock cycle for standard turbo decoders. There are designs that potentially have the speed below the lower bound by having multiple sessions of computing α and β in parallel. The random access nature of the soft output memory due to the interleaver, however, makes the single memory access per clock cycle still the best choice. Thus the lower bound remains a meaningful benchmark for decoding speed.

The decoding speed of the TDP approaches the lower bound with a 10 to 20% overhead. Thus, a 40 Mb-iteration/s decoder can be driven by a less than 100 MHz clock. The decoding speed can be further doubled by employing a parallel iteration architecture, making the TDP suitable for multi-channel decoding and the high data rate beyond 2 Mbps.

**About the Author**

**Xiao-an Wang** is a principal engineer at Communiverse Laboratories. He obtained an MSEE and Ph.D. in electrical engineering from Georgia Institute of Technology. Xiao-an can be reached at xiaoanwang@communiverse-labs.com.