Digital signal and image processing (DSP) is ubiquitous: From digital cameras to cell phones, HDTV to DVDs, satellite radio to medical imaging. The modern world is increasingly dependent on DSP algorithms.
Although, traditionally, special-purpose silicon devices such as digital signal processors, ASICs, or FPGAs are used for data manipulation, general-purpose processors (GPPs) can now also be used for DSP workloads. Code is generally easier and more cost-effective to develop and support on GPPs than on large DSPs or FPGAs. GPPs are also able to combine general purpose processing with digital signal processing in the same chip, a major advantage for many complex algorithms.
Case Study: Wireless Baseband Signal Processing
In wireless communication systems, the physical layer (PHY) (baseband signal processing) is usually implemented in dedicated hardware (ASICs), or in a combination of DSPs and FPGAs, because of its extremely high computational load. GPPs (such as Intel architecture) have traditionally been reserved for higher, less demanding layers of the associated protocols.
This section, however, will show that two of the most demanding next-generation baseband processing algorithms can be effectively implemented on modern Intel processors—LTE turbo encoder  and channel estimation.
The following discussion assumes Intel architecture as the target platform for implementation, and the parameters shown in Table 1.
Table 1. Parameters for LTE algorithm discussion
LTE Turbo Encoder
The Turbo encoder is an algorithm that operates intensively at bit level. This is one of the reasons why it is usually offloaded to dedicated circuitry. As will be shown further, there are software architecture alternatives that can lead to an efficient realization on an Intel architecture platform.
The LTE standard  specifies the Turbo encoding scheme as depicted in Figure 1.
Figure 1. Block diagram of the LTE Turbo encoder (Source: 3GPP)
The scheme implements a Parallel Concatenated Convolutional Code (PCCC) using two 8-state constituent encoders in parallel, and comprises an internal interleaver. For each input bit, 3 bits are generated at the output.
The relationship between the input i and output π(i) bit indexes (positions in the stream) is defined by the following expression:
K is the input block size in number of bits (188 possible values ranging from 40 to 6144). The constants f1 and f2 are predetermined by the standard, and depend solely on K.
At a cost of a slightly larger memory footprint (710 kilobytes), it is possible to pre-generate the π(i) LUTs for each allowed value of K. For processing a single data frame, only the portion of the table referring to the current K value will be used (maximum 12 kB).
Computing the permutation indexes at runtime would require 4 multiplications, 1 division and 1 addition, giving a total of 6 integer operations per bit.