Concern over the complexity of electronics system design is widespread, but it is felt particularly strongly in wireless communications where engineers struggle to get closer and closer to the Shannon limit the theoretical limit of information transmission in the presence of noise. Techniques such as orthogonal frequency division multiplexing, code-division multiple access (CDMA) and advanced forward error-correction systems such as Turbo coding have helped modern systems approach Shannon's theoretical maximum bit rate for a bandwidth-limited channel. But they demand increasing levels of signal-processing power. In communications, the MIPS requirement has grown tenfold every four years. Unfortunately, frequency scaling and density improvements made possible by Moore's Law using conventional processor technologies can deliver that only every six years. The gap is widening.
Today's 3G systems are complicated, demanding at least 3,000 MIPS per channel 300 times that of 2G and wideband CDMA is the most complex of the 3G systems in production today. To achieve the necessary MIPS-per-channel rating, most basestations use a combination of digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and high-speed general-purpose processors. The FPGAs are used for chip-rate processing, which demands processing rates beyond those possible with DSPs. The DSPs are typically used in the symbol-rate processing section. The whole system will be managed by the general-purpose processor.
Over time, DSPs have taken advantage of the logic-density improvements offered by Moore's Law (transistor density per chip will double every 18 months) to use more complex microarchitectures in a bid to deliver higher power. They have evolved into very-long-instruction-word processors, with multiple execution units, complex instructions schedulers and deep pipelines. In some cases, the pipeline can be a scary 11 stages deep. But even that has not been enough. They have incorporated hardware accelerators such as dedicated Viterbi decoders, Turbo decoders and matrix multipliers. All these additions and changes make the DSP harder to program.
In many cases, the only reasonable way to access the internal features is to use precompiled intrinsic functions, with a consequent performance penalty. More subtly, because they make the processor harder to follow or predict, development and verification become harder. With each microarchitectural "improvement," a law of diminishing returns applies. Every 10 percent increase in performance requires more than a 20 percent increase in complexity, with a consequent impact on die area, cost and power.
The other primary architecture of choice is the FPGA. Famously versatile, these are the only devices capable of processing the extremely high-speed tasks of CDMA. Although they are well suited to some roles, the very-fine-grain nature of the FPGA's logic array makes it inefficient for designing complex systems, especially in systems that need complex control tasks, such as wideband code-division multiple access (W-CDMA). And the low level of abstraction provides a design time penalty.
It is clear that neither the FPGA nor the DSP is an optimum solution for a basestation design, because each is restricted to its own part of the baseband. FPGAs deliver performance and predictability, but are designed at too low a level of abstraction plus they are power hungry and expensive. DSPs provide a reasonably familiar programming environment and have a wealth of code to draw on. But they do not have the horsepower needed and, with all their microarchitectural tweaks, performance is hard to predict or test.
The split between FPGAs and DSPs requires significant effort in partitioning the system, which is neither trivial nor obvious. There are complex interactions between control and data paths in a W-CDMA basestation. Although the FPGA offers high-performance chip-rate processing, the standards require an ongoing dialog between the FPGA and the basestation's control functions. Examples include interactions between rake fingers and the "smarts" of the rake-finger manager. Traffic types further complicate the picture as processing may need to switch quickly among different chip rates and forward error-correction modes. Then there are the operations, administration and management requirements of the TS-215 standard, often conveniently forgotten in component vendor benchmarks because of the overhead they put on their devices.
Once the system has been partitioned, further system integration difficulties are presented by the different nature of each part of the baseband implementation. The ASIC, FPGA and DSP parts of the system use different design paradigms, programming models, tools and test frameworks. Integrating blocks takes time, particularly when statistical testing is needed. Because other units depend on the results, this often requires redesigns of other blocks which worked fine in isolation but which fail because of buffering or synchronization problems. Worse, it may be that the solution requires repartitioning or adding additional devices, requiring not just a redesign, but new interfaces.
The root of the problem is simple: sub-blocks are not orthogonal. The way design is supposed to work is in a hierarchy, where we abstract out key blocks and sub-blocks, which can be treated independently. Unfortunately, in conventional architectures they are not actually independent. Run-time dependencies and lack of predictability make an already complex situation worse.
The PicoChip platform was developed from the beginning to address these concerns and to provide a more effective environment to implement such complex systems, considering design, implementation and test.
The PicoArray itself is a massively parallel array of processors linked by a deterministic, high speed interconnect fabric, with 400 processor cores on a single die. Each of these cores is a highly capable 16-bit device, roughly equivalent to an ARM9 for control tasks or a TI C5x for DSP roles. Because each of these cores can operate in parallel or in concert and because of the huge bandwidth of the on-chip buses, the PicoArray delivers huge processing power calculated at more than 100 billion operations per second. For DSP tasks, the aggregate throughput is 30 billion multiply-accumulations per second.
This architecture is more than capable of handling the most demanding high-speed tasks of the chip-rate section. But importantly, because the component elements of the PicoArray are completely programmable, using standard ANSI C or a familiar assembler, it is also suitable for symbol-rate processing and control-plane functions. This integration into a single environment reconciles the needs for both speed and control complexity, which explains why the PicoArray is so suitable for these systems.
The granularity of elements in the PicoArray lies between the very fine granularity of a universal FPGA, or the "big chunks" of a powerful DSP. The operational elements in the PicoArray were expressly designed to match the complexity of the tasks in the system. Tasks are mapped directly onto processors as easily as drawing a block diagram. For example, a simple filter might be implemented in one processor. But it is easy to split a more complex filter across two or more processors by showing their roles in the block diagram. Signal-processing algorithms are well suited to such an approach, with inherent parallelism both within the algorithms and across them for multiple data streams.
Crucially, the PicoArray is fully deterministic. There is no run-time scheduling or arbitration. This helps in two ways. First, test and verification of each element can be greatly accelerated. Engineers can code a block, simulate it and then see exactly how it will perform. It is not necessary to perform statistical tests to check if a combination of circumstances may cause a problem. As all interactions are explicitly defined in code, there are no side effects, timing dependencies or vagaries due to statistical multiplexing.
Secondly, because the interaction between blocks is defined and predictable, there is confidence in how they interact. This orthogonality of function in subsystem test and verification is lacking in other technologies. Test becomes a summation of verified sub-blocks, rather than a combinatorial explosion of test cases.
The PicoChip platform also includes a full tool chain, with C compiler, assembler and debugger, as well as a comprehensive software reference design for a full production-quality UMTS Node B basestation.
Doug Pulley is chief technology officer of PicoChip Designs Ltd. (Bath, England).