News & Analysis
Reconfigurable Processors: Changing the Systems Design Paradigm
Dinesh Venkatachalam
5/3/2005 12:00 AM EDT
System designers are always looking for the most efficient methodology to get their products to market. When presented with a new algorithm, they have to juggle a variety of constraints such as total system cost, performance, power, and time to get that algorithm implemented in silicon. None of the existing solutions have provided all the features that they crave. That is changing; a new class of devices called reconfigurable processors seems to finally provide a solution that seems to meet all the requirements from system designers.
Reconfigurable processors provide an array of programming elements that can be reconfigured to suit the application without any down time to accommodate changes. The entire process from algorithm exploration to final realization in hardware is seamlessly handled in an integrated development environment.
At Legend Silicon, we design ASICs and systems for wireless communications, primarily for digital terrestrial TV. We do a lot of emulation and field trials with the emulation platform. We run the same code in the emulation devices that we use in the ASIC. For the emulation devices, we started by using DSP devices and then migrated to large FPGAs. This worked okay for the first two generations of devices, but when we went to deep sub-micron processes for our ASICs we started having issues with the FPGA-based emulation systems, since they were not capable of running at the same speed as the current generation of ASICs. We started looking for the next class of devices that can provide the emulation solution that we wanted. This led us right to reconfigurable processors. The FPGA devices we also use are reconfigurable but the performance and power are an order of magnitude below these reconfigurable processors.
Figure 1: Comparasion of the relative performance of DSPs, FPGAs, and reconfigurable processors as the algorithm complexity increases
The RISC processor handles the swapping of the foreground and background images into the processing array as well as the standard housekeeping tasks. The processing engines provide the high degree of flexibility and data-level parallelism to enable mapping of different and distinct algorithms seamlessly to hardware.
The other noticeable difference was the lower power consumption compared to FPGAs at the same performance point. We are still exploring the other hidden benefits of reconfigurable processors such as time slicing and multi-functionality. We use some basic time-slicing capability to run different functions in a time-sliced manner. In the multi-function mode, we are trying to use the dynamic reconfiguration capability of the device to expand the capability of the system without having to add additional external components. This is critical since a new board design and layout takes time and money to accomplish.
The second challenge is the conversion of the generic C-code to C-code capable of taking advantage of the reconfigurable processor's architecture. This meant rewriting certain sections and adding structure to unroll loops. The existence of an integrated development environment helped ease the pain, and since we could compare the performance of the original and the modified C-code using Matlab, we could simulate and eliminate the bugs during the code conversion process.
The DAP/DNA device consists of a RISC processor (DAP) tightly meshed with an array of 376 processing elements (DNA). The RISC processor controls the array?s reconfiguration and also provides some basic house keeping tasks. The two-dimensional processing elements of the DNA matrix are simple but efficient building blocks capable of working in a sequential or parallel fashion. They are capable of being configured as Add/Sub/Mult, Storage RAM, and I/O elements. The DAP/DNA comes with a very high-speed interface providing industry standard connections to other devices such as Direct-IO, DDR, and PCI.
Figure 2: Block diagram of the DAP/DNA
To fully utilize the capability of the device we had to change our thought process to think in a parallel fashion to more effectively use the four parallel data streams which could be handled by the DAP/DNA. We also learned that in our applications, since we were processing information in blocks and we have some left over cycles, we could very easily reconfigure the device and make it perform additional functions for which we had a separate discrete device before.
The device also provided a low power solution, with average power consumption of a device around 2W and provided a smooth path from algorithm exploration in Mathlab/C to final device implementation in hardware. We also tried the exploration using Dataflow-C and with the help of the applications team, we were able to convert the generic C-code into Dataflow-C.
The further we understood the capabilities of the device; the more discrete devices we replaced with DAP/DNA running in time slice configuration. We ended up using the device in a time-sliced function mode (hence increasing the throughput with reduced number of resources) as well as in a multi-function mode (by using the array to perform two completely different tasks).
The learning curve was steep, but at the end, the result was worth it. We managed to get the most power-performance effective solution and provided a building block with which production systems can be built from.

