System designers are always looking for the most efficient methodology to get their products to market. When presented with a new algorithm, they have to juggle a variety of constraints such as total system cost, performance, power, and time to get that algorithm implemented in silicon. None of the existing solutions have provided all the features that they crave. That is changing; a new class of devices called reconfigurable processors seems to finally provide a solution that seems to meet all the requirements from system designers.
Reconfigurable processors provide an array of programming elements that can be reconfigured to suit the application without any down time to accommodate changes. The entire process from algorithm exploration to final realization in hardware is seamlessly handled in an integrated development environment.
At Legend Silicon, we design ASICs and systems for wireless communications, primarily for digital terrestrial TV. We do a lot of emulation and field trials with the emulation platform. We run the same code in the emulation devices that we use in the ASIC. For the emulation devices, we started by using DSP devices and then migrated to large FPGAs. This worked okay for the first two generations of devices, but when we went to deep sub-micron processes for our ASICs we started having issues with the FPGA-based emulation systems, since they were not capable of running at the same speed as the current generation of ASICs. We started looking for the next class of devices that can provide the emulation solution that we wanted. This led us right to reconfigurable processors. The FPGA devices we also use are reconfigurable but the performance and power are an order of magnitude below these reconfigurable processors.
Figure 1: Comparasion of the relative performance of DSPs, FPGAs, and reconfigurable processors as the algorithm complexity increases
Basic Design of Reconfigurable Processors
The typical architecture of a reconfigurable processor consists of a RISC processor coupled with a two-dimensional array of processing elements (PE). These PEs consist of either computing engines or memory. They also have very high-speed I/O interfaces, which can take advantage of the tremendous processing power of the processing elements.
The RISC processor handles the swapping of the foreground and background images into the processing array as well as the standard housekeeping tasks. The processing engines provide the high degree of flexibility and data-level parallelism to enable mapping of different and distinct algorithms seamlessly to hardware.
Benefits of Reconfigurable Processors
The first differentiation we experienced with the reconfigurable processor solution over FPGAs solution was the deterministic place and route. We could map directly from algorithm to gates with no intermediate tweaks and iterations for timing closure. With our older flow, designs have to be mapped providing considerable margins for internal data-paths so that when new functions are mapped; the device still meets timing requirements.
The other noticeable difference was the lower power consumption compared to FPGAs at the same performance point. We are still exploring the other hidden benefits of reconfigurable processors such as time slicing and multi-functionality. We use some basic time-slicing capability to run different functions in a time-sliced manner. In the multi-function mode, we are trying to use the dynamic reconfiguration capability of the device to expand the capability of the system without having to add additional external components. This is critical since a new board design and layout takes time and money to accomplish.
Challenges of Reconfigurable Processors
The main challenge in converting an existing design to this new methodology is the availability of C-code and algorithm designers. There is no simple migration path from RTL to gates.
The second challenge is the conversion of the generic C-code to C-code capable of taking advantage of the reconfigurable processor's architecture. This meant rewriting certain sections and adding structure to unroll loops. The existence of an integrated development environment helped ease the pain, and since we could compare the performance of the original and the modified C-code using Matlab, we could simulate and eliminate the bugs during the code conversion process.
Device of Choice
The reconfigurable processor that we selected is the DAP/DNA-2 from IPflex. The DAP/DNA device differs from other reconfigurable processors since it is the only dynamically reconfigurable device capable of completely re-mapping within a single clock cycle to handle new algorithms or new functionality. This feature when fully utilized can keep the device and the system full functional with no downtime, which is critical for certain applications. We chose this product mainly because it was the only working silicon meeting our processing needs.
The DAP/DNA device consists of a RISC processor (DAP) tightly meshed with an array of 376 processing elements (DNA). The RISC processor controls the array?s reconfiguration and also provides some basic house keeping tasks. The two-dimensional processing elements of the DNA matrix are simple but efficient building blocks capable of working in a sequential or parallel fashion. They are capable of being configured as Add/Sub/Mult, Storage RAM, and I/O elements. The DAP/DNA comes with a very high-speed interface providing industry standard connections to other devices such as Direct-IO, DDR, and PCI.
Figure 2: Block diagram of the DAP/DNA
To fully utilize the capability of the device we had to change our thought process to think in a parallel fashion to more effectively use the four parallel data streams which could be handled by the DAP/DNA. We also learned that in our applications, since we were processing information in blocks and we have some left over cycles, we could very easily reconfigure the device and make it perform additional functions for which we had a separate discrete device before.
The device also provided a low power solution, with average power consumption of a device around 2W and provided a smooth path from algorithm exploration in Mathlab/C to final device implementation in hardware. We also tried the exploration using Dataflow-C and with the help of the applications team, we were able to convert the generic C-code into Dataflow-C.
The further we understood the capabilities of the device; the more discrete devices we replaced with DAP/DNA running in time slice configuration. We ended up using the device in a time-sliced function mode (hence increasing the throughput with reduced number of resources) as well as in a multi-function mode (by using the array to perform two completely different tasks).
The learning curve was steep, but at the end, the result was worth it. We managed to get the most power-performance effective solution and provided a building block with which production systems can be built from.
About the Author
Dinesh Venkatachalam is the VP of Engineering at Legend Silicon Corp. He received his M.S. in Electrical Engineering from the University of Missouri-Rolla. He has over 18 years of experience managing ASIC development teams in the field of digital communications. Over his career, he has designed or managed successful implementations of over 25 ASICs.