Chip architects are faced with many decisions when designing a system on a chip (SoC). The chip often contains some number of control processors, signal processors and peripheral cores. In addition to these cores, special function blocks often are required to support performance-critical functions. These special acceleration blocks are very important because they provide an opportunity for chip architects to add special algorithms and differentiate their products from others on the market.
There are two basic ways to design acceleration blocks. The traditional approach is to design the acceleration blocks from scratch, using Verilog or VHDL code (RTL). This is a well understood but time consuming process.
The second approach is to use configurable processors instead of RTL coding. Customization techniques, available in varying flavors from ARC, MIPS and Tensilica, now provide performance rivaling what can be obtained through RTL coding.
Our applications group studied the implementation of MPEG-4 video encoding as well as an Advanced Encryption Standard (AES) engine. This study compared a traditional RTL hardware approach with using Tensilica's configurable Xtensa processor to help decide which is better RTL coding or configurable processors. The results were quire interesting and the answer is: it depends.
Sometimes, designing acceleration hardware with RTL is best. In other cases, equivalent performance can be obtained from a configurable processor with much less design effort. The goal of this article is to highlight the various trade-offs involved between these two alternatives and to help you make intelligent decisions about real SoC designs.
Why develop custom hardware?
SoC designers often implement custom hardware accelerators for the compute-intensive algorithms of MPEG-4 encoding because software-based encoding is too slow on any processor running at speeds below a few GHz. Once the hardware accelerators are ready, designers combine them with a main processor core that handles other MPEG-4 functions in software. The architectural block diagram shown in Figure 1 illustrates an MPEG-4 encoder SoC that incorporates hardware accelerators.
Figure 1 -- Block diagram of MPEG-4 encoder SoC using hardware accelerators
System performance can be increased by using multiple hardware accelerators to process many MPEG-4 encoder algorithms in parallel. However, splitting an application's algorithms across multiple hardware accelerators complicates the design and verification process. Hardware accelerators are often hard to create from complex specifications and they are very expensive to change once a product is deployed.
The alternative SoC design approach is use configurable processors. The designer can create a customized processor by merging acceleration units into the processor as opposed to outside the processors in separate RTL blocks. The internal acceleration logic processor becomes part of the processor's programming model as new instructions available for the processor to execute. This approach is shown in Figure 2.
Figure 2 -- Block diagram of MPEG-4 encoder SoC using specialized processors
The use of specialized processors in a SoC design simplifies the design cycle by mapping the MPEG-4 encoder's algorithms to optimized software. In general, it's much easier to change software than RTL descriptions coded in Verilog or VHDL. In addition, if the software resides in RAM during system operation, it's much easier to change in the field than hardware that's been cast in silicon.
An SoC designer must evaluate which of the design approaches (custom RTL hardware accelerators or application-tailored processors) offer the best mix of performance, power, die area, and development cost. In addition to these concerns, the designer must weigh the design risks of each approach. The following sections explore the pros and cons of both approaches based on our work on implementation alternatives for several types of algorithms.
In most applications, system performance is the most important issue. Our research and test cases showed that, unless the SoC design team creates full-custom hardware accelerators at the transistor level instead of using synthesized logic, both approaches provide comparable performance, assuming that comparable algorithm acceleration techniques are used.
Configurable processors and custom RTL approaches allow designers to create combinatorial logic to accelerate any algorithm. Acceleration logic operating outside the processor (traditional RTL) is equivalent in speed (measured in MHz) and performance (measured in operations per cycle) to acceleration logic operating inside a configurable processor. Additionally, both techniques can exploit microscopic parallelism of a given algorithm by creating parallel datapaths to process multiple operands. (Such algorithms apply the same operations across many operands.)
For example, a motion-estimation algorithm compares macroblocks (a 16x16 set of pixel elements) to neighboring macroblocks using a sum-of-absolute-differences calculation. Designers can accelerate the motion-estimation algorithm by creating acceleration logic that calculates the sum-of-absolute-difference across many macroblocks at the same time.
Figure 3 -- Datapath of acceleration logic for parallel calculation of sum-of-absolute-differences
Figure 3 shows a 128-bit data path for such a sum-of-absolute differences acceleration block. If an SoC implements the data path shown in Figure 3 into RTL, the performance may be hindered by the limitations of the relatively narrow interfaces found on fixed processors. Without additional logic to maximize data throughput across the processor interface, bus bandwidth may easily become the performance bottleneck for the entire design. The main design challenge in this context is the design of hardware that keeps the acceleration block busy. Designers often use hardware accelerators with built-in DMA controllers and buffer RAMs to mitigate this problem.
A designer can incorporate the data path shown in Figure 3 directly into a configurable processor as a designer-defined acceleration unit. This unit appears in the processor's programming model as a SIMD (single instruction, multiple data) instruction. Through extensibility, a configurable processor's data-path performance is on par with an RTL accelerator, assuming that the interface between the processor core and instruction's acceleration logic can keep up.
A designer can create individual instructions that access data from several internal registers, perform simultaneous calculations on all the accessed data, and then write results to several other internal registers. Paths between the processor core and internal acceleration units are relatively short, thereby allowing data transfers at higher frequencies than external bus interfaces. Data transfers occur without the delays associated with transferring data across a slow bus interface. When external bus transactions are required, the configurable Xtensa processor boosts external data transfers with an optional 128-bit memory interface.
Besides the data paths, acceleration units also contain control functions. With RTL accelerators, control functions are implemented as hardware state machines. By contrast, configurable processors perform control functions in software. Control functions built with hardware run faster than those implemented with software running on conventional processor architectures. However, designers can construct new instructions for a configurable processor that accelerate control functions so software-based control functions can often approach the performance of hardware state machines.
Many applications have macroscopic parallelism (such applications have independent processes that may execute concurrently), exploited through the independent execution of multiple machines. For example, a designer can create multiple hardware accelerators with independent state machines to accelerate MPEG-4 bit-stream coding, allowing the SoC to process multiple video objects concurrently. However, a designer may also exploit such macroscopic parallelism using multiple configurable processors to process the video objects with similar performance.
If performance is the number one concern, which approach should be used? The RTL accelerator approach may have a performance edge due to a more cycle-efficient control implementation, but not by a huge margin especially considering the limitations of conventional processor bus interfaces. Configurable processors may achieve comparable performance. Therefore, designers must consider other criteria besides performance before selecting the optimal approach.
The silicon real estate occupied by logic directly affects chip cost and overall system costs. Therefore, area is a major criterion when selecting between the use of a fixed processor with custom RTL hardware accelerators or configurable processors. By definition, the SoC already includes a processor so the incremental number of gates required for the acceleration logic is roughly equivalent whether that logic resides inside or outside the processor.
The algorithmic control functions accelerated with a configurable processor require some additional memory for code and data, but this increase is negligible compared to the memory that is required to support other functions already performed in software (operating system, file management, user interface). For example, the control code required for a complete Advanced Encryption Standard (AES) cipher on an Xtensa processor requires less than 700 bytes of additional memory.
Besides adding functions, designers can also finely tune configurable processors and toss out unnecessary and space-consuming features. The designer only selects features that significantly improve the system performance. At some point the designer will see diminishing gains such that adding even more features provides very little performance improvement. The figures below demonstrate this point using cache size as an example.
Figure 4 -- Performance & area vs. cache size profiles for AES encryption
Consider an SoC for wireless LANs. A designer can create a customized processor that is capable of handling the AES cipher at a rate of 1Gbps by constructing new instructions that accelerate the cipher. Figure 4 shows graphs of system performance and area versus cache memory size (This data is from actual tests using Tensilica's Xtensa processor at .13um/250MHz). Significant performance gains are achieved when direct-mapped cache size increases from 0 to 8 Kbytes of memory.
There is little need to use a 16- or 32-kbyte cache memory because these larger cache sizes do not improve performance of the AES code. The sweet spot lies somewhere between 2 and 8 Kbytes of cache memory, depending on the relative importance of processing rate and die size. The designer selects a cache memory size by analyzing the area curve and comparing that to the area constraints for the SoC.
Additional cache memory would be wasteful, because it does not improve performance commensurate with the added area. The designer can also perform similar evaluation on myriad other processor options such as register-file size, size and distribution of local memories, and additional instructions. Proper selection of configuration options results in a processor core with the most "bang-per-gate."
The area cost required to accelerate a particular algorithm is similar whether an SoC is designed with a configurable processor or a fixed processor with RTL accelerators. However, if the application consists of many processes that must run concurrently, a fixed processor with RTL acceleration will have area advantages due to its superior parallelism.
As stated earlier, a designer can create several RTL accelerators that execute multiple processes in parallel to increase system performance. Although the designer can achieve the same performance through the use of multiple configurable processors, the area required for such an approach is likely to be greater than for one fixed processor and several hardware accelerators.
Power efficiency is paramount for portable systems. Even when the system gets its juice from a wall socket, power efficiency is important because high power translates into increased system costs including bigger power supplies, fans, and even higher cabinet or case costs.
The more gates on a chip that switch every clock cycle, the more power a chip will consume. As mentioned earlier, designing a custom RTL block may have a gate-count advantage when the system requires the performance attained by executing many processes in parallel. It follows that the smaller gate-count requirement of a custom RTL accelerator translates into less power consumption than a multiple processor-based implementation, as long as this approach delivers the required performance.
In some cases, RTL implementations are more cycle efficient than software-based implementations running on a configurable processor. The equation below shows that the dynamic power dissipation for a CMOS gate is proportional to the processor's switching frequency (fp).
An RTL implementation may operate at a reduced frequency compared to a configurable processor implementation, and will consequently consume less power for a given performance level. In addition, the savvy SoC designer will further reduce power by throttling clocks to every circuit in the design. For this reason, a fixed processor with RTL acceleration may offer an advantage for low power designs.
Still, designing with a configurable processor may provide access to power saving features that minimize the power differences between the two approaches. Rather than designing power-efficient RTL, designers may configure processors with power-efficient features.
For example, designers can configure Xtensa processors that use functional clock gating to dynamically control clocks to various modules in the processor. If an MPEG-4 algorithm is not currently running, then the clocks to the MPEG-4 acceleration units in the configurable processor are gated off. These acceleration units are only active when the associated instructions execute. Unlike RTL design, Xtensa processor clock gating is configured without extra effort on the designer's part.
There is another power benefit to placing acceleration blocks inside the configurable processor as opposed to outside of a standard processor. Signals between the processing core and internal acceleration blocks don't need to travel across a long and highly loaded external bus. Paths between the processor core and acceleration blocks are relatively short, thereby reducing the power required to drive them.
SoC design and schedule
The traditional approach to SoC design starts with a specification that partitions the system into hardware and software. Hardware and software teams commonly work in isolation from each other at the beginning of a design cycle. Integration is scheduled at the project's end. This approach postpones difficult integration challenges until the tail end of the schedule. Yet unpleasant surprises during the integration phase are commonly cited as one of the main reasons for schedule slippage.
When using configurable processors, hardware acceleration blocks are built into the processor as new instructions so there is no integration phase, per se. Rather, the integration phase is replaced with a different set of challenges. When designers add acceleration blocks into a configurable processor, the software and simulation tools must be extended to support the new instructions. Depending on the number and complexity of the new instructions, the effort required to extend the development environment can be significant if these tools are not automatically generated to match the new instructions.
Tensilica's development tools (C compiler, instruction-set simulator, etc.) automatically extend to support new instructions, allowing designers to concentrate on accelerating their application rather than on the development environment. Designers can create several versions of instructions that result in different performance and area profiles. From these, the designer will choose instructions that provide the optimal blend of performance and area. In our experience, designers typically create between 10 and 30 new instructions to accelerate their applications.
Configurable processors offer a considerable time-to-market advantage when applications are described in C/C++. Freely distributed source code for standards-based algorithms such as MPEG-4, JPEG2000, and MP3 are easily found on the Internet. Often, this source code is actually part of the standard. This source code provides a head start when developing an SoC using configurable processors. A designer profiles the application for compute-intensive C expressions or functions and gradually replaces the compute-intensive versions with efficient new ones incorporating designer-defined instructions, while validating the equivalency of the new code to the original C code.
In contrast, the custom RTL accelerator designer must abandon the C source code entirely and develop RTL from the specification. Manual interpretation of a specification is often difficult and error-prone, hence increasing the challenge to the designer. Moreover, designers must wait until accelerator blocks are completed and integrated with other SoC components before performing tests across the entire application.
The configurable processor design approach is better suited for applications with standards that are in flux. Because of the improved methodology of constructing and validating acceleration logic from source code, designers can quickly modify their designs to track modifications to application standards.
A configurable processor design approach is also better suited for projects with goals that are moving targets. For example, an SoC designer may initially implement algorithms in software because simulation results show that performance goals are easily met using this approach. However, as the design phase nears completion, the competitive landscape may change and the performance goals of the design increase or new features may be required.
Rather than shelve the current design and re-architect the entire SoC, the designer can salvage a configurable processor-based architecture by adding new instructions to the configurable processor or changing the software code. A configurable processor based approach thus offers design insurance, allowing designers to hedge their bets against the unexpected.
Simulation and debug
When simulating an SoC containing a fixed processor/RTL combination, the hardware designer creates test benches that exercise each RTL block in isolation from the complete system. The test bench must accurately model how software and other hardware blocks interact with the hardware under test. Software designers prepare C behavioral models of the custom hardware accelerators, assuming that these models accurately model the operation of the actual hardware. However, simulation of individual isolated hardware blocks will not uncover problems that occur when the individual blocks are integrated.
At some later stage in the design cycle after the design team has developed the hardware interface between the processor and the RTL block, designers can use co-verification tools to simulate both hardware and software domains simultaneously. But even with this environment, debug will be tediously slow if many complex hardware blocks must be simulated simultaneously.
Some configurable processors simplify simulation of acceleration logic by treating this logic as a core component of the processor. For example, Tensilica's software debugging environment is extended with complete visibility to all wires and states within the acceleration units, in much the same way that it provides access to the core register file. While single stepping through new instructions, the software developer can see how the acceleration logic inside the processor operates. Through this mechanism, the software developer evaluates system performance and debugs problems in the acceleration units using familiar tools instead of analyzing timing waveforms generated from hardware simulation dump files.
Instead of a hardware simulator, the processor's instruction set simulator (ISS) simulates new instructions added to a configurable processor as part of an acceleration unit. Simulation of functions is orders of magnitude faster using an ISS, when compared to hardware simulation. Because of this speed advantage, the turn-around time between modification of new instructions and observing results over an entire application such as an MPEG-4 decoder is typically measured in minutes.
In contrast, turn-around times for design modifications of equivalent hardware blocks are typically much longer. Fast simulation of new instructions translates into acceleration units that are far more robust, because the instructions have been debugged across many more test cases than their custom hardware accelerator counterparts. For these reasons, the configurable processor approach has the advantage when it comes to simulation and debugging.
Verification and risk
With today's rising cost of chip implementation and mask costs, prototype chips are vanishing. Designers are increasingly under pressure to get it right the first time so a greater effort is going into chip verification. Design trends show that verification now consumes upwards of 80% of the overall design effort.
The beauty of the software-based approach is that the verification effort is vastly reduced because of the fact that the processors running the software are pre-verified. Pre-verified cores save design teams a great amount of effort. Even when processor cores are synthesizable, they generally include a test bench that allows verification after synthesis.
Because software can be patched on a deployed product, risk is minimal. If a fatal bug is detected after a product has been deployed, a software patch can be reprogrammed into the system's memory. For this reason, designers are commonly implementing algorithms of even moderate complexity in software. However, algorithms mapped completely to software result in reduced system performance.
As stated earlier, RTL accelerators are designed to supplement processor performance. Each additional RTL block adds to the design risk, particularly the verification risk. Algorithms contain data-path and control functions. The data path (combinatorial RTL) functions are generally easy to verify and present a low risk when implemented in hardware. In contrast, control functions (sequential RTL) are significantly more difficult to verify and consequently incur more risk for hardware implementation. A simple state-machine design error may cause the state machine to enter an undefined state.
To mitigate this type of failure, verification engineers typically create many test vectors to exercise state transitions. With increasingly complex state machines, generating exhaustive test vectors becomes impractical. For complex state machines, verification engineers prepare vectors only for most traversed state transitions and accept the design risks.
SoC designers can reduce risk by implementing an algorithm's data path with hardware accelerators, where data is processed quickly, and control functions in software, which pose less of a design risk. But such partitioning of an algorithm's datapath and control functions is ineffective because of the delays incurred when software control functions interact with external hardware. Several cycles are required for software to issue commands and transfer data across external busses.
The bottom line is that the datapath and control functions must be integrated to fully exploit the performance potential of the acceleration logic. Using a configurable-processor design approach, designers integrate both data-path and control functions into the processor, accelerating algorithm's datapath with new instructions. This approach significantly reduces design risk by handling the algorithm's control functions in software, which can be changed and re-verified more easily than RTL.
New instructions added to a configurable processor must be verified. Interface errors are likely to occur between the new instruction logic and the processor core. This is particularly true of configurable processors that require the designer to modify the RTL. The risk is not so great for configurable processors that provide a mechanism for automatically adding the hardware required for designer-defined instructions. For such a processor, the designer need not verify the underlying, machine-generated logic.
In addition to interface errors, functional errors are likely to occur when designers optimize instruction logic to reduce gate count and improve timing. To prevent functional errors, designers must verify that the optimized instructions indeed perform the desired operations. The verification of optimized instructions alone has the potential of becoming a huge verification task. However, configurable processors that provide a framework for performing equivalency checking between concise descriptions of the new instruction and the optimized implementation can significantly reduce instruction verification time.
Designers must carefully consider each algorithm and decide on either a high-performance, high-risk, hardware accelerator implementation, or a lower-performance, low-risk, software-based implementation. The configurable processor bridges the gap between these two extremes of performance and risk as shown in Figure 5. Configurable processors offer a means for designers to significantly reduce design risk while maintaining performance levels comparable to hardware-based design approaches.
Figure 5 -- Specialized instructions bridge the risk and performance gap
If the SoC designer's goal is to create the fastest and most area and power-efficient chip on the market, then RTL-based accelerators may be the right approach, as long as the increased development effort required by this approach is acceptable. On the other hand, if the designer is more concerned about time-to-market, getting it right the first time, and being able to adapt quickly to specification changes, then the ease of design, reduced risk, and flexibility of a configurable processor design approach is the better choice.
Designers can get the best of both worlds with a hybrid approach to accelerating compute-intensive algorithms: boost performance by implementing low-complexity algorithms with custom RTL accelerators, and minimize risk by implementing high-complexity algorithms with configurable processors. A designer can determine which algorithms are sufficiently complex and are best implemented with a configurable processor by surveying the design team about the verification effort for the algorithm in question. If verification is not an issue, a simple hardware accelerator for that algorithm is ideal. However, if there appear to be unrealistic expectations regarding verification, then configurable processors are likely the best bet.
Del Miranda is the Applications Engineering Manager responsible for International (non U.S.) customer support for Tensilica, Inc. Prior to working at Tensilica, he was an Application Engineering Manager for Hitachi Semiconductor of America, supporting the Hitachi SuperH RISC product line. He has also held marketing positions and system engineering positions at Zilog Inc, promoting Z80 based microcontrollers.