The complexity of system-on-chips for embedded wireless devices is increasing while, at the same time, greater flexibility of the implementation is being demanded. In addition, the cost penalty and probability of design errors in ASICs are steadily rising. These factors are the key drivers for heterogeneous platforms consisting of programmable processor cores combined with dedicated hardware modules.
Heterogeneous platforms provide high computational performance for runtime-critical data-flow-dominated tasks, combined with high flexibility for complex mixed control-/data-flow-oriented tasks. Furthermore, the software part of these platforms allows bug fixes and adaptations to changing requirements at low cost, as well as design reuse leverage for several product cycles.
These considerations figured prominently in the design of a flexible and reusable processor core for acquisition and tracking control in a Digital Video Broadcasting-Terrestrial (DVB-T) receiver chip. The processor is responsible for system control tasks and performs the timing synchronization of the sampling window, as well as the synchronization of sampling clock and carrier frequency. Application profiling revealed that these mixed control/data-flow-oriented tasks have tight time constraints that cannot be met with current off-the-shelf processor cores.
An alternative to an off-the-shelf processor would be to develop a processor with the required application-specific instructions to meet the time constraints. Traditionally, such an approach would have been a last resort because of the range of architectural, software and implementation skills necessary to complete an application-specific CPU design.
But the advent of system-level design tools that allow modeling and tracking of tasks at a transaction-driven level makes the development of such systems more tractable. And, the emergence of design environments specifically for generating application-specific processors and their language tools makes the actual CPU creation considerably more attractive.
We used this approach, employing SystemC in Synopsys Inc.'s CoCentric System Studio and the LISATek processor design environment. To meet these time constraints, instruction-level optimization for the performance-critical computational kernels was done.
As an example, the loop body for an iterative coordinate rotation digital computer (CORDIC) computation for the DVB-T processor was used. Basically, the CORDIC algorithm needs conditional additions and subtractions with shift operations to either rotate a two-dimensional-vector (rotate mode) or compute the magnitude and phase of this vector (vectoring mode). Furthermore, a look-up table to store precomputed angles is needed.
An optimized software implementation can take advantage of the inherent parallelism of this CORDIC computation by exploiting specialized instructions, which perform several operations simultaneously. One straightforward idea for the CORDIC is an instruction implementing a conditional addition/subtraction with the sign bit of a given register as the decision bit. Furthermore, a shift operation can be chained with this conditional operation and a table look-up can be performed in parallel. Obviously, even for this simple example, there are many options that need to be explored and evaluated to get an optimum implementation.
As a starting point for instruction-set optimization, a basic RISC-like instruction set with simple two-operand instructions and register-indirect addressing mode was used. This instruction set was supported by an optimizing high-level language compiler used to quickly obtain an initial software implementation for incremental optimization.
For the DVB-T application benchmark, the runtime of the initial CORDIC implementation is about 6.6 microseconds. After instruction-set optimization, the runtime is reduced to 0.82 microseconds, meeting the time constraints of the application, which corresponds to a relative reduction of about 87 percent. This optimization increases the silicon area by less than 5 percent, due to an additional adder/subtractor and several multiplexers.
The basic principle of tailoring a processor instruction set to the requirements of an application typically has the additional advantage of increasing the energy efficiency. In today's instruction-set- oriented processors the clock tree, the instruction memory accesses, the instruction fetch and decode operations, as well as the data routing consume the major part of the total energy consumption typically 60 percent to 90 percent. In contrast to this "overhead energy", which is not directly useful for computations, the so-called "intrinsic energy" is consumed in the functional units to perform the actual computations. Ideally, this intrinsic energy can be regarded as the minimum required energy to perform a certain computation. Thus, this intrinsic energy is only a function of the kind of operation, the technology and the operating conditions, and the architecture of the operator implementation, but it is largely invariant to scheduling changes of operations.
A detailed power evaluation of the previously mentioned CORDIC application yields an overhead energy of 68 percent, including the on-chip instruction memory. The CORDIC instruction-set optimization results in energy savings of about 85 percent by minimizing the overhead energy. Comparable studies in industry and academia prove that this kind of instruction-set optimization can be viewed as joint energy-performance optimization.
Due to the fact that the instruction-set optimizations are performed incrementally, that is, without removing the original basic RISC-instructions, this optimization does not impair the high flexibility of the original processor.
The final task of platform-based applications-specific instruction processor (ASIP) design is the system integration and verification with the optimized ASIP embedded in the system context. This important task can be easily performed with the previously mentioned tools due to the tight coupling between the ASIP simulators and the system design environment.
Essentially, ASIPs can be used as modules for platform-based design, increasing the design flexibility with programmability. Compared to fixed-processor cores, instruction-set optimized ASIPs also provide a significantly increased computational performance and energy efficiency.
See related chart