General-purpose instruction processors have dominated computing for a long time. However, they tend to lose performance when dealing with nonstandard operations and nonstandard data that is not supported by the instruction set format. The need for customizing instruction processors for specific applications is particularly acute in embedded systems, such as cell phones, medical appliances, digital cameras and printers.
One way of supporting customization is to augment an instruction processor with programmable logic for implementing custom instructions. Several vendors are offering a route to such implementations. The processors involved are usually based on existing architectures, such as those from ARM, IBM and MIPS. These fixed-instruction processor cores are interfaced with programmable logic, which provides the resources that implement a set of custom instructions for a given application.
Another way to support customization of instruction processors is to implement them using existing FPGAs. In this case, it is possible to customize the entire instruction processor at compile time or at run time. Customizing ASIPs
Recent work on applicationspecific instruction processors (ASIPs) demonstrates the benefits of their customization. The trade-offs involved in designing ASIPs differ from those of general-purpose processors. Similarly, trade-offs involved in application-specific integrated circuit (ASIC) implementations of ASIPs differ from those of FPGA implementations.
While many ASIPs have been developed manually, an alternative approach called flexible instruction processor (FIP) is being developed at Imperial College in London. It provides an automatic method for instruction processor design and optimization and is based on the idea of capturing the instruction interpretation process as a parallel program, which a hardware compiler can turn into a circuit for implementation on an FPGA. The approach is particularly suitable for customizing an ASIP to meet given performance requirements and resource constraints.
Basically, the FIP consists of a processor template and a set of parameters. Different processor implementations, such as stack-based or register-based, can be produced by varying the parameters for that template, or by combining and optimizing existing templates. The parameters for a template are selected to transform a skeletal processor into a processor suited for its task.
Possible parameterizations include addition of custom instructions, removal of unnecessary resources, customization of data and instruction widths, optimization of op-code assignments and variation of the amount of pipelining.
FIPs are assembled from a processor template, with modules connected by communicating channels. When an FIP is assembled, required instructions are included from a library that contains implementations of these instructions in various styles. Depending on which instructions are included, resources such as stacks and different decode units are instantiated. Channels provide a mechanism for dependencies between instructions and resources to be mitigated. The efficiency of an implementation is often highly dependent on the style of the processor selected.
Specialized processor styles, such as the Three Instruction Machine, are designed specifically to execute a particular language, in this case a functional language. The efficiency of different processor templates depends on the application and the implementation medium. Hence, for a given application the choice of the processor style is an important decision. Issues such as the resources and speed requirements are affected by the decision.
The FIP framework is currently implemented using Handel-C tools available from Celoxica Inc. Handel-C is a language based on C that contains language extensions to support efficient hardware implementation. It has been chosen because it enables the entire design process to be captured at a high level of abstraction, which benefits both the design of the processor and the inclusion of custom instructions. Handel-C also facilitates rapid prototyping of designs. Current research is focused on providing FIPs that are customized for specific applications, particularly lightweight implementations for embedded systems.
The source code for an application can be captured in various forms. It can be a conventional language such as C or Java, or it can be a block diagram language such as Simulink for MatLab applications. The compilation from source code to hardware consists of two steps. First, create an appropriate FIP; second, generate the code that runs on the FIP from the source code. Two possible compilation paths: using an existing compiler, or a FIP-specific compiler.
In one approach, an existing compiler can be used to compile the source code. This compiler can be a standard one, or a compiler used in a previous FIP design. A design environment has been developed to evaluate the compiled code to determine possible optimizations for the FIP. It also reorganizes the code to exploit instruction-level parallelism and other optimization opportunities. This is similar to the idea of just-in-time compilation for the Java Virtual Machine (JVM). The advantage of this strategy is that existing compilers can be used and compiled code can execute on the processor without modification. Since it is often difficult to identify possible optimizations in compiled code, this approach may yield a less optimum solution than using a FIP-specific compiler.
In another approach, the source code is annotated with relevant information, such as the frequency of the use of instructions, common instruction groups and shared resources. Source code then includes useful information for optimizing both the compiled code and the FIP.
In this way both the FIP and the code that runs on it can be optimized to meet given performance and resource constraints. The advantage of this strategy is that no information is lost during the entire design flow, enabling the optimization process to be as effective as possible.
Researchers at Imperial College have developed methods for automating the generation of FIP-specific tools, such as the compiler that produces FIP-specific code from the annotated source code. To evaluate the FIP approach, they have implemented various FIPs based on the JVM specification. Many parameterizations and optimizations have been investigated, including removal of unnecessary resources, customization of data and instruction widths, optimization of op-code assignments and variation of the amount of pipelining.
Several versions of FIP-based JVMs are described here for illustration. The first version of the JVM involves segregated resources that are shared. This provides good area utilization at the expense of speed, because of routing congestion. The second version of the JVM introduces two stages of pipelining and shares only irreplaceable resources, such as the stack and main memory. Stack-based processors are intrinsically sequential. Speed optimization of the JVM tends to introduce parallelism that manifests as register-style implementations of instructions.
The third version of the JVM incorporates deeper pipelines for certain instructions and "register-style" improvements such as having top-of-stack registers. Such registers are replicated. Instructions can be read from different top-of-stack registers but are written back to the stack directly. Replicated registers are updated during the fetch cycle. Most instructions are processed by four pipeline stages, although certain instructions, such as those for invoking functions, require deeper logic and their implementations have been partitioned into five or six pipeline stages. Routing has also been pipelined to reduce the effects of congestion.
The evolution of the three versions of the JVM demonstrates trade-offs between the possible parameterizations. For instance, pipelining is useful for reducing clock cycle time. However, resources such as stacks may have operation dependencies that limit the amount of overlapping between instructions and they introduce latency when pipelined. Most of the customized JVMs have been successfully implemented using the RC1000-PP system, which contains a Xilinx Virtex FPGA and multiple banks of memory. The Handel-C compiler includes support for RC-1000-PP to simplify the implementation process.
The performance of the third version of the above FIP-based JVM has been compared with a software-based JVM running on a 300-MHz Pentium processor. Using the CaffeineMark 3.0 Java benchmark, the FIP-based JVM at 33 MHz runs twice as fast as the Pentium, and a FIP with a deeper pipeline is estimated to be seven times faster.
Another performance comparison involves the FIP-based JVM and an ASIC-based Java processor from Hyundai Microelectronics, the GMJ30501SB, which is based on the PicoJava1 core from Sun Microsystems. Although the ASIC runs four times faster than the FIP running on an FPGA, there are some points to keep in mind.
First, the ASIC processor is running at 200 MHz, compared with the FIP at 33 MHz. Second, the ASIC processor has fixed instructions while the FIP can incorporate custom instructions; the speedup provided by a FIP is expected to increase as further custom instructions are added. Additional custom hardware, such as external interface logic, can be included in the FPGA on which the FIP is based to provide a single-chip solution. And there is much scope for optimizing the current FIP implementations, for instance by storing the FIP instructions in the fast on-chip memory of the FPGA.
See related chart