Programmable system-on- chip (SoC) components, containing a microcontroller and reconfigurable hardware, promise the flexibility of software and the performance of hardware. But the reconfigurable hardware is not truly hard since it is programmable, so from where does its benefit come?
Its advantage is that it enables a spatial model of computing that is traditionally associated with hardware, compared to a microprocessor's temporal fetch-and-execute model. Applications often contain a mix of computation, some better suited to spatial computation and some better suited to temporal computation, prompting the development of hybrid architectures that can support both efficiently.
Our group designed such an architecture, Garp, which features a single-issue microprocessor with a rapidly reconfigurable array attached as a coprocessor for loop acceleration.
Since our research group is investigating reconfigurable hardware primarily in the context of general-purpose computing, compiler development is an integral part of our architecture research. In contrast, automatic compilation to reconfigurable architectures has been considered less important in the embedded domain because most designers have hardware expertise. However, as embedded applications grow in size, the productivity benefits of automatic compilation also grow. Designers can start with automatic compilation, then focus their efforts on improving a few loops while benefitting from the compiler's speedup on the remainder. Further, with the compiler's support, the designer's required effort is reduced in many cases to simply restructuring the code or embedding compiler directives.
The Garp compiler's challenge is to exploit the reconfigurable hardware's parallelism starting from a sequential software program. We drew heavily from compilation techniques for very long instruction word (VLIW) architectures, which also exploit operation-level parallelism.
Garp's coprocessor is a two-dimensional array of configurable logic blocks. Array details dictate that modules run horizontally along a row. Fast, flexible carry chains running along each row enable fast sums and comparisons, while horizontal wire channels between rows enable fast shifts between adjacent modules. Vertical buses of various lengths statically connect modules to each other to form a standard bit slice layout.
Memory buses, time-multiplexed cycle to cycle, provide the path into and out of the reconfigurable array. Garp's array has four 32-bit data buses and one 32-bit address bus. While the array is idle, the processor can use the memory buses to load configurations or to transfer data between processor registers and array registers. While the array is active, it is master of the memory buses and uses them to access memory-the same memory system viewed by the microprocessor, including all levels of caching. Garp also supports streaming memory accesses.
One of Garp's novel features is the array's fixed clock frequency and simplified timing model. The timing model enumerates exactly what combinations of routing and computation delay in series can be traversed in one clock cycle or less. For example, a module can receive an operand over a short vertical bus and perform a carry chain computation within a single cycle.
The Garp compiler essentially targets loops for acceleration using the reconfigurable array. The compiler uses a straightforward approach where the one thread of control hops back and forth between the microprocessor and the array. When a loop utilizing the array is encountered during the execution of a program, the correct configuration is loaded; this takes minimal time if it was the last one used or if it is resident in the configuration cache. Live values are then transferred to the array, the array is activated and the microprocessor goes to sleep. When the loop exits, the microprocessor wakes up, determines which exit was taken, retrieves live variables from the array as appropriate for that exit, and continues software execution.
The Garp-specific part of the compiler begins by recognizing natural loops in the control flow graph. These are recognized whether they originate from FOR loops, WHILE loops or even backward GOTO statements in the C source code. All are candidates for acceleration. Break and continue statements are allowed. All loops are treated as WHILE loops-having data-dependent exits.
From each loop we form a hyperblock, which contains only the commonly executed paths in the loop. Only the hyperblock is implemented on the array. When an excluded path must be executed, it is executed on the microprocessor and transfers control back to software; control is returned to the array at the beginning of the next iteration.
Next, a data-flow graph is constructed from the basic blocks selected for the hyperblock. As the data-flow graph is built, all control dependence is converted to data dependence through the introduction of predicates. Peephole optimizations are then applied to the data-flow graph.
The optimized data-flow graph is then implemented as a direct, fully spatial network of modules on the array-every operation gets its own hardware, which never has to be time-multiplexed to execute other operations. Our fully spatial approach allows groups of adjacent data-flow graph nodes to be merged into optimized modules, reducing the area and critical-path delay. This spatial approach also simplifies pipelining.
In the case of Garp, pipelining is accomplished mainly through rescheduling the execution of the modules on the array, which is one of the last compilation steps. However, to get the most out of pipelining, two changes are needed during earlier phases of compilation. Both result from the fact that pipelined performance is usually governed by the longest feedback loop in the computation (the critical cycle) instead of the longest path through one iteration (the critical path). When considering a path to include or reject from the loop, the hyperblock selection algorithm now evaluates the impact on the critical cycle rather than the critical path. The module-mapping algorithm, when packing data-flow graph nodes into feasible modules, similarly adjusts its goal to minimizing the critical cycle instead of minimizing the critical path.
After module mapping, each module is considered as an atomic unit and its scheduling in time as well as its placement on the array are still flexible. Recall that modules are completely pipelined, with a register inserted every cycle of delay on internal paths.
The sequencer, a simple FSM triggering modules during the correct cycle of each iteration, is modified slightly to support having multiple iterations in flight simultaneously. Iterative modulo scheduling is used to resolve conflicts for the memory address bus between overlapping iterations. Register chains are inserted between modules as necessary to ensure that operands from the same iteration arrive at each module synchronously.
Pipelined computation, whether in software or hardware, characteristically has a prologue, the period from the start of computation to the time the first iteration completes, at which point the pipeline achieves a steady state. During the prologue latter portions of the pipeline should not be active, since the first iteration has not yet reached them.
With VLIW processors, a number of schemas for handling the prologue have been put forward. One category emits the prologue as separate code, with inactive pipeline stages simply eliminated. Another solution, termed "kernel-only," uses the same instructions for both the prologue and the steady-state execution, but uses stage predicates to suppress inactive stages during the prologue. The sequencing technique used by the Garp compiler more closely resembles the latter, since each iteration maintains the same schedule, even those started during the prologue. In some sense, the sequencer combines the functions of program counter with stage predicates.
Software pipelining of counted loops traditionally involves the generation of an epilogue: special code to finish off the iterations that are in progress at the point when the last iteration is started. The Garp compiler takes a simpler approach that does not require an epilogue: It simply allows the last iteration to complete using the array, then discards work that has started speculatively on subsequent iterations. This approach is possible because of the Garp architecture's support for speculative execution, and in particular, speculative loads-a subsequent iteration might execute a load with an invalid address. And, there is another benefit: It works for loops with data-dependent exits, allowing a uniform pipelining scheme for all loops.
It could be argued that a special epilogue should be generated for counted loops, since it can be optimized to avoid doing unnecessary work. However, with Garp, in most cases an optimized epilogue in software is still much slower than simply completing the last iteration in hardware, even if "extra" work is done in the latter case.
To quantify the effects of pipelining, we compiled and simulated execution of a representative embedded application, a wavelet image compression. This benchmark stresses reconfigurability by splitting execution time among several loops that together capture 87 percent of the original software execution time.
Taking into account all cache effects and configuration overhead, average speedup over software for the loops using the reconfigurable array increased from 2.1 without pipelining to 4.1 with pipelining. Almost half of the loops performed 10 operations per cycle or greater (neglecting stalls), a level of instruction-level parallelism beyond that achievable by most contemporary VLIW processors.
From the compiler's point of view, the main difference compared to VLIW modulo scheduling is that our case has very few and very simple resource conflicts. We do not face per-cycle instruction issue limits or complicated instruction-bundling rules. Each module has its own resources and so does not interfere with the execution of other modules. There is not even a limit on the number of exits (branches out of the hyperblock) evaluated per cycle. The single global shared resource that must be allocated among the modules is the address bus used for non-streamed memory accesses.
It is not surprising that pipelining is more straightforward with Garp's reconfigurable hardware, since pipelining is essentially a spatial concept. In contrast to a hardware pipeline where all operators can execute simultaneously, the VLIW processor can perform only a small number each cycle.
The VLIW challenge
While this is a limitation, we realized that for a VLIW processor, moving the data through the virtual pipeline is even more of a challenge than processing the data. The register chains added to the module pipelines in Garp represent enormous data transfer bandwidth, possible only because each register has its own read and write port and thus all can be accessed simultaneously. While reconfigurable hardware handles this naturally, VLIW architectures typically require special support-rotating register files-to sustain the same effective transfer bandwidth.
The compiler techniques were adapted from VLIW compilation, and in many cases simplified in the process. We consider it to be a positive result that our compiler can use simpler techniques to achieve greater parallelism in many loops, targeting what is in fact a simpler architecture than most contemporary VLIW architectures.
The Nimble Compiler project undertaken at Synopsys, based on an early version of the Garp compiler, has produced a version that targets not only Garp but also a hybrid platform using a Xilinx Virtex chip as its reconfigurable coprocessor. However, the Nimble Compiler's approach to pipelining is slightly different. Further, little of the Garp compiler's flow assumes bit-level reconfigurability; it could be easily retargeted to coarser-grained reconfigurable fabrics.