The key to a successful architecture is the compiler: It is the compiler that determines effective programming and efficient utilization of the target architecture. In the case of Improv, where the target architecture contains multiple, heterogeneous, very long-instruction-word (VLIW) processors, the compiler provides the bridge between intuitive application development in a high-level language (Java) and an elegant but complex architecture.
Improv's compiler, which we have designated Solo, is the linchpin technology opening up the Jazz architecture to a wide range of application developers.
The Solo compiler was designed with the aggressive goal of extending state-of-the-art compilation to a new level. Among the goals set out were:
- automatic allocation and partitioning of application tasks onto the different processing engines on the Jazz architecture;
- retargetability to different Jazz configurations;
- high utilization of the Jazz VLIW instruction set;
- handling of user-defined, application-level constraints during compilation;
- and advanced optimization and feedback support.
It was clear from the start that we were talking about an entirely new approach to compilation that would tie together system-level control- and data-flow analysis, aggressive optimization techniques, schedule-based algorithms and flow management.
The compiler is designed to consider a complete application and to use the best resource allocation available to meet the application data and performance constraints on a target Jazz platform. The application designer uses the Java design environment (Improv's Application Development Framework) to specify and verify the applications. A control- and data-flow graph is generated along with the Java class files used by the compiler. The division of labor is simple: The application designer specifies the function, data flow, control flow and constraints of the application; Solo is responsible for creating a task-level schedule and allocation that implement the application on the Jazz platform.
To achieve the best results, Solo uses estimated information for task size in terms of data and code as well as task duration. It can even use data-dependent estimations and is designed to work in iterative mode to converge as estimates become more accurate.
Solo employs a combination of classic compiler technology with behavioral synthesis and VLIW code-generation technology. A number of major design issues are addressed in each phase of the compiler. This article will focus on three: retargetability , coarse-grained allocation and fine-grained allocation.
Most programmable solutions on the market today have an inflexible tool environment in which a specific compiler is matched with a specific processing engine. Recently, some DSP-core providers have fielded support for generation of custom compilers for custom cores created using their technology. But that approach still requires a specific compiler for a single processing core.
Further, there are no compilers on the market today that handle automatic allocation of tasks among multiple, heterogeneous processing engines. Rather than take a traditional approach, Improv has developed a solution that allows the same compilation system to target different configurations of the Jazz architecture.
Jazz is highly modular. Specific configurations of the architecture can be used that specify the number of processing engines, the size of data and instruction memories, and the particular collection of datapath operators for any given engine.
Solo had to be designed to be flexible and adaptable to those configuration decisions. All of the programs contained within Solo can access the specific information about the target configuration through a configuration file. That file is processed by the different phases of the compiler to adjust automatically to the target Jazz configuration.
Also, to allow for maximum flexibility, Solo uses several layers of instruction formats and pseudo-operations targeting a virtual platform. As a result, only the assembly- and object-code generation depend on the actual target Jazz configuration.
System-on-chip (SOC) platforms today often contain different on-chip processors, such as microcontrollers, DSPs or other custom blocks. The system designer is left to partition an application into different pieces and manually allocate tasks to the processors on which they will be run.
The Jazz architecture also contains multiple, heterogeneous processing engines, but a key design goal was to automate the allocation of tasks to processing engines based on the application developer's constraints. That frees the developer from an arduous task and can lead to much more efficient allocations, since the compiler can computationally analyze control- and data-flow dependencies to maximize performance and resource utilization.
The coarse-grained allocator in Solo handles the allocation of the application tasks and data to the available resources. Based on the application control and data flow, a complete control data-flow graph (CDFG) of the application is created in which the nodes are individual tasks within the application. The tasks themselves are stored in an abstract syntax tree representation based on the Stanford University Intermediate Format (SUIF).
The coarse-grained allocator in Solo performs a mapping of the CDFG onto the target Jazz architecture. That mapping consists of two main parts: the allocation of tasks to engines and of data to memories, and transformations of the graph.
One of the key challenges in designing the coarse-grained allocator was the definition of the objective function used during the allocation procedure. The objective function has three levels of operation: a demand-driven pipe dictating the desirable order of task allocation; a cost function for allocating tasks to an engine; and a cost function to break ties. The objective function uses a combination of heuristics and formalisms to create an allocation based on a weighted assessment of task constraints, critical paths, data weights attached to each task, the impact on the predicted schedule of the executing tasks, and the number of data transfers that need to take place.
One of the first design roadblocks we encountered was the need to assign a time cost to a task to determine how long it would occupy a given target engine. The accuracy of that cost assignment is critical to the efficiency of the allocation overall.
The only guaranteed performance estimate, however, is a worst-case assessment. We have incorporated three approaches to incrementally improve the quality of the task-performance estimation.
The first step is to perform a detailed operator and branch analysis to determine the worst-case branching characteristics and an initial estimate for the number of cycles required when the worst-case branches are taken. The second step is to feedback the actual results of generating VLIW instructions for the task on the target engine to refine the estimate and, possibly, to reallocate based on the new results.
Finally, the coarse-grained allocator can incorporate run-time statistical data collected during simulations of the application on which the task instructions have been measured and saved.
To help the process of allocation and scheduling, Solo has the freedom to partition and merge tasks and transform the task graph as required to create a more malleable set of constraints. In addition to satisfying the data- and control-flow requirements, Solo creates an allocation map that can meet the application timing, data and size constraints by considering task constraints expressed as hard limits for latency and exclusivity with respect to other tasks in the allocation cost function.
Control and data-flow transformations of the graph can be performed in two places during the compilation. A prepartitioning stage organizes the graph before allocation. It performs the splitting of tasks into microtasks, based on estimates of the time and size of the executable code.
Transformations can also be performed during allocation of the graph. As allocation proceeds, the graph can be transformed to achieve certain goals, such as data locality and time sharing of the engine resources.
The key to task-level performance on a VLIW micro-architecture is to maximize the number of operations that can be executed in a single instruction. Each processing engine on the Jazz architecture is a VLIW micro-architecture capable of executing 12 to 15 operations per cycle. Once Solo has generated a sequence of operations implementing a given task, the fine-grained allocator is responsible for transforming that sequence into a collection of VLIW instructions for the target processing engine.
The primary goal here is to map basic operations in the task code to available instruction slots, minimize the number of cycles required for the task execution, and minimize the number of overall required slots and instructions.
The fine-grained allocator considers a number of factors within the sequence of operations, including the life cycle of data instances, data dependency, control dependency and available resources to create the best fit. The key activities at this level are micro-scheduling, operator substitution, speculative evaluation, register assignment and software pipelining.
In addition, the fine-grained allocator exploits specific capabilities of the hardware to minimize overhead and maximize instruction level parallelism including hardware loops, byte addressing, block updates, conditional execution and stacked results registers.
In the final analysis, it is the quality of application mapping that matters. For Solo, the goal is to utilize the macro-level resources (engines) and micro-level resources (VLIW slots) with high efficiency and consistency and to meet application constraints. At the micro level, Solo can generate results comparable to high-end DSP chips both in code density and cycle execution. It is more difficult to evaluate Solo's performance at the macro level since no comparable tool is available.
In internal benchmarks, Solo consistently achieves minimum engine utilization over 60 percent overall and average utilization in the range of 65 to 75 percent for our target applications.