![]() Java compiler synthesizes a 300-MHz FPUBy Don Davis, Jonathan Harris, and Ian Miller High-level design has become the next great hurdle for modern EDA. To put high-level design to the test, we synthesized a 40,000-gate floating-point unit (FPU) using the LavaLogic Forge Hardware Compiler, which compiles high-level algorithmic and functional descriptions into synthesizable RTL Verilog. Forge produced results that were generally within 10 percent of hand-coded designs, with big gains in designer productivity due to reduced iteration times. The idea of using a high-level software language for design of hardware is not new. But the challenge has been to automatically produce good-quality hardware results while retaining the ability to leverage the power of the high-level software constructs. Approaches that require low-level elaboration may produce good hardware but fail to leverage high-level design capabilities; essentially, they create a new syntax for current register-transfer-level (RTL) methodologies. Our approach has been to define hardware semantics for all common software constructs and produce excellent hardware implementations through an extensive compilation, analysis and optimization process. We have implemented this compilation approach in Forge, which accepts input in languages such as C, C++ or Java. To test our approach with a real-world design, we needed a design that met the following criteria complexity over 20,000 gates, an available hand-coded implementation and an available high-level description. After some discussions with the microelectronics group at Sun Microsystems Inc. (Mountain View, Calif.), they suggested that the FPU for the picoJava core would be a good candidate. The design was about 40,000 gates, the hand-coded RTL Verilog was readily available through Sun's Community Source License (CSL), and Sun could provide us with a C functional description. Our challenge was to take the high-level description, compile it into RTL Verilog using Forge and synthesize it to gates. We could then compare our automatically compiled results to the hand-coded CSL version. For all benchmarking activities described in this article, we used Forge HC beta 1.19.2.7, which is available for download from our Web site. During the development of the picoJava FPU, Sun had internally coded a C software model and test vectors to represent and test the desired functionality. This model used case statements to represent muxes, built a multiplier through the use of shifts and adds, had a microsequencer to control its operation, and implemented the instruction ROMs using arrays. Sun shared this model with us, and it became our starting point. The C model was translated into Verilog by Sun engineers. The CSL Verilog became our baseline. The synthesis of the CSL Verilog gave us the standard to compare against. The goal was to come close to the CSL results in terms of speed and area while dramatically improving designer productivity by working at the level of software design instead of RTL design.
Our first step was to port the C model to Java, since Java byte codes are the input to Forge. To do this, we had to remove the pointers in the code, move the global variables to object field variables, change the C integer conditional tests to "compares against zero" and modify the unsigned integer C variables to work in Java, which expects signed variables. We took a "modify as little as possible" approach to this exercise. When we had ported the code, we verified the Java model against the C model. The original model was coded to compare against the UltraSparc FPU results during the execution of the model. While we could have done the same, it seemed cleaner to compare output result vectors for each model, so we added output vector generation to both the C and Java models. We then ran the vector set on the C and Java models on an UltraSparc workstation and compared the vector results. They matched. We also ran the vector set on the CSL Verilog using the Verilog XL simulator to make sure that it matched the results of the C and Java models. Not surprisingly, functional verification was much faster using the C and Java models as compared to the CSL Verilog. The C and Java programs took about 3.5 minutes while the Verilog simulation took over 32 hours. At this point, we needed to add bus sizing information and a cycle-accurate interface specification to the Java model. We added the bus sizing information by simply masking variables to the appropriate size. The C and Java modeled the FPU as a function (method) call. By masking the input variables, Forge automatically propagated the sizing information throughout the entire design. Alternately, the output variables could have been masked and the sizing information would have been back-propagated through the design. Specification of a cycle-accurate interface was accomplished through the use of a limited class library to specify the interface signals and the timing relationships between these signals. Although the entire design could be accomplished using this class library, by its very nature it is at a low level of design abstraction, so using it alone would defeat the point of high-level design. Going to hardware Now we were ready to compile the Java program into hardware. This was accomplished by using a standard Java compiler to compile the Java into a class file of Java bytecodes which are the machine code for the Java virtual machine. We used Sunýs JDK 1.2 for this step. Forge was then used to compile the class file into RTL Verilog. The Java program for the picoJava FPU was about 3,500 lines of code. It produced approximately 42,000 lines of Verilog. For comparison, the hand-coded CSL Verilog was about 27,000 lines. We then verified that the Forge's automatically-generated Verilog was correct by running the testbench for the CSL Verilog on the Forge Verilog. The vectors and the clock cycle timing matched. Of paramount importance, both versions of the design were taken through the same synthesis process to achieve a gate-level netlist representation. The synthesis process used the Design Compiler from Synopsys Inc. (Mountain View, Calif.) to target both versions of the design to the 0.18-micron standard-cell library of TSMC (Taiwan) from Artisan Components Inc. (Sunnyvale, Calif.). In both cases, the same command script file was used to ensure that the results could be accurately compared. Top-down method The synthesis strategy employed a top-down (root first) approach, in which the design files were read in and elaborated, target frequency was defined, target area was set to zero, and compilation begun. The ROMs used in the design were treated as "black box" elements in the synthesis process, and as such added no area or timing information to the results. Although the top-down approach to synthesizing large designs is not the most efficient, in this case it allowed for a simplified testing environment and a clearer implementation path that could be used, unchanged, for both designs. The characterization of both designs involved synthesizing the designs to a wide range of target frequencies. Some of these frequencies were achievable, while others were beyond the capabilities of the design and technology. The area and frequency numbers were obtained through Design Compiler by using the "report_area" and "report_timing" features. Some of the synthesis performance metrics remained relatively constant across all the target frequency runs. Specifically, the Forge-generated HDL synthesis runs required 408 Mbytes of memory, while the hand-coded HDL used more than twice as much memory, at 941 Mbytes. The time required to complete the synthesis process was 6.5 hours for the Forge HDL and 8.75 hours for the CSL HDL using a Sun Enterprise 450 with four 300-MHz processors and over 3 Gbytes of memory. Getting results Two data points of particular interest were the target frequencies of 600 MHz and 300 MHz. The first data point represents an unachievable target frequency for both versions of the design. The Forge-produced design achieved a 351-MHz design in 439k area. By comparison, the hand-coded HDL produced a 386-MHz design in 490k area. The second data point, 300 MHz, was achievable for both designs. The area for Forge was 373k, while the CSL Verilog was 317k. The results of other target frequencies are shown in the charts. Regardless of the target frequency of the design, the Forge precisely met the design's functional requirements and clock-cycle performance. In the case where the design was primarily constrained by the target frequency (600 MHz), the Forge-generated HDL was 9.13 percent slower, but resulted in a design which was 10.48 percent smaller than the hand-coded design. On the other hand, the 300-MHz target was achievable with both the Forge HDL and the CSL HDL, but Forge incurred a 17.84 percent area overhead. Area, frequency The remaining data points, ranging from 150 MHz to 320 MHz, show the relationship between the two versions of the designs and the area and frequency results. The maximum achieved frequency of either version was 390 MHz. From 150-MHz target frequency to 300 MHz, both designs were easily able to achieve the desired frequency, while after 300 MHz the achievable frequency begins to level out. The area metrics show that at the lower target frequencies the area overhead introduced by the Forge is less than 10 percent. As the data points move into the high performance range, Design Compiler adds more gates to the Forge HDL to achieve the maximum frequency. However, the number of additional gates added to the Forge HDL design quickly levels off, while the gates added to the hand-coded HDL continues to rise rapidly. This causes the highest performance Forge HDL design to have a smaller area than the highest performance hand-coded HDL design. It is important to note that none of the analysis or optimizations used in Forge were developed specifically for this design, or even this type of design. Indeed, we expected data-flow intensive algorithms, not the complex control path instantiated by the FPU. So while the Forge produced a functionally identical design to the hand-coded design, the two designs were not structurally identical. The more centralized state machines used in the hand-coded CSL Verilog were distributed throughout the logic in the automatically-generated Forge HDL. This allowed for reduction of the decode logic necessary with a more complex state machine, by having the control flow follow the data through the system. Additionally, high-performance data structures such as high-speed multiplexers were used throughout the hand-coded CSL Verilog to improve timing. These optimizations were not available at the time of this exercise through the Forge, but may be included in future releases. An analysis of the critical path for both the Forge and CSL implementations reveals that for Forge, the critical path is through one of these distributed control flows. Again, since this control flow is something that Forge generates, it is something that Forge could, through appropriate optimizations, remove. The next longest critical path is through the main multiplier in the FPU and matches the critical path for the CSL implementation. Don Davis is Director of Engineering, Jonathan Harris is Principal Engineer, and Ian Miller is Senior Engineer at Lavalogic, a business unit of TSI Telsys Inc., Columbia, MD. |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Home | About | Editorial Calendar | Feedback | Subscriptions | Newsletter | Media Kit | Contact | Reprints| RSS|
Digital| Mobile |
| Network Websites |
|
International |
|
Network Features |
|
|
|
All materials on this site Copyright © 2009 TechInsights, a Division of United Business Media LLC All rights reserved. Privacy Statement | Terms of Service | About |