|
System Design
Building Up Chips Using VHDL and SynthesisOne key to using VHDL to build chips is to remember what the lines of code mean in terms of hardware.by Doug WarmkeDesigners just starting with VHDL are often worried about using the language effectively. They are afraid of writing unsynthesizable code, or code that will generate too many gates, or a design that is less efficient than they could generate by hand. Five hardware designers at IKOS Systems Inc. (Cupertino, CA) recently completed the design of the company's next-generation hardware accelerator, called Nsim. A pure VHDL methodology was used to build four complex ASICs in less than 18 months. None of the designers had VHDL or synthesis experience when starting the project. In the end, the project was a great success. There were no chip turns, and the first beta unit shipped less than 2 months after the last chip was back from the foundry. The Nsim project taught us how to write VHDL code effectively for high-performance ASICs. This article will outline different strategies and offer design tips gained from our experience. It assumes the reader has some familiarity with VHDL syntax and characteristics. After reading this article, you should feel more comfortable writing VHDL code that will synthesize well enough to meet and even exceed your design goals. Design and methodology overview The Nsim project is a new design based on an existing family of hardware simulators. The project consists of designing four large ASICS (one 100, one 70, one 50, and one 40 kgate array) and integrating them into a complete system. All ASICs are packaged in 476-pin ceramic PGAs. The process is American Microsystems Inc.'s (AMI, Pocatello, ID) 1µm CMOS with 3-layer metal (TLM). There are numerous embedded RAMs in 2 of the chips. The operating frequency is 20 MHz. A 50 ns clock period is needed to accomplish SRAM accesses every cycle without the complication of wait-states. The design is all-synchronous, and with the exception of the I/O pad rings, all of the chips are implemented 100 percent in RTL source (so there is no hand-drawn circuitry). ASIC test is done with a full-scan methodology. AMI was contracted to provide the test vectors and to build the clock trees (grid implementation). The project design philosophy is pragmatic and conservative. There is only one clock phase. There are no tri-state drivers or asynchronous logic anywhere in the chip cores. This conservative design philosophy was shaped by time-to-market pressure, the complexity of the project, and the fact that the designers were first time VHDL/synthesis users. Note, however, that conservative design rules don't exclude designing for the highest possible system performance. The following is a simplified 10-step overview of the design methodology used:
2. Design top-down using block-capture tools (ECS). 3. Fill out and simulate the "leaf" VHDL entities using RTL code. 4. Start validating larger and larger chunks, bottom-up style. 5. Synthesize in parallel with validation ( Synopsys .com/isdweb/&lf=isd-sendtolog"> Synopsys 2.2a). 6. Use synthesizer's timing and gate count reports to judge synthesis results. 7. If needed, re-code VHDL; work on synthesis constraint files, go back to step 3 or 5. 8. Simulate chips and system (Voyager 1.30a). 9. Run static timing analyzer on post-layout netlist (Motive 3.5). 10. Run system simulation regressions on post-layout netlist (Voyager 1.30a). The system has a parallel processing architecture in the form of a cube (see Figure 1). The basis of the system is the evaluator unit, which is the fundamental event processor. An event processor is fairly similar in complexity to a pipelined microprocessor. The user's network under test supplies the processor with its "instructions." The network is stored in RAM. Its "instructions" consist of primitive opcodes, delays, etc. for each gate in the simulation. As simulation progresses, the event processor updates each primitive's state as necessary. The primitive pin value memories may be thought of as "data" for the processor. During the simulation, the user may trace the state of any primitive's pins to determine if the network is functioning as desired. The simulator architecture also includes an I/O unit acting as a two-way interface with the host workstation. It is used to source stimulus into the gate-level simulation, and it collects the gate responses and sends them to host processing software. In such a processor architecture, there is heavy pipelining and heavy memory access. The ASIC design therefore consists mostly of control and datapath elements (i.e. state machines, muxes, adders, etc.). There is some numerical calculation in one of the chips, but it is not as complex as that of a graphics or DSP processor. Let us consider the cube architecture. Each of the cube's vertices is called a cluster . Clusters contain 8 processors and a message router. The processors may be either event processors (evaluators) or I/O processors. There may be up to 256 clusters per cube. The initial product has 8 clusters and allows up to 8 simultaneous users. There are 64 processors in the cube shown in Figure 1. All processors communicate to each other via "messages" that flow between processors on 45-bit wide busses. There are two bus specifications: CLINK and EBUS. ASICs are partitioned across the design. The 8 evaluators on each cluster are connected to each other via the EBUS. Each cluster also has a router chip, which routes EBUS messages in and out of the cluster in coordination with other neighbor clusters. The routers communicate with each other via bidirectional CLINK busses. Partitioning an ASIC for synthesis A top-down chip design begins with constructing a block diagram or model of the ASIC. Chip modularity is usually based on functionality, e.g. a bus con-troller and all its associated logic exist in one module. For success, there are two things a designer must keep in mind when blocking out a design for synthesis. The leaf modules should be reasonably sized and effort should be made to limit the number of separate entities a timing path can traverse. By limiting module size, a designer gains improvements in both the synthesizer runtimes and the quality of the synthesis results. The average entity size in the project was around 2 kgates. The largest entity was around 12 kgates. In general there was a broad range of sizes, including 140 gates, 2 kgates, 6 kgates, and 10 kgates. We needed a Sun Microsystem SPARC 690 server with around 500 Mbytes of swap space and 256 Mbytes of core memory to synthesize the largest blocks. We started out partitioning the chips on a purely functional basis. If we hadn't had such powerful computing resources available, we would have been forced to pay more attention to up-front chip partitioning. By limiting the number of separate entities that a timing path can traverse, a designer can improve synthesis results in both timing and area. The synthesizer does much better on logic that is contained within one synthesis run. If a certain timing path travels through four separate modules, the synthesizer will make its best effort within each of the modules, but the overall results won't be as good as the case where the whole path is in a single module. Often, bad timing paths that span multiple modules aren't obvious at the start of a design. Later on, after most of the design is complete, the designer will have to revisit the design's hierarchy structure. He will then modify the module connectivity and functionality in order to work down the long timing paths. There is a trade-off between these two factors. The synthesizer always does better on smaller-sized entities. However, it also does better when all of a path's logic is contained within one entity. In the case of three of our chips, the final partitioning was affected by these synthesis concerns.
Thinking in hardware It turns out that writing effective VHDL RTL is easy. The key is to realize that the code describes hardware, not software. The main idea of "thinking in hardware" is to always have a good idea of what the lines of code really mean in terms of hardware. You must know it when you lay down an adder. You must be aware of the cost of a magnitude comparator. You must be conscious of the relative timing of different signals with respect to the clock edge. After writing a sequence of RTL, you should be able to sketch out a quick block diagram of what the code physically represents. The fundamental unit of an RTL description is the VHDL "process." When thinking in hardware, there are two kinds of processes: a combinational cloud process and a register-instantiating process. As shown in Figure 2, the combinational processes have sensitivity lists that include all signals used in right-hand-side operations, while register-instantiating processes only have the clock and/or reset signal in their sensitivity lists. Note that the register processes may include some combinational logic, but all effects of that logic must end up in the D-input of a register instantiated in the process. The inputs should not feed non-registered output logic. If they did, the output logic wouldn't simulate correctly, since its inputs are not in the process's sensitivity list. Combinational processes make up the bulk of the gates of the design, and all complexity is described in them. One must structure combinational processes intelligently to avoid gate explosion and poor timing paths. Processes in a given design should all have the same form. Use variables to structure datapath elements such as adders and comparators. Always group such resources together at the top of the process and assign their outputs to intermediate variables. Later in the meat of the process these variables will be used when the result of the datapath resource is required. This will always reduce resource use to a minimum, and the designer has a quick reference to all gate-hogging components inside the process. Timing: still everything Timing is still supremely important. Perhaps that is the most important part of "thinking in hardware." Writing RTL that does not take a signal's timing into consideration is not useful. Signals may stabilize at any point in the clock cycle. Register-driven signals always have the earliest stability time, and different signals may have different maturity times depending on how much combinational logic precedes them in the timing path. Writing an RTL process that uses four 32-bit adders in series is usually impractical. The designer must recognize that a clock stage is needed to break up the timing path. A register should be called out to hold an intermediate term, and the final addition(s) should be done in the next clock cycle. The concept of relative signal timing is nothing new in hardware design, but it is easy to have the perception that VHDL will somehow magically take care of the issue.
VHDL takes care of gate-level complexity. It does not take care of gate-level timing. It is easy to forget this simple fact. However, after the first few synthesizer runs that result in a minimum clock period 180 percent longer than the target, relative signal timing becomes very clear. In the design, it was especially easy for us to make mistakes with signals arriving late from other VHDL entities. Late-arriving signals should be run into a minimum of logic before a register stage. Miscalculations of this kind are usually only evident after the first full-chip timing report is made. Building the actual chips Each chip consists of several dozen entities and at least 20,000 lines of RTL code, including comments. To synthesize an entire chip, a certain methodology must be decided upon and used consistently. Synopsys .com/isdweb/&lf=isd-sendtolog"> Synopsys provides two fundamental means to achieve complete-chip synthesis: iterative entity compile and hierarchical compile. The project used the iterative entity compile approach. The idea of this approach is as follows:
2. "Link" the design hierarchy. This is not "synthesis"; rather, it is a means for the synthesizer to gather the chip's high level structural and timing information into one database. A complete-chip timing report is available here. 3. "Characterize" each entity. This synthesizer process calculates all I/O information pertinent to the entity being characterized (output signal loads, input signal drive strengths and arrival times, and delay constraints). 4. Create a synthesizer script file that contains all the characterize data. 5. Go back to step 1 and use the characterized data as input to the synthesizer. The main work involved in complete-chip synthesis is to beat down long timing paths. This process takes a lot of time and CPU cycles, especially for big chips. A designer beats down timing using a combination of two methods. The first is to provide the synthesizer with an optimum set of script files containing constraints and accurate environmental data. The second method is to restructure the hardware with the VHDL code. At the end of the iteration process, there will be one script file for each entity in the design. These script files are generated by using the synthesizer's "characterize" and "Write_script" commands. They do not have any constraints or design rules in them. They only contain environmental data such as arrival times, loading factors, etc. The accuracy of this data is important for the synthesis process. To start generating correct environmental data, the designer must supply arrival times and loading factors at the chip core's boundary. This information is entered into a special top-level script file. It is a good idea to add pessimistic loading factors to the chip core's ports, since the chip layout may include big delays in the paths from the chip core to the pad ring. The user-provided boundary factors work their way into the chip core with each iterative chip build. Eventually the various arrival times and output delays in the chip core reach a stable and realistic point, but it does take several iterations. In addition to the automatically generated environmental script file, each entity needs its own set of constraints. The synthesizer needs a little help in figuring out where to best spend its effort. Only the designer can provide it such knowledge. Some of the constraints we found most effective were "max_transition" and "set_output_delay." (These commands have been renamed with newer versions of the sofware.) The entity-specific constraints get placed in another special script file. Once these files are generated, they don't need to be adjusted too often.
The next step in building a chip is to set up a build process. We use Unix "make" to build chips. Makefiles clearly spell out hierarchical dependencies, and they give a concise summary of how each chip is built from its source. We write all our makefiles by hand, adding comments to explain what is going on. There is one makefile per chip. Once the chip is building and all the synthesizer script files are in place, a designer needs to improve timing by modifying the VHDL source. There are two parts to modifying the VHDL. The first is to concentrate on improving delay within each entity. The second is to redistribute logic among the entities in the hierarchy. Deciding what changes to make in the code is a design-specific challenge. Here is a partial summary of things to try:
2. Move logic into an earlier pipe stage. 3. Duplicate logic in different entities of the chip to avoid long paths. 4. Add extra pipe stage(s). 5. Move logic into a later pipe stage. 6. Reduce entity sizes by adding sub-hierarchy or new entities. 7. Re-implement or re-design algorithms. Refer to the sidebar for concrete examples of how these ideas were applied to the Nsim chips. Notice the methodology implications of this chip building process. What is happening is that engineers design in a top-down fashion, but building the chip is a bottom-up process. The necessities of making timing at the bottom levels of the design reflect back up into the top levels of the hierarchy. In some cases the top levels are profoundly affected. The results are that the final design may not be as readable or structured as simply as a true top-down design would be. This is not always the case. For example, the IOCHIP needed a lot of work to get its timing in shape, but the original design hierarchy was scarcely affected.
The design group dealt with this issue via the extensive use of regression simulations. After the first cut at a chip, when the design is structured in the simplest fashion, we establish a known-working regression suite. Subsequent design iterations will gradually move the design away from its simple and intuitive structure. The design is periodically run through the regression simulations to ensure that the restructuring does not break the functionality. Don't forget the hardware The abstract nature of VHDL and the power of synthesis allowed the design team to implement algorithms in hardware that were previously considered not possible. The key to using VHDL to build chips is to remember that it describes real hardware and to establish a methodology based on this fact. VHDL allows designers to think at a more algorithmic level, but it does not allow them to forget they are designing hardware. In the future, synthesis technology will certainly make many advances. Engineers will build chips from high-level descriptions without being required to significantly manipulate source code, or perhaps even understand the hardware they are creating. Currently, "donkeying" around with the source code is a key part of chip-building methodology. However, this is a small price to pay for the gains that VHDL and synthesis provide. Doug Warmke is the manager of hardware development and integration at IKOS Systems Inc. (Cupertino, CA), where he has been for 7 years. He was responsible for the design of the Nsim simulation engine.
To voice an opinion on this or any Integrated System Design article, please e-mail your message to michael@asic.com. [ Articles from Integrated System Design Magazine ] [ ICs and uPs ] [ Custom ICs and Programmable Logic ] [ Vendor Guide ] [ Design and Development Tools ] [ Home ] For more information about isdmag.com e-mail cam@isdmag.com For advertising information e-mail amstjohn@mfi.com Comments on our editorial are welcome. Copyright © 1996 - Integrated System Design Magazine |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Home | About | Editorial Calendar | Feedback | Subscriptions | Newsletter | Media Kit | Contact | Reprints| RSS|
Digital| Mobile |
| Network Websites |
|
International |
|
Network Features |
|
|
|
All materials on this site Copyright © 2009 TechInsights, a Division of United Business Media LLC All rights reserved. Privacy Statement | Terms of Service | About |