Design Article
Chip synthesis: A new approach to RTL implementation
Paul van Besouw, president and CEO, Oasys Design Systems
2/16/2010 5:23 AM EST
Traditional synthesis is coming apart at the seams, especially for designs larger than
Synthesis: a little bit of history
From early days, all synthesis tools have been built basically the same way: turn the RTL code into gates using fairly naïve algorithms, and then optimize the gates to meet the constraints. It's as if C language compilers all worked by turning the C straight into machine instructions, and then optimizing the machine instructions. In principle, with enough runtime and enough clever optimization techniques, working at the machine instruction level might discover a higher-level optimization such as pulling a constant sub-expression out of a loop. However, it is much better just to do higher-level optimizations at the higher level in the first place. Modern C compilers are indeed built this way, with global optimizers that look at a high-level representation of the program, and a straightforward peephole optimizer cleaning up final details at the machine instruction level at the very end.
The first logic synthesis tools in the late 1980s simply optimized gate-level netlists derived from schematics. RTL synthesis was added on top of that foundation of logic optimization. The RTL code was read in and reduced to a control/dataflow data-structure, which was then turned into gates. Finally, the gate-level optimizer would grind away until the design met its timing constraints. Since the impact of wires on timing was almost entirely capacitive (resistance was not yet an issuetiming analyzers didn't even take it into account), simple wire-load models were used and the gates were not physically placed until the next step, place and route.
When physical information became more important, placement was merged into the gate-level optimization step so that instead of using wire load models, an estimated route including resistance could be calculated. For the last twenty years, synthesis has been built around a core of gate-level optimization.
There are two big disadvantages of this approach. Firstly, gate-level optimization is a low-level optimization, and secondly gate-level optimization requires an enormous amount of data to be simultaneously accessible in memory. This means that run times are too long and capacity is too low.
As a result, with traditional synthesis designs need to be split up into smaller blocks to address tool capacity limitations. And it keeps getting worse: in 1990 traditional synthesis capacity was about 10K gates and a chip was about 100K gates, meaning the design would need to be split into 10 blocks. In 2009, traditional synthesis capacity is up to about 500K gates but chips are 100M gates, meaning 200 blocks. This makes for a horrible problem of time-budgeting to control the synthesis. Then place and route has to take those 200 blocks and assemble them together and meet the overall global timing constraints. This cycle simply does not close without an unacceptably large number of iterations that can take months. A new approach is required.
The chip synthesis solution
Chip synthesis works very differently. Once the RTL code has been parsed, it is partitionedbased on connectivityinto smaller partitions that will eventually be reduced to gates. Each partition is small enough that it won't contain any long wires, which would lead to high variability in timing, but large enough to have implementations with potentially different area-time tradeoffs. Each partition is largely independent of the others. Of course, the timing numbers from all the other partitions are required to be able to time the whole chip, but the detailed internals of every partition are not required simultaneously. Because it is no longer necessary to look at the whole chip at the gate-level at the same time, the memory requirements are hugely reduced.
This RTL partitioning approach is the main reason that chip synthesis can be so fast and so effective. By operating at a higher level, it intelligently synthesizes and times the design one partition at a time. Then, until timing is met, it re-synthesizes, re-places (and updates the global routes) and perhaps re-partitions parts of the design until the constraints are met.
Traditional synthesis does not take placement into account during the RTL synthesis step; it is only considered later when gate-level optimization is being done. Chip synthesis pulls placement forward into RTL synthesis and so enables high-level optimization equivalent to the more powerful transformations that modern software compilers can make to programs. Chip synthesis partitions the RTL code into placeable pieces, and then refines those down into actual library cells so that there is always a complete placement to go along with the timing values.Working at a higher level produces orders of magnitude better performance: an ordinary 32-bit PC can synthesize designs of tens of millions of gates in an hour or two. This compares to 64-bit workstations requiring literally weeks of run time to achieve, or often, fail to achieve, an acceptable result. Moreover, the technology's extremely efficient memory utilization allows enormous designs to be handled with just a modest memory footprint.
Here is a specific example of a TSMC 65nm 700K instance design: with traditional synthesis it took over 14 hours to synthesize, followed by two weeks per iteration of over-the-wall physical design, and required 10 iterations for a total of 20 weeks. And when all was done, the worst negative slack was a huge -300ps.
That same design synthesized with chip synthesis took just 20 minutes to complete, and closed physical design in a single iteration with just -7ps of negative slack.
Chip synthesis: under the hood
Chip synthesis reads in the entire design, along with the floorplan. If there is no floorplanoften the case early in the design processthen one is generated automatically. But by the time that production RTL code is being synthesized to production gates, then a good floorplan is a necessity. High-level modules in the input RTL code are assigned to regions if constrained by the floorplan, and then the whole RTL code is partitioned, already using this coarse placement information in timing and congestion analysis. In a modern process, any timing value without associated physical information is little more than a guess so it is very important to have a physical location for every element before calculating any timing.
Hard macroslarger blocks that will not be implemented using standard cellsare also provisionally placed so that they can be considered from both a physical obstruction and a timing point of view. If there is physical hierarchy in the design, then this is honored. RTL partitions are correctly assigned to be within the physical boundaries of the appropriate partitions so that synthesis matches up with place and route.
Unlike other synthesis approaches, a fully detailed netlist of each RTL partition is available at all times, and is used to accurately time the design.
Next, to improve the design until it meets the design constraints, the original RTL partitions are re-synthesized given their current physical and timing constraints. On top of that, the RTL partitions themselves are merged, repartitioned and replaced in order to meet timing constraints and reduce congestion. In a final refinement step, all gates undergo a legal placement. Since the final placement already avoids excessive routing congestion, it should not subsequently be badly perturbed by the place and route tool.
The chip synthesis design and its placement are fed to the team's choice of place and route tool. The placement is so self-consistent that the place and route tool runs faster than normal, another useful productivity gain. The timing resulting from the final detailed routing with all the parasitics included will be very close to the predictions from the front end.
Being able to synthesize entire chips in a matter of a few hours, as opposed to taking days to synthesize the chip in separate blocks, is more than just a numeric increase in productivity, it lets the designer focus on the design and not the limitations of the design tool. The time saved can either be used to pull in the schedule or to explore the design space more extensively.
High-level synthesisFor designs using high-level synthesis (HLS) from C or SystemC, chip synthesis is the missing link in the methodology. HLS tools all have an extremely coarse view of implementation tradeoffs because, by definition, they operate at a high level. Using traditional synthesis for design space exploration entails implementing candidate architectures in order to validate their relative efficacy. This is just too slow to use at this early point in the design cycle when fast iteration is important. Chip synthesis yields the speed and accuracy that make the application of HLS methodologies practical for a much wider range of design types.
Results
Chip synthesis is at least 20 times and sometimes 100 times faster than mainstream synthesis tools, and produces equal or better quality of results (area and timing), along with a placement that runs more smoothly through industry standard place and route tools. It has the capacity to handle full-chip designs of up to 100 million gates.
The following table gives representative detailed results for two designs. For timing, a perfect result is zero slack, the design precisely meets the timing constraints; the more negative the slack, the further the design is from meeting timing and so the worse the result. Both designs synthesized over 40 times faster than conventional synthesis with better or equivalent results.
![]() Click on image to enlarge. |
Summary
Chip Synthesis outperforms traditional synthesis because it can synthesize and optimize the whole chip at once, as opposed to forcing the designer to split the design up into a large number of sub-blocks that need to be stitched back together again for physical design.
The key underlying technology is that optimization is done at the RTL level, as opposed to gate-level where traditional synthesis spends all its time. This allows for much faster and more effective design exploration. Also, since the entire design is always placed, the timing values are very accurate, and the congestion analysis ensures that the placement used by synthesis to time the design will be close to the final placement after place and route leading to greater predictability.
Oasys Design Systems' RealTime Designer is a chip synthesis solution available today.
![]() Click on image to enlarge. |
About the author:
Paul van Besouw is president, CEO and co-founder of Oasys Design Systems. He managed the synthesis and physical synthesis teams at Cadence and was a member of the Ambit engineering team.
He holds a M.S. Electrical and Computer Engineering from the Eindhoven University of Technology, the Netherlands.





