United Business Media EE Times


Search

HOMEMARKET INTELLIGENCE UNITFORUMSDESIGNNEW PRODUCTSCAREERSBLOGSCONTACTEVENTSSIGN UP!RSSMost Popular contentTrusted Sources

 

Hierarchical Physical Design for Megagate ASICs

Raw design size in megagate ASICs can cripple physical design and timing closure - physical hierarchy is the solution.

By Sam Appleton


Cutting-edge ASIC projects face a whole host of challenges in design, verification and test. Design complexity is forcing reuse methodologies to try to keep up with increases in silicon density as process technology goes below 0.25 ým. To cap off complexity issues at the front-end of the ASIC design, physical design issues have become more problematic as design size and performance increases. Meanwhile, timing and area convergence, as well as signal integrity issues, have become critical issues in the back-end. This additional challenge at the physical layer can add months to time-to-market for ASIC products.

Challenges in physical design

CAD tool companies have risen to the physical design challenge by increasing the performance and capacity of their tools, as well as adding new features such as integrated synthesis and placement to get faster timing convergence. This has allowed large ASIC designs to remain flat at the back-end, simplifying the CAD flow and minimizing the cost of tools. Despite these improvements, iteration time between synthesis and place-and-route, a critical factor to achieve timing and area convergence, is still increasing as design sizes rise. Larger designs take significantly longer to complete synthesis followed by place-and-route.

Thus, the backend may require multiple iterations to close on timing. With a slower iteration loop, this can be a significant hit to the schedule. In some cases, designs may not converge on area, timing - or both - leaving the team at an impasse. This impact on the project schedule can be devastating.

Furthermore, VDSM process technology often introduces new, complex problems to VLSI layout - signal integrity due to capacitive coupling and inductive effects. Since coupling capacitance continues to rise as a percentage of total wire capacitance, delay and noise issues on global wiring have arisen, causing real problems in post-tapeout design debug. CAD tools need to do a better job of analyzing and fixing violations.

Physical hierarchy - partitioning the design into two or more levels of layout hierarchy - offers a solution that may bring a range of benefits to the physical design process.

The good, the bad, and the ugly

The most critical gain realized in switching to hierarchically structured layout is "divide and conquer".

Modules or blocks of a design can be built independently, and timing closure on these blocks can be achieved independently of other blocks of the design. The top level of the design is where all the blocks are interconnected and where chip assembly is done.

Figure 1 - Top-level power distribution
 
Rings enlarge channel area considerably. A grid approach uses upper metal layers for power, reducing channels considerably and leaving adequate metal for routing: (left) Ring power distribution    (right) Grid power distribution

Top-level timing closure can commence once an initial top-level netlist is available, even if RTL for the blocks is in flux or not available. Early feedback on design area can be given by estimating block size for uncompleted sections of the RTL.

A top-level layout is required to instantiate and to interconnect all the blocks of the hierarchy. Invariably, this involves some area loss from channels, which are regions used to route connections between the blocks. This is commonly cited as one drawback of hierarchical layout. However, with processes having six or more layers of metal, intelligent layer assignment can mitigate or eliminate this penalty altogether. In addition, by eliminating power rings, commonly used for power distribution to blocks, another major contributor to channel space is removed (see Figure 1a and Figure 1b).

Many designers feel that physical hierarchy is not worth the additional effort and CAD tools required at the backend. For ASICs in the 100,000 to 750,000-gate range, this may indeed be true. However, with increasing design size and more aggressive timing targets, hierarchical layout will become essential. In addition, new physically aware synthesis tools have sweet spots in the 200,000 to 300,000 instance range, allowing hierarchical layout to take full advantage of these new tools since block size typically falls here.

One drawback is the expertise required to develop a hierarchical backend CAD flow. However, once developed, such a methodology can be brought forward for multiple future generations of hierarchical ASICs, enabling faster timing closure and better back-end predictability.

Design flow

In hierarchical physical design, we partition the design into two (or more) levels of hierarchy, the top and the block level. The tool requirements and issues for these two levels are different, and affect how we get from RTL to a timing-clean layout.

Layout of the block netlist uses standard cell place-and-route, as well as new physical synthesis solutions. These tools are well understood and we can commence timing closure on each block in the hierarchy when RTL becomes available, thus treating each block as its own "mini-chip". Each block may contain hard macros as well as other levels of physical hierarchy if required.

Figure 2 - RTL-to-layout flow
All RTL passes through synthesis tools, but the block and top level are built differently.

The top-level of the layout hierarchy requires a different set of functionality than the block level.

The top-level should mainly contain large blocks of the design and any critical pieces of top-level functionality, like PLLs, I/O pads, and repeaters. We used Synopsys' Flexroute top-level router and floorplanner on our last ASIC tapeout with good success; an area-based gridless top-level router with additional floorplanning functionality proved critical to meet our density and schedule targets.

A typical design flow includes a range of tools (see Figure 2). RTL is synthesized to gate-level netlists as well as a top-level netlist, which contains the top-level blocks and their interconnections. Timing constraints are applied in synthesis, as well as layout, to speed timing convergence. The block-level floorplans are interactively generated between the top-level floorplanner and from block-level decisions that affect the block floorplan. Each part of the design proceeds to timing analysis and iteration, as well as the final physical verification stages for COT designs.

Floorplanning is crucial

Floorplanning becomes a critical step in hierarchical layout flows. Since the blocks must be connected at the top-level, pin assignment, floorplan evaluation, and the routing quality become critical. Flexroute provides a number of features essential for producing an excellent top-level floorplan.

Figure 3 - Pin assignment operations
Yellow pins are soft or movable pins, red pins are hard or immovable pins, and purple pins must retain their "edge" of the block. Pin assignment minimizes routed net distance (and thus routing resource usage) while obeying these pin constraints as well as pin keep-out areas, shown in grey. (left) Before pin assignment     (right) After pin assignment

Routing-driven pin assignment minimizes the routing resources required at the top-level by minimizing the routed net length between connected pins of blocks. This process is applied to the soft pins of each block - in other words, pins that can be moved freely in the floorplan. This technique can be applied to virtually eliminate channels in the top-level design by allowing the top-level designer to reduce channel size down to what is required purely for the nets traversing the channel (see Figures 3a and 3b).

A fast global router allows pin assignment and routing density in channels to be evaluated, which can then be used to either increase channel space in over-congested regions, or to allocate more area to critical or already over-dense blocks. The combination of pin assignment and global routing driven floorplan evaluation allowed a density increase in our previous design by as much as 20 percent compared to our previous hierarchical design. This density increase was achieved by removing unused channel area, therefore allowing extra functionality. Critical to this was the speed at which we could evaluate the floorplan and provide feedback to the logic team for area-intensive optimizations. Most of the top-level channels that were previously unnecessary were virtually eliminated using pin assignment.

Timing optimization and repeater insertion allows the timing of the top-level nets to be evaluated and optimized by using a combination of wiring optimizations and repeater insertion. Wiring optimizations improve the RC delay of net segments, whereas repeater insertion breaks the quadratic RC increase with length into delay-optimized segments that gave a near-linear delay function against routed net length.

Signal integrity and timing

The top-level of the physical hierarchy contains relatively long nets since they traverse major portions of the chip to connect the blocks. This can result in significant coupled capacitance between adjacent nets travelling to the same or similar destinations, and thus noise and spurious data-dependent signal delays (see Figures 4a and 4b). A perfect aggressor signal can cause faster or slower signal transitions on victim lines depending on topology and distance from the driver. When the victim is quiescent, the aggressor causes noise spikes on the victim, potentially glitching downstream logic or incorrectly sampled values if the glitch is accidentally sampled.
Figure 4 - Signal Integrity
With the aggressor and victim simultaneously switching, the victim suffers from delay effects. With the victim quiescent, noise results from aggressor switching. (a) Top-level nets subject to delay and noise effects       (b) Waveforms on the victim net

The magnitude of the effects is dependent on coupled capacitance as well as resistance to the driver. Thus, we can modify the wire spacing (coupled capacitance), widen the wire (reduce resistance from source), or add repeaters (reduce distance to source) to address signal integrity on these long, parallel routed nets. Flexroute's gridless approach allows a fine degree of control over width and spacing for all nets of the design to enable simultaneous optimization of signal integrity, timing and top-level congestion. This is not possible with a gridded router, which requires a more constrictive set of wiring rules - compromising either delay and signal integrity, or design density.

Automation

Physical hierarchy involves a number of dependent and parallel tasks. To eliminate the "black art" from the physical layout process, all tool flow should be isolated to scripts (for tool functionality) and Makefiles (for tool flow).

Figure 5 - Pin optimization process
Pins were optimized both at the block and the top level, depending on connectivity and constraints.

Makefiles allow all of the build dependencies and source files to be explicitly noted as part of the build process, as well as providing an amount of self-documentation of the physical design process. Since tools need some level of wrapping to customize them for the target project, a script can be used to encompass major portions of the design flow (like placement, clocks, routing, and extraction). The Makefile then tracks dependencies on these tasks, as well as automating the netlist to the verified layout process. Top-level Makefiles can be used to build all elements in parallel, reducing tool interaction to a minimum. This frees up valuable designers to improve tool flow, analyze and improve the design, and fix build failures. Our last tapeout automated the RTL-to-netlist, netlist-to-layout, and layout verification processes entirely. We were, therefore, able to concentrate on tool issues and timing area closure.

ASIC tapeout

We used all of these techniques (except for repeater insertion) on the tapeout of an SGI ASIC, Krypton. The design contains 15 million transistors with around 750,000 placeable instances, and uses a six-layer metal 0.25-ým CMOS process with flip-chip bump packaging technology on a die measuring 13.6 mm x 13.6 mm.

Previous projects had used hierarchical physical design, but this design was the first to employ Flexroute. Using the tool in combination with improvements to the block-level place-and-route flow, allowed us to pack 50 percent more transistors, while increasing clock speed by 33 percent to 133 MHz over the previous design, despite an unchanged die size or process geometry (although the transistors in the process did improve considerably).

We used a wide variety of techniques to optimize our top-level layout, mostly focused on pin assignment and density improvements. We optimized almost all of the 17 000 pins in the top-level design. Our pin assignment process was bottom-up, top-down. We used block-level optimizations to improve block routability for hard IP blocks and critical nets like clocks, and then optimized other pins at the top-level using the tool's routing-driven pin assignment.

Manual routing features were used to pre-route critical top-level nets that traversed nearly the entire width or height of the core area. Blocks which connect to these nets then tap directly to the pre-routed segments. Manual pre-routing allowed total control over net topology, as well as net width and spacing along every segment of the pre-route.

This permitted us to adjust width/spacing in channels where congestion became an issue.

Timing closure was relatively easy at the block level. We used timing-driven place-and-route extensively, and most timing issues were fixed either by changing the physical timing constraints, changing the floorplan to the place-and-route tool, or optimizations in synthesis. Some "sticky" timing issues can result when connecting to large IP blocks, as wireload models do not account for net length. Such problems require IPO techniques or gate instantiation to fix timing problems.

Top-level timing closure was slightly more difficult. Many problems were traced to inconsistent boundary constraints between the synthesis and physical timing constraints, which were actually fairly easy to fix and evaluate. Final timing problems included certain pins with large (greater than 4) fanout at the top-level, resulting in significant net length and thus delay due to topology, and long inter-block paths, fixed by path breaking, repeater insertion, or changing the top-level routing topology (sometimes by pin optimization).

Central to our ability to close timing quickly was the speed of our iteration loop, enabled by physical hierarchy. Total build time for top-level blocks was 4 to 24 hours, including timing analysis. Top-level in-context analysis then took from 8 to 24 hours, which generated a top-level timing report. Therefore, we were able to turn the complete chip around in at most 2 days. Our automation approach allowed this to occur with minimal intervention on the designers' part. We used multiple CAD licenses to build blocks in parallel, keeping our iteration time almost constant despite changes in many blocks for each turn.

Once timing was closed, parallel physical verification enabled preparation of the final GDS II within 24 hours, ready to be released to the foundry.

Where to from here

We achieved significantly higher density on this design by improving our hierarchical layout flow and new place-and-route solutions at the block-level.

Sharing upper layers between power and top-level routing - and some block-level routing - may take us closer to eliminating channels in the design and allow us to continue to improve density. We did not implement layer sharing on this design and used a dedicated power/ground layer to distribute power to all the blocks.

Timing analysis and optimization in the Flexroute environment would enable us to get a faster start on top-level timing closure without deploying the extraction, delay calculation, and analysis loop. In addition, we could insert repeaters earlier in the design flow to fix timing violations while block netlists are still in flux.

Internal CAD improvements will allow us to close timing faster in our next design. In particular, we need to have a clean way of separating top-level timing analysis from our block-level results, which would in turn enable us to turn our top-level loop faster.


Sam Appleton is a member of the technical staff at SGI, Mountain View, CA. He currently is working on the logic and physical design of the next-generation MIPS processors at SGI.

To voice an opinion on this or any other article in Integrated System Design, please e-mail your comments to sdean@cmp.com.


Send electronic versions of press releases to news@isdmag.com
For more information about isdmag.com e-mail webmaster@isdmag.com
Comments on our editorial are welcome.
Copyright © 2000 Integrated System Design Magazine

  Free Subscription to EE Times
First Name Last Name
Company Name Title
Email address
  Click here for your Free Subscription to EETimes Europe
 
CAREER CENTER
Looking for a new job?
SEARCH JOBS
SPONSOR

RECENT JOB POSTINGS
CAREER NEWS
SRC Expands R&D Centers
The Semiconductor Research Corp has added a new center to its university R&D efforts.

For more great jobs, career related news, features and services, please visit EETimes' Career Center.


All White Papers »   

 
Education and
Learning


Learn Now:












Home | About | Editorial Calendar | Feedback | Subscriptions | Newsletter | Media Kit | Contact | Reprints|  RSS|   Digital|  Mobile
Network Websites
International
Network Features




All materials on this site Copyright © 2009 TechInsights, a Division of United Business Media LLC All rights reserved.
Privacy Statement | Terms of Service | About