United Business Media EE Times


Search

HOMEMARKET INTELLIGENCE UNITFORUMSDESIGNNEW PRODUCTSCAREERSBLOGSCONTACTEVENTSSIGN UP!RSSMost Popular contentTrusted Sources

 

System Design

Timing Analysis for the PA-8000

HP engineers developed two timing analysis methodologies to optimize the PA-8000 microprocessor.

by Clay McDonald, Tom Indermaur, and Mike Buckley


This article by Clay McDonald, Tom Indermaur, and Mike Buckley is the second in a series of three that describe the design of Hewlett-Packard's PA-8000 microprocessor--the heart of the company's latest generation workstations. The first appeared in our January 1997 issue. The third will run in the March 1997 issue. The three articles are extracted from presentations being made at Design Supercon97 (Santa Clara, CA) this month.

As metal line widths decrease and transistor counts increase, timing verification of state-of-the-art microprocessors becomes both more critical and more challenging. RC delay, once a second-order effect, now dominates global signal budgets and requires highly detailed analyses. At the same time, chip complexity continues to grow, increasing capacity and data management requirements. To meet these challenges for the PA-8000 microprocessor, the CPU and tool development teams at Hewlett-Packard Co. (Fort Collins, CO) developed two global timing analysis methodologies excelling in accuracy, completeness, and flexibility.

Our early global timing methodology addressed only global interconnect analysis, generating feedback to improve routing, floorplanning, and global signal timing and budgeting. These processes performed well and enabled us to achieve a high confidence level in our global timing design before silicon was available.

Most importantly, the PA-8000 met the target frequency of 180 MHz with margin on first silicon. In addition, the timing model accurately predicted many critical paths later observed and measured on silicon.

At first tape release, we set out to develop a new methodology to provide greater accuracy, detail, and visibility. Our goal was to aid silicon characterization of the PA-8000 as well as devise a new methodology for future development projects.

This second methodology was successful in producing greater verification coverage and model accuracy, and played a key role in ensuring delivery of a high-quality PA-8000 design to our customers.

The design The PA-8000 is an extremely large and complex full-custom design, containing 3.8 million logic transistors. Fabricated using a 0.5-µm technology with five layers of metal, it operates at 180 MHz and issues instructions out of a 56-entry reorder buffer to achieve leading-edge performance.

RC challenges The PA-8000's size and complexity required a hierarchical design approach. At the top level of the design hierarchy, approximately 6,000 global routed signals are devoted to interconnected communication between the various functional units. The task of modeling all the interconnect, as well as the full custom circuitry employed, presented a great challenge.

Interconnect resistance is a complicating factor in timing analysis. Additional nodes must be defined to describe the R and C element interconnections. The resulting volume of data can quickly overwhelm a computer's resources. In addition, RC nets are often split across hierarchy, introducing a wealth of corner cases and modeling issues. Guaranteeing correct connectivity of the complex RC networks in an evolving design presents more difficulties.

Finding the "wall"
One of the most useful outcomes of a global timing analysis is finding the "wall." This is the point of diminishing returns--the point beyond which pushing the frequency by fixing speed paths becomes increasingly difficult.

This figure shows a hypothetical failure histogram with frequency on the X axis and the number of failures on the Y axis. The histogram clearly shows that the target frequency (highlighted with the vertical bar) could be raised about 15 percent before hitting the point of diminishing returns. The shaded area represents everything beyond the "wall."

The "Wall": impossible to find without a complete, accurate model.

Our toolset extracts parasitics at each level of design hierarchy, providing efficient management of the large number of RC elements in the full design. Because our RC data is hierarchical, single RC networks are often split across multiple levels of hierarchy. For example, feedthroughs occur when a global-routed signal makes use of metal placed inside a circuit block. Frequently, a long wire will separate the driving transistor from the port of the block, creating a large RC delay at the block port. Since an RC delay is highly affected by the network's environment, most notably the driver strength and the receiver capacitance, our approach analyzed the entire network as a single unit to obtain accurate results.

Any approach to RC simulation must include some form of node foliation. Node foliation replaces the original "logical" node (as would appear in a schematic) with a network of resistors, capacitors, and artificially created and named nodes. These nodes represent all of the internal points of the RC network, as well as differentiate each of the termination points (see Figure 1). The development of new tools for formatting and analyzing complex network connectivity enabled us to efficiently manage large netlists as the design evolved.

Connecting the ports Very often in custom designs, a circuit block will have multiple physical ports for a given "logical" port. In many cases, these ports are separated by significant global RC networks that require accurate modeling.

Connecting circuit-block timing specifications defined with logical ports to multiple physical block ports creates an interconnect modeling problem. In this case, there are at least three possible ways to model the connections, each with disadvantages that introduce inaccuracy. The physical ports may be shorted together and connected to the logical port, providing a simple solution but ignoring the global route resistance separating the physical ports. Instead, the logical port may be connected to the most critical physical port. However, it may be difficult to identify the critical port on a network connected to multiple drivers and receivers. Another alternative is to connect the logical port to the middle of a network, modeling the internal block connectivity to the physical ports.

Figure 1. RC simulation requires foliation, in which a complex network of resistors, capacitors, and artificial nodes (created or "foliated" artificially) replace a "logical" node.
A more optimal general solution is to provide multiple circuit-block timing specifications for the multiple physical ports and connect them appropriately to the global routing.

Black boxes Our early global timing analysis used a gate-based static timing analyzer, which also read "black box" timing specifications. The "black box" timing specifications modeled the full-custom block designs based upon either designer's budgets, estimates, or data drawn from Spice simulation. The final PA-8000 model contained a set of approximately 80 major circuit-block timing specifications. Paired with each major specification was another file containing RC networks, which modeled feedthroughs and long ported wires, and resolved multiple physical port connections.

Interconnect delays were precalculated by a network analyzer that combined RC data with driver strength and port capacitance information and generated a report of point-to-point delays.

The "black box" circuit-block specifications contained setup times, clock-to-out delays, combinational delays, and port capacitances for each logical port. These specs abstracted away any internal structures. The clock-to-out and combinational delays were modeled with two coefficient equations consisting of constant and load-dependent delay terms.

Strong points This early methodology displayed significant strengths. For example, using extracted parasitic data improved the RC delay calculation accuracy, particularly for complex RC networks with multiple branches and destinations. Such cases are not handled well by methodologies relying on estimated or lumped resistances and capacitances.

The early methodology excelled as a budgeting and communication tool. All circuit-block specifications and timing assumptions for each global signal were crosschecked, including the RC delays of the global route. The timing model provided a tracking mechanism and central database for global signal timing.

Our experience also identified several issues not adequately addressed by our early methodology. First, the large amount of budgeted and estimated data contained in the circuit-block specifications was time consuming to generate, verify, and keep current. Estimated data was often over worst-cased, skewing the position of the "wall" on the frequency histogram (see "Finding the Wall"). Second, our toolset was stressed by our large design, resulting in longer than desired loop times. Third, our analysis tools did not take full and efficient advantage of our design hierarchy. This limitation contributed significantly to our decision to develop a new methodology based on hierarchical analysis tools.

Finally, the problem of connecting single logical circuit-block specifications to multiple physical-block ports remained. Resolving each case manually added processing time and increased loop cycle times. We needed a general, efficiently automated, and accurate solution integrated with a more-comprehensive, hierarchical timing methodology.

Figure 2. Ported RC networks are pulled out of the netlist for a child block and are moved up, or "promoted," to the next level of the hierarchy.

A new methodology At first tape release, the global timing team set out to address the limitations of our early timing approach while building on its strengths. As with our pre-silicon approach, we opted for static timing analysis to achieve complete path coverage. To avoid the problems associated with estimated timing specifications, we required a static timing analyzer capable of analyzing our custom designs at the transistor level, as well as at the gate level for library based designs. Early floorplanning still requires support for black-box and large memory structure models to reduce capacity limitations. To further improve efficiency, we decided to automate promotion of block-ported RC networks to higher levels of the hierarchy.

Our early timing methodology was limited to one level of hierarchy, forcing us to flatten our database before running the static timing analyzer, resulting in a large monolithic data set that slowed analysis. In addition, low-level design issues involving tightly coupled blocks were often identified by the global timing model, stealing post-processing bandwidth from the team and delaying feedback to the designers of the low-level block. Clearly, we wanted a flexible methodology to partition the hierarchy as needed and seamlessly integrate the block-level analysis into the global timing model.

Choosing an analyzer Central to achieving these goals was selecting a static timing analyzer capable of understanding our full-custom design styles and encapsulating them for global analysis. Of the commercial tools available, we opted for PathMill, from EPIC Design Technology Inc. (Sunnyvale, CA) for a number of reasons:

  • PathMill analyzes high-performance circuit design styles, such as transparent latches, dynamic logic, pseudo-NMOS, and gated clocks. Nearly all of our custom techniques either fall into one these categories or require some minor work-arounds.
  • PathMill handles RC parasitic networks. Our tools are well-suited to generating netlists containing RC parasitics, so there is no need to calculate and backannotate RC delays.
  • PathMill generates an abstraction of a block, called a gray box, which contains only the worst-case timing arcs between latches and ports. Using a gray box for each major block in a full-chip model reduces the amount of data for simulation.
  • PathMill models may contain an integration of gray boxes, transistors, and RC parasitics. This flexibility provides an elegant structure for timing verification of hierarchical designs.

By creating and analyzing hierarchical timing models, PathMill not only reduces capacity requirements, it also enables us to iterate between design and analysis of tightly coupled blocks independent of the higher level timing model.

Figure 3. For analysis at the top level of the hierarchy, each child block and its promoted RC network is instantiated.

Data flow The data flow in this timing methodology is bidirectional. Detailed timing information from low-level blocks is pushed upward, while context information from the top level is pushed down. Low-level blocks are simulated and analyzed with PathMill before generation of gray box models, which are incorporated into a the top-level timing model. Block context information is determined at the top level and returned for the next iteration of lower level block simulation and gray box generation.

For tightly coupled blocks, we introduced a level of hierarchy called a minichip. At this level, the gray boxes of child blocks are instantiated along with any transistors and RC parasitics that were not abstracted at the lower levels. As in the block analysis, a minichip gray box is generated and submitted for use in the top-level timing model.

The top-level netlist contains a mixture of RC parasitics from the global route; transistors for simple structures, such as signal repeaters; and gray boxes for top-level blocks and minichips.

When simulating the top-level model, PathMill generates context files for each block and minichip. These files contain loading information, input signal arrival times, and output signal timing requirements. This information improves the driver and receiver gray box characterization and communicates port timing to the block designer during low-level block analysis.

While building the PathMill netlist for a lower level block, RC networks based on parasitic extraction are generated to connect the block ports to the internal transistor circuits. Because delays through these RC networks are difficult to characterize until the complete network is present, ported RC networks are promoted to the higher level model via an automated process.

Consider the example shown in Figure 2. Before applying PathMill analysis to the block, the ported RC networks are removed from the netlist and promoted to the next level of hierarchy. PathMill encapsulates only the shaded portion into a gray box.

Minichips use the same RC promotion process. Full RC networks are assembled at the appropriate level of hierarchy, solving a host of problems associated with feed-throughs, long ported wires, and multiple ported signals.

The top-level netlist instantiates the gray boxes and RC networks for all top-level blocks and minichips (see Figure 3). Signal repeaters and other simple blocks may remain in transistor form and are not abstracted into gray boxes. At this level, the model contains a set of all worst-case timing arcs represented in each low-level block gray box and a complete RC network for each global routed signal, including the promoted lower level RC networks.

In summary This methodology provides a more comprehensive timing model where all speed paths are visible and reported with high accuracy. The path histograms correlated well with manufactured parts and predicted failure frequencies within 10 percent of silicon measurements. Higher accuracy translates into improved ability to identify the "wall," or point of diminishing returns. This type of information is invaluable for selecting product frequencies, scoping follow-on projects, and identifying opportunities to increase performance and yield.

Using PathMill, we generated circuit-block timing specifications directly from the transistor-level analysis, making use of our hierarchical RC extraction toolset. Combining those specs, our RC promotion solution seamlessly integrates block and global timing, greatly increasing the accuracy and completeness of the entire process. The methodology also allows us to define minichips wherever their use was most productive.

These gains were accomplished while reducing the amount of handwork required to generate block-timing specifications and consolidate them into the global timing model.

By smoothing the path from block to global timing and dividing the hierarchy appropriately, we will be able to provide global timing feedback earlier and more often. This ability translates directly into faster time to market.

This methodology will continue to be refined. One of the biggest issues is the large amount of resource consumed. For the post-silicon analysis of the PA-8000, we generated approximately 50 Gbytes of data. More importantly, developing the methodology and completing the analysis required about ten engineers for six months. However, we believe the engineering requirements will be considerably lower as these processes become part of the incremental development and analysis of each circuit block.

This methodology proved its usefulness by identifying several timing paths that were not optimized during product development and would have been more costly to identify during silicon characterization. Due to its success, the methodology will be used on all of our current development projects.

Clay McDonald, Tom Indermaur, and Mike Buckley are design engineers at Hewlett-Packard Co. (Fort Collins, CO).

To voice an opinion on this or any Integrated System Design article, please e-mail your message to michael@asic.com.


integrated system design  February 1997



[ Articles from Integrated System Design Magazine ] [ ICs and uPs ]
[ Custom ICs and Programmable Logic ] [ Vendor Guide ]
[ Design and Development Tools ] [ Home ]



For more information about isdmag.com e-mail cam@isdmag.com
For advertising information e-mail amstjohn@mfi.com
Comments on our editorial are welcome
Copyright © 1997 Integrated System Design Magazine

  Free Subscription to EE Times
First Name Last Name
Company Name Title
Email address
  Click here for your Free Subscription to EETimes Europe
 
CAREER CENTER
Looking for a new job?
SEARCH JOBS
SPONSOR

RECENT JOB POSTINGS
CAREER NEWS
SRC Expands R&D Centers
The Semiconductor Research Corp has added a new center to its university R&D efforts.

For more great jobs, career related news, features and services, please visit EETimes' Career Center.


All White Papers »   

 
Education and
Learning


Learn Now:












Home | About | Editorial Calendar | Feedback | Subscriptions | Newsletter | Media Kit | Contact | Reprints|  RSS|   Digital|  Mobile
Network Websites
International
Network Features




All materials on this site Copyright © 2009 TechInsights, a Division of United Business Media LLC All rights reserved.
Privacy Statement | Terms of Service | About