Defining, developing, and verifying large and complex ASICs
requires a disciplined and well-understood methodology with which to approach the ideal goal of first-pass success. At Network Virtual Systems (NVS), we begin by evolving the specification into a behavioral prototype early in the design cycle. We implement a divide-and-conquer strategy, addressing architectural issues and design completeness at a high-level behavioral stage, then using RTL simulation to verify design correctness. The behavioral model serves as groupware for the design team and enables us to
evolve a robust and complete architecture.
NVS partners with large OEMs to develop server architectures and design the chipsets that implement such architectures. The growth of the Web and e-commerce has made the target market for these scalable server products both competitive and highly visible. The potential cost of a design failure is colossal. It's imperative that we create architectures that meet the performance and functionality criteria, and implement designs that function correctly over a very
wide range of operating conditions.
Design overfloweth
Early in the architecture design cycle we develop a behavioral model, which we use to evaluate competing architectures, to characterize the performance impact of sizing (the impact of increasing a cache size from 32 Mbytes to 64 Mbytes, for example), and to analyze the performance sensitivity of design parameters. When we converge on an architecture, we sketch out the needed functions and describe the architecture and each functional block.
While the detailed descriptions for each functional block are under development, we code the top-level Verilog, which contains the I/O pins, the boundary scan registers, the functional block headers, the block ports, and the connectivity between these top-level elements. At this point the design is well described by a combination of top-level Verilog code, module descriptions, and an architectural document.
Next comes implementation, where we code the empty modules and edit the recycled modules. During
implementation, we develop the system-level verification framework using a set of simple system-level transactional tests available for debug prior to completion of coding. We ýavoidý module-level testing, beginning debug only when a complete section of the design is ready for system-level testing. While coding the modules, we develop a floor plan from the top-level Verilog, which contains all the connectivity information, and from our estimates of the size of each module.
When we settle on a floor
plan, we perform global routing using the top-level Verilog and then extract the routed global delays. This step is critical since the synthesis tools can't generate accurate estimates for global delays, the lack of which makes it impossible to achieve timing convergence. We perform synthesis at the full-chip level, and incorporate the global timing to perform a gate-level mapping that meets our timing goals when we place and route the chip.
All the architecture's a stage
The early portion of
our development cycle is devoted to characterizing architectural options generated in the concept phase (see Figure 1). The key requirement is the ability to evaluate competing options without having to invest time in modeling the details of an implementation. To achieve this efficiency we develop analytic models of the architectures. Just as spreadsheets allow financial analysts to explore financial strategies, the analytic models allow us to ask ýwhat ifý questions and evaluate architectural tradeoffs.
In conjunction with the analytic models, we develop trace-driven models that explore the behavior of critical architectural elements. These traces are address reference traces collected on real machines running workloads that represent the target market for these servers. These traces are necessarily large to contain the long-term (~1 minute) behavior of the workloads, and each trace requires many gigabytes of storage. Today's servers are compared on their performance running benchmarks that represent
database activity (TPC and SAP), and Web server activity (web spec). We derive our traces from machines running these benchmarks.
|
Figure 1 - Architectural development flow
|
|
|
The architectural design flow employs extensive C modeling as well as analysis to produce an
optimum implementation.
|
The trace-driven models are used to parameterize the workloads represented by the traces. The models parse the traces into transactions, which are then evaluated by the models to generate parameters such as: hit/miss ratios, replacement rates, copy back rates, invalidation rates, and update success rates. In addition, we characterize the workloads to develop appropriate setup criteria for an analytic model. For example, the address traces are characterized to
provide statistical numbers such as cycle mix, cycle ratios, traffic density profiles, and arrival rate distributions. We use in-house ýCý tools to perform the characterization.
The analytic model then uses these parameters, in conjunction with the characterization data, to generate performance estimates for each of the key workloads. Within each architecture under evaluation, we further explore the performance advantages of elements such as the associativity of a cache, size of cache, and update policy. We
then must examine each of the elements that lead to a significant performance advantage to understand the complexity of implementation and verification. Elements may be very difficult to design and implement, leading to an unacceptable schedule-or conversely may be simple to design and implement but extremely difficult to verify, leading to an equally unacceptable schedule. Our process here involves balancing the benefits of innovative architecture with their associated schedule risks, and balancing the
performance benefits of aggressive structures such as larger, more associative caches, with increased power and cost.
The performance projections would be more accurate if we created a cycle-accurate model of the proposed architecture. Constructing such a model is a difficult and time-consuming task and best serves to validate a particular architecture rather than as an exploration tool to rapidly evaluate architectural options. The performance of a cycle-accurate model when evaluating a multi-gigabyte
transaction trace is relatively slow, and prevents us from using the ýwhat ifý technique in exploring architectural options. At our current level of abstraction, we can create models quickly and take just minutes to execute the models that parameterize the address trace data sets. When we evaluate the merits of competing solutions, the absolute accuracy of the projections is less critical than the relative accuracy.
Modeling at a behavioral level allows us to match the level of abstraction to the
objective. The objective is to evolve an architecture that will meet the performance and functionality goals, with the confidence that it can be designed, implemented, verified, mapped to gates, and timed, within a schedule window.
What a piece of work is design
When we have an architecture that meets the performance and functional objectives, we partition the functionality into modules. We reuse some of the blocks from previous designs and enhance existing internal IP as well. Based on experience
and calculation, we estimate a size for each of the blocks. We do a floor plan based on these size estimates, then globally route the design (see Figure 2).
|
Figure 2 - Floorplan
|
|
|
Behavioral modeling and the use and reuse of verified blocks allows us to floorplan
with confidence.
|
At this stage of the design cycle we feel confident that the architecture will compete in the areas of functionality and performance. We also enjoy increased confidence that the implementation will achieve the cost, size, and schedule goals. For each of the modules we possess timing and size constraints, port lists, and a specification of the module's functionality. Based on lessons learned at the architectural stage, we also create white papers for each module detailing
interesting test cases.
Previously, we have considered creating C models for all the submodules and assembling them to form a behavioral model of the micro-architecture, which would be cycle-accurate at the external and intermodule boundaries. The appeal of a C model, which is cycle-accurate at the boundaries, lies in its high speed. A compiled C model could yield many thousands of cycles per second once it's compiled to the native code of the workstation, compared with only tens of cycles for an RTL
simulator. The available C programmers far outnumber the Verilog engineers. Consequently, C modeling of the design implementation is common.
The practical benefits of a cycle-accurate C model would include increased throughput, enabling faster discovery of design errors. A C model that is cycle-accurate at the inter-module boundaries could also double as a test harness for module-level verification.
These benefits are attractive, but it takes considerable time to implement a cycle-accurate C model
of a complex architecture. Our existing strategy is to divide and conquer. The first pass is the high-level behavioral modeling used to evolve the architecture. The second pass is the RTL modeling used to verify correctness of both the design and the implementation. We don't see sufficient benefit in extending this two-phase methodology into the three-phase methodology created by adding a C model of the design. We like our two-phase analytic modeling approach because it yields a solid architecture. Even
though a C-level model of the implementation would outpace the RTL, it would remain slower than the analytic model and unsuitable both for our ýwhat ifý method and for analysis of the large trace database.
A C model would uncover design errors more quickly than an RTL event simulation. However, we must simulate the RTL to uncover implementation errors and we believe that it's more efficient to verify the RTL for both design and implementation errors simultaneously. For example, a ýwrite orderingý problem
may result from a design error or an implementation error. A C model or RTL simulation could uncover a design error such as a misinterpretation of the coherency protocol. However, only RTL simulation could uncover a ýwrite orderingý problem stemming from an implementation error (for example, logical AND versus bit-wise AND).
The RTL is our golden model, carried forward and mapped into gates, while a C design model, which can't be carried forward, is a dead end. Using a C model as a reference model could
potentially help us at module-level verification. However, a C model is likely to be subject to the same specification errors and design errors as the RTL model. In general, spec-based verification or verification based on comparison to an alternate interpretation of the spec isn't sufficient (in other words, a C model is an alternate implementation of the spec). We don't see that a C model of the design, which is accurate at intermodule boundaries, would enhance our functional coverage.
In general,
simulation throughput comprises only one aspect of the equation. Typically, schedule depends on test creation time, simulation throughput, debugging time, and fix implementation time. Historically, test design and test implementation created the biggest time sink for us. We have started using a commercial test-bench tool that has automated these tasks. By automating the test creation we more than compensate for the runtime penalty of RTL simulation.
The crucible of silicon
Our goal at
system-level verification is to confirm the correctness of both the design and the implementation and to find and fix all functional defects. We use the iControl tool, from iMODL in San Jose, for system-level verification. Using a commercial tool enables us to focus on the value we add, which is the design. If we created an in-house test bench for a large server it would demand almost as much effort as the creation of the design itself.
The verification environment for a multiprocessing server must provide a
stimulus stream to represent each of the processors and the I/O buses (see Figure 3). In general, the stimulus streams are independent and asynchronous from a transaction point of view. However, to target specific scenarios (transaction collisions or queue full conditions, for example) we must coordinate and synchronize these concurrent streams. We use continuous protocol monitors to verify that the design conforms to the bus protocol of the target processor at all times. Our complete verification
environment is complex, but we must add the complexity incrementally. When we bring up a design we start with one processor and add processors as the design stabilizes. We use iControl for the test bench; we use some of the iMODL bus-functional models (BFMs) and monitors to represent the processors and to check our design for compliance with the bus protocol. This allows us to modularly expand the size and complexity of the test bench. To add an extra processor, we simply instantiate an extra processor BFM and
connect it to the appropriate port in the design; iControl takes care of the rest. Being able to increase the complexity of the verification environment without having to restructure the test bench avoids dead time in the integration phase. Having the test bench implemented in C minimizes the simulation load, which is key for large n-way simulations.
|
Figure 3 - Multiprocessor block diagram
|
|
|
Low simulation overhead enables N-way system simulation for a design under test.
|
We create complex system-level traffic scenarios at a high level using the iMODL tool. We can duplicate traffic patterns we observed at the architectural stage, or generate completely new traffic scenarios using the templates which iControl provides. During the simulation run, iControl
implements the traffic scenario automatically by generating self-checking transaction sequences. This activity enables us to create multiple scenarios and launch parallel verification runs on multiple workstations. We aren't limited by our ability to create new functional tests.
For this reason we haven't been constrained by the defect discovery rate. We launch a battery of verification runs overnight; while we sleep iControl is creating new tests to increase the functional coverage and identify
system-level errors (see Figure 4). To monitor and control the verification process we categorize failures into a number of different error classes: design, implementation, syntax, documentation, environment, simulation, and other.
And all the people merely testers
|
Figure 4 - Design stability
|
|
|
As the design cycle progresses we chart the discovery rate in terms of simulation cycles run, the fix rate, and the error distribution among the classes.
|
We have three categories of functional tests for system-level debugging: regression tests, deterministic tests, and validation tests. The regression suite consists of test cases designed to verify bug fixes and classes of related failure mechanisms. The deterministic test suite covers basic operability
of all functional primitives and consists of around 170 individual self-checking tests. This deterministic test suite runs rapidly and is part of the database check-in process for bug fixes. The system validation tests are automatically generated from high-level descriptions of general traffic scenarios. These automated system validation tests are the mainstay of our verification and presilicon validation strategy.
For a server design it would be humanly impossible for us to conceive all of the
possible test cases. Nor do we have sufficient manpower or schedule to code and debug all the test cases we could dream up. A random generation tool would help. However, the set of legal transaction sequences for a multi-processor server design is much smaller than the set of possible input sequences. It's likely that separating false errors from real errors would diminish the advantage of the random exploration of all combinations and permutations. We use the iMODL tool because it offers automated generation and
direct leveraging of architectural knowledge. We can create targeted validation suites because the tool can understand enough about the architecture to generate and verify protocol-compliant sequences. Invariably, an automated tool, with the ability to leverage architecture, generates interesting test cases that we wouldn't have even considered. This coverage has enabled us to break through false plateaus in the defect rate and to extrapolate the plot of the defect rate against the log of actual cycles to
predict design stability.
At the architectural stage, both the stimulus and the model are modeled as behavioral, so we parse the traces into transactions and run them on a trace model. At system-level verification, the stimulus is behavioral and the design is in RTL. We use BFMs to bridge the gap between the behavioral stimulus and the RTL design. BFMs allow the creation of interesting transaction sequences independent of whether a real application program could ever achieve the same I/O transaction
density or timing. For example, the verification of ýdeferred response orderingý requires an access to a cache line with an outstanding deferred phase. In this way BFMs compress simulation time and enable efficient use of the available verification cycles. If we used full functional models or instruction set simulators for the processors we would incur a greater simulation overhead and the chances of having access collisions would be very small.
For many projects the product has to be designed twice, once
at the architectural stage and once at the RT level. The key to effective modeling is to identify all the questions that you want answered by the model and to narrow the focus of the model accordingly. The winning strategy is to divide and conquer-to identify the goals at each level of abstraction and to eliminate overlap.
The value derived from a model is only as good as the stimulus applied to the model. Test-bench creation and stimulus creation require time and effort. These expenses are sometimes
hidden and absorbed into design activity. It's important to recognize all the verification costs up front and allocate the dollars wisely. Anything that can be done to accelerate the schedule and the completeness of the validation will pay dividends.
Robert Quinn is CEO of Network Virtual Systems in San Jose. NVS architects and designs scalable server chipset solutions for the OEM market.
To voice an opinion on this or any other article in Integrated System Design,
please e-mail your comments to mikem@isdmag.com.
Send electronic versions of press releases to
news@isdmag.com
For more information about isdmag.com e-mail
webmaster@isdmag.com
Comments on our editorial are welcome.
Copyright © 2000
Integrated System Design
Magazine