Modern processor design demands more than just a good model.
By Scott Taylor and Nicholas Jamba
As memory architectures become more advanced, the communication protocols used within them will become more complicated. More specifically, as related to processor construction, our engineering team at Compaq (Shrewbury, MA) found that the modeling of Rambus DRAM (Rdram) devices in a system simulation environment provided new challenges for our memory controller design process. Among one of the more important facets of our Rdram design experience was the adoption of memory subsystem models supplied by third-party vendors. After a great deal of work, we concluded that these models can increase the productivity of a design team - but only if the models are accurate, detailed, and well supported. The following description details our successful use of this approach to verify the functionality and timing of our Alpha 21364 microprocessor's memory controller.
The nimble Rdram
Standard DRAM devices receive complete commands and don't maintain much internal state. Rdram devices, however, consist of a memory block containing sense amps and banks, and a control block that decodes and executes commands to access the core data. These commands are in the form of packets containing fields for reading and writing the memory core, precharging and activating banks, and controlling on-chip power-conservation modes. Command packets are transmitted on time-multiplexed row, column, and data buses.
Additional control packets can also be sent on a slower serial bus to modify the internal control registers of the Rdram device during initialization.
The Rambus architectural specification allows for many different types of Rdram devices, each with a different core-access time and core organization of banks, rows, and columns. Multiple Rdram parts are grouped together to form a Rambus channel. Signals are then propagated down each Rambus channel to all of the Rdram devices, which decode the packets to determine if the packet was destined for that device. The control complexity and timing of the Rdram architecture makes verification of an Rdram memory controller a significant problem.
Our goal in simulating an Rdram environment was to model each device as accurately as possible, while still maintaining a high level of simulation performance. A primary concern was to be able to model all conceivable Rambus implementations that could be manufactured under the Rambus architectural specification. Actual Rambus parts being manufactured at the time didn't necessarily correspond to the types of parts that might exist during the lifetime of the microprocessor. Therefore, we intended to design its Alpha 21364 processor to support any possible future implementation. The simulation environment had to be flexible and allow the specification of different randomized configurations at run time.
Meeting the goals
Random configurations fell into one of two categories: the configuration tool could randomly choose from a set of pre-defined configurations that corresponded to actual parts being manufactured by various companies; or the tool could generate a completely new configuration by randomizing the individual Rambus parameters. This second type of device would still obey the Rambus architectural specification. Randomization of the second type was weighted to favor proposed future generations of Rdram devices.
We also wanted the option to simulate the full reset and initialization functionality of Rdram devices. This was necessary to test the processor's interactions with the devices during system boot. It was expected, however, that most processor bugs wouldn't fall within this narrow area of functionality. As a result, we wanted to avoid the lengthy serial I/O initialization process for the majority of processor simulations. This meant that the memory model needed to support two types of initialization: via the normal serial I/O process, and via a "model magic" mechanism wherein the devices were initialized by a non-hardware "back-door" mechanism under the control of the simulator. By using this model magic, the simulation performance for the vast majority of simulations wouldn't be affected, but accurate boot-time behavior could be attained at the user's discretion.
The Rdram channel has many complex timing constraints and clock domains. The CPU's memory controller design must be aware of these constraints and be able to configure the channel correctly. The memory controller therefore contains several FIFOs and queues to maintain data ordering and timing. These functional blocks can be tested only under conditions where the memory model exhibits the proper timing behavior (for example, wire delays, clock skews, and CPU-to-Rambus clock ratios). This functional verification requirement means that the memory model must contain accurate timing information. As with the serial initialization sequence, this accurate timing model would ultimately impact simulation speed. Therefore, it was decided that such a model would only be used for a subset of simulations that were specifically testing the associated processor logic.
Additional back-door mechanisms were needed to allow the simulation to easily pre-load, examine, or modify memory contents. In addition, we needed a complete set of debugging tools, including graphical displays of packet communication, log files of Rdram core accesses, and the ability to query the state of the Rdram devices. The environment also required protocol checkers that would watch the packet communication with the Rdram and report any architectural timing violations in the Rambus packet interactions. Finally, the Alpha 21364 processor was designed to exist in a multiprocessor system that uses more than one channel per CPU. The simulation environment would need to track the different Rambus channels and map them to the correct processor pins.
Model development
Different processors could conceivably use different types of Rambus devices (varying device memory size and number of devices per channel). This meant that a simulation must be able to instantiate different types of Rambus devices in a given simulation run (see Figure 1). When we started development of our Alpha 21364 processor simulation model, we began by internally developing a Rambus model. Because of our proprietary simulation language, it was believed that interfacing the model to third-party vendor software would be impossible. Unknown to Compaq engineers, Denali Software, Inc. (Palo Alto, CA) was also developing a model of the Rdram architecture with many of the same simulation goals in mind.
|
Figure 1 - Alpha Rambus memory subsystem
|
|
|
The multiple channels within our processor required a simulation environment that could map all channels to their corresponding pins.
|
We had the responsability of developing Compaq's Rambus model. It was designed to meet the memory model goals listed above. There were also secondary goals related to interoperability of the model with the existing Alpha simulation environment. The simulator was built on a C++ cycle-based simulation engine developed internally by our Alpha CAD group. The model was completed and used extensively in the initial testing of the Alpha 21364 microprocessor.
Denali markets its Rdram model as part of its Memory Modeler platform. Built to cover an increasingly wide array of memory architectures, Memory Modeler is delivered as pre-compiled C objects on Unix and NT, and co-simulates with commercial Verilog, VHDL, and HW/SW simulators. In addition, via the Yukon event-driven API, Memory Modeler can be integrated into proprietary simulation environments, such as that used at Compaq. Denali delivers one parameterized class-based model per unique memory architecture. The class is then instantiated as a particular memory component by having its parameters defined in a SOMA file.
Normally, the SOMA file is constructed through the Denali Memory Maker tool. Denali and several memory manufactures offer these SOMA files on their web sites. Since the SOMA file is saved as an ASCII file, Compaq was able to readily generate random SOMA files on the fly to represent any possible Rdram device (see Figure 2).
Denali, as part of their modeling technology, incorporated into their architecture, runtime switches for running the models in pure functional or full-timing mode. Cycle-based simulators can then integrate directly to the Yukon interface. For users of commercial HDL simulators, Denali delivers the simulation tool already integrated. Other than this limited amount of code customized for each simulator, the exact model code is shared for all Denali users. All of the memory models share a back-door data interface in C. We used this API for data-driven verification where the contents of the memories could be set, queried, overridden, and checked throughout the simulation.
DAC spreads the word
At the DAC '98 conference and trade show, we became aware of the existence of Denali and its memory modeling tools. Several meetings were set up to discuss the possibilities of using their models.
|
Figure 2 - Use of SOMA in Denali simulation environment
|
|
|
The SOMA file defines the parameters of a specific memory component.
|
It became evident that there were several advantages to be gained by adopting Denali's Rdram model and discontinuing our internal model development. Denali has engineers devoted full-time to the development of memory models. They would maintain, debug, and extend the model as problems were found or as the Rambus architecture evolves. This freed up Compaq resources to handle other tasks. Additionally, Denali's C-based model is reused across both Verilog and VHDL simulators and across different simulation environments, including event-driven, cycle-accurate, full-timing, HW/SW, and proprietary simulators. Denali's model was already completed, while our model still lacked certain functional components such as the serial line interface. Finally, the Denali model had more comprehensive assertion checks and timing checks than our model.
Changes to simulation models
The Alpha simulation environment is based on a cycle-based simulation engine. The Denali-supplied model is based on an event-driven engine. This resulted in a small amount of re-work of our model-interface code to connect the two models. The mechanism by which random configurations were generated also had to change. The environment needed to generate pseudo-random SOMA specification files to configure the Denali model. This file would then be loaded into the simulator at run-time to configure the memory subsystem.
The Denali simulation model supports three-state memory data. Our initial simulation model was only a two-state model, so unknown bit states had to be converted into a recognizable two-state value. Denali provided a simulation callback hook to allow the Compaq model to override the Rambus data with a value of our own choosing.
|
Figure 3 - Pseudophase model
|
|
|
The model measures small time increments by splitting the cycle into two or more high and low phases.
|
A specific data pattern was chosen to represent an unknown data element. This pattern could then be detected during simulation and trigger an error. Another difficulty arose when managing multiple Rambus channel instances. Each CPU in a system can support more than one Rambus channel, and there can be multiple CPUs in a single simulation. Pins from each processor had to be tied to specific channel instances. This required a set of data structures to manage the collection of Denali Rambus instances and track which instances belonged to a particular CPU instance.
The Denali model library also needed a few changes. The Alpha simulation environment supports save and restore of the entire model state. Denali added additional hooks to their model to allow its state to be written to file as a part of this save/restore process. Denali also added several debugging hooks and logfile messages to support the Compaq simulator.
Increasing model performance
One of our primary concerns with adopting a third-party tool was its effect on overall simulation performance. We were unwilling to sacrifice simulation speed due to changes in a memory model implementation, so there was considerable work done to improve the overall performance of the combined Compaq/Denali code base.
The most dramatic enhancement provided in the Rdram model (Denali markets this as the Turbo option) was accomplished by collapsing a set of models representing individual devices into a single model representing a complete channel. By modeling a complete channel at once, the pin activity to be communicated between the Compaq simulator and the Denali model was reduced by a factor equal to the number of devices on the channel.
Furthermore, the Rambus architecture guarantees that the commands driven on the input ROW, COL, and DQ buses must propagate in the same order across all devices on a channel. The per-channel approach took advantage of this by parsing the incoming pin activity once and then applying the resulting Rambus command(s) against each unique, individual device representation. The per-channel model characteristically provides an order of magnitude improvement compared to the per-device model, and often in the Compaq environment, overall simulation improvements of over 20 percent were consistently measured.
During each simulation cycle, the processor may be reading or writing data to any number of pins on various Rambus channel instances. Denali supplies mechanisms to access specific Rambus devices or pins by name, but this involved costly string compares within the Denali library. Instead, we improved the data structures that track Denali instances. The structures were expanded to allow table-style lookups of CPU, channel, device, and pin via numeric ID numbers. This table was calculated once at simulator initialization time, and resulted in additional run-time speedups of approximately 2 percent.
Realism and pseudophase simulations
A cycle-accurate memory model is a good first step, but our engineers desired more accuracy in a number of areas. They wanted to test setup and hold times on data. They wanted realistic timing delays on data from different Rambus devices. Multiple clock domains and skews were also an issue. Several methods were developed to address these issues in a cycle-based simulation environment.
The most critical problem to overcome was measuring small time increments in a cycle-based simulator. Our simulation engine supports a concept of pseudophases per simulation cycle. For most simulations, we want two phases per cycle (a high phase and a low phase); these form the basis for the simulator's clock. It's possible, however, to support more than two phases. In the case where the simulator supports eight pseudophases, the phases can be mapped to the original clock by creating a derived clock, which is high for the first four phases, and low for the last four phases. This yields four time units per clock phase to handle detailed timing analysis. The simulator can then drive data with reasonable setup and hold times to a CPU clock edge and receive data from the event-driven Denali model at the correct times (see Figure 3). If we are simulating, for example, a 1-GHz processor with eight pseudophases, this corresponds to a time resolution of 125 ps, which is more than adequate for Rambus timing parameters.
A second reason for pseudophase simulation became apparent as we were designing the logic to connect the Rambus clocking domain to the CPU's clocking domain. Since the two domains are independent, a "gearbox" mechanism was needed to pass data between the domains. A series of clock ratios was devised to closely match Rambus speed bins with expected speed bins of the 21364 processor. The closest matches were often non-integer clock ratios (for example, 2.5 CPU cycles to one Rambus clock cycle). This requires driving data partway through a CPU clock cycle.
|
Figure 4 - Wire and clock delay
|
|
|
Inserting delays between the device and processor pins is necessary to consistently test the configuration registers.
|
In real hardware, this is achieved through the use of a delay-lock loop (DLL). This analog structure isn't possible to model in a cycle-based simulator, but the effect can be mimicked by using a derivative of the pseudophase clock as the output of the DLL for scheduling the drive of data. The pseudophase simulation could've been accomplished by simply defining a clock at the pseudophase frequency, and then dividing down that clock to yield the CPU clock. In a level-sensitive latch methodology, however, the entire RTL model is evaluated at each pseudophase, resulting in a tremendous increase in simulation time. By building pseudophase support into the simulator, it's possible to define the CPU clock latches to only evaluate on a particular pseudophase edge, resulting in much more efficient simulation.
Wire delays
It takes a finite amount of time to drive data from a Rambus device to the processor pins. This time delay increases for Rdram parts at the far end of a channel. There are a number of configuration registers on the Rdram device and on the processor to handle these timing delays such that all devices may be treated the same by the memory controller. In a zero-delay, cycle-based simulator, it would be impossible to test these registers. However, by combining the pseudophase model with additional code to inject delays between Rambus devices and the processor, it's possible to create realistic timing models of signal propagation (see Figure 4).
In the Denali model, the data delay elements were modeled by creating a fake internal Rdram register at address 0xf. This fake register could be configured to set the necessary delay time that the instance should wait before driving data onto the pins of the device. Each device could be configured with a different value, which allows the user to set increasingly larger delays on devices that are farther out on the channel. Clock delays are handled on the Compaq simulator end by introducing time delays in the clock signals before reporting the clocks to the Denali memory instances.
Wire delays also affect clock propagation. The Rambus architecture supports several clock domains on a single channel, and there is hardware in a Rambus device to lock onto the correct clock edges. The simulator must accurately model clock delays and skews between different clocks in the memory system to be able to verify the clock domain logic. This can be accomplished by modeling clock delays to and from Rambus devices, and by allowing different channels to be skewed from each other on a single processor. By varying the delay values on each channel (to simulate different etch lengths, for example), the user can create channels that return data at slightly different times.
The clock and wire delay models play a large role in the initialization sequence for a Rambus memory system. The initialization must query different devices and determine the delays and timings that the processor must adhere to for a functional system. This can only be accomplished if all of the previous timing issues have been addressed.
Processor design is needy
Today's microprocessor verification teams require additional features such as protocol assertion checkers, timing information, and wire-delay elements in order to test the full functional range of a design. Simulation models must be able to cover a wide range of memory device instances. In addition to meeting these functional requirements, such models must also be easily integrated with a customer's simulator, provide necessary simulation options, and serve to maximize the overall performance of the verification environment.
Acknowledgments
The authors would like to extend their thanks to the engineers at Rambus, Inc. (Mountain View, CA) for many long hours of discussions about the internal timings of direct Rdram devices. Without this detailed knowledge, accurate modeling of Rdram timing would have been impossible.
Scott Taylor is a senior engineer in Compaq's Alpha microprocessor verification group (Shrewsbury, MA). He has worked on the verification of several generations of Alpha processors in the areas of caches, branch prediction logic, build-in self test/repair mechanisms, integer-operation units, and system testbanch modeling. He is currently in charge of multi-processor verification for a next-generation Alpha microprocessor.
Nicholas Jamba is currently working on the verification of high-speed terabit routing ASIC's at Avici Systems (N. Billerica, MA). Previously, in Compaq's Alpha development group, he worked on the verification infrastructure and modeling of external Rdrams for the 21354 Alpha micr0processor.
To voice an opinion on this or any other article in
Integrated System Design, please e-mail your comments to mikem@isdmag.com
Send electronic versions of press releases to
news@isdmag.com
For more information about isdmag.com e-mail
webmaster@isdmag.com
Comments on our
editorial are welcome.
Copyright © 2000
Integrated System Design
Magazine