With the explosive
growth in Internet usage and the merging of data with voice, designs for the networking industry are going through major changes. With chip capacity reaching greater than 3 million gates, design verification has been the major bottleneck in bringing new network products to market. Fortunately, new verification technologies, such as reconfigurable computing (RCC) coprocessors, accelerate design verification over previous methods, and are helping to reduce time to market.
Networking functionality has matured
to the extent that networking vendors are striving to create added value in their products by producing systems with higher port densities, higher bandwidth ports, longer packet sizes, and advanced traffic policing features to enable the integration of voice and data on a data network. The implementation of all of these features drastically increases the verification effort required for past designs, not only in the number of tests required to verify the increasing feature sets, but also in the simulation
time required to complete them individually and in regression suites.
Of the methods available to boost simulation performance, RCC coprocessor technology has demonstrated the most promise in closing our verification gap. The associated ease of use, preservation of current design methodology, high performance, and debugging tools such as dynamic checking, hot swapping, and waveform extraction, have made RCC our choice to verify large system-on-a-chip (SOC) designs.
Integration equals complexity
In the past, networking devices were relatively simple devices. They typically consisted of a network port, a memory port, a port for CPU configuration, and one or more proprietary system ports. The main function of the devices was to deliver network data as it came off the network onto the system port. In general they provided a FIFO architecture to move the data from the network to the system in an orderly manner. Verifying these devices was a straightforward process, accomplished by generating
and sending packets of various sizes at various rates through the interfaces and making sure that the packets exited the system at the correct destination port intact.
An aggressive networking system based on SOC technologies may contain dozens of network ports, each with multiple queues, a large and fast memory port with a link-list architecture, advanced IP security features (IPsec), a system or fabric port, and one or more embedded processors or DSPs. In the past, this system may have been developed by
separate teams or companies and integrated by the networking vendor on a PCB using multiple chips. The respective design team or company designed and verified each chip. Enabled by dense SOC technology, all of the components mentioned are now integrated into a single chip. The networking vendor may develop the individual modules in-house or obtain them as intellectual property, but after the design is integrated, their verification team must functionally verify all of the components and interfaces.
Multiple network ports and higher bandwidth requirements ultimately result in the need for more advanced and complex memory management subsystems. Instead of managing one or two data path flows as in designs of the past, this subsystem may have to manage dozens of queues and data paths, multiple control/data buffer descriptors, circular buffers, and other features. Features such as IP security with data encryption and compression need to be integrated architecturally into the data path, as do data flow
algorithms to minimize the added processing time and latency. With these additional complexities comes the need to verify every combination of memory access and data path-related features of the SOC.
The use of embedded processor subsystems is also increasing. Designers program the embedded processor to perform system management and interfacing, control the parsing of packets, determine priorities, and ensure that the packets are scheduled into the appropriate queues. The programmable processor introduces the
coverification methodology, which not only requires the hardware team to verify their modules, but demands that the software team, in simulation, run as much runtime application code as possible before tape-out. With the software and hardware so tightly dependent, if some feature or interface isnęt verified completely the chip may come back "dead" or, more often, either doesnęt support all of the features intended or impairs the systemęs performance. As with any SOC project the outcome of incomplete
verification leads to respins and added development costs. One method that attempts to increase functional verification coverage employs random simulations, which randomizes as many parameters as possible of the design and test bench in the attempt to discover situations not previously identified. Most random simulations use a random-number generator supplied by an HDL. Ensuring that the random simulations are truly random requires the execution of a large number of repeated simulations.
Verifying the
new
Networking vendors are implementing traffic-shaping algorithms in their devices to deliver differentiable types of service. Guaranteed bandwidth and quality of service are the cornerstone capabilities for next-generation systems. One scheme that can guarantee bandwidth is adaptive-packet marking, which uses modest support from the network in the form of priority handling for appropriately marked packets and relies on intelligent transmission control mechanisms at the edges of the network to
achieve the desired throughput levels. Coupled with an algorithm called Random Early Drop (RED), this queue management scheme provides an evolutionary advancement in differentiating service. When the queue length exceeds a certain threshold, the RED scheme drops packets randomly with a given probability. The drop probability depends on the queue length and the time elapsed since the last packet was dropped. This queue management mechanism gives preferential treatment to marked packets since it assigns them
significantly smaller drop probabilities than those of unmarked packets. Higher priority packets receive the guaranteed bandwidth.
The deployment of such advanced networking algorithms makes design verification more challenging because traditional verification techniques donęt cover all possible conditions. Previously, verification techniques separated individual design functions into independent tests and ran them concurrently on multiple workstations. Most network companies have adopted this technique to
schedule parallel execution of multiple simulation runs on a farm of workstations. However, the increasing use of more advanced algorithms such as RED and adaptive packet marking mandates much longer simulation runs to verify the efficacy of the implemented algorithmsand to discover previously unidentified cases by way of random simulations.
Time to converge
A pure data network places fewer constraints on the requirements for packet size than do voice criteria. Combining voice and data,
however, changes the dynamics of packet size, depending on what type of information is being sent. The intelligibility of packetized speech depends, in part, on packet loss rate and the time spent waiting for late packets. The acceptable packet loss rate is a function of packet size. Speech losses as high as 50 percent can be tolerated for very small packets containing 20 ms of voice data. In an Internet Protocol network with adaptive congestion control the packet size may vary depending on network conditions.
On the other side of the packet-size spectrum is a trend toward larger packet sizes, which reduces overhead processing and increase throughput for bulk data transfers. Simulation using longer data packet lengths, along with packetized speech to verify device behavior in the two extremes, simultaneously requires more simulation time and more computing resources.
Increased system complexity and verification requirements, longer random/regression simulation runs, newer high-level algorithm implementation,
embedded processor coverification, and large packet size usage have all combined to severely impair verification throughput. Under the fastest compiled simulator on the fastest workstation, simulation performance of two to ten cycles per second is normal for system-level simulation. Any method that boosts simulation performance without changing design methodology is the appropriate choice.
Accelerated simulation methods
Hardware-based systems provide the best way to accelerate simulation by
an order of magnitude. We evaluated three distinct hardware categories: hardware emulation, hardware acceleration, and reconfigurable computing (see the table).
The newest entry, the reconfigurable computing (RCC) coprocessor technology, uses a coprocessor that contains a massively parallel structure of computing elements specially configured for each design. A computing element is a small compact processor dedicated to performing one function, such as the simulation of Verilog RTL "case" and "if"
statements.
After reviewing the varying accelerated simulation technologies, we decided to try RCC. We ran accelerated simulation on Axis Systemsę Xcite, a software- and hardware-based verification package that fits directly into a Sun Microsystems workstation and provides transparent access to the RCC technology. The tool includes its own compiled simulator running on the Sun microprocessor for running and verifying behavioral Verilog and C applications, and accelerates the RTL and gate-level verification
with RCC technology.
The design we tested was a million-gate Verilog SOC design for a gigabit switching router. This design contains 500 Kbits of memory. The logic block consists of approximately one million ASIC gates with nine physical clocks, including some gated clock circuitry. We described the design in a Verilog RTL format and used a test bench to stimulate the chip in behavioral Verilog as well as custom C application code linked through a Verilog programming language interface (PLI).
Simulation is golden
Before we wrote any Verilog code, we generated a simulation model written in C to simulate the architecture of the network router. We simulated and verified the architecture implementationincluding the number of data queues and encryption and compression methods, as well as packet priority methodswith the C reference model before proceeding with the hardware design. With the full model in place, architecture simulation verified the expected packet latency with a minimum soft
guaranteed bandwidth. In addition, the architecture simulation also tested RED with random packets being dropped based on queue length.
Because of the high complexity of this design, the architectural C model served as a reference for expected results. Thus the C model was integrated within the simulation session to detect simulation mismatches when they occurred. We generated all tests from the C environment, applying the same stimulus to the golden C model as well as to the Verilog RTL design.
|
Figure 1 - The parallel-case detective
|
|
|
The dynamic checker can detect gate-level parallel-case violations at the register-transfer level during simulation
|
To identify debugging mismatches, the architectural C model contains internal states that are
matched with the hardware model. Comparing the internal design states usually allows for the detection of the cause of the mismatch, although a history of all key internal states must be captured for RTL simulation to determine why the RTL simulation reached a faulty state.
Before using the RCC simulation acceleration technology, we typically would distribute simulation jobs onto a farm of more than twenty Sun workstations and servers. These simulations sometimes didnęt detect design bugs until weeks into a
session. By that time, the RTL design had usually changed from the original simulation model and we had to resimulate the latest design or fix, a process requiring weeks of simulation to verify that the fix was in place. Obviously, a faster turnaround time would boost productivity.
Purifying RTL designs
In an RTL design, designers insert synthesis directives to instruct the logic synthesis gate-level translation process, maximizing performance and minimizing gate count. Unfortunately, they
add these synthesis directives to the RTL design as comments that arenęt interpreted as part of the simulation model. As a result, extra directives often lead to simulation mismatches when comparing RTL simulation results against gate-level simulation.
To identify and isolate these design problems earlyat the RTL levelwithout running either logic synthesis or simulation at the gate level, a new classification tool set can perform dynamic checking during simulation (see Figure 1). One such tool set is
the Axis Xsim Xaminer. A dynamic checker offers the ability to pinpoint potential design implementation problems during RTL simulation. Compared with static design checking, the RTL dynamic checker has the built-in intelligence to simulate according to synthesis directives and automatically detect differences in simulation results. Dynamic checking can also detect design problems previously undetectable with static tools.
|
Figure 2 - Extracting history
|
|
|
RCC compression enables the on-demand extraction of waveform history information during simulation
|
The tool set enabled us to easily detect and correct parallel case violations, full case violations, design race conditions, and proper reset sequence not exercised, all at the RT level without
resorting to gate-level simulation. By running the dynamic checker, we were able to isolate potential gate-level problems at the RT level. Thus, we detected design implementation issues early in the design process, eliminating costly synthesis iterations.
Simulation acceleration
Compiling the design into RCC technology proved simple since the design was already simulated with the Synopsys VCS simulation environment. With some minor modification and setup, it took less than two hours to compile the
design. During this time, the RCC compiler automatically determined all the RTL and gate components for mapping into RCC computing elements, set up proper communication sequences between the native compiled simulator and RCC technology, and performed placement and routing for all programmable logic devices (PLDs).
RCC is a functional simulation that complements static timing analysis well. We were thus able to separate functional and timing verification into two separate steps, allowing us to focus on
first obtaining correct RTL functionalitythe most time-consuming part of the design process.
The lack of debugging tools and lengthy setup time made previous hardware-assisted simulation difficult to use. In contrast, RCC simulation technology blends ease of use and advanced debugging tools. RCC enabled us to hot-swap simulation states from the compiled simulator into the RCC during simulation. The swapping capability provides the best of two worlds: fast simulation execution and full-circuit debugging.
We could simulate our reset sequence within the compiled simulator, swap a simulation state into the RCC after reset had completed, accelerate simulation in the RCC, and swap back from the RCC into the compiled simulator for design debugging.
RCC can also compress all node changes and extract waveform files at any time region during simulation. With these functions, we simulated as fast as possible up to the time of a design error and then had full visibility of all node changes without needing to
restart the simulation. Waveform extraction is simple and fast and costs little in simulation performance and disk space overhead. For example, one of our simulation runs ran for more than five hours before a design error occurred. We instantly extracted all the node changes in a waveform format within any time range and design hierarchy, from time zero, without restarting simulation (see Figure 2).
Simulation performance using RCC technology dramatically boosted simulation throughput, allowing the stimuli
to include long packets as well as long runs with a random packet injection/rejection ratio. Using our golden compiled simulator, we achieved a maximum of 200 clock cycles per second. In contrast, with the RCC technology producing identical results, performance reached 12,000 clock cycles per second (a 60-fold increase in speed).
With RCC, we verified the long random tests overnight, compared with the two-week turnaround time of a compiled simulator. We also added several random tests into the
regression suite and still finished the complete regression run overnight.
Verification has become one of the biggest design bottlenecks within the networking industry. The strong demand to prioritize route packets and increase packet length has increased simulation time to bring forward all corner test conditions. RCC coprocessor technology provided us with an easy-to-use, speedy, and thorough method of streamlining our verification flow.
Eric Shieh is the principle at
Shieh Resources, Inc. (San Jose), a consulting firm that specializes in the verification of networking-related devices.
Send electronic versions of press releases to
news@isdmag.com
For more information about isdmag.com e-mail
webmaster@isdmag.com
Comments on our editorial are welcome.
Copyright © 2000
Integrated System Design
Magazine