United Business Media EE Times


Search

HOMEMARKET INTELLIGENCE UNITFORUMSDESIGNNEW PRODUCTSCAREERSBLOGSCONTACTEVENTSSIGN UP!RSSMost Popular contentTrusted Sources

 



cover story

Coverification Tames Cache and Bus Interface Controller Design

The highly complex interfaces in today's designs require significant attention to the verification flow. A hardware-software coverification environment allows more thorough verification.

by Steven P. Larky



As designs continue to grow more complex and time-to-market windows shrink, designers no longer enjoy the luxury of designing software and hardware independently, then spending months to debug the two. Increasingly, the two design efforts must be combined. Hardware and software coverification provides several benefits: more complete product verification, early feedback of hardware-software interactions to help refine features and functions, testing of the actual code in the logic simulator environment, and architectural verification as a simulated system.

At Anchor Chips we implemented this practice in the design of the AN3041Q Co-Mem device, a caching bus interface controller that interfaces to a variety of microprocessors and DSPs. In creating the design, we needed to verify four external interfaces: PCI bus master, PCI target, processor interface, and I 2 C interface. The verification was further complicated by speed differences--the PCI bus operates at 33 MHz, whereas the local bus runs at up to 40 MHz. Thus we needed to check the asynchronous interfaces between the two halves of the chip.

Designing a PCI bus interface chip
The Co-Mem device (see Figure 1), which runs software drivers as bus master and bus slave, contains approximately 100,000 logic gates and 135,000 bits of dual-port SRAM. One design constraint required the chip to be fully compatible and compliant with the PCI specification. Furthermore, the design had to accommodate design supplements, class specifications, and software coding guidelines imposed by the system designers using the chip.

System simulation combining hardware and software addressed those constraints. The involvement of our software engineers early in the hardware design gave them insight into how to structure the software interface to the hardware. In addition, they gained an early start on verifying their software architecture, design, and code. They were able to verify both the software that runs directly on the system and the interactions with the operating system of the host PC (Windows 95 or NT).

Structuring the system simulation around a test bench that closely mirrored the first development boards enabled us to verify the system architecture and get an early start on board certification. On the PCI bus side of the device, we needed to verify that the external interface complied with the PCI specification. To perform verification, we varied the number of wait states and data availability. We used a PCI bus monitor to confirm that every cycle on the bus had the correct protocol and timing. The PCI bus master performs five functions: filling the instruction and data caches, writing back changed data from the data cache, writing around data from the instruction cache, reading data to or writing data from the local-to-host direct access mechanism, and reading the page tables before performing the address translation from the local address to the PCI address. The PCI target interface allows the host on the PCI bus to access the shared memory, the instruction and data caches, and the internal operation registers. The processor interface is a slave-only interface, but its numerous configurations allow direct connection to a variety of processors. We had to make address decoding very configurable on the processor interface to allow the maximum flexibility for the system designer. To test the different interface possibilities, we employed several models of the local processor side.

Figure 1 Co-Mem system block diagram

The Co-Mem device provides a cache between the local processor and the PCI bus, with main memory on the PCI bus serving as backing store for the cache.

The last external interface is the two-wire serial (I 2 C) interface for connecting to nonvolatile memory that contains initialization parameters. Verifying the I 2 C interface required us to run an initialization sequence at system reset to personalize the PCI configuration. We also needed to check the interface for compatibility with the I 2 C specification.

Cache challenges
Cache management proved to be the most complex portion of the design. During writes to the data cache, the fetch must be completed before the write can occur in the cache memory. For every access, we employed a least recently used (LRU) stack to prioritize pages for replacement and filling. While the local processor is accessing the cache, the cache controller is busy doing the background work of trying to fetch the missing cache lines to increase the hit rate and to save back modified data to speed up swapping out pages in case of a miss. Given the two clock domains, the interaction between foreground and background operations was tricky.

There are no work-arounds for a "mostly correct" cache that provides instructions and data to a processor, so our highest priority during verification was to ensure that the data was always correct--simple in theory, but not so simple in practice. Every read from the Co-Mem during simulation had to contain an expected value, and every write needed to include a value to be written.

Figure 2 Co-Mem verification environment

The verification environment mimics the Co-Mem system block diagram to allow complete and accurate system simulation.

The expected value on a read at a given address would change over time as the test case ran, since writes to the address may have occurred. Simulation needed to account for these changes. Our approach generated the data algorithmically, based on the address. Then, if the location was written, the new value would need to be associated with that address and a flag bit would have to be set, indicating that the stored value rather than the algorithmic value should be expected.

For the page tables, however, we needed to set up the initial value ahead of time in order for the test case to run. Internal registers were unique in that their value could vary. They could contain the defaults at power-on or reset, the value most recently written to the register, or a value dependent on the current state of the cache (an internal base address register, for example).

As stated before, our highest priority during verification was to ensure that the data was always correct; but having correct data wasn't sufficient to ensure full functionality of the design. Cache memory can provide correct data even if only one of its pages is being used or if the LRU algorithm is faulty. The background operations that fill the cache could be faulty in several ways--not running at all, for example--yet the data would always be correct.

The verification environment
The heart of the verification environment was a full- system simulation that instantiated the device under test in a test bench. The test bench consisted of a PCI host model that allowed us to configure the device at power-on/reset and to modify operation registers during run time. A PCI slave model acted as the system memory. Generating the data algorithmically allowed loading of far fewer memory locations--typically just those that corresponded to the page table in host memory.

Figure 3 Co-Mem simulation

These simulation waveforms show about 3 µs of device operation taken from the Verilog-XL simulation as displayed by the Signalscan waveform viewer. The first access, a cache miss, triggers three PCI bus cycles: page look-up, data word fetch, and cache line fill.

On the local side, we used processor models to initiate read and write cycles to the device, developing our own bus-functional model that we configured to emulate different processors. The risk in using an internally developed model, however, is that its behavior tends to mimic the behavior of the design logic, since both come from the same set of specs.

This risk led us to augment our simulation with bus-functional processor models from Logic Modeling. In addition, we used a fully functional 68020 model to wring out the function and interface, shedding light on system-level issues such as the initial state of the processor and the initial configuration of the bus interface unit. It also provided an easy means of verifying data supplied by the cache--as long as the processor was correctly running, then the cache was working.

The verification environment intentionally resembles the Co-Mem system block diagram (see Figure 2). We separated the design and verification teams to encourage independent points of view and expectations of slightly different behaviors. The third-party models extend this concept for the processors to take advantage of the entire model user base, a verification resource not available within a single company.

The involvement of software engineers in the verification cycle added another perspective and made the entire design team more responsive to design and architectural problems. The software engineers also helped to translate data from one domain to the other and to automate regression runs.

Tools
For simulation we chose Cadence Design Systems' Verilog-XL Turbo. Given the challenges outlined earlier, we wanted to avoid tacking on problems arising from the simulator. There are faster Verilog simulators available, but they suffer from slight incompatibilities with the other software we use, causing simulation results to vary slightly.

For waveform viewing we used Signalscan from Design Acceleration. The tool offers an intuitive interface, allows alphabetical signal search with wild cards, and can display signals in just about any format (binary, hex, ASCII, mnemonic, octal, and so on). We had no trouble viewing the simulation results, even when the simulation trace file topped 500 Mbytes.

We used Anchor Chips' Mapper software, which translates memory address spaces between the Co-Mem and the PCI host as well as the local processor, to load initial register values and set up the page table. We used the PCI bus host, slave, and monitor model from Logic Modeling exclusively on the PCI bus side of the chip. For the local processor side, we augmented our own internal model with Logic Modeling's bus-functional models for the i186, i486, 68040, and i960, as well as the fully functional model of the 68020. We ran all of our simulations on Unix workstations.

Test cases
Although we simulated everything at the system level, we did write some simple test cases to verify the basic functions. To test the I 2 C interface, we loaded the configuration registers from an external EEPROM to confirm that the device could be personalized in the end-user system. We used PCI host cycles to check that the PCI registers could be read from and written to, and to configure the device. Once the chip was configured, we could read and write internal registers from both the PCI side and the local processor side. Our next-most-complex task was to verify that we could read and write the shared memory in both single-cycle and burst access.

Basic cache test cases consisted of memory read and write operations. A single memory read triggers an avalanche of internal operations. On decoding an address belonging to the cache, the cache controller detects the miss, then signals the PCI logic to fetch the data. Even if no additional activity occurs on the local processor bus, background operations begin and fill the entire page in the cache. Although simple to describe, these test cases exercise a great deal of logic.

Figure 4 Dhrystone simulation

These simulation waveforms show just under 3 ms of device operation, offering a view of the entire cache operation.

More complex cache test cases very quickly grow with the addition of burst and nonburst accesses on the local processor side, variable wait states on both the local processor and PCI bus sides, random addresses causing cache hits and misses, and adjustable burst lengths around and near cache line and page boundaries. When we started the cache test cases, we spent time observing the internal functions of the chip to ensure that the LRU stack was loading properly and that the asynchronous interface between the local processor and the PCI bus was correctly handled.

The most complex test cases we ran closely modeled a real system using the fully functional 68020 model, providing an easy means of verifying the data supplied by the cache. Again, as long as the processor was running correctly, the cache was working. We mapped all of the instructions to the instruction cache and the processor stack, and the application data to the data cache.

The initialization of the device came directly from the Mapper software. The code was written in C and compiled into a hex file. We loaded the hex file into the PCI slave memory based on the page table translation, then loaded the page table itself. At that point, the reset register was written and the 68020 model began executing code. The first instruction fetch was a cache miss; the cache fetched the data and then supplied the data to the 68020, which then executed the instruction.

Running the fully functional model enabled us to verify the entire system operation, ensuring that the process contained no holes. We first ran a memory test, which by definition is self-checking, stresses the data cache, performs a mixture of instruction and data operations, and is relatively easy to debug when things go wrong. Once the memory test was running correctly, we added a Dhrystone benchmark to verify performance and some nail-and-flush tests to verify additional cache functions.

Three microseconds of waveform data, taken from the Verilog-XL simulation and displayed by the Signalscan waveform viewer, illustrates the device's operation (see Figure 3). The top half of the waveforms represent PCI bus signals, the bottom half local processor signals. On the left side, the local processor access begins and triggers several PCI cycles, after which the data becomes available on the local side and RDYOUT is asserted. The first access is a cache miss, which causes three PCI bus cycles: page table look-up, miss data read, and cache line fill. Additional fetches on the local side trigger the subsequent PCI bus cycles.

Just under 3 ms of waveform data shows device operation running the Dhrystone benchmark (see Figure 4). The waveforms represent the overall cache operation instead of individual bus cycles. In the first half of the waveforms, as the test case begins, most of the local processor accesses are cache misses, causing PCI cycles to occur to fill the cache. The second half of the waveforms include only a few cache misses, as all the instructions and most of the data fit within the Co-Mem's internal cache memory.

Finally, we ran dual Co-Mem test cases. An additional Co-Mem and a processor model were "plugged into" the PCI bus. The chips were then individually initialized from the PCI host. We used the local processor models to trigger cache operations. Additional features we exercised included peer-to-peer access to the shared memory, direct access to each other's operations registers, and nail-and-flush operations using the cache region as the source and destination of the PCI bus master cycles.

Using two Co-Mems at once allowed us to verify coexistence with another bus master (albeit only a master that behaved just like ours). We also verified that we could interoperate with ourselves and that our master could tolerate our target wait-state profile and vice versa.

First-pass fixes
The first-pass parts received from the fab worked well. We were able to demonstrate PCI compatibility right away and had the 68020 up and running the first day. We quickly moved on to general cache operations, which were also successful.

Our first real-world test was the memory test already simulated prior to tape-out. Then we added tests to verify other features, using the shared memory for a printfbuffer so that we could send text from the 68020 to the Windows 95 display. The second processor we tested was the i960. This test enabled us to verify another local processor interface, as well as burst cycles on our local bus.

But not everything worked on the first pass. As we continued to write and run more tests, we found that a few features were not 100 percent operational. In every case, when we went back to the simulation to try to understand why we missed the bug, we discovered that the test case either didn't cover or weakly covered that particular area. Fundamentally, if you haven't simulated it, chances are that it won't work.

Finding and fixing the problems was relatively easy, once we identified their nature. The additional testing strategies included checking internal states that didn't automatically propagate to the pins by inspection or through bus watchers, small Verilog modules that check for anomalous behavior inside the chip. A viable alternative method that we didn't use writes test cases that depend for success on the correctness of an internal state.

In the system
As one of the first applications we ran to debug the system, the memory test from the full-system verification suite helped to bring up and debug the chip in the system fairly quickly. Experience with good software and the full-system operations from the simulations taught us where to look for problems. One large difference between the simulations and the ASIC debugging occurred when the chip froze.

During simulation we could rewrite the 68020 code or the HDL and restart fairly easily. However, in the actual system, the freeze required a reboot of the PC, a very time-consuming process. In the second pass of the chip, we eliminated this delay by adding a soft reset to the chip to enable a clean start without the reboot. We learned that it isn't sufficient to run real-world applications; we must be able to debug those applications on the hardware as well.

One other test we ran stressed the actual chip in a real system using two different host processors. We used two Co-Mem boards, one with a 68020 and one with an i960. The boards then independently looped through a series of tests: reading data from host memory, writing data to host memory, reading data from the other board, and writing data to the other board. Since the processors ran at different speeds, the tests didn't stay in sync; at times both local processors were moving data to and from each other.

Although the application accomplishes minimal work--it moves the same data back and forth--it demonstrates some of the key features of the chip. These include direct connection to a local processor and a PCI bus, and a local processor running code with no memory on the add-in board. From a chip verification standpoint, the application makes it clear that the first-pass parts were very functional.

We found three primary benefits to hardware-software cosimulation. At the top of the list is chip verification. Simulating in a full-system environment that closely matches the real world is the best way to ensure first-pass success. The second advantage is that the architecture can be verified prior to tape-out. Software and hardware engineers work together to hammer out the design, eliminating unwieldy interfaces between hardware and software. Running the full system confirms that system-level performance and function targets are met and that the architecture is complete. The third advantage is the ability to run and verify some portion of the code before the chips arrive. Software development is often the critical path for releasing developer kits, so any reduction in this time can shorten the time to market and increase revenues of the end product.


Steven Larky, the IC design manager at Anchor Chips, Inc. in San Diego, has 14 years' experience in chip design and verification. He holds 12 patents and has completed designs using both schematic entry and HDLs targeted to gate arrays, standard cells, and full-custom chips.

To voice an opinion on this or any Integrated System Design article, please email your message to miker@isdmag.com.


integrated system design  September 1998



[ Articles from Integrated System Design Magazine ] [ ICs and uPs ]
[ Custom ICs and Programmable Logic ] [ Vendor Guide ]
[ Design and Development Tools ] [ Home ]



For more information about isdmag.com email webmaster@isdmag.com
For advertising information email amstjohn@mfi.com
Comments on our editorial are welcome.
Copyright © 2000 Integrated System Design

  Free Subscription to EE Times
First Name Last Name
Company Name Title
Email address
  Click here for your Free Subscription to EETimes Europe
 
CAREER CENTER
Looking for a new job?
SEARCH JOBS
SPONSOR

RECENT JOB POSTINGS
CAREER NEWS
SRC Expands R&D Centers
The Semiconductor Research Corp has added a new center to its university R&D efforts.

For more great jobs, career related news, features and services, please visit EETimes' Career Center.


All White Papers »   

 
Education and
Learning


Learn Now:












Home | About | Editorial Calendar | Feedback | Subscriptions | Newsletter | Media Kit | Contact | Reprints|  RSS|   Digital|  Mobile
Network Websites
International
Network Features




All materials on this site Copyright © 2009 TechInsights, a Division of United Business Media LLC All rights reserved.
Privacy Statement | Terms of Service | About