Co-verification is a relatively new technique that gives software developers early access to a hardware design. Before co-verification became available, embedded-system developers often had to wait until a physical prototype of the hardware was available before writing software to run the system. But shrinking project development times and increasing complexity have driven developers to look for ways to begin writing and debugging software sooner.
Co-verification works by executing target system software on an embedded-system model that itself runs on a simulator. A special processor model interacts with both hardware and software and, through an ordinary software debugger, gives the software developer visibility into the operation of both. To increase performance, co-verification employs, among other techniques, cycle hiding, which removes bus-access cycles from hardware simulation. While it improves performance dramatically, cycle hiding desynchronizes the hardware and software simulations, and this can be a major drawback when verifying systems that depend on timers in the hardware. This is especially true in systems that employ real-time, multitasking operating systems. In fact, many developers consider it impossible to run an RTOS on simulated hardware.
Developers at In-System Design (Boise, ID) wanted to use co-verification for early testing of the VxWorks operating system in a system-on-a-chip (SOC) ink-jet printer controller. Using the Seamless Co-Verification Environment (Seamless CVE) and other tools from Mentor Graphics (Wilsonville, OR), they were able to verify the hardware-software interface and characterize RTOS performance as it switched between tasks.
Co-verification combines a special purpose processor model with an embedded-system model that runs in a logic simulator. Hardware developers create the embedded system model using a hardware description language, or HDL. An executable image of the target-system software is loaded into memory within the embedded-system model and runs as if it were on a prototype of the actual system. In other words, the software runs on a "virtual" in-circuit emulator that allows the software to be tested before the physical prototype is available.
In co-verification, the embedded-processor model serves three main functions. First, of course, it has to model the behavior of the processor as it executes instructions. The most common way to do this is to use an instruction-set simulator. Second, the processor model allows the software designer to view and affect the state of the hardware and software, usually through a graphical debugger. Finally, the processor model needs to interact with the hardware design. A bus interface model provides this function. The bus interface model translates the bus-cycle activity of the processor into a sequence of pin-state changes that mimic the processor's bus behavior in the logic simulator. The bus interface model also needs to handle resets, interrupts, and other asynchronous events that are generated in the hardware design (see Figure 1).
The co-verification tool links the processor model with the rest of the design model, so they can work cooperatively. It usually starts the logic simulator and software debugger and establishes communications between the hardware and software as well. The designer runs the design by starting the logic simulation, loading an executable image of the software (generally through the software debugger) and advancing both the hardware and software simulation from the debugger. The designer observes the processor registers, memory contents and variables through the software debugger. Both the logic simulator and software debugger retain all of the debugging functions available in the standalone environment.
While a complete description of the hardware design, with fully functional models of the processor and memory, could run entirely within a logic simulator and execute code from the design's memories, this would be far too slow. Real-world designs typically run at speeds of about five instructions per second. Simple math shows that we could run about 144,000 instructions in eight hours. Considering the amount of software that most systems contain, this rate of execution is too slow to verify anything but tiny snippets of code. Besides, using just a logic simulator provides no debugging visibility into the software.
ancing simulation performance
All successful co-verification approaches rely upon "cycle hiding" to improve performance and allow meaningful amounts of software to be executed. Cycle hiding processes certain memory transactions directly in a host-system memory array, rather than running them through the logic simulator. It turns out that most bus cycles can be suppressed without affecting simulation accuracy. Consider the example of an instruction fetch. When a fetch runs in the logic simulator, a bus cycle needs to propagate through the memory control logic and one or more memory elements. The memory then returns its contents to the processor, which interprets them as the next instruction. To execute the fetch cycle, the logic simulator has to process hundreds of simulation events, and this consumes a lot of compute time on the host computer. However, the fetch produces no significant state changes in the hardware, and can safely be omitted from the logic simulation.
For co-verification, a simple data array set up outside of the hardware simulation to hold the embedded-software memory image allows the instruction-set simulator to fetch opcodes much faster, and without affecting simulation results. The instruction-set simulator processes opcodes and determines what kinds of data movements and bus cycles it needs to generate. It passes this information to the co-verification tool, which decides whether to direct the transaction to the data array (a "software" region) or to memory modeled within the logic simulation (a "hardware" region). If the transaction takes place in the hardware region, the bus interface model drives a series of signal transitions from the pin interface into the logic simulation to emulate the bus cycle. In a read cycle, the data returns to the instruction-set simulator, along with any interrupt or other exception that may have occurred during the processing of the bus cycle.
Figure 2 shows the effect of eliminating just the fetch cycles from simulation. Executing all instructions from address 100, the hardware design sees a total of 11 bus cycles. If we model the code space as a "software" region, which means it is in a fast memory array accessible to the instruction-set simulator, the hardware simulation has to process only three bus cycles. Most data references can be masked as well, which in this case reduces the hardware bus-cycle count to two.
So, what percentage of the bus cycles can actually be hidden from the hardware simulation? Experience shows that 99.9 percent or more of the bus cycles generated by embedded software can be masked from the hardware. Bus cycles that directly access a memory array outside of hardware simulation can run about 10,000 times faster than they would in the logic simulation. Assume a program generates 1000 bus cycles, and the ratio of I/O cycles to fetch and data cycles is typical-about 1 to 1000. In this case, a co-verification session executes 999 code and data references at a rate of 100,000 transactions per second, and one I/O cycle at 5 transactions per second. This adds up to 0.2 seconds for the I/O cycle (in hardware simulation) and 0.00999 second for the code and data references (in software-only simulation), for a total run time of 209.99 milliseconds.
It is easy to see that the logic simulator is the performance bottleneck, since it consumes over 95 percent of the time. The instruction-set simulator is also relatively slow, but the co-verification runs themselves end up being more than 10 times slower. Contrary to common assumptions, replacing the instruction-set simulator with some faster mode of software execution does not materially change co-verification performance. Even if the performance of the instruction-set simulator in our example increases by a factor of 1000, throughput goes up by less than 5 percent of the overall runtime (from 209.99 milliseconds to 200.00999 milliseconds).
We can conclude that most bus cycles must be masked from the hardware for co-verification to be effective. Fortunately, the nature of embedded code makes this possible, allowing us to boost performance by a factor of 100. But this relies on the fact that the hardware and software are synchronized mostly by events rather than the passage of time. When time synchronization between the hardware and software is critical, as it is with systems that employ real-time, multitasking operating systems, realistic verification scenarios require some way to handle the "warping" of time that cycle hiding brings.
Accounting for hidden cycles
As we have seen, hiding bus cycles during co-verification has little or no effect on the logic simulation, since the kinds of memory transactions we mask do not affect the state of the simulated hardware. But suppose the design depends on a hardware timer that is driven by the system clock. If co-verification suppresses hardware simulation activity for most bus cycles, the clock no longer ticks, and the timer becomes seriously out of synchronization with the software running in the instruction-set simulator.
The co-verification tool needs to do at least two things to compensate for "time warping." First, it has to compute the missing bus-cycle time in detail, accounting for the number of cycles that a given instruction takes, as well as cache timing, pipeline effects, bus-acquisition times, and other factors. Second, the co-verification tool has to update the timer registers in the hardware design and allow the logic simulation to be active around the time of any critical event, such as an interrupt generated by the timer.
The level of timing detail we need from simulation depends on what we are trying to verify. To keep an operating-system timer updated, we could assume a given number of clocks per bus cycle and simply count bus cycles. This gives us a rough estimate correct to perhaps 10 or 20 percent. But characterizing system performance, interrupt latencies in our case, requires a more accurate timing model. Our ARM-based processor model includes timing models for its internal pipeline, and it accounts for wait-states and bus-acquisition times as well. Caches were not a consideration, since our processor does not have a data cache. In addition to computing hidden cycle time, we had to ensure that the hardware was not in a "hidden" bus cycle at a time when the timer needed to trigger an interrupt.
Our hardware design is an SOC controller for an ink-jet printer that takes a subset of PCL 3 input and builds a raster image for printing. It consists of an ARM-based embedded processor, with ASIC logic that implements video frame-processing logic, memory control logic and external interfaces. The external interfaces include serial, parallel, and USB interfaces, as well as Print Mechanism Controller (PMC) and front-panel interfaces (see Figure 3).
We ran the design's Board Support Package (BSP), which included the interface drivers for the ASIC, an API for a third-party imaging application and the VxWorks operating system from WindRiver (Alameda, CA). To characterize overall system performance and observe the effects of the co-verification tools, we created several small tasks to run in the RTOS.
The goals of the co-verification sessions were to debug the BSP and verify that the system would boot up to the VxWorks level. To do this, we needed to prove out the interrupt controller and its drivers, along with the serial port and the timers. We also wanted to see task creation and scheduling take place within co-verification and decide whether the performance was sufficient to run application-level code.
Seamless CVE required no changes to the system software to run the co-verification sessions. It was important to be able to run actual target-system software. We did make some changes to work around parts of the hardware design that were not available. We also changed some of the software to improve system performance. Several memory tests were either shortened or omitted to allow the system to boot faster. A short section of code that prints the VxWorks banner to the universal asynchronous receiver-transmitter (UART), but which takes a couple of minutes of simulation time, was omitted after running it once. We modified the UART driver to output an address that could be snooped from the bus, and omitted a UART test that checked all possible baud rates. Finally, we excerpted the C run-time library, which fills a large region of memory with zeros, and instead used a debugger macro to initialize the memories.
The hardware design required some modifications to run in co-verification. We replaced the processor instances with co-verification models provided by ARM. This required just the addition of a new VHDL architecture and a two-line configuration change. We also needed to replace several memory instances with co-verification models. These models give the software debugger access to hardware memory without introducing speed-robbing bus cycles into the logic simulation. They also allow the co-verification tool to maintain a coherent view of memory from both hardware and software. This was straightforward and accomplished in less than a day.
So the hardware timers could be synchronized with the software, we needed to add several function calls to the timer HDL descriptions, which allowed us to update the timer state with a software-synchronized time value. We changed about 40 lines of HDL source code and were finished in a couple of hours.
Finally, we added a bus monitor to the hardware that watched for write cycles at address 0x31000000, the modified address of the UART, where the debugging output from the RTOS would be written. Written in the TCL/TK extension language of the ModelSim logic simulator, the bus monitor opened a window on the Sun Workstation, allowing us to see the output from the RTOS as it executed.
We needed to work around several time-synchronization problems to get correct simulation results. Obviously, the hardware timers needed to be kept up to date. We also needed to maintain synchronization during what we call the "atomic-swap," a memory swap that must be performed as an atomic operation with respect to both hardware and software. Several direct-memory access (DMA) channels in our design move data into and out of the system. We needed to make sure the hardware had sufficient time (or simulation clock cycles) to complete all DMA transfers. Finally, performance analysis requires time synchronization, and we were interested in measuring the maximum interrupt latency of the system.
Our design contains three timers which track time in the hardware and generate the operating-system tick, which has a 16-millisecond period. Assuming a typical ratio of hidden to non-hidden cycles of 1000 to one, the software would see about 16 seconds of time before the OS tick arrived. It is tempting to reduce the hardware OS tick to 16 microseconds and hope all goes well. This fails, of course, if any task performs a hardware operation that consumes more than 16 microseconds, which is quite probable. Leaving the OS tick at 16 milliseconds forces the task-scheduling algorithm to be "run-to-completion." We cannot support a run-to-completion model because many tasks simply run forever, and to properly verify our system the tasks need to run for the correct amount of time. Our task-switching model is preemptive-priority scheduling, with round robin scheduling for tasks of the same priority.
At system reset, a ROM containing the reset code for booting the system resides at address 0x0. However, VxWorks and the embedded ARM processor exception table need to run with RAM at that address. As part of the boot process, we move the ROM from 0x0 to 0x800000, and the RAM from 0x800000 to 0x0. While this swap takes place, we execute code that resides in ROM. To swap memory correctly, without tripping the software, the hardware and software must be in complete synchronization and the co-verification tool has to be able to reconfigure the memory regions during simulation.
The software sets up a DMA transfer by writing pointer and counter values to the DMA controller registers; it assumes the hardware will complete the operation. To complete the operation, the hardware runs some number of bus cycles. However, if the co-verification masks a large number of cycles, the DMA operation may not run to completion. For this reason, the co-verification tool had to be able to suspend cycle hiding for the duration of the DMA transfer. We accomplished this by setting breakpoints in the software to disable cycle hiding at the start of a DMA. In the hardware, we set breakpoints that watched for DMA completion and re-enabled cycle hiding. This allowed DMAs to run correctly without manual intervention.
With the design configured for co-verification, we ran a number of tests on a Sun Ultra Sparc 60, with 2 processors running at 360 MHz, two Gigabytes of RAM and Solaris 2.6. All of the software tools were from Mentor Graphics: The hardware simulator was ModelSim Version 5.4, the software instruction-set simulator/debugger was XRAY Version 4.4, and the co-verification tool was Seamless CVE, Version 4.0. The software was written in C, C++ and ARM assembler. We used register-transfer level models of the design, in both VHDL and Verilog versions.
Beating the clock
Approximately 18 minutes of wall-clock time elapsed between reset and creation of the first VxWorks task. Hardware initialization and diagnostics consumed most of this time. At this point, we were 1.6 milliseconds into the logic simulation. The OS tick for our system triggered every 16 milliseconds. We estimated that by continuing to run the simulation with clocks running in the hardware (in other words, without direct timer updating), it would take approximately 150 minutes (2 1/2 hours) to reach the first OS tick and almost 3 hours to reach the second OS tick. With the ability to update the timers directly to account for hidden bus cycles, we were able to spin the idle loop of the RTOS without dragging along the slow logic simulator. When we did this, the OS tick triggered at the rate of 4 timer ticks per wall clock minute.
We created and ran two tasks so we could measure RTOS performance as it switched between tasks. The tasks were trivial and simply wrote a character to the UART, which then printed the characters in a window on the workstation. Each task relinquished its time once the character was delivered to the UART. By watching as the characters were displayed, we were able to gauge performance as the system switched between the tasks. We got 16 task switches per wall-clock second.
In another experiment, we set the time slice of the operating system to 10 ticks, amounting to 160 milliseconds. Then we configured the RTOS to switch between two simple tasks. Each task went through a simple loop 220,000 times in one time slice. Disassembling the code, we found that the innermost loop of the task contained 11 instructions, for a total of roughly 2.2 million instructions. The simulated design completed each time slice in 170 seconds.
David Harris serves as In-System Design's chief technical officer. In 1988, he helped found Noninvasive Medical Technology Corp.
DeVerl Stokes is an engineer with In-System Design, Inc. Prior to working for In-System Design, he worked at Hewlett-Packard Company developing printer drivers.
Russell Klein is a technical marketing engineer and has been with Mentor Graphics for the past 8 years working on hardware/software co-verification. He was on the team that created the original prototype and received two patents on this work which provided the technology basis for the Seamless CVE co-verification product.