United Business Media EE Times


Search

HOMEMARKET INTELLIGENCE UNITFORUMSDESIGNNEW PRODUCTSCAREERSBLOGSCONTACTEVENTSSIGN UP!RSSMost Popular contentTrusted Sources

 

special section

EDA Platform Benchmark: Synthesis

The second EDA on NT benchmark shows that the hardware is as capable as any other platform, but questions remain about the operating system.

by James Lee and Bob Peterson



This, the third installment in our series on PC-based design flows evaluates synthesis performance on several 400-MHz Pentium machines. Note in particular that the PCs produced eye-popping results on disk/memory trade-offs along the way. If you've ever wondered how much memory you should invest in for synthesis, we can answer that query beyond all doubt.

We posed the question: Does Windows NT really boost the PC into the ranks of world-class EDA platforms? Specifically, we asked whether Synopsys Design Compiler is viable on NT. To add a little more variety to the view, we also compared the performance of two 300-MHz Pentium PCs with the 400-MHz versions.

We began the series with simulation (March) because that tends to be the most challenging EDA task for any platform. Surprisingly though, we've probably found more behavioral differences between NT and Unix on this installment's benchmarks. It's not a question of whether Design Compiler runs well on NT; it does. Despite the fact that we were testing early beta code, Design Compiler turned in an exemplary performance. You simply need to be aware that NT differs from Unix in many ways, and if you make the switch to NT, you'll need to watch those differences closely.

Picking benchmarks
Finding appropriate benchmarks for a synthesis run is harder than you might think. One difficulty is that synthesis benchmarks should have a certain reality to them. For the simulation benchmarks, we could pick a simple circuit and replicate it as many times as necessary to exercise the system. That method, however, doesn't work for synthesis unless you make the instances unique, which we did in one case to create a special memory/disk challenge. To get the characteristics we wanted in the synthesis tests, we needed to choose three new benchmark circuits.

The biggest difficulty with finding appropriate synthesis benchmarks came from a self-imposed constraint that we avoid the common practice of using proprietary circuits. If you're evaluating computing hardware or synthesis tools for internal purposes, you'll probably want to use one of your own proprietary designs so the results will reflect your design style.

By using nonproprietary tests here, though, we made it possible for other people to run the same benchmarks (on other platforms, for example) and compare the results. Nonproprietary benchmarks also help ensure that no one is manipulating the tests to make one vendor look better than another.

As the first installment pointed out, vendors can publicize the benchmarks that make their software or hardware look better--and conveniently fail to mention the benchmarks on which they fared poorly. Our benchmarks were strictly chosen to provide synthesis runs that were long enough to stress various aspects of the hardware and software. Predictably, that approach favored no particular PC vendor.

Nonproprietary RTL code is available from several sources. We could have created designs of our own--an especially appropriate task for the Seva Technology designers participating in the benchmark program--but that approach was too labor-intensive for the level of complexity we wanted. Other practical sources include universities, EDA vendor demonstrations, and people who post designs on the Internet for one reason or another.

After informally trying many different possibilities, we settled on two designs from university sources, one from a designer who posted his freeware RTL code on the Internet, and one that co-author James Lee designed several years ago for a Cadence demo. All of the designs are available on the Internet, but bear in mind that some of them have changed since we downloaded the copies used in the benchmark tests. If you want the versions we used, get them from ISD 's Web site at www.isdmag.com/edabenchmark.

Table 1 lists the statistics for each benchmark circuit so you can get an idea of how the designs might compare with your own. The values for area represent library units. The report size refers to the size of the log file created by Design Compiler, and the values for the disk space used represent the database files created by Design Compiler. The file sizes give an idea of the amount of disk activity that took place even in the CPU-intensive tests.

The RAW benchmark
The first benchmark comes from a source that we tapped last time: the Reconfigurable Architecture Workstation (RAW) project at the Massachusetts Institute of Technology. The RAW benchmark suite was designed to facilitate the comparison, validation, and improvement of reconfigurable computing systems. With that objective in mind, the benchmarks were designed to be small and easy to understand, parameterizable to generate designs that would consume a range of hardware resources, and portable to any reconfigurable computer as a behavioral Verilog netlist. You can access the suite and other RAW information at cag.lcs.mit.edu/raw/.

Figure 1 Benchmark circuit statistics
Design Number of lines Area (units) Report size (bytes) Memory used (kbytes) Disk space used (bytes) Scripts
RAW PIC 1,658 60,691 245k 86,156 176k Top-down compile
DES 1,603 133,555 232k 94,144 289K Top-down compile small-block characterization and recompile
TORCH Dpath 3,963 Not applicable 1915k 213,332 3.3M Top-down compile of many small components
TORCH Regfile 3,109 441,971 161k 124,600 1.18M Top-down compile
All TORCH 16,162 Not applicable (see text) 3.5M 306,040 58M See text
rpu256 Not applicable (see text) Not applicable (see text) 1.46M 849,796 Not applicable (see text) Top-down uniquify to estimate memory usage

Even though we weren't dealing with reconfigurable hardware in our benchmarks, the designs suited our needs quite well. Because the Game of Life design we used last time is too short for a good synthesis test, we chose another design from the RAW suite for this installment. The new design, a Data Encryption Standard (DES) module, is still fairly small, taking about seven to eight minutes to run. Our main intention with this benchmark was to check run time for a fairly small design.

The DES module also made sense for the tests because its datapath and control logic use a lot of math features found in Design Compiler. The DES algorithm repeatedly applies substitution and permutation techniques, one on top of the other, for a total of 16 cycles. The software algorithm used in this benchmark was adopted from Eric Young's fast encryption package.

Another reason why DES made a good synthesis test was that it includes a large number of nets. Because the success of deep-submicron designs depends so much on optimizing the interconnect, we wanted to include designs that stressed the ability of the system to deal with many nets.

Crashing the TORCH
The second benchmark also came from a university source--in this case the Stanford University TORCH project. TORCH is an experimental superscalar processor architecture that's interesting because it combines the strengths of static and dynamic instruction scheduling. Static scheduling takes advantage of the compiler's ability to efficiently schedule operations across many basic blocks; thus TORCH relegates all instruction scheduling to the compiler. On the other hand, dynamic scheduling offers the advantage of efficiently supporting speculative execution. To take advantage of that efficiency, TORCH includes hardware that allows the compiler to schedule any instruction before preceding branches--an operation the TORCH architects refer to as boosting. Boosted instructions are conditionally committed to the result of later branch instructions, thereby removing the scheduling constraints imposed by dependencies on conditional branches and simplifying aggressive instruction scheduling by the compiler. Among the other interesting aspects of that architecture are the use of 40-bit-wide instruction words and instructions that constitute a superset of the MIPS R2000 RISC instruction set.

If you suspect that all of this speculative instruction execution might require a lot of hardware, you're correct. The design includes two integer execution units connected to a six-port register file, a floating-point unit with an associated register file, and separate instruction and data caches (see Figure 1). The two integer execution units have different resources. The design is available at www-flash.stanford.edu:80/torch/.

The TORCH architecture provided a design that we thought would challenge Design Compiler in every way, and we were right. When we fed the entire 16,000-line design to Design Compiler as a single block, all we got back was an internal DC error message. The PC on which we were running this informal test was a 400-MHz machine with 512 Mbytes of RAM. Design Compiler kept 800 Mbytes of swap space, so we had to reboot the system.

Out of a sense of devious and perhaps morbid curiosity, we ran the same test on a Sun Ultra 60 SPARCstation, also with 512 Mbytes of RAM. We achieved the same result, except that the swap space was freed after the run. If you prefer that your internal errors be civil enough to let go of swap space, you can take this experience as an endorsement of Unix--bearing in mind that the PC version of Design Compiler was a beta copy. All these considerations notwithstanding, the final result was the same.

To make the TORCH benchmark more realistic, we extracted two of the modules from design, regfile, and dpath . Using those modules resulted in a combination that stressed Design Compiler without causing mortal harm.

To get a complete structured design that wasn't too big to run through Design Compiler in one chunk, we turned to a freeware design by Tom Coonan, an engineer with Scientific Atlanta. As a Verilog synthesizable model of the PIC 16C5X RISC processor, Tom's design provided a relatively small circuit that he estimates at about 1,500 equivalent gates, not including memories.

Figure 1 TORCH benchmark architecture

The experimental superscalar processor design from the Stanford University TORCH project offers an ideal way to stress the hardware and software used in the synthesis benchmarks. Because of the complexity of this design, only the datapath (dpath) and register file (regfile) modules were used for benchmarking purposes.

"If you want massive MIPS and sophisticated instruction sets, go look at the ARM or the Oak or commercial IP," Tom points out. "This is a simple processor that's easily comprehended and easy to work with." He also notes that the code changes on a daily basis. If you want the version used in our benchmark, go to ISD 's Web site. For Tom's latest version, see www.mindspring.com/~tcoonan/.

The disk stressor
Our last benchmark circuit, the rpu designed by co-author James Lee, is the only one we kept from the first round of benchmarks. We instantiated the simple 5,000-gate RISC CPU 256 times to create a design that was too big to fit into a 512-Mbyte memory. The uniquify function in Design Compiler allowed us to force the tool to make each instance unique so that the circuit was synthesized 256 times.

Because such a large circuit would take hours to compile, we read in only the design and did uniquify without actually compiling. While that approach is a nonstandard use of Design Compiler, it allowed us to use a maximum amount of memory in a minimum amount of time. The rpu256 benchmark served as a good memory- and disk-intensive test, prompting the most spectacular results of any of our tests.

We designed the benchmarks to compare Design Compiler's performance on NT PCs with its performance on a Unix workstation, as well as to compare the convenience of the two operating systems. By convenience we actually refer to the differences between Unix and NT that will make extra work for designers who are switching from one environment to the other.

Most designers currently use Unix as their EDA platform, and almost all use Unix to run Design Compiler because the NT version is only now becoming generally available. Because EDA tools take Unix for granted (including formal parts of the operating system as well as items that are indistinguishable from the operating system), switching to NT isn't just a matter of porting the tools' code to NT. The inevitable differences between the two operating environments will to some extent inconvenience designers.

Inconveniences
We encountered several examples of such inconvenience in the course of running the benchmarks. The first indication of the differences ahead involved the use of MKS Toolkit and NFS. Our goal was to use NT right out of the box with no additions, just as we used the unaugmented Solaris 2.5.1 on the SPARCstation. That naked-OS policy violates the installation instructions for the Design Compiler NT beta, however. According to the beta notes, "For interoperability between Unix and Windows NT, it's recommended that you install Hummingbird NFS Maestro-Solo v5.1.3 or higher version."

In addition to challenging our goal of using NT out of the box, the Hummingbird recommendation was unappealing because we preferred using Samba for sharing files between the PCs and our Unix systems. Seva Technology has used Samba for some time. In addition to being free and working well, it provides a convenient server-side solution and makes separate PC and Unix passwords unnecessary. To use Hummingbird, we would have had to install it on each PC, but since we were only using the network to download the benchmarks from our Unix server to the PCs and not for actually running the benchmarks, we ignored the Hummingbird recommendation.

As for the MKS Toolkit, the Synopsys beta notes said, "If you are relying on Unix commands in your script, we also recommend that you install MKS Toolkit 5.2 or higher." We did, in fact, rely on Unix commands in our scripts, but we avoided the requirement by translating the scripts to DOS and writing a simple program (described in the next section) to return the current time.

Listing Environment setup batch file
set Synopsys _KEY_FILE=c:\ Synopsys \admin\license\license.dat

set HOME=C:\

"C:\ Synopsys \msvc50\syn\bin\dc_shell_exec.exe" -r "c:\ Synopsys " %1 %2 %3 %4 %5 %6

We should also point out that, beyond the installation notes that we've quoted here, the Hummingbird and MKS Toolkit software is on the list of minimum requirements for running Design Compiler on NT. As with all such EDA requirements, however, they really mean that this is the configuration in which the vendor has tested the software--what Synopsys terms the Qualified System Configuration. We made sure first that we had reasonable alternatives for file sharing and Unix interoperability, and then we tested our work.

Another part of Design Compiler's Qualified System Configuration on the PC was potentially more challenging than the NFS and MKS Toolkit requirements. The beta notes gave the following information about NT file systems: "The Synopsys NT software is intended to be installed and run using the NTFS file system and/or network file system. Currently, we do not support the FAT file system." Even if you're not up on PC talk, you can probably figure out that NTFS is the NT file system, which provides such reliability features as transaction logs to help users recover from disk failures and access control features for directories and individual files. On the other hand, you would have to really know something about PCs to know that FAT is the file allocation table that has been the basis for the DOS and Windows file system since life first emerged from the ocean near Redmond, Wash., in 1977.

File system issues
The Synopsys installation notice was a concern because the disk drives in one of the 400-MHz PCs used in our benchmark tests incorporated NTFS, while two had the FAT file system. It makes sense to ship PCs with FAT if you don't know which operating system the user will run, because both Windows 95 and NT can read FAT partitions, but only NT can recognize NTFS. We could have reformatted the FAT drives, but we decided to chance fate and use the systems the way we got them. As with the software requirements, the FAT file system proved to be no problem, and it's unclear whether it made a difference in the benchmark results. Our contacts at Compaq report that in their Design Compiler tests, they've found no performance difference between NTFS and FAT.

Windows NT: Adding to the EDA Arsenal
By Robert B. Baden

In the past few years, there has been some debate over whether Windows NT will overtake Unix as the preferred design environment in EDA. From the EDA users' perspective, the products they're struggling to design are more important by far than the operating system they're running on--be it Unix or Windows NT. In fact, ask any busy chip designer which would be the operating system of choice, and he or she'll probably answer, "Who cares?" Designers are focused on the tools that get the job done rather than the underlying platforms, and Windows NT is really just another operating system. [For a very different view from the designers themselves, see "Engineers Speak Out: Linux vs. Windows NT, Part I,"]

It's the responsibility of EDA companies to deliver tools that operate transparently across all commercial platforms (operating systems) that designers and their companies want. High-end EDA tools have been standardized on the Unix environment. They must use common data formats (file structures) and common user interfaces as they become available on Windows NT so that designers can easily move back and forth at will.

On the other hand, business and personal productivity systems have standardized on the Windows NT/PC environment virtually across the board, which results in many chip designers having a Windows NT platform readily available, either in their briefcases or on their desks, in addition to their Unix engineering workstation. But in the EDA world, complex designs have traditionally been created on Unix systems, because that's where the tools were. Now those tools are migrating to Windows NT, and companies are starting to ask why they need two pieces of computer hardware on their designers' desks.

The separate systems cost factor becomes compelling when considering real estate on the desktop, management and administration, and maintenance. For many companies, applications on Unix can be accessible to the Windows NT business systems through an X terminal emulator on Windows NT, providing direct access to the Unix servers. However, that level of applications interoperability doesn't always meet customers' expectations. That doesn't mean that companies will instantly discard the Unix systems that they've been using for years. Instead, the natural conclusion is that the chip design world will evolve to a mixed Unix and Windows NT environment for the foreseeable future.

There's a need to improve application interoperability, as well as to increase transparency between Unix and Windows NT environments in the design flow. If the design tools aren't transparent, then it's not practical for designers to move between these incompatible operating systems to access a series of systemically incompatible tools. The inherent "tax" on interoperability is the conversion of data files and dealing with different user interfaces. The Unix/Windows NT choice is based on more than just having a single footprint and a lower cost of management. Because customers will not migrate all systems to Windows NT, they'll be forced to deal with interoperability in a mixed environment. They will also demand it.

There are two primary issues that need to be addressed. First, if the move between platforms is for the same application, data files must be readable in both environments. The Synopsys files ( .db and .lib , for instance) are readable in both environments, independent of whether the client or server is Unix or Windows NT. Second, the user interface must be the same in both environments to avoid user retraining. Synopsys intends to deliver such a user interface.

Still, the underlying platforms are different, and several issues remain to be worked out:

  • The incompatibility of file naming conventions and path structures
  • The fact that Unix utilities aren't widely used on Windows NT
  • The fact that critical design flow control scripts in Unix are required in Windows NT

The real kicker is that the myriad savings Unix design shops expect may be more quickly achieved through interoperability in a mixed environment. But accomplishing that doesn't come for free. If individual companies develop one-off solutions to those problems, it will cost the industry much more than simply converting 100 percent to Windows NT. And, in any case, that's a step most companies aren't willing to take immediately.

So interoperability becomes the critical element in the move to the mixed Unix and NT environment that designers and their companies want and need [but see again "Engineers Speak Out: Linux vs. Windows NT, Part 1,"]. Synopsys has developed and intends to deploy a unique architecture for data file transparency and user interface transparency. Designers also require the ability to move between Synopsys tools and tools from other EDA vendors, adding another dimension to the interoperability challenge. The solutions Synopsys and its partners in the Interoperability Task Force are developing for Synopsys 's tools may be transferable to other applications in EDA design, as well as other Unix applications, such as finance, M-CAD, and graphics.

There are seven members working on the Synopsys Interoperability Task Force to solve those specific interoperability issues for users of Synopsys 's tools: Compaq, Digital Equipment Corp., Hewlett-Packard, IBM, Intel, Microsoft, and Siemens. Together, the group will provide solutions to many of the issues chip designers will face in the migration to a mixed Unix and Windows NT environment.

It's still true that today's super high-performance chip designs require experienced designers. By solving interoperability, we can add real value for the ranks of experienced designers. EDA tools have a legacy of addressing multiple Unix platforms. As the EDA community moves to address the same issues in NT, it becomes more and more apparent that NT is one more platform to add to EDA's design arsenal.


Robert Baden is the product line manager, new product development, at Synopsys , Inc. in Mountain View, Calif. (baden@ Synopsys .com).

Telling time
Unix has built-in commands that aren't available for NT, and the lack of a few of those proved to be inconvenient for our benchmarking efforts. Specifically, the Unix time command is perfect for timing benchmarks because one command ( time x ) returns the time taken to run task x. Similarly, the Unix date command returns the current date and time in seconds, allowing us to log benchmark runs precisely.

Although the Time/T command in NT does return the current time, it does so only to a resolution of minutes. That resolution is inadequate, even for benchmarks that run more than an hour, because the differences from one platform to another were often very slight in our tests.

To overcome that problem, James wrote about four lines of C code to get the time. (He might have saved himself the trouble if we had known about Microsoft's NT Resource Kit at the time.) Coupled with the NT Date/T command to get the date, the C code allowed the benchmarks to go forward unimpeded. Those considerations will hardly affect EDA tasks, but other differences between NT and Unix will have an impact on anyone who moves from one environment to the other.

Note that getting the time information in seconds was not an issue in our last round of benchmarks because Verilog-XL has built-in statistical reporting. We have since learned, however, that Verilog-XL reports CPU time on Unix but wall clock time on NT. Our error in equating those measurements probably skewed the results in favor of the SPARCstation Ultra 2 used in the first round of benchmarks. The advantage gained by the CPU time measurement would be especially noticeable in tests involving a great deal of disk I/O, and that's where the SPARCstation gave the best relative performance.

In future benchmark installments, we'll correct the erroneous data by rerunning the Verilog-XL benchmarks using the new time measurement method. In the meantime, the benchmark tests in this installment provide apples-to-apples measurements that offer a compelling case for adopting the PC as an EDA platform.

Installing the software
In addition to the Design Compiler installation notes that we described earlier, we dealt with some other installation issues that reflect differences between Unix and NT. As foreshadowed by our pet peeve about the time and date shell scripts, we encountered a script-related problem involving a directory name on NT.

When installing Design Compiler, we specified the following directory: Program Files\ Synopsys Inc\ Synopsys beta . We quickly found that the application wouldn't run at that location.

The reason for the failure resided in the script file that finds all the items needed to run the application. Because Unix file and directory names can't have spaces, scripts routinely use spaces as delimiters. Running the usual Unix start-up script on NT thus caused the software to look for each piece of the directory name as a separate location. The application went off on a fruitless search for Program, Files\ Synopsys , Inc\ Synopsys , and beta .

By reinstalling Design Compiler in C:\ Synopsys , we fixed the problem. Designers need to be aware of the differences between NT and Unix and exercise due caution. You can't assume that shell scripts proven under one operating system will work correctly on the other. That being said, once we changed the installation directory on NT, all of our libraries, design files, and scripts worked perfectly on both operating systems without modification.

Setting up the environment
In the release version of Design Compiler NT, Synopsys has promised an installation utility for setting up each application's run environment, as did Cadence for the NT version of Verilog-XL. We want to commend this practice and suggest that other vendors follow the example. The installation utility was not yet availablefor the beta version of Design Compiler NT, so we had to work from a list of variables on paper ( HOME, Synopsys _KEY_FILE, Synopsys _FILE_NAME_DELIM, Synopsys _CONSPEC , and Synopsys _CONSPEC_SWITCH, Synopsys _RESERVE_SIZE ). Naturally, we neglected to set one of them ( HOME ), and Design Compiler flagged an internal error. When we called Synopsys 's technical support for help, our contact immediately knew what was wrong. To avoid further environment mishaps as we installed Design Compiler on each platform, we wrote our own batch file that set all the variables.

On the subject of environment variables, we set up the license file in a different way than the conventional approach, in which the Synopsys _KEY_FILE resides on one server. While that centralized arrangement is preferred in a normal working environment, having the license file only on one server would have forced each machine in the benchmark tests to repeatedly access the file over the network (and would have required the use of NFS or Samba, as well).

Because we didn't want the vagaries of the network to influence the benchmark times, we copied the key file to each platform. Each copy of Design Compiler still had to access the server once to get a user token, but after that all license queries during the benchmark runs were handled by the local key file. We used a simple batch file to set up the runs (see the listing).

Before starting a benchmark run, we disconnected all network drives to eliminate all network overhead. We then rebooted the systems to ensure that we had a clean starting point for the test.

On the subject of license files, note that Synopsys made them equivalent on Unix and NT. Obtaining a Design Compiler license is therefore transparent to users on either type of platform. In contrast, Cadence made the Unix and NT licenses different for some reason, so you have to obtain the correct license for the type of platform on which you're working. In a heterogeneous environment like the one at Seva, the Synopsys approach simplifies EDA work.

In this round of benchmarks, we tested three 400-MHz PCs, a 300-MHz Sun SPARCstation Ultra 60, and two 300-MHz PCs from our first benchmark installment (see Table 2). Sun declined to participate in the benchmarks, but Intel loaned us the new SPARCstation Ultra 60. The results of the benchmarks made it clear why Intel was eager to fill the gap.

A fast start
To prepare for the PC benchmark tests, we got in line for 400-MHz PCs on April 16--the day after Intel revealed the 400-MHz Pentium II processor. Knowing that system makers probably had this chip for more than a month, we expected that the PCs would be ready to roll, and we were almost right. It turned out, though, that we were pushing the availability envelope a bit, so Compaq, Hewlett-Packard, and IBM deserve a lot of credit for managing to provide machines in time to meet our tight deadline.

Table 2 Benchmark test subjects
System Speed Number of processors Memory Disk subsystem
Hewlett-Packard Kayak XU 400 MHz 2 512 Mbytes Two 4.5-Gbyte drives configured as hardware RAID, FAT file system
IBM Intellistation M Pro 400 MHz 2 512 Mbytes/1 Gbyte Two 9-Gbyte drives configured as hardware RAID, FAT file system
Compaq Professional Workstation AP400 400 MHz 1 512 Mbytes Two drives configured as software RAID, NTFS
Sun SPARCstation Ultra 60 300 MHz 2 512 Mbytes Two 4.2-Gbyte drives, no RAID
Compaq 5100 300 MHz 2 512 Mbytes No RAID
Hewlett-Packard Kayak 300 MHz 2 512 Mbytes No RAID

Figure 2 Design Compiler cache advantage

After one run of a benchmark, Design Compiler's cache of compiled Designware components sped up subsequent runs. For comparison purposes, the shortest run for a given platform on each benchmark was used.

As an introduction to the three 400-MHz PCs, it's worthwhile to look at the processor and support chips Intel released on April 15, because all three PCs offer nearly identical configurations based on the new chips. In addition to boosting the Pentium II's clock speed from 300 to 400 MHz, the new processor improves the system bus speed from 66 to 100 MHz.

As for the support chips, Intel claims that its new 440BX chip set enables the 100-MHz system bus to increase peak processor data transfers to the rest of the system by 50 percent. The chip set promises to improve bandwidth among the Pentium, the Accelerated Graphics Port, 100-MHz SDRAM, and the PCI bus using enhanced bus arbitration, deeper buffers, an open-page memory architecture, and ECC memory control. The 440BX supports both 100- and 66-MHz bus speeds.

All three 400-MHz PCs use the BX chip set and take advantage of the 100-MHz system bus option. The PCs are all dual-processor systems, although the Compaq machine had only one processor installed--keeping with our request for single-processor systems. We cover that discrepancy in more detail later, but in short, we believe that the number of processors made little if any difference in the single-threaded, CPU-intensive benchmark results. In future reports, we'll compare various benchmarks with one and two CPUs.

Figure 3 CPU-intensive test of 400-MHz PCs

All three of the 400-MHz PCs tested achieved almost identical results on the CPU-intensive benchmarks. Those benchmarks involved relatively little disk I/O.

As PC designers gain more experience with a new system's architecture, the highly conservative values used in the BIOS can be relaxed in some cases to achieve better performance. During our testing, updated BIOS for the IBM and Compaq became available, so we flashed the BIOS on each machine. That change resulted in noticeable improvements in the performance of the two machines, and we used the faster run times in the benchmark results.

Memory and disk capacity
All of the 400-MHz PCs incorporated 512 Mbytes of SDRAM, though we did test one system (the IBM) with 1 Gbyte of SDRAM. The advantage of having more memory on memory-bound runs is obvious, but the magnitude of the advantage might surprise you.

An interesting variable relating to the memory itself is whether the DIMMs are registered or unbuffered. In unbuffered SDRAM, because the CAS signals pass directly to the memory in the same clock cycle, the CAS latency for the DIMM equals the SDRAM CAS latency, resulting in timings of 5/1/1/1. In the registered SDRAM, the DIMMs buffer all address and control signals, which adds a cycle to the CAS latency for 6/1/1/1 timing. The buffering is needed to decrease the capacitive loading of the DIMM on the input address and control signals.

The registered/unbuffered issue is important if you want more than 512 Mbytes of SDRAM in your system. At memory capacities lower than 512 Mbytes, you can use 128-Mbyte SDRAMs, which can be unbuffered. To achieve higher capacities, however, you need 256-Mbyte SDRAMs or larger and they must be registered. Thus a PC with 1 Gbyte of SDRAM has an extra wait state, compared with a PC with half as much SDRAM.

On the other hand, that wait state represents only one out of eight. Additionally, the system performance will suffer from that wait state only on cache misses, so the overall effect should be relatively minor. That proved to be the case. (See the results for the IBM system below.)

All of these PCs also have RAID 0 subsystems based on two 10,000-RPM disks that use an Ultra-Wide SCSI interface. The Compaq PC differs in that its RAID striping is managed by software, whereas the IBM and HP RAIDs are managed by hardware. That difference put the Compaq system at a disadvantage in our one disk-intensive test. Remember that RAID 0 doesn't provide data redundancy but should offer a performance improvement over standard disk subsystems by alternately striping data across two (or more) drives.

Figure 4 Small differences on CPU-intensive benchmarks

By normalizing run times to the fastest 400-MHz PC on every benchmark, the chart shows the tiny variations among the systems. The biggest difference on any of the benchmarks was barely more than 1 percent.

As mentioned earlier, the Compaq PC uses NTFS, whereas the HP and IBM PCs use FAT. The only major difference between the HP and IBM RAID subsystems is that the former used two 4-Gbyte Seagate Cheetah drives and the latter used two 9-Gbyte Cheetah drives. The smaller drives apparently gave the HP system a slight edge on disk-intensive tasks.

The test setup
In all of the synthesis benchmarks, we used a 0.25-µm TSMC library provided by Silicon Access (www.siliconaccess.com). The library represents the kind of deep-submicron technology we wanted to test, offering the wide range of gate sizes that are vital for good deep-submicron optimization in synthesis.

We didn't try to take advantage of that optimization capability in these tests, however, because the goal was to simply measure run times. The scripts used to run the tests were therefore straightforward. The basic sequence was:

  • Read the design file(s)
  • Set the current design
  • Link designs
  • Set some default constraint
  • Uniquify the design
  • Compile with varying map_efforts
  • Report results
  • Write out the design netlist

Note that as in our first benchmark installment, we didn't evaluate graphics performance on any of the platforms. Like most Synopsys users, we ran the benchmarks in batch mode.

We ran each benchmark three times on each platform to ensure that the results were consistent. That was a good practice, because we discovered that the results were often inconsistent. Sometimes the second and third runs were slightly faster than the first one--but not always. The reason lay in Design Compiler's cache. After a run, the application saves compiled Designware components, which are then reused on subsequent runs to provide a speed advantage. We didn't see that effect on every series of benchmark runs, because sometimes we ran a benchmark on a platform informally before starting the actual test series. The cache's speed advantage thus benefited every run in the series.

The best way to deal with the cache effect would be to delete the cache before every run. Because we discovered this factor late in our testing, though, we compensated by using the shortest run on each platform for comparison purposes. The results of some of the test runs exhibit the benefit of the Design Compiler cache (see Figure 2).

PCs are the clear winners
The synthesis benchmark had several important results:

  • All three 400-MHz PCs performed about equally well in the CPU-intensive tests.
  • If the IBM and HP machines had had their L2 cache ECC units enabled, the Compaq PC might have outperformed them slightly in most of the tests.
  • The Compaq PC fell short in the rpu256 benchmark, which required a lot of paging to disk, obviously because of the use of software-based RAID.
  • Both the 300- and 400-MHz PCs consistently outperformed the 300-MHz SPARCstation Ultra 60.
  • The 400-MHz PCs outperformed the 300-MHz PCs by about the degree indicated by the difference in their processor clock frequencies, except on the rpu256 benchmark.

It's not surprising that the 400-MHz PCs all performed at about the same level on the CPU-intensive tests, given that they all have about the same configuration (see Figure 3). To highlight the tiny differences among the PCs, you can view the same results as a percentage of the fastest system on each of the tests (see Figure 4).

Despite the fact that the 400-MHz PCs always performed to within 1.5 percent of each other on the CPU-intensive tests, the results may hide some deviations. Late in our benchmark work, for example, someone brought to our attention the fact that both the L2 cache and main memory have their own ECC functions. Turning off either of those ECC functions improves memory bandwidth and thus allows applications to run faster. While turning off the main-memory ECC would be risky, the rapid data turnover in the cache makes that memory less vulnerable to errors.

Figure 5 The advantage of hardware RAID 0

Two of the 400-MHz PCs incorporated hardware-based RAID 0 (striping), whereas the other used a RAID 0 subsystem managed by Windows NT. The hardware approach had a huge advantage on the rpu256 disk-intensive benchmark.

A Hewlett-Packard spokesperson observed that the "chances of an error happening in 512-kbyte cache, in particular one resulting in a problem, are exceedingly small." HP decided that any extra protection afforded by the cache ECC was not worth the performance penalty, so the company disabled the function in our test system and plans to ship the Kayak XU that way to customers.

To check the effects of that choice, we tested the IBM Intellistation with and without its cache ECC enabled and found that the system's performance changed by about 2 to 3 percent on the CPU-intensive benchmarks. With the small differences among the 400-MHz PCs on these benchmarks, a 2 to 3 percent advantage is significant. The cache ECC is enabled by a register setting in the BX chip set, and the BIOS controls setting of the register bit.

Should the L2 cache ECC be on or off? According to an Intel spokesperson, "Intel takes the reliability of its products seriously, especially for what we call business-critical computing. In these applications, it's unacceptable for the system to produce an undetected wrong result. We strongly recommend that L2 ECC be enabled. This is the default, supported configuration for OEM-branded workstations and servers."

Figure 6 Comparing 300- and 400- MHz PCs

Adding the 300-MHz PC results to the chart shown in Figure 3 shows that the two 300-MHz systems took about 25 percent longer to run the CPU-intensive benchmarks. The results for the 400-MHz PCs are so close that they appear as a single value.

Are two processors faster?
As mentioned earlier, we asked each of the PC vendors for single-processor systems, but only Compaq actually delivered a system with just one Pentium installed. Even though the synthesis benchmarks we ran were single-threaded and therefore shouldn't have experienced any speed improvement from a second processor, the NT Performance Monitor showed that the second processor was performing I/O tasks.

In most of the benchmarks, the overhead of managing the two processors probably equaled or even exceeded the contribution of the second processor. On the disk-intensive test, the ability to offload I/O tasks to the second processor could have been significant, especially in light of Compaq's use of software-managed RAID. On the other hand, the CPU utilization was around 7 percent during the disk-intensive test, so it was not as if the single processor was being overtaxed.

In fact, we might speculate that the IBM and HP systems suffered a performance penalty from the overhead associated with running dual processors on single-threaded tests. From that perspective, the Compaq PC could have lost a couple of percent on performance as a consequence of having the L2 cache ECC enabled and gained a couple of percent from avoiding the two-processor overhead. Our conclusion: It was all just a wash.

Hardware or software RAID?
The most important factor in Compaq's shortfall on the disk-intensive test was probably its use of NT for managing the disk striping in the RAID subsystem. HP and IBM both use hardware solutions based on Adaptec Array1000CA RAID controllers. Relying on NT to manage the RAID penalized Compaq significantly on disk I/O (see Figure 5).

On the memory-intensive test (the rpu256), which forced a lot of paging to disk, the benchmark times for the Compaq system with software-based RAID were only slightly better than those for a 300-MHz system with no RAID. Compared with the software-based RAID, hardware-based RAID gave the IBM and HP PCs about a 50 percent advantage--exactly what Adaptec claims for disk-intensive applications.

Figure 7 The benefits of more memory

To highlight the advantage offered by keeping a large task resident in memory, one can average the results of the memory/disk-intensive rpu256 benchmark. The memory-resident result came from a PC with 1 Gbyte of SDRAM, whereas all the other systems had 512 Mbytes of SDRAM. Giving the memory-resident result a weighting of 1, the figure shows that the 400-MHz PCs with hardware-based RAID 0 took five times longer than the memory-resident system.

Adaptec uses two approaches to accelerate the RAID subsystem. First, a RAID coprocessor board employs an on-board BIOS and SRAM for RAID command execution. Unlike NT's software approach, the hardware can create a bootable array that stripes the operating system files (as well as other disk files) to speed up NT's performance.

Second, the Adaptec technology provides a disk cache on the RAID coprocessor board in addition to the caches already maintained in the disk drives and the PC's main memory. Adaptec explains that the NT cache in main memory accumulates a lot of dirty blocks during heavy disk I/O. The coprocessor cache moves those dirty blocks out of main memory faster because the data transfers to the coprocessor board at PCI speeds rather than waiting on the SCSI channel all the way to disk; the coprocessor transfers data to disk faster by applying elevator sorting, which minimizes disk head movement. The coprocessor cache also prefetches sequential data blocks to make them available to the main memory cache at PCI speeds.

Figure 8 Waiting for disk

Running a task similar to the rpu256, which fit nicely in 1 Gbyte of RAM, the user could finish in about 8 minutes. With only 512 Mbyte of RAM, the user would have to wait 40 minutes to an hour and 20 minutes, depending on whether he had hardware- or software-based RAID 0. The 300-MHz PCs did surprisingly well, considering that they used no RAID at all.

How the 300-MHz PCs fared
Another interesting disk-related result was that the difference between the 400- and 300-MHz PCs decreased in the memory-intensive benchmark, indicating that you won't get as big a performance boost from the faster systems as you might expect if your applications do a lot of disk thrashing. (Considering a system running 1 Gbyte of memory, as we do in the next section, throws further light on the comparison between 400- and 300-MHz PCs on disk-intensive tasks.)

On the CPU-intensive benchmarks, though, the performance difference between the 400- and 300-MHz PCs is directly proportional to clock frequency (see Figure 6). Apparently, the other enhancements in the 400-MHz Pentium II and the BX chip set only make it possible for the increase in clock frequency to tell its tale.

The extraordinary value of memory
As part of the benchmark tests, we tried the IBM Intellistation with both 512 Mbytes and 1 Gbyte of SDRAM. Most of the benchmark results reported in this article used the smaller memory to maintain uniformity among the various platforms. But we also wanted to see what advantage the 1-Gbyte memory delivered in a synthesis test that required the 512-Mbyte systems to perform a lot of paging to disk.

The results were stunning (see Figures 7 and 8). Compared with the PCs having 512 Mbytes of SDRAM and hardware-based RAID, the system with 1 Gbyte of memory slashed the run time in this test about fivefold. Clearly, if you are going to run large designs through Design Compiler, you'll get more bang for your buck by buying more memory for a 300-MHz PC rather than upgrading to a faster processor.

Figure 9 Difference among platform classes

By looking at the average results for all the machines as a percentage of the fastest platform on each benchmark, one can see that the PCs consistently turned in the fastest time on every benchmark by a significant margin.

The test used the rpu256 benchmark. It turned out to be ideal for showing off the benefits of the larger memory, because the benchmark required more than 800 Mbytes of memory--enough to cause lots of disk paging in the 512-Mbyte systems but not too much for the 1-Gbyte system.

The data in Figure 7 also help answer the question of whether Windows NT is ready for the EDA big time. We explained earlier that the way we measured test times in our previous set of benchmarks may have skewed the results in favor of the SPARCstation, and the benchmark results in this installment confirm that suspicion (see Figure 8).

The SPARCstation Ultra 60 consistently lagged behind even the 300-MHz PCs in the synthesis benchmarks, and the performance difference was substantial (see Figure 9). The SPARCstation's shortfall was especially profound on the disk-intensive test, in which the Unix system seemed to excel previously. By running some of the simulation benchmarks we used last time, we verified that the Ultra 60's performance was consistent with that of the Ultra 2 in our first installment.

If you are buying an EDA workstation based strictly on performance, the PC is the clear choice. With 400-MHz PCs expected to sell for around $8,000 with dual processors, 512 Mbytes of SDRAM, and hardware-based RAID, the Ultra 60's $26,000 price tag seems a bit on the high side.

Of course, you have to consider many other factors when choosing EDA workstations (see "Windows NT: Adding to the EDA Arsenal"). Our experience with setting up functions, such as networking and a myriad of other details under Windows NT, also leads us to say that the operating system can be annoying if you're used to Unix. Still, the performance of Design Compiler on the NT PCs we tested was compelling. Pay a little extra for all the SDRAM you can get and the PC will reward your big synthesis runs with stellar performance.

The authors wish to thank Ramprasad Rangarajan for his expert assistance in conducting the synthesis tests. As project manager at Seva Technologies, Inc. in Fremont, Calif., Ram's many years of experience with Synopsys Design Compiler helped make the benchmark tests possible.

James Lee is a senior consulting engineer at Seva Technologies. He has 12 years' experience working with Verilog and was one of the first employees at Gateway Design Automation, which developed Verilog. Prior to joining Seva, he was with Cadence Design Systems. He is the author of Verilog Quickstart and is a part-time instructor in Verilog at the University of California at Santa Cruz.

Bob Peterson is a freelance writer based in Monterey, Calif. Formerly the assistant managing editor of EDN magazine, he has written on a wide variety of technical topics for many publications and companies for the past 16 years.

To voice an opinion on this or any Integrated System Design article, please email your message to miker@isdmag.com.


integrated system design  July 1998



[ Articles from Integrated System Design Magazine ] [ ICs and uPs ]
[ Custom ICs and Programmable Logic ] [ Vendor Guide ]
[ Design and Development Tools ] [ Home ]



For more information about isdmag.com email webmaster@isdmag.com
For advertising information email amstjohn@mfi.com
Comments on our editorial are welcome
Copyright © 2000 Integrated System Design

  Free Subscription to EE Times
First Name Last Name
Company Name Title
Email address
  Click here for your Free Subscription to EETimes Europe
 
CAREER CENTER
Looking for a new job?
SEARCH JOBS
SPONSOR

RECENT JOB POSTINGS
CAREER NEWS
SRC Expands R&D Centers
The Semiconductor Research Corp has added a new center to its university R&D efforts.

For more great jobs, career related news, features and services, please visit EETimes' Career Center.


All White Papers »   

 
Education and
Learning


Learn Now:












Home | About | Editorial Calendar | Feedback | Subscriptions | Newsletter | Media Kit | Contact | Reprints|  RSS|   Digital|  Mobile
Network Websites
International
Network Features




All materials on this site Copyright © 2009 TechInsights, a Division of United Business Media LLC All rights reserved.
Privacy Statement | Terms of Service | About