|
special section
This, the third installment in our series on PC-based design flows evaluates synthesis performance on several 400-MHz Pentium machines. Note in particular that the PCs produced eye-popping results on disk/memory trade-offs along the way. If you've ever wondered how much memory you should invest in for synthesis, we can answer that query beyond all doubt. We posed the question: Does Windows NT really boost the PC into the ranks of world-class EDA platforms? Specifically, we asked whether Synopsys Design Compiler is viable on NT. To add a little more variety to the view, we also compared the performance of two 300-MHz Pentium PCs with the 400-MHz versions. We began the series with simulation (March) because that tends to be the most challenging EDA task for any platform. Surprisingly though, we've probably found more behavioral differences between NT and Unix on this installment's benchmarks. It's not a question of whether Design Compiler runs well on NT; it does. Despite the fact that we were testing early beta code, Design Compiler turned in an exemplary performance. You simply need to be aware that NT differs from Unix in many ways, and if you make the switch to NT, you'll need to watch those differences closely.
Picking benchmarks
The biggest difficulty with finding appropriate synthesis benchmarks came from a self-imposed constraint that we avoid the common practice of using proprietary circuits. If you're evaluating computing hardware or synthesis tools for internal purposes, you'll probably want to use one of your own proprietary designs so the results will reflect your design style. By using nonproprietary tests here, though, we made it possible for other people to run the same benchmarks (on other platforms, for example) and compare the results. Nonproprietary benchmarks also help ensure that no one is manipulating the tests to make one vendor look better than another. As the first installment pointed out, vendors can publicize the benchmarks that make their software or hardware look better--and conveniently fail to mention the benchmarks on which they fared poorly. Our benchmarks were strictly chosen to provide synthesis runs that were long enough to stress various aspects of the hardware and software. Predictably, that approach favored no particular PC vendor. Nonproprietary RTL code is available from several sources. We could have created designs of our own--an especially appropriate task for the Seva Technology designers participating in the benchmark program--but that approach was too labor-intensive for the level of complexity we wanted. Other practical sources include universities, EDA vendor demonstrations, and people who post designs on the Internet for one reason or another. After informally trying many different possibilities, we settled on two designs from university sources, one from a designer who posted his freeware RTL code on the Internet, and one that co-author James Lee designed several years ago for a Cadence demo. All of the designs are available on the Internet, but bear in mind that some of them have changed since we downloaded the copies used in the benchmark tests. If you want the versions we used, get them from ISD 's Web site at www.isdmag.com/edabenchmark. Table 1 lists the statistics for each benchmark circuit so you can get an idea of how the designs might compare with your own. The values for area represent library units. The report size refers to the size of the log file created by Design Compiler, and the values for the disk space used represent the database files created by Design Compiler. The file sizes give an idea of the amount of disk activity that took place even in the CPU-intensive tests.
The RAW benchmark
Even though we weren't dealing with reconfigurable hardware in our benchmarks, the designs suited our needs quite well. Because the Game of Life design we used last time is too short for a good synthesis test, we chose another design from the RAW suite for this installment. The new design, a Data Encryption Standard (DES) module, is still fairly small, taking about seven to eight minutes to run. Our main intention with this benchmark was to check run time for a fairly small design. The DES module also made sense for the tests because its datapath and control logic use a lot of math features found in Design Compiler. The DES algorithm repeatedly applies substitution and permutation techniques, one on top of the other, for a total of 16 cycles. The software algorithm used in this benchmark was adopted from Eric Young's fast encryption package. Another reason why DES made a good synthesis test was that it includes a large number of nets. Because the success of deep-submicron designs depends so much on optimizing the interconnect, we wanted to include designs that stressed the ability of the system to deal with many nets.
Crashing the TORCH
If you suspect that all of this speculative instruction execution might require a lot of hardware, you're correct. The design includes two integer execution units connected to a six-port register file, a floating-point unit with an associated register file, and separate instruction and data caches (see Figure 1). The two integer execution units have different resources. The design is available at www-flash.stanford.edu:80/torch/. The TORCH architecture provided a design that we thought would challenge Design Compiler in every way, and we were right. When we fed the entire 16,000-line design to Design Compiler as a single block, all we got back was an internal DC error message. The PC on which we were running this informal test was a 400-MHz machine with 512 Mbytes of RAM. Design Compiler kept 800 Mbytes of swap space, so we had to reboot the system. Out of a sense of devious and perhaps morbid curiosity, we ran the same test on a Sun Ultra 60 SPARCstation, also with 512 Mbytes of RAM. We achieved the same result, except that the swap space was freed after the run. If you prefer that your internal errors be civil enough to let go of swap space, you can take this experience as an endorsement of Unix--bearing in mind that the PC version of Design Compiler was a beta copy. All these considerations notwithstanding, the final result was the same. To make the TORCH benchmark more realistic, we extracted two of the modules from design, regfile, and dpath . Using those modules resulted in a combination that stressed Design Compiler without causing mortal harm. To get a complete structured design that wasn't too big to run through Design Compiler in one chunk, we turned to a freeware design by Tom Coonan, an engineer with Scientific Atlanta. As a Verilog synthesizable model of the PIC 16C5X RISC processor, Tom's design provided a relatively small circuit that he estimates at about 1,500 equivalent gates, not including memories.
"If you want massive MIPS and sophisticated instruction sets, go look at the ARM or the Oak or commercial IP," Tom points out. "This is a simple processor that's easily comprehended and easy to work with." He also notes that the code changes on a daily basis. If you want the version used in our benchmark, go to ISD 's Web site. For Tom's latest version, see www.mindspring.com/~tcoonan/.
The disk stressor
Because such a large circuit would take hours to compile, we read in only the design and did uniquify without actually compiling. While that approach is a nonstandard use of Design Compiler, it allowed us to use a maximum amount of memory in a minimum amount of time. The rpu256 benchmark served as a good memory- and disk-intensive test, prompting the most spectacular results of any of our tests. We designed the benchmarks to compare Design Compiler's performance on NT PCs with its performance on a Unix workstation, as well as to compare the convenience of the two operating systems. By convenience we actually refer to the differences between Unix and NT that will make extra work for designers who are switching from one environment to the other. Most designers currently use Unix as their EDA platform, and almost all use Unix to run Design Compiler because the NT version is only now becoming generally available. Because EDA tools take Unix for granted (including formal parts of the operating system as well as items that are indistinguishable from the operating system), switching to NT isn't just a matter of porting the tools' code to NT. The inevitable differences between the two operating environments will to some extent inconvenience designers.
Inconveniences
In addition to challenging our goal of using NT out of the box, the Hummingbird recommendation was unappealing because we preferred using Samba for sharing files between the PCs and our Unix systems. Seva Technology has used Samba for some time. In addition to being free and working well, it provides a convenient server-side solution and makes separate PC and Unix passwords unnecessary. To use Hummingbird, we would have had to install it on each PC, but since we were only using the network to download the benchmarks from our Unix server to the PCs and not for actually running the benchmarks, we ignored the Hummingbird recommendation. As for the MKS Toolkit, the Synopsys beta notes said, "If you are relying on Unix commands in your script, we also recommend that you install MKS Toolkit 5.2 or higher." We did, in fact, rely on Unix commands in our scripts, but we avoided the requirement by translating the scripts to DOS and writing a simple program (described in the next section) to return the current time.
We should also point out that, beyond the installation notes that we've quoted here, the Hummingbird and MKS Toolkit software is on the list of minimum requirements for running Design Compiler on NT. As with all such EDA requirements, however, they really mean that this is the configuration in which the vendor has tested the software--what Synopsys terms the Qualified System Configuration. We made sure first that we had reasonable alternatives for file sharing and Unix interoperability, and then we tested our work. Another part of Design Compiler's Qualified System Configuration on the PC was potentially more challenging than the NFS and MKS Toolkit requirements. The beta notes gave the following information about NT file systems: "The Synopsys NT software is intended to be installed and run using the NTFS file system and/or network file system. Currently, we do not support the FAT file system." Even if you're not up on PC talk, you can probably figure out that NTFS is the NT file system, which provides such reliability features as transaction logs to help users recover from disk failures and access control features for directories and individual files. On the other hand, you would have to really know something about PCs to know that FAT is the file allocation table that has been the basis for the DOS and Windows file system since life first emerged from the ocean near Redmond, Wash., in 1977.
File system issues
Telling time
Although the Time/T command in NT does return the current time, it does so only to a resolution of minutes. That resolution is inadequate, even for benchmarks that run more than an hour, because the differences from one platform to another were often very slight in our tests. To overcome that problem, James wrote about four lines of C code to get the time. (He might have saved himself the trouble if we had known about Microsoft's NT Resource Kit at the time.) Coupled with the NT Date/T command to get the date, the C code allowed the benchmarks to go forward unimpeded. Those considerations will hardly affect EDA tasks, but other differences between NT and Unix will have an impact on anyone who moves from one environment to the other. Note that getting the time information in seconds was not an issue in our last round of benchmarks because Verilog-XL has built-in statistical reporting. We have since learned, however, that Verilog-XL reports CPU time on Unix but wall clock time on NT. Our error in equating those measurements probably skewed the results in favor of the SPARCstation Ultra 2 used in the first round of benchmarks. The advantage gained by the CPU time measurement would be especially noticeable in tests involving a great deal of disk I/O, and that's where the SPARCstation gave the best relative performance. In future benchmark installments, we'll correct the erroneous data by rerunning the Verilog-XL benchmarks using the new time measurement method. In the meantime, the benchmark tests in this installment provide apples-to-apples measurements that offer a compelling case for adopting the PC as an EDA platform.
Installing the software
When installing Design Compiler, we specified the following directory: Program Files\ Synopsys Inc\ Synopsys beta . We quickly found that the application wouldn't run at that location. The reason for the failure resided in the script file that finds all the items needed to run the application. Because Unix file and directory names can't have spaces, scripts routinely use spaces as delimiters. Running the usual Unix start-up script on NT thus caused the software to look for each piece of the directory name as a separate location. The application went off on a fruitless search for Program, Files\ Synopsys , Inc\ Synopsys , and beta . By reinstalling Design Compiler in C:\ Synopsys , we fixed the problem. Designers need to be aware of the differences between NT and Unix and exercise due caution. You can't assume that shell scripts proven under one operating system will work correctly on the other. That being said, once we changed the installation directory on NT, all of our libraries, design files, and scripts worked perfectly on both operating systems without modification.
Setting up the environment
On the subject of environment variables, we set up the license file in a different way than the conventional approach, in which the Synopsys _KEY_FILE resides on one server. While that centralized arrangement is preferred in a normal working environment, having the license file only on one server would have forced each machine in the benchmark tests to repeatedly access the file over the network (and would have required the use of NFS or Samba, as well). Because we didn't want the vagaries of the network to influence the benchmark times, we copied the key file to each platform. Each copy of Design Compiler still had to access the server once to get a user token, but after that all license queries during the benchmark runs were handled by the local key file. We used a simple batch file to set up the runs (see the listing). Before starting a benchmark run, we disconnected all network drives to eliminate all network overhead. We then rebooted the systems to ensure that we had a clean starting point for the test. On the subject of license files, note that Synopsys made them equivalent on Unix and NT. Obtaining a Design Compiler license is therefore transparent to users on either type of platform. In contrast, Cadence made the Unix and NT licenses different for some reason, so you have to obtain the correct license for the type of platform on which you're working. In a heterogeneous environment like the one at Seva, the Synopsys approach simplifies EDA work. In this round of benchmarks, we tested three 400-MHz PCs, a 300-MHz Sun SPARCstation Ultra 60, and two 300-MHz PCs from our first benchmark installment (see Table 2). Sun declined to participate in the benchmarks, but Intel loaned us the new SPARCstation Ultra 60. The results of the benchmarks made it clear why Intel was eager to fill the gap.
A fast start
As an introduction to the three 400-MHz PCs, it's worthwhile to look at the processor and support chips Intel released on April 15, because all three PCs offer nearly identical configurations based on the new chips. In addition to boosting the Pentium II's clock speed from 300 to 400 MHz, the new processor improves the system bus speed from 66 to 100 MHz. As for the support chips, Intel claims that its new 440BX chip set enables the 100-MHz system bus to increase peak processor data transfers to the rest of the system by 50 percent. The chip set promises to improve bandwidth among the Pentium, the Accelerated Graphics Port, 100-MHz SDRAM, and the PCI bus using enhanced bus arbitration, deeper buffers, an open-page memory architecture, and ECC memory control. The 440BX supports both 100- and 66-MHz bus speeds. All three 400-MHz PCs use the BX chip set and take advantage of the 100-MHz system bus option. The PCs are all dual-processor systems, although the Compaq machine had only one processor installed--keeping with our request for single-processor systems. We cover that discrepancy in more detail later, but in short, we believe that the number of processors made little if any difference in the single-threaded, CPU-intensive benchmark results. In future reports, we'll compare various benchmarks with one and two CPUs.
As PC designers gain more experience with a new system's architecture, the highly conservative values used in the BIOS can be relaxed in some cases to achieve better performance. During our testing, updated BIOS for the IBM and Compaq became available, so we flashed the BIOS on each machine. That change resulted in noticeable improvements in the performance of the two machines, and we used the faster run times in the benchmark results.
Memory and disk capacity
An interesting variable relating to the memory itself is whether the DIMMs are registered or unbuffered. In unbuffered SDRAM, because the CAS signals pass directly to the memory in the same clock cycle, the CAS latency for the DIMM equals the SDRAM CAS latency, resulting in timings of 5/1/1/1. In the registered SDRAM, the DIMMs buffer all address and control signals, which adds a cycle to the CAS latency for 6/1/1/1 timing. The buffering is needed to decrease the capacitive loading of the DIMM on the input address and control signals. The registered/unbuffered issue is important if you want more than 512 Mbytes of SDRAM in your system. At memory capacities lower than 512 Mbytes, you can use 128-Mbyte SDRAMs, which can be unbuffered. To achieve higher capacities, however, you need 256-Mbyte SDRAMs or larger and they must be registered. Thus a PC with 1 Gbyte of SDRAM has an extra wait state, compared with a PC with half as much SDRAM. On the other hand, that wait state represents only one out of eight. Additionally, the system performance will suffer from that wait state only on cache misses, so the overall effect should be relatively minor. That proved to be the case. (See the results for the IBM system below.) All of these PCs also have RAID 0 subsystems based on two 10,000-RPM disks that use an Ultra-Wide SCSI interface. The Compaq PC differs in that its RAID striping is managed by software, whereas the IBM and HP RAIDs are managed by hardware. That difference put the Compaq system at a disadvantage in our one disk-intensive test. Remember that RAID 0 doesn't provide data redundancy but should offer a performance improvement over standard disk subsystems by alternately striping data across two (or more) drives.
As mentioned earlier, the Compaq PC uses NTFS, whereas the HP and IBM PCs use FAT. The only major difference between the HP and IBM RAID subsystems is that the former used two 4-Gbyte Seagate Cheetah drives and the latter used two 9-Gbyte Cheetah drives. The smaller drives apparently gave the HP system a slight edge on disk-intensive tasks.
The test setup
We didn't try to take advantage of that optimization capability in these tests, however, because the goal was to simply measure run times. The scripts used to run the tests were therefore straightforward. The basic sequence was:
Note that as in our first benchmark installment, we didn't evaluate graphics performance on any of the platforms. Like most Synopsys users, we ran the benchmarks in batch mode. We ran each benchmark three times on each platform to ensure that the results were consistent. That was a good practice, because we discovered that the results were often inconsistent. Sometimes the second and third runs were slightly faster than the first one--but not always. The reason lay in Design Compiler's cache. After a run, the application saves compiled Designware components, which are then reused on subsequent runs to provide a speed advantage. We didn't see that effect on every series of benchmark runs, because sometimes we ran a benchmark on a platform informally before starting the actual test series. The cache's speed advantage thus benefited every run in the series. The best way to deal with the cache effect would be to delete the cache before every run. Because we discovered this factor late in our testing, though, we compensated by using the shortest run on each platform for comparison purposes. The results of some of the test runs exhibit the benefit of the Design Compiler cache (see Figure 2).
PCs are the clear winners
It's not surprising that the 400-MHz PCs all performed at about the same level on the CPU-intensive tests, given that they all have about the same configuration (see Figure 3). To highlight the tiny differences among the PCs, you can view the same results as a percentage of the fastest system on each of the tests (see Figure 4). Despite the fact that the 400-MHz PCs always performed to within 1.5 percent of each other on the CPU-intensive tests, the results may hide some deviations. Late in our benchmark work, for example, someone brought to our attention the fact that both the L2 cache and main memory have their own ECC functions. Turning off either of those ECC functions improves memory bandwidth and thus allows applications to run faster. While turning off the main-memory ECC would be risky, the rapid data turnover in the cache makes that memory less vulnerable to errors.
A Hewlett-Packard spokesperson observed that the "chances of an error happening in 512-kbyte cache, in particular one resulting in a problem, are exceedingly small." HP decided that any extra protection afforded by the cache ECC was not worth the performance penalty, so the company disabled the function in our test system and plans to ship the Kayak XU that way to customers. To check the effects of that choice, we tested the IBM Intellistation with and without its cache ECC enabled and found that the system's performance changed by about 2 to 3 percent on the CPU-intensive benchmarks. With the small differences among the 400-MHz PCs on these benchmarks, a 2 to 3 percent advantage is significant. The cache ECC is enabled by a register setting in the BX chip set, and the BIOS controls setting of the register bit. Should the L2 cache ECC be on or off? According to an Intel spokesperson, "Intel takes the reliability of its products seriously, especially for what we call business-critical computing. In these applications, it's unacceptable for the system to produce an undetected wrong result. We strongly recommend that L2 ECC be enabled. This is the default, supported configuration for OEM-branded workstations and servers."
Are two processors faster?
In most of the benchmarks, the overhead of managing the two processors probably equaled or even exceeded the contribution of the second processor. On the disk-intensive test, the ability to offload I/O tasks to the second processor could have been significant, especially in light of Compaq's use of software-managed RAID. On the other hand, the CPU utilization was around 7 percent during the disk-intensive test, so it was not as if the single processor was being overtaxed. In fact, we might speculate that the IBM and HP systems suffered a performance penalty from the overhead associated with running dual processors on single-threaded tests. From that perspective, the Compaq PC could have lost a couple of percent on performance as a consequence of having the L2 cache ECC enabled and gained a couple of percent from avoiding the two-processor overhead. Our conclusion: It was all just a wash.
Hardware or software RAID?
On the memory-intensive test (the rpu256), which forced a lot of paging to disk, the benchmark times for the Compaq system with software-based RAID were only slightly better than those for a 300-MHz system with no RAID. Compared with the software-based RAID, hardware-based RAID gave the IBM and HP PCs about a 50 percent advantage--exactly what Adaptec claims for disk-intensive applications.
Adaptec uses two approaches to accelerate the RAID subsystem. First, a RAID coprocessor board employs an on-board BIOS and SRAM for RAID command execution. Unlike NT's software approach, the hardware can create a bootable array that stripes the operating system files (as well as other disk files) to speed up NT's performance. Second, the Adaptec technology provides a disk cache on the RAID coprocessor board in addition to the caches already maintained in the disk drives and the PC's main memory. Adaptec explains that the NT cache in main memory accumulates a lot of dirty blocks during heavy disk I/O. The coprocessor cache moves those dirty blocks out of main memory faster because the data transfers to the coprocessor board at PCI speeds rather than waiting on the SCSI channel all the way to disk; the coprocessor transfers data to disk faster by applying elevator sorting, which minimizes disk head movement. The coprocessor cache also prefetches sequential data blocks to make them available to the main memory cache at PCI speeds.
How the 300-MHz PCs fared
On the CPU-intensive benchmarks, though, the performance difference between the 400- and 300-MHz PCs is directly proportional to clock frequency (see Figure 6). Apparently, the other enhancements in the 400-MHz Pentium II and the BX chip set only make it possible for the increase in clock frequency to tell its tale.
The extraordinary value of memory
The results were stunning (see Figures 7 and 8). Compared with the PCs having 512 Mbytes of SDRAM and hardware-based RAID, the system with 1 Gbyte of memory slashed the run time in this test about fivefold. Clearly, if you are going to run large designs through Design Compiler, you'll get more bang for your buck by buying more memory for a 300-MHz PC rather than upgrading to a faster processor.
The test used the rpu256 benchmark. It turned out to be ideal for showing off the benefits of the larger memory, because the benchmark required more than 800 Mbytes of memory--enough to cause lots of disk paging in the 512-Mbyte systems but not too much for the 1-Gbyte system. The data in Figure 7 also help answer the question of whether Windows NT is ready for the EDA big time. We explained earlier that the way we measured test times in our previous set of benchmarks may have skewed the results in favor of the SPARCstation, and the benchmark results in this installment confirm that suspicion (see Figure 8). The SPARCstation Ultra 60 consistently lagged behind even the 300-MHz PCs in the synthesis benchmarks, and the performance difference was substantial (see Figure 9). The SPARCstation's shortfall was especially profound on the disk-intensive test, in which the Unix system seemed to excel previously. By running some of the simulation benchmarks we used last time, we verified that the Ultra 60's performance was consistent with that of the Ultra 2 in our first installment. If you are buying an EDA workstation based strictly on performance, the PC is the clear choice. With 400-MHz PCs expected to sell for around $8,000 with dual processors, 512 Mbytes of SDRAM, and hardware-based RAID, the Ultra 60's $26,000 price tag seems a bit on the high side. Of course, you have to consider many other factors when choosing EDA workstations (see "Windows NT: Adding to the EDA Arsenal"). Our experience with setting up functions, such as networking and a myriad of other details under Windows NT, also leads us to say that the operating system can be annoying if you're used to Unix. Still, the performance of Design Compiler on the NT PCs we tested was compelling. Pay a little extra for all the SDRAM you can get and the PC will reward your big synthesis runs with stellar performance. The authors wish to thank Ramprasad Rangarajan for his expert assistance in conducting the synthesis tests. As project manager at Seva Technologies, Inc. in Fremont, Calif., Ram's many years of experience with Synopsys Design Compiler helped make the benchmark tests possible. James Lee is a senior consulting engineer at Seva Technologies. He has 12 years' experience working with Verilog and was one of the first employees at Gateway Design Automation, which developed Verilog. Prior to joining Seva, he was with Cadence Design Systems. He is the author of Verilog Quickstart and is a part-time instructor in Verilog at the University of California at Santa Cruz. Bob Peterson is a freelance writer based in Monterey, Calif. Formerly the assistant managing editor of EDN magazine, he has written on a wide variety of technical topics for many publications and companies for the past 16 years. To voice an opinion on this or any Integrated System Design article, please email your message to miker@isdmag.com. integrated system design July 1998[ Articles from Integrated System Design Magazine ] [ ICs and uPs ] [ Custom ICs and Programmable Logic ] [ Vendor Guide ] [ Design and Development Tools ] [ Home ] For more information about isdmag.com email webmaster@isdmag.com For advertising information email amstjohn@mfi.com Comments on our editorial are welcome Copyright © 2000 Integrated System Design |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Home | About | Editorial Calendar | Feedback | Subscriptions | Newsletter | Media Kit | Contact | Reprints| RSS|
Digital| Mobile |
| Network Websites |
|
International |
|
Network Features |
|
|
|
All materials on this site Copyright © 2009 TechInsights, a Division of United Business Media LLC All rights reserved. Privacy Statement | Terms of Service | About |