|
Special Section
The value of the PC as an EDA platform has been debated since the first PCs hit the street and the first simple schematic capture package running under DOS made its debut. That was some 20 years ago, give or take a couple of years, and we've come a long way since then. Today, Pentium II and Pentium Pro platforms are available, offering a choice of one to four processors running Windows NT at 300 MHz, with a half gigabyte of RAM, Wide Ultra SCSI hard disk drives that provide several gigabytes of storage (and if one drive isn't enough, you can use RAID), and your choice of fast graphics cards, with Accelerated Graphics Port (AGP) quickly becoming available.
Nevertheless, the debate goes on and Unix remains the undisputed leader in the EDA industry. According to Collett International, Inc. in Santa Clara, Calif., a well-known consulting firm in the EDA market, over 65 percent of all the design teams in North America are currently using Unix. According to data from the EDA Consortium (San Jose), Unix-based tools garnered more than 90 percent of new license revenues in the third quarter of 1997, compared with only 10 percent for all DOS, Windows 3.x, Windows 95, and Windows NT tools. On the hardware side, Sun Microsystems, Inc. (Palo Alto, Calif.) enjoys the premier position, accounting for more than 40 percent of the workstations shipped in 1996, according to International Data Corp. in Framingham, Mass. A world of opportunity Despite Unix's entrenched advantage and Sun's strong market position, there are several compelling reasons to consider Intel/Windows NT-based platforms. The Pentium II and Pentium Pro processors allow scalable systems and offer performance that matches, and in some cases exceeds, the performance of SPARC processors. Furthermore, Windows NT is a robust, scalable operating system, and EDA tool vendors have been aggressively porting their tools to it. With Windows NT, the world of PC software is open to a designer, and it's no longer necessary to switch between a technical workstation for EDA tasks and a PC for ordinary office tasks such as e-mail, spreadsheets, and word processing. Windows NT-based machines are considerably less expensive than Unix-based machines: A study of business users by Deloitte & Touche, for example, reports cost savings of nearly 40 percent for workstations running Windows NT compared with those running Unix. Furthermore, future platforms will be based on the IA-64 architecture and the 64-bit version of Windows NT, which should provide an enormous jump in performance. Because of those advantages, the growth in Windows 3.x, Windows 95, and Windows NT tools has been dramatic--roughly 160 percent between 1996 and 1997, according to data from the EDA Consortium. (Although the percentage growth is dramatic, the numbers are still small, with new license revenues of only $39 million in the third quarter of 1997.) But the pace will continue and even accelerate, and Collett International predicts that Windows NT-based EDA software revenues will be equivalent to that of Unix-based EDA software in 2001. The migration from Unix-based platforms to Intel/Windows NT-based platforms isn't without its challenges, however. They involve ease of use, data and application migration, interoperability, staffing and training, networking, platform stability, and a functionality that lags behind Unix. But over and above those challenges, the fundamental issue of platform performance still needs to be resolved. Quite simply: Do Pentium-based workstations measure up to the challenges of EDA? Setting up the benchmark In a world void of budget constraints, we would have acquired workstations for benchmarking the same way you would acquire workstations for your design team--we would have bought or leased them. Since buying a half-dozen Pentium II workstations was never really an option, and leasing the specific workstations we wanted to benchmark proved too difficult, we asked the vendors participating in the benchmark to supply us with the units we needed. Those vendors were Compaq, Dell, Gateway, Hewlett-Packard, and IBM. We specified these units at 300 MHz, with a single Pentium II processor, 512 kbytes of level-two cache, and 512 Mbytes of RAM. The memory type--EDO or synchronous DRAM--was not specified. Although we didn't specify the type of disk drive (EIDE or SCSI), the graphics controller, the monitor, or the network interface card, we did point out that these machines were going to be benchmarked as EDA workstations and left it up to the individual vendors to outfit them as they deemed appropriate for this type of application (see Table 1). Since our objective was to benchmark the machines only to determine their computational performance, we didn't concern ourselves with the graphics or peripheral performance. As it turned out, however, we should have paid more attention to I/O and disk drive performance because they had an impact on overall performance in a couple of our benchmark cases. Obtaining the machines from the vendors has its disadvantages, because all of the machines might not have the same "backgrounds." In other words, a machine we received could have been factory-fresh and right off the production line, or it could have a questionable ancestry because it was used as a demo machine and bounced around from site to site. Or it could have been carefully tweaked by the vendor because it was going to be used in a benchmark. Whereas the first two scenarios are realistic and would duplicate your experience, depending on whether you bought or leased a brand new workstation or leased one that's used, the later scenario is anything but representative of the real world. As far as we know, none of the workstations used in this benchmark received any special attention. We do know that at least one of them was used as a demo model.
Five Pentium II machines running Windows NT were benchmarked at Seva Technologies, Inc. in Fremont, Calif.: the Compaq 5100, Dell Workstation 400, Gateway E5000, Hewlett-Packard Kayak, and IBM Intellistation. We were able to account for the differences we observed among the first four machines, and on the whole, those four vendors agreed with our analysis. Unfortunately, Sun would not provide us with a 300-MHz Ultra 2 (based on the UltraSPARC II processor) to use in the benchmarks, and it was literally impossible to lease one because of the high demand for those workstations. Our solution was to use a machine that Compaq itself had used in a series of benchmarks. The benchmarks There are a couple of criteria that need to be applied when selecting benchmark suites. The first is that the benchmark suite should, ideally, exhaustively exercise the hardware you're benchmarking. Because you can configure a PC using a virtually unlimited array of hardware in the form of disk drives, graphics cards, and network interface cards, our benchmark was restricted to exercising the basic Pentium II and memory architecture of the machines using benchmark suites that were computation- and memory-intensive. The most significant differences we observed between the benchmark suites, in fact, occurred when the benchmark involved significant writes to disk (which we hadn't taken into account).
The second criterion is that the benchmarks should be readily available and, ideally, in the public domain. Unfortunately, good EDA benchmark suites or circuits are hard to come by, so we used a mix of proprietary, vendor, and publicly available benchmarks. The proprietary benchmark was a behavioral memory model developed by Seva for a 128-Mbit virtual channel SDRAM. (For details about the data size of this model and the other models used, see Table 2). The benchmark actually involved some significant disk writes (about 120 Mbytes) and was one that clearly revealed some differences among the disk drives used in the benchmarked platforms. The vendor benchmark was a basic RISC processor used for demonstration purposes and developed by author James Lee several years ago while he was at San Jose-based Cadence Design Systems. The design comprises about 5,000 gates, and the benchmark consisted of simulating the design at the gate and behavioral levels and as a mixed gate- and behavioral-level simulation, in which the ALU was simulated at the behavioral level (the ALU represented about 1,600 gates). This design was then replicated eight times (to yield approximately 40,000 gates), 160 times (to yield approximately 800,000 gates), and 256 times (to yield approximately 1.3 million gates); and the same gate-level, behavioral, and mixed simulations were run. (For convenience, the four benchmarks are referred to as 5k, 40k, 800k, and 1.3M RISC.) You can obtain the Verilog code for this design directly from Integrated System Design's Web site ( www.isdmag.com/edabenchmark ). The public-domain benchmark we used is called Life and was developed at MIT as part of its Reconfigurable Architecture Workstation (RAW) project. The RAW benchmark suite was designed to facilitate the comparison, validation, and improvement of reconfigurable computing systems. With that objective in mind, the benchmarks were designed to be portable to any reconfigurable computer as a behavioral Verilog netlist, small and easy to understand, and parameterizable to generate designs that would consume a range of hardware resources. Each benchmark was designed in both C and behavioral Verilog.
The 34 benchmarks in the RAW suite run the gamut of algorithms found in general-purpose computing. They include source code and synthesized netlists in the range of a few thousand to a million gates for binary heap, bubble sort, merge sort, DES encryption, integer FFT, Jacobi relaxation, integer matrix multiplication, and Conway's Game of Life,* as well as netlists (only) for shortest path, multiplicative shortest path, and transitive-closure graph problems. The Life benchmark, which was chosen because of the size of its data structure, implements Conway's Game of Life program in FPGA hardware. Five versions of Life are contained in the suite consisting of 1, 6, 32, 48, and 128 elements. For our benchmark, we used Life48 and Life128, which represent designs of approximately 354,000 and 976,000 total gates, as detailed in the RAW documentation. It should be noted, though, that author James Lee estimates the total gates for the two designs at closer to 225,000 and 600,000. Additional details and the necessary downloadable Verilog code can be found on the RAW Web site (http://www.cag.lcs.mit.edu/raw). We ran the benchmark cases consecutively in batch mode, and we ran each case three times to obtain an average. We ran the 40k RISC design concurrently with a second 40k RISC design and then repetitively and concurrently with the 800k RISC. We chose the number of 40k RISC iterations to complete in the same amount of time estimated for the 800k RISC. We ran the two concurrent benchmarks in the foreground and the background, with the simulation running in the foreground, given the maximum foreground performance boost under Windows NT. We took this approach under the assumption that a typical user would be running the simulations in the background and would want to give any foreground applications, such as coding or word processing, priority. We also ran the benchmarks with the Pentium II-based workstations running Windows NT connected to a Unix server to get a feel for using this heterogeneous environment. Figures 1 and 2 present the raw data for the simulation benchmarks. Because the results plotted in this straightforward manner don't show the differences in performance adequately, we've also plotted the performance of each workstation, including the Sun Ultra 2, normalized to the performance of the Compaq 5100 (see Figures 3 and 4). The results for the concurrent simulations are shown in Figure 5.
The data in all of these charts (and the results obtained from running the benchmarks in our Unix serverNT client configuration, which aren't shown) can be summarized as follows:
After carefully reviewing the architectures and configurations of the workstations, the results, including the exceptions, can be accounted for by the differences in the memory controllers and memory types used and by the differences in I/O performance and disk drives.
The overall differences in performance, except for the difference due to I/O performance, are most likely attributed to differences in memory architectures, and in particular, memory controllers. The Compaq 5100 uses two memory controllers from Reliance Computer Corp.; the IBM, Hewlett-Packard, and Gateway workstations use Intel's 440LX AGPset chip set; and the Dell workstation uses Intel's 440FX PCIset. The Compaq 5100, as well as the other members of Compaq's workstation line, use what Compaq calls a Highly Parallel System Architecture to maximize system bandwidth. Most Pentium II workstations running Windows NT support two CPUs for concurrent instruction processing, but the overall system bandwidth is limited because each CPU must compete for access to critical subsystems, such as memory and I/O. Compaq's parallel architecture addresses the need for greater overall system bandwidth by using dual memory controllers and dual-peer PCI buses. The dual memory controllers process memory requests in parallel, significantly increasing overall memory bandwidth. Each of the memory controllers has a bandwidth of 533 Mbytes/s, and with the memory distributed equally between two DRAM banks, the total memory bandwidth increases to 1.07 Gbytes/s. In contrast, the other Pentium II workstations with only one memory controller can offer memory bandwidths of only 533 Mbytes/s.
Because of the increased memory bandwidth, Compaq opted to use EDO DRAM rather than the somewhat more expensive, albeit faster, SDRAM. With SDRAM, the Compaq 5100 may have achieved even higher performance. Compaq also uses dual-peer PCI buses in its workstations to increase system I/O bandwidth. A single PCI bus provides an I/O bandwidth of 133 Mbytes/s, which must be shared by peripherals such as the graphics controller and hard disk drive. With dual-peer PCI buses, each bus can provide peak bandwidth in parallel with the other controller, allowing twice the bandwidth of single-bus architectures or an aggregate I/O bandwidth of 267 Mbytes/s. In contrast to Compaq's use of a non-Intel memory controller, the Gateway, Hewlett-Packard, and IBM workstations use the Intel 440LX AGPset. The 440LX is the first in a series of AGP chip sets that, according to Intel, are intended to optimize the performance of the Pentium II processor for business and home users. The 440LX extends the system bandwidth to the graphics controller, complementing Intel's Double Independent Bus architecture, which consists of a processor-to-cache bus and a processor-to-memory bus. It optimizes the system bandwidth and concurrency through the implementation of Quad Port Acceleration (QPA) which provides four-port concurrent arbitration of the processor bus, graphics, PCI bus, and SDRAM memory subsystem. AGP gives PCs the head room to handle memory-intensive 3D graphics applications. Had we benchmarked performance that was more graphics-oriented, rather than computation- and memory-oriented, it's not clear that the Compaq 5100 would have shown an advantage over the Gateway, Hewlett-Packard, or IBM workstations. This is a difficult call, however, because the 440LX is balanced to optimize graphics performance, the PCI bus, and the memory subsystem. On the other hand, the Compaq machine has a dual-memory controller to increase memory bandwidth and dual-peer PCI buses to increase the bandwidth for PCI devices, but it doesn't offer AGP for graphics. The 440FX PCIset is an integrated chip set intended to deliver Pentium II and Pentium Pro processor performance to mainstream business systems at an "affordable price," according to Intel. The 440FX consists of three components: a PCI bridge and memory controller, a data bus accelerator, and a PCI-ISA-IDE accelerator. The memory controller supports FPM (fast page mode), EDO, and burst EDO DRAM, but not SDRAM. The worst performer in our benchmarks, the Dell Workstation 400, was the only machine using that controller, although the Compaq 5100 also uses EDO RAM. It's our opinion that the combination of the controller and EDO RAM negatively affected the Dell workstation's performance. (Dell does, however, make workstations with the 440LX chip set, but we don't know if they also use EDO RAM.) Impact of I/O on performance Although performance variations caused by differences in memory architectures and memory types is readily apparent from the data for overall simulation time, accounting for some of the exceptions requires a closer look at the three elements that make up the total simulation time--compilation time, link time, and run time. Though differences in compilation time don't significantly affect overall simulation time, the differences in link and run times do, because they each involve different amounts of I/O and writes to disk.
Let's first examine the performance of the Pentium II machines and the Sun Ultra 2 for the behavioral memory model. This simulation involved a data dump of about 130 Mbytes for each simulation, and the I/O and disk drive performance played a major role in determining the benchmarked performance. The Gateway E5000 and HP Kayak had faster disk drives than the Compaq 5100 and therefore showed better overall performance on the behavioral memory benchmark. The IBM Intellistation also shows performance comparable to that of the Compaq machine for the 800,000 RISC gate-level simulation and superior to that of the Compaq machine for the 1.3M RISC gate-level simulation. In these two cases, the data structure size (approximately 524 Mbytes and 841 Mbytes) exceeded the 512 Mbytes of memory available to the simulation and required some paging to disk. The RAID used in the IBM workstation (the only one so equipped) appears to have given it an advantage. What's especially interesting, however, is that for the behavioral memory benchmark, the Sun workstation literally "blew away" all of the Pentium II machines, including the Compaq 5100. The reasons start to become clear when the link time and the actual simulation run time are broken out for the Compaq 5100 and Sun Ultra 2 (see Figure 6a). Because the link process for the behavioral memory model is essentially computation- and memory-intensive and accounts for only a small portion of the total simulation time, the differences in computational and memory performance between the Compaq 5100 and the Sun Ultra 2 aren't significant. But the difference in run time, which involves a large data dump to disk, gives the Sun machine a major advantage--roughly two times--over the other machines. RISC and Life analysis Although less dramatic, the difference is also apparent in the RISC and Life benchmarks, where some paging to disk is required because of the large data structure sizes. For these cases, especially in the gate-level simulations, the performance advantage of the Compaq machine lay in its superior computational and memory performance or memory allocation during linking. To look at I/O performance and the effect of the disk drive on the benchmark performance, we swapped the 7,000-rpm drive used in the Compaq machine with a 10,000-rpm drive. We also created two partitions on the disk, one NTFS and one FAT, to evaluate NTFS versus FAT. The benchmark performance improved somewhat with the faster disk, and the FAT was slightly faster than NTFS for the data dump. The observation that NTFS is slower than FAT was also supported by some experiments on the IBM RAID, in which we were unable to make it write any faster than the internal IDE disk (which was FAT). Another interesting observation involves the 1.3M RISC gate-level simulation. Verilog-XL uses the full (peak) virtual memory during linking when it builds the data structure. About 90 percent of the memory then goes unused if the design is gate-level. When comparing link times, the IBM was the fastest among the Pentium II machines, with a link time of about 1,000 seconds (see Figure 6b) and a simulation time of about 5,500 seconds. But the Sun Ultra 2 came in with a link time of less than 300 seconds and a simulation time of about 4,800 seconds. Again, it was a big winner because of the superior I/O performance.
Since the Sun Ultra 2, the Compaq 5100, and all but one of the other Pentium II machines were equipped with Wide Ultra SCSI disk drives (remember that the IBM Intellistation used an EIDE drive for its first level of mass storage), the Sun may have performed better because of caching algorithms and the processor-memory-I/O interface. According to Sun, Solaris uses two special algorithms, silo and elevator, to speed disk writes. Silo tags the FIFO buffer queue with "high watermark" and "low watermark" thresholds. The buffer continues to accept I/O requests from the application program until the high watermark on the I/O buffer is reached. The buffer is then drained until the low watermark is reached and then restarts the application and lets it continue sending I/O to the operating system. Instead of writing all of the I/O in the queue in the right order or in exactly the order in which it's sent, the elevator algorithm looks at where the blocks reside on the disk and schedules which blocks are written at which time. The algorithm first schedules all of the writes moving in one direction and then schedules the writes moving in the reverse direction. Throughput is increased because it takes less time to move the head one track at a time in the same direction than to reverse the direction. Another advantage the Sun Ultra 2 may have is in the way information is transferred from memory to I/O. Sun interfaces the SCSI controller and the memory using its UltraSPARC Port Architecture (UPA). The UPA is based on a crossbar switch with a bandwidth of 100 MHz and results in a total throughput of 1.6 Gbytes/s. In addition, the bus between the processor and the memory in a Pentium II machine is the same as the bus between the processor and I/O, namely, a 33-Mbyte/s PCI bus. Here, Sun also uses its 100-MHz UPA crossbar switch. What's still not clear is how much of the Ultra 2's advantage in I/O performance is due to its hardware architecture and how much is due to the Solaris operating system and the disk write algorithms it implements. Either way, it's clear that if your simulations require more than 512 Mbytes of memory and you have to page to disk (or need to perform large data dumps), you'll fall off a higher cliff with Windows NT-based machines then with Unix-based machines running Solaris. Vendor benchmarks It's common practice for a vendor to run a benchmark that measures its own products against its competitors' products. If the vendor's hardware or software doesn't match the competitors, the benchmark never sees the light of day. But if the benchmark comes out in the vendor's favor, it's almost certain that it will be used in the vendor's promotional materials and sales presentations. A vendor's benchmarks should be viewed with skepticism because they are, after all, self-serving and not completed under independent, unbiased supervision. Nevertheless, they can provide some additional insight if they agree with independent benchmarks. With that view in mind, we looked at several benchmarks that Compaq ran to compare its 5100 workstation under Windows NT with several other workstations, in particular, the Sun Ultra 2 (Model 2300) running Solaris.
Compaq benchmarked several EDA packages. Among them were three simulation packages that are highly computation-intensive and so come closest to the benchmarks that we ran. These are VCS from Viewlogic Systems, Inc. in Marlboro, Mass.; Verilog-XL from Cadence Design Systems; and QuickHDL from Mentor Graphics Corp. in Beaverton, Ore. We ran most of our Verilog-XL benchmarks singly, whereas Compaq ran four simulation jobs concurrently and looked at both single- and dual-processor workstations.
Gate-level simulation The benchmark Compaq used consisted of a gate-level Pentium chip set (three chips of approximately 120,000 gates each) and included a bus-functional Pentium model. A bus-cycle simulation of the Pentium chip set was performed in regression mode.
For the VCS benchmark, the Pentium-based workstations, including the Compaq 5100, were configured with only 128 Mbytes of RAM, whereas the Sun UItra 2 had 256 Mbytes. The compiled executable size was 12.4 Mbytes on the Compaq 5100 and 19.9 Mbytes on the Sun Ultra 2. The Windows NT machine had peak memory usage (including the operating system) of 75 Mbytes, and the machine running Solaris had 80 Mbytes. For both the single- and dual-processor machines, the Compaq 5100 was roughly 10 to 20 percent slower than the Sun Ultra 2 (see Figure 7a). For the Verilog-XL benchmark, all the workstations were configured with 512 Mbytes of RAM. Each of the concurrent simulation jobs used 112.5 Mbytes of memory for the data structure, and the total memory usage was 450 Mbytes in both the Compaq and Sun workstations. In the single-CPU configuration, the Ultra 2 was about 5 percent faster than the 5100, and in the two-CPU configuration about 10 percent faster (see Figure 7b). For the QuickHDL benchmark, all of the workstations were again configured with 512 Mbytes of RAM. Both the Compaq 5100 and the Sun Ultra 2 used 832 Mbytes of memory (including virtual memory). In this case, the 5100 was about 7 percent faster than the Ultra 2 in the single-CPU configuration but about 2 percent slower in the two-CPU configuration (see Figure 8). These results for a single-CPU configuration agree with our observations that the Ultra 2 was about 10 percent faster, on average, than the Compaq 5100. Some conclusions It's clear from our benchmarks, as well as those done by Compaq, that 300-MHz Pentium II-based platforms are a match for 300-MHz UltraSPARC II-based workstations and have the horsepower for EDA applications, including those as computation- and memory-intensive as Verilog simulation. Pentium II-based platforms also offer excellent performance for their price and currently enjoy a significant price advantage of roughly two to three times over the UltraSPARC II-based platform. However, there are several things to consider when moving from Unix to Windows NT.
Caveats The first consideration is design size. Currently, Windows NT is limited to 500 Mbytes per process, and the Pentium II can accommodate a total address space of only 1 Gbyte. Windows NT 5.0 will soon overcome Unix's current limitation of 4 Gbytes of address space, but that version is not yet available. What's more, Solaris is already a 64-bit operating system, whereas Windows NT is still 32-bit. Sun's Ultra 2 workstations have roughly a 50 percent advantage in floating-point performance over equivalent-speed Pentium II workstations. That advantage isn't significant in digital simulations like Verilog, perhaps, but it's definitely an advantage if you're doing Spice simulations, for example. Windows NT servers are still not equivalent to Solaris servers, and it's always an advantage to have the compute farm servers fully compatible with the front-end machines. If the simulation environment includes standard Unix tools--such as make, perl, awk, and sed--shell scripts, or C programs, it may be difficult to deploy a mixed Unix-NT environment. Just because Pentium II-based workstations are significantly less expensive than the current generation of Ultra 2 workstations, you shouldn't jump to the conclusion that Windows NT-based EDA software will ever be significantly less expensive than Unix-based EDA software. Conversely, even though you can now obtain a wealth of "shrink-wrapped" office productivity software that will run under NT, the equivalent for Unix-based workstations will be a long time coming. Furthermore, no one should expect Sun to stand still in the face of the competition posed by Intel and Microsoft and the challenges presented by Pentium-based Windows NT workstations, especially in terms of price and performance.
*Conway's Game of Life is represented on a 2D array of cells, each cell being alive or dead at any given time. The program begins with an initial configuration for the cells, and henceforth obeys the following set of rules: A living cell remains alive if it has exactly two or three living neighbors; otherwise it dies. A dead cell becomes alive if it has exactly three living neighbors; otherwise it remains dead. James Lee is a senior consulting engineer at Seva Technologies, Inc. in Fremont, Calif. He has 12 years' experience working with Verilog and was one of the first employees at Gateway Design Automation, which developed Verilog. Prior to joining Seva, he was with Cadence Design Systems. He has written a book on Verilog and is a part-time instructor in Verilog at the University of California, Santa Cruz. Contributing Editor John Miklosz has 17 years' experience working on electronics publications, including several years as the editor-in-chief of Computer Design. To voice an opinion on this or any Integrated System Design article, please e-mail your message to miker@isdmag.com. integrated system design March 1998[ Articles from Integrated System Design Magazine ] [ ICs and uPs ] [ Custom ICs and Programmable Logic ] [ Vendor Guide ] [ Design and Development Tools ] [ Home ] For more information about isdmag.com e-mail cam@isdmag.com For advertising information e-mail amstjohn@mfi.com Comments on our editorial are welcome Copyright © 2000 Integrated System Design |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Home | About | Editorial Calendar | Feedback | Subscriptions | Newsletter | Media Kit | Contact | Reprints| RSS|
Digital| Mobile |
| Network Websites |
|
International |
|
Network Features |
|
|
|
All materials on this site Copyright © 2009 TechInsights, a Division of United Business Media LLC All rights reserved. Privacy Statement | Terms of Service | About |