Viewpoint: Mass GPUs, not CPUs for EDA simulations

 

Simulation has always been about speed. A program that forecasts tomorrow's weather but takes 26 hours to complete is useless, but one that takes 26 minutes is invaluable. It's the same with EDA. If you can get simulation results faster than spinning a board or a chip, you add value. If you don't, you don't.

There are basically three ways to make the simulation go faster: better algorithms, faster processor clocks and parallelism. As David A. Patterson, professor of computer science at the University of California at Berkeley, says: "No one knows how to design a 15-GHz processor, so the other option is to retrain all the software developers" to program parallel machines.

We agree. Processor speeds have topped out. Clock them much faster, and you wouldn't be able to get the heat out fast enough to keep the chip from burning up.

As for algorithms, there is predictable, incremental improvement in algorithms, and sometimes there are breakthroughs. But you can't write a business plan based on that kind of breakthrough.

So the new trend is clearly towards parallel machines. The obvious target is multicore central processor units (CPUs).

We are also seeing a trend to leverage graphics processor units (GPUs). These chips originated in the video game industry for high-performance graphics calculations. They have hundreds of cores. And it turns out they can do tasks unrelated to their original target market of rendering a moving 3-D scene onto streaming 2-D screen images.

If you've been in the industry awhile, you may be getting a feeling of déjà vu these days. This configuration is a bit like the vector supercomputers of decades ago, and you may be wondering, "Well, if supercomputers didn't go mainstream, then why GPUs now?"

It's different this time around because vector machines started at the top of the price/performance curve with Pentagon funding and didn't migrate down. GPUs, on the other hand, have a different economic model. They started at the bottom, were sold to millions to gamers, and now have a terrific price/performance ratio.

Why are GPUs different this time around? Moore's Law states that the number of transistors on a chip doubles every two years. (For decades, smaller transistors meant not only more per chip, but also faster transistors as well. We got faster CPUs at the same time we got more sophisticated CPUs.)'

CPU architecture has also changed from complex, highly pipelined designs to simpler, cloned designs, for multicore CPUs. Moore's Law can be therefore be interpreted as "the number of cores on a chip will double every two years."

A typical application may run 1.9 times faster with two cores (95 percentage efficiency), 3.1 times faster with four cores (78 percentage efficiency) and 4.5 times with eight cores (56 percentage efficiency). As more cores are added, designers can't utilize all of them efficiently. One of the major reasons for the lack of scalability is memory bandwidth limitations -- the computer's main memory can't feed them data fast enough to keep them fully utilized.

A GPU can be inexpensively added onto an existing computer"a leading-edge GPU only costs $350. There's no need to replace the entire computer. With this upgrade, computer applications, including EDA tools, get access to a processor that can compute at 1 teraflop.

The leading multicore CPU can only deliver 100 gigaflops. This gives the GPU a 10x performance advantage over the CPU. (For a historical perspective, the first computer to deliver 1 teraflops was ASCI Red at Sandia National Laboratory, which became operational in December 1996. It cost $55 million and took up 2,500 square feet of floor space.)

GPUs have excelled at providing high memory bandwidth. A GPU needs to transform and render millions of geometry primitives 60 times per second to keep video output running in real time. GPUs have their own dedicated memory and super-wide memory bus (512 bits for GPU versus 64 bits for a CPU) to feed the data to the GPU. A state-of-the-art GPU has a memory bandwidth of 159 GBytes/s, compared with 25 to 32 GBytes/s for the leading multicore CPU. That gives the GPU a 5- to 6-fold advantage in memory bandwidth.

The programming environment for GPUs makes them relatively easy to program on some of highly parallel EDA applications. Some of the core calculations that are handled by GPUs can run 20 times faster than on a CPU. Not all applications will run on the GPU, but critical bottlenecks can be identified and moved from the CPU to the GPU. We've been working with Acceleware and Nvidia to port our EMPro and ADS Transient-Convolution Simulator products to GPUs. Nvidia not only makes the chips but are also a big consumer of computing power to design their next-generation chips. Technological bootstrapping!

Software developers have been helping us port the code, and chip designers have been our lead beta site. They've achieved a 14-fold improvement in simulation time.

Not quite 26 hours to 26 minutes, but we're getting there.

—Larry Lerner is R&D senior manager at Agilent Technologies, EEsof EDA division.