Having immense computational power and high memory bandwidth packed in a commoditized hardware platform, GPUs have already been successfully utilized...
Graphic Processing Units (GPUs) hold great promise in the field of high-performance-computing (HPC). Having immense computational power and high memory bandwidth packed in a commoditized hardware platform, GPUs have already been successfully utilized by oil and gas companies, and computational finance and similar organizations seeking the best compute-bang for their bucks. EDA applications have a lot in common with other HPC applications – huge quantities of computations and memory-related operations.
EDA applications have been traditionally implemented on regular processors. Most, if not all, of these applications were not designed for parallel or vector processors. For example, HDL simulators, being event-driven, manage a single queue of events and handle events one at a time, serially. To be able to utilize a massive multi-core architecture, it is not sufficient to even completely rewrite the software; the algorithms must be re-thought with parallelism at the heart of the process.
Graphic Processing Units (GPUs) emerged in the late 90s. At first, they were designed solely to function as a coprocessor to the CPU, to offload graphic algorithms onto custom hardware. Optimized for this mission, they consisted of a pipeline of several types of processors, each being dedicated to a specific stage in the algorithm flow – vertex processing, texturing, shading, and so on. As it turned out, this architecture was not maximally utilized because some game scenes were geometrically rich, while others required mostly texture-related computations. To better deal with such imbalanced scenarios it made sense to move into a more general architecture. The newer GPUs consisted of “general-purpose” SIMD cores. GPUs adapted a massive multi-core architecture. Since 2007, NVIDIA’s CUDA, and OpenCL enable us to program these immense stream processors in C++, instead of “impersonating” pixels and triangles.
The performance bottleneck of EDA applications comes from two directions. First, most of these applications are single-threaded while CPU and GPU architectures have tens to thousands of parallel cores. Secondly, these applications are bottlenecked by memory latency. CPUs are designed for the 90%-100% cache hit working point. Half of the die size of modern CPUs consists of cache memory and another quarter is invested in all sorts of cache-related optimizations. Unfortunately, in EDA applications, the data-set is too large to fit in the cache and, with the absence of data-access locality, the “cache-hit” assumption fails. As a result, the single CPU core is even further underutilized due to the need to wait hundreds of cycles for each memory load operation. This is the reason why the main contributor to the increase in simulation performance in the past few years has been the increase in cache size.
GPU’s are perfectly suited for data-parallel algorithms with huge datasets. In the most recently developed GPUs there are more than a thousand processing cores, organized in SIMD groups. All that is required is that you launch several million short-lived independent threads that need not communicate with each other. The memory latency can be perfectly hidden by switching between “waiting” threads to “ready” threads very efficiently. Instead of optimizing for the latency of the single thread, optimization is for throughput – the number of threads that can be processed in specific time duration. In applications such as graphics, you can reach 100% utilization of the hundreds of cores that exist in the GPU, and make use of every bit of the 190 GB/sec data rate that the GDDR5 bus allows.
Sounds easy? Well, if your algorithm naturally breaks into parallel threads, where each thread works on its own different subset of the data, then it is. The bad news is that in most EDA applications not all parts of the flow can be broken into independent parallel threads. From the in-depth experience we gained when developing RocketSim, we came to the conclusion that you must start from a “blank sheet” and rethink the algorithm of running logic simulations. This is the only way we could break the problem into parallel threads. And even then, these threads were not really independent. In fact, the situation was quite the opposite. The number of dependencies in the data-flow graph that we have to deal with was enormous. We put a lot of effort into optimally partitioning the data-flow-graph so that the number of dependencies among threads would be kept to a minimum.
There’s a lot of potential in GPUs for those EDA applications that have parallelism potential. The GPU architecture is ideal for data-parallel processing; it is an incredible throughput-machine, if you give it the right code to run. However, a major effort is needed to redesign not only the software, but the underlying algorithms as well.
For us at Rocketick, this redesign effort paid off. We are able today to simulate the largest chip designs in the world 10 to 30 times faster, compared to the leading simulators in the market. Due to a complete re-design of the entire simulation database, our product is completely scalable and can run over multiple GPUs. With every new GPU generation, our product automatically leverages the ever-growing gap between the GPU and the CPU in terms of processing and memory bandwidth.
About the author
Uri Tal, founder of Rocketick and its CEO from its inception, has 14 years of experience in management, design, and implementation of hardware acceleration technologies. Prior to Rocketick Uri was a system architect at Siliquent/Broadcom. He previously managed a large R&D team, which developed FPGA-based acceleration solutions for the intelligence corps of the Israeli Defense Forces. Uri holds a B.Sc. (Summa Cum Laude) and M.Sc. in Electrical Engineering from the Technion – the Israeli Institute of Technology.
If you found this article to be of interest, visit EDA Designline
where you will find the latest and greatest design, technology, product, and news articles with regard to all aspects of Electronic Design Automation (EDA).
Also, you can obtain a highlights update delivered directly to your inbox by signing up for the EDA Designline weekly newsletter – just Click Here
to request this newsletter using the Manage Newsletters tab (if you aren't already a member you'll be asked to register, but it's free and painless so don't let that stop you [grin]).