EDA DesignLine Blog
Comment
docdivakar
Existing (not optimally parallel) EDA tools can still exploit the operating ...
urital.rocketick
Using GPUs to accelerate EDA applications
Uri Tal
5/3/2012 12:11 PM EDT
Graphic Processing Units (GPUs) hold great promise in the field of high-performance-computing (HPC). Having immense computational power and high memory bandwidth packed in a commoditized hardware platform, GPUs have already been successfully utilized by oil and gas companies, and computational finance and similar organizations seeking the best compute-bang for their bucks. EDA applications have a lot in common with other HPC applications – huge quantities of computations and memory-related operations.
EDA applications have been traditionally implemented on regular processors. Most, if not all, of these applications were not designed for parallel or vector processors. For example, HDL simulators, being event-driven, manage a single queue of events and handle events one at a time, serially. To be able to utilize a massive multi-core architecture, it is not sufficient to even completely rewrite the software; the algorithms must be re-thought with parallelism at the heart of the process.
Graphic Processing Units (GPUs) emerged in the late 90s. At first, they were designed solely to function as a coprocessor to the CPU, to offload graphic algorithms onto custom hardware. Optimized for this mission, they consisted of a pipeline of several types of processors, each being dedicated to a specific stage in the algorithm flow – vertex processing, texturing, shading, and so on. As it turned out, this architecture was not maximally utilized because some game scenes were geometrically rich, while others required mostly texture-related computations. To better deal with such imbalanced scenarios it made sense to move into a more general architecture. The newer GPUs consisted of “general-purpose” SIMD cores. GPUs adapted a massive multi-core architecture. Since 2007, NVIDIA’s CUDA, and OpenCL enable us to program these immense stream processors in C++, instead of “impersonating” pixels and triangles.
The performance bottleneck of EDA applications comes from two directions. First, most of these applications are single-threaded while CPU and GPU architectures have tens to thousands of parallel cores. Secondly, these applications are bottlenecked by memory latency. CPUs are designed for the 90%-100% cache hit working point. Half of the die size of modern CPUs consists of cache memory and another quarter is invested in all sorts of cache-related optimizations. Unfortunately, in EDA applications, the data-set is too large to fit in the cache and, with the absence of data-access locality, the “cache-hit” assumption fails. As a result, the single CPU core is even further underutilized due to the need to wait hundreds of cycles for each memory load operation. This is the reason why the main contributor to the increase in simulation performance in the past few years has been the increase in cache size.
GPU’s are perfectly suited for data-parallel algorithms with huge datasets. In the most recently developed GPUs there are more than a thousand processing cores, organized in SIMD groups. All that is required is that you launch several million short-lived independent threads that need not communicate with each other. The memory latency can be perfectly hidden by switching between “waiting” threads to “ready” threads very efficiently. Instead of optimizing for the latency of the single thread, optimization is for throughput – the number of threads that can be processed in specific time duration. In applications such as graphics, you can reach 100% utilization of the hundreds of cores that exist in the GPU, and make use of every bit of the 190 GB/sec data rate that the GDDR5 bus allows.
Sounds easy? Well, if your algorithm naturally breaks into parallel threads, where each thread works on its own different subset of the data, then it is. The bad news is that in most EDA applications not all parts of the flow can be broken into independent parallel threads. From the in-depth experience we gained when developing RocketSim, we came to the conclusion that you must start from a “blank sheet” and rethink the algorithm of running logic simulations. This is the only way we could break the problem into parallel threads. And even then, these threads were not really independent. In fact, the situation was quite the opposite. The number of dependencies in the data-flow graph that we have to deal with was enormous. We put a lot of effort into optimally partitioning the data-flow-graph so that the number of dependencies among threads would be kept to a minimum.
There’s a lot of potential in GPUs for those EDA applications that have parallelism potential. The GPU architecture is ideal for data-parallel processing; it is an incredible throughput-machine, if you give it the right code to run. However, a major effort is needed to redesign not only the software, but the underlying algorithms as well.
For us at Rocketick, this redesign effort paid off. We are able today to simulate the largest chip designs in the world 10 to 30 times faster, compared to the leading simulators in the market. Due to a complete re-design of the entire simulation database, our product is completely scalable and can run over multiple GPUs. With every new GPU generation, our product automatically leverages the ever-growing gap between the GPU and the CPU in terms of processing and memory bandwidth.
About the author
Uri Tal, founder of Rocketick and its CEO from its inception, has 14 years of experience in management, design, and implementation of hardware acceleration technologies. Prior to Rocketick Uri was a system architect at Siliquent/Broadcom. He previously managed a large R&D team, which developed FPGA-based acceleration solutions for the intelligence corps of the Israeli Defense Forces. Uri holds a B.Sc. (Summa Cum Laude) and M.Sc. in Electrical Engineering from the Technion – the Israeli Institute of Technology.
If you found this article to be of interest, visit EDA Designline where you will find the latest and greatest design, technology, product, and news articles with regard to all aspects of Electronic Design Automation (EDA).
Also, you can obtain a highlights update delivered directly to your inbox by signing up for the EDA Designline weekly newsletter – just Click Here to request this newsletter using the Manage Newsletters tab (if you aren't already a member you'll be asked to register, but it's free and painless so don't let that stop you [grin]).
EDA applications have been traditionally implemented on regular processors. Most, if not all, of these applications were not designed for parallel or vector processors. For example, HDL simulators, being event-driven, manage a single queue of events and handle events one at a time, serially. To be able to utilize a massive multi-core architecture, it is not sufficient to even completely rewrite the software; the algorithms must be re-thought with parallelism at the heart of the process.
Graphic Processing Units (GPUs) emerged in the late 90s. At first, they were designed solely to function as a coprocessor to the CPU, to offload graphic algorithms onto custom hardware. Optimized for this mission, they consisted of a pipeline of several types of processors, each being dedicated to a specific stage in the algorithm flow – vertex processing, texturing, shading, and so on. As it turned out, this architecture was not maximally utilized because some game scenes were geometrically rich, while others required mostly texture-related computations. To better deal with such imbalanced scenarios it made sense to move into a more general architecture. The newer GPUs consisted of “general-purpose” SIMD cores. GPUs adapted a massive multi-core architecture. Since 2007, NVIDIA’s CUDA, and OpenCL enable us to program these immense stream processors in C++, instead of “impersonating” pixels and triangles.
The performance bottleneck of EDA applications comes from two directions. First, most of these applications are single-threaded while CPU and GPU architectures have tens to thousands of parallel cores. Secondly, these applications are bottlenecked by memory latency. CPUs are designed for the 90%-100% cache hit working point. Half of the die size of modern CPUs consists of cache memory and another quarter is invested in all sorts of cache-related optimizations. Unfortunately, in EDA applications, the data-set is too large to fit in the cache and, with the absence of data-access locality, the “cache-hit” assumption fails. As a result, the single CPU core is even further underutilized due to the need to wait hundreds of cycles for each memory load operation. This is the reason why the main contributor to the increase in simulation performance in the past few years has been the increase in cache size.
GPU’s are perfectly suited for data-parallel algorithms with huge datasets. In the most recently developed GPUs there are more than a thousand processing cores, organized in SIMD groups. All that is required is that you launch several million short-lived independent threads that need not communicate with each other. The memory latency can be perfectly hidden by switching between “waiting” threads to “ready” threads very efficiently. Instead of optimizing for the latency of the single thread, optimization is for throughput – the number of threads that can be processed in specific time duration. In applications such as graphics, you can reach 100% utilization of the hundreds of cores that exist in the GPU, and make use of every bit of the 190 GB/sec data rate that the GDDR5 bus allows.
Sounds easy? Well, if your algorithm naturally breaks into parallel threads, where each thread works on its own different subset of the data, then it is. The bad news is that in most EDA applications not all parts of the flow can be broken into independent parallel threads. From the in-depth experience we gained when developing RocketSim, we came to the conclusion that you must start from a “blank sheet” and rethink the algorithm of running logic simulations. This is the only way we could break the problem into parallel threads. And even then, these threads were not really independent. In fact, the situation was quite the opposite. The number of dependencies in the data-flow graph that we have to deal with was enormous. We put a lot of effort into optimally partitioning the data-flow-graph so that the number of dependencies among threads would be kept to a minimum.
There’s a lot of potential in GPUs for those EDA applications that have parallelism potential. The GPU architecture is ideal for data-parallel processing; it is an incredible throughput-machine, if you give it the right code to run. However, a major effort is needed to redesign not only the software, but the underlying algorithms as well.
For us at Rocketick, this redesign effort paid off. We are able today to simulate the largest chip designs in the world 10 to 30 times faster, compared to the leading simulators in the market. Due to a complete re-design of the entire simulation database, our product is completely scalable and can run over multiple GPUs. With every new GPU generation, our product automatically leverages the ever-growing gap between the GPU and the CPU in terms of processing and memory bandwidth.
About the author
Uri Tal, founder of Rocketick and its CEO from its inception, has 14 years of experience in management, design, and implementation of hardware acceleration technologies. Prior to Rocketick Uri was a system architect at Siliquent/Broadcom. He previously managed a large R&D team, which developed FPGA-based acceleration solutions for the intelligence corps of the Israeli Defense Forces. Uri holds a B.Sc. (Summa Cum Laude) and M.Sc. in Electrical Engineering from the Technion – the Israeli Institute of Technology.If you found this article to be of interest, visit EDA Designline where you will find the latest and greatest design, technology, product, and news articles with regard to all aspects of Electronic Design Automation (EDA).
Also, you can obtain a highlights update delivered directly to your inbox by signing up for the EDA Designline weekly newsletter – just Click Here to request this newsletter using the Manage Newsletters tab (if you aren't already a member you'll be asked to register, but it's free and painless so don't let that stop you [grin]).
Navigate to related information


Karl Fergusen
5/3/2012 3:05 PM EDT
We started using Jacket a few months ago to accelerate MatLab codes at L-3 on the GPU. Awesome speedups!
Sign in to Reply
Les_Slater
5/5/2012 9:20 PM EDT
The problem of rethinking algorithmic foundations is an interesting one. This needs to be taken on as a general formality of a geometry of problem space.
Sign in to Reply
TingLu
5/7/2012 12:17 AM EDT
Simulation acceleration has been dominant by FPGA. I am wondering how well it is compared with GPU.
Sign in to Reply
urital.rocketick
5/7/2012 1:31 AM EDT
Hi TingLu,
FPGA-based accelerators enable to run chip designs at MHz speeds and to debug system-level scenarios in the lab, but they are not simulators. It is just a different product category.
Pros:
- You can reach 1-10MHz speeds with them and therefore debug your driver and even your application in embedded systems
Cons:
- They are very expensive.
- Require significant ramp-up time, and then if you change your code or libraries you are not really debugging your real silicon design
- Does not work alongside your existing test-bench (verification environment), and if it does you cannot reach MHz speeds.
- Limited in capacity (to scale you need to add more FPGAs/boxes but then you trade-off with speed)
- Lack support for non-synthesize-able code
- No support for 4-state logic
- Lack full visibility
- Long compilation time (require to synthesys and place-and-route)
Sign in to Reply
docdivakar
5/11/2012 2:09 AM EDT
Existing (not optimally parallel) EDA tools can still exploit the operating system to benefit from parallelism. Examples abound, like the pattern-based DRC; in the TCAD area, computational lithography, etc.
MP Divakar
Sign in to Reply