Design Article
Comment
SteveKo
Indeed, it's interesting that even our max power virus pattern only needs 15% of ...
SteveKo
:-) About 10 years ago we did some serious looking into a clockless X86 design ...
Reducing power in AMD processor core with RTL clock gating analysis
Steve Kommrusch - AMD, Inc.
2/4/2013 10:45 AM EST
PowerPro results
The first test we ran was cpu-halt. We ran this first because it was among the easiest ways to make significant improvements in clock gating. Figure 2 shows a snapshot of the clock-gating improvement process as tracked by PowerPro. Thirteen blocks are shown that had been leveraged between Jaguar and a previous design, Bobcat. By helping track progress often, even as functionality and timing work was progressing, the team was able to drive down active clock counts dramatically during product development.

The cpu_halt test was also run after adding a new block (the shared L2 cache controller) to the design that was not leveraged from the previous processor core. The significant drop in activity seen from Month3 to Month4 shows a point at which the functionality of the new block was nearly complete and design work began focusing on power concerns (Figure 3).

We then ran various applications on PowerPro (Table 1). The goal was to minimize the average number of flops clocked each cycle by optimizing away flops or improving clock-gating efficiency. Designers could look at RTL as-is flop-efficiency details as well as recommended improvements for gating efficiency. The design owner’s name was associated with each block to establish a clear assignment of responsibility for reviewing and improving clock-gating results.

At the same frequency, AMD Bobcat and AMD Jaguar have similar maximum power levels for the virus case. (Due to timing work, Jaguar can run at lower voltages for the same frequency, but it also has higher IPC architecturally.) Table 2 shows the result of our clock-gating efforts on the AMD Jaguar core. For typical applications, even though the instructions per clock (IPC) was improved from one core to the next, the percentage of active flops decreased by approximately 25%.

In addition to running a snapshot of RTL code through the PowerPro flow, the AMD gates team would do synthesis, placement, routing, and gate simulations from the same tag. These PTPX runs included accurate gate and wire capacitance for the actual tape-out netlist. However, getting an accurate PTPX result can take several weeks, because it requires that the design be synthesized and routed through a back-end flow that is capable of achieving the high frequencies at which the AMD Jaguar cores can run. The general PTPX results demonstrate that using PowerPro as a quick estimate for power work was useful. Also, based on Bobcat silicon results, reasonable correlation (+/-10%) between silicon and PTPX results has been observed.
In summary, AMD’s efficient RTL clock-gating analysis flow had these key advantages:
About the author
Steve
Kommrusch received his BS from University of Illinois in 1987 and his
Masters degree from Massachusetts Institute of Technology in 1989. Steve
has worked as a lead engineer on low power processors for over 15
years. At Hewlett Packard, Steve worked on a 3 ARM core ASIC for the
CapShare 910 handheld scanner. With National Semiconductor, Steve worked
on the Geode LX, an SoC with 2D graphics, X86 processor, and integrated
display control which was in the OLPC laptop (One Laptop per Child).
Most recently, Steve architected the clock, reset, and power control
signals for the AMD Jaguar processor. All of these products made
extensive use of clock gating to improve battery life.
If you found this article to be of interest, visit EDA Designline where you will find the latest and greatest design, technology, product, and news articles with regard to all aspects of Electronic Design Automation (EDA).
Also, you can obtain a highlights update delivered directly to your inbox by signing up for the EDA Designline weekly newsletter – just Click Here to request this newsletter using the Manage Newsletters tab (if you aren't already a member you'll be asked to register, but it's free and painless so don't let that stop you).
The first test we ran was cpu-halt. We ran this first because it was among the easiest ways to make significant improvements in clock gating. Figure 2 shows a snapshot of the clock-gating improvement process as tracked by PowerPro. Thirteen blocks are shown that had been leveraged between Jaguar and a previous design, Bobcat. By helping track progress often, even as functionality and timing work was progressing, the team was able to drive down active clock counts dramatically during product development.

Figure 2. Clock-gating improvements based on cpu-halt regressions.
The cpu_halt test was also run after adding a new block (the shared L2 cache controller) to the design that was not leveraged from the previous processor core. The significant drop in activity seen from Month3 to Month4 shows a point at which the functionality of the new block was nearly complete and design work began focusing on power concerns (Figure 3).

Figure 3. Average clocked flops after adding “newblock”: the shared L2 cache controller.
We then ran various applications on PowerPro (Table 1). The goal was to minimize the average number of flops clocked each cycle by optimizing away flops or improving clock-gating efficiency. Designers could look at RTL as-is flop-efficiency details as well as recommended improvements for gating efficiency. The design owner’s name was associated with each block to establish a clear assignment of responsibility for reviewing and improving clock-gating results.

Table 1. Summary of PowerPro AppTyp results. (Note: “newblock” is not part of the CPU core total.)
At the same frequency, AMD Bobcat and AMD Jaguar have similar maximum power levels for the virus case. (Due to timing work, Jaguar can run at lower voltages for the same frequency, but it also has higher IPC architecturally.) Table 2 shows the result of our clock-gating efforts on the AMD Jaguar core. For typical applications, even though the instructions per clock (IPC) was improved from one core to the next, the percentage of active flops decreased by approximately 25%.

Table 2. Comparison of clock-gating improvements. (Note: % of Flops Active is approximate.)
In addition to running a snapshot of RTL code through the PowerPro flow, the AMD gates team would do synthesis, placement, routing, and gate simulations from the same tag. These PTPX runs included accurate gate and wire capacitance for the actual tape-out netlist. However, getting an accurate PTPX result can take several weeks, because it requires that the design be synthesized and routed through a back-end flow that is capable of achieving the high frequencies at which the AMD Jaguar cores can run. The general PTPX results demonstrate that using PowerPro as a quick estimate for power work was useful. Also, based on Bobcat silicon results, reasonable correlation (+/-10%) between silicon and PTPX results has been observed.
In summary, AMD’s efficient RTL clock-gating analysis flow had these key advantages:
- RTL analysis could run over the weekend and analyze key power benchmark tests.
- Output format was easy to parse and summarize for designer use.
- Recommended improvements had value as suggestions and showed possible optimizations.
- Correlation between active clock count and total power used was good.
- Ultimately, even given IPC and frequency improvements, PowerPro helped achieve an approximately 20% reduction in typical dynamic application power compared to an already-tuned low-power X86 CPU.
About the author
Steve
Kommrusch received his BS from University of Illinois in 1987 and his
Masters degree from Massachusetts Institute of Technology in 1989. Steve
has worked as a lead engineer on low power processors for over 15
years. At Hewlett Packard, Steve worked on a 3 ARM core ASIC for the
CapShare 910 handheld scanner. With National Semiconductor, Steve worked
on the Geode LX, an SoC with 2D graphics, X86 processor, and integrated
display control which was in the OLPC laptop (One Laptop per Child).
Most recently, Steve architected the clock, reset, and power control
signals for the AMD Jaguar processor. All of these products made
extensive use of clock gating to improve battery life.If you found this article to be of interest, visit EDA Designline where you will find the latest and greatest design, technology, product, and news articles with regard to all aspects of Electronic Design Automation (EDA).
Also, you can obtain a highlights update delivered directly to your inbox by signing up for the EDA Designline weekly newsletter – just Click Here to request this newsletter using the Manage Newsletters tab (if you aren't already a member you'll be asked to register, but it's free and painless so don't let that stop you).
Navigate to related information


yjchen
2/8/2013 2:21 AM EST
Hi Steve,
As you mentioned, the correlation between silicon and PTPX are about +/-10%. From your experience, what's the correlation between PowerPro and PTPX? And Powerpro between silicon?
Besides, in your flow, the input of Powerpro is saif. Why don't you use real waveform, like vcd or fsdb? Thanks.
yjchen
Sign in to Reply
GMN
2/22/2013 3:54 PM EST
PowerPro does use VCD and FSDB for more accurate analysis. However, if you are primarily concerned about clock gating efficiency, and not looking for peak power analysis, then SAIF is faster and more efficient
Sign in to Reply
SteveKo
3/1/2013 12:38 PM EST
GMN had a good reply for SAIF usage, we used the Calypto recommended flow for that technical decision.
For PowerPro to PTPX, we were not using their newer version which estimates actual power, we were looking and clock gating efficiency. However, there was useful correlation there. As per table 2, we achieved about a 25% reduction in flop activity rate from one design to the next, and that correlated with about 25% lower dynamic power for typical applications. (As a short point of interest, we did check how much power tended to be used per active flop on one of our early runs. But block-to-block varied a fair bit. As one would expect, blocks with lots of combination logic like floating point had more gate fanout capacitance per flop than other blocks).
Sign in to Reply
daleste
2/11/2013 10:29 PM EST
Good work to improve the efficiency of the design. What ever happened to the clock-less logic that was supposed to make all of this not needed?
Sign in to Reply
SteveKo
3/1/2013 12:42 PM EST
:-) About 10 years ago we did some serious looking into a clockless X86 design for deep low power, but the toolsets for efficient timing closure weren't there. And providing sufficiently robust async timing for state machines eats into perceived benefits. I think clock trees and meshes with optimized gating strategies will be with us for a while.
Sign in to Reply
Frank Eory
2/13/2013 9:18 AM EST
I find it amazing that after all these years of using clock gating to reduce power, the tools & methodologies continue to improve to such a degree that these types of large power reductions are still possible.
Sign in to Reply
SteveKo
3/1/2013 12:46 PM EST
Indeed, it's interesting that even our max power virus pattern only needs 15% of the flops clocked. There was a lot of designer work optimizing clock gating, but Calypto's SLEC methodology helped show what could be done too.
Tools evolve and designers gain experience, leading to ever lower active flop counts.
Sign in to Reply