Design Article
Comment
SteveKo
Indeed, it's interesting that even our max power virus pattern only needs 15% of ...
SteveKo
:-) About 10 years ago we did some serious looking into a clockless X86 design ...
Reducing power in AMD processor core with RTL clock gating analysis
Steve Kommrusch - AMD, Inc.
2/4/2013 10:45 AM EST
Lowering the power consumption of consumer products and networking centers is an important design consideration, and this effort begins with many of the chips that go into these devices. Semiconductor design innovators like AMD want to improve on previous generation designs in terms of faster performance in a given power envelope, higher frequency at a given voltage, and improved power efficiency through clock gating and unit redesign.
With these aims, the AMD low-power core design team used a power analysis solution that helped analyze pre-synthesis RTL clock-gating quality, find opportunities for improvements, and generate reports that the engineering team could use to decrease the operating power of the design. By targeting pre-synthesis RTL, power analysis can be run more often and over a larger number of simulation cycles — more quickly and with fewer machine resources than tools that rely on synthesized gates. The focus on clock gating and the quick turnaround of RTL analysis allowed AMD to achieve measurable power reductions for typical applications of a new, low-power X86 AMD core.
The AMD Jaguar X86 core is a flexible, high-frequency processor aimed at system-on-a-chip designs for low-power markets and cloud clients. It uses the 28-nm process technology and has a small die area (3.1 mm2). Compared to the previous generation of this core, AMD Bobcat, many blocks were redesigned for improved power efficiency, including the IC loop buffer, store queue, and L2 clocks. The Jaguar compute unit (CU) includes four independent Jaguar cores and a shared-cache unit with four L2 databanks and an L2 interface tile. The L2 interface block runs at the core clock speed. The L2 databanks run at half-clock to save power and are clocked only when required, reducing power even further.

As design goals included increasing the frequency and instructions per clock cycle (IPC) in this generation of the core, designers worked on timing and minimizing the gates between flops. The goal at the start of the project was to lower typical application power by 10%. Ultimately, using a design methodology that included deployment of PowerPro® from Calypto®, AMD was able to lower the typical power by approximately 20% while increasing frequency at the given voltage by over 10%.
The power analysis flow
In AMD’s overall design flow, engineering managers would pick a tag from which to do synthesis at selected intervals. A snapshot of the relevant RTL code would run through PowerPro. Because PowerPro is able to analyze RTL in a matter of hours, AMD could run weekend regressions to make sure all of the simulations passed and to conduct power analysis of the RTL design very quickly, helping increase clock-gating efficiency by iteratively adjusting the existing clock gates based on the PowerPro recommendations. These weekend regressions also allowed the rapid analysis of design alternatives, resulting in significant performance and power improvements, including optimizations that could not have been done at the gate level or that may not have been detected and targeted without the PowerPro reports.
Typically, to run the power analysis of a given RTL snapshot, the following steps were completed using short AMD-internal scripts.
1. Run builtIt script
Next: Starting the simulation
With these aims, the AMD low-power core design team used a power analysis solution that helped analyze pre-synthesis RTL clock-gating quality, find opportunities for improvements, and generate reports that the engineering team could use to decrease the operating power of the design. By targeting pre-synthesis RTL, power analysis can be run more often and over a larger number of simulation cycles — more quickly and with fewer machine resources than tools that rely on synthesized gates. The focus on clock gating and the quick turnaround of RTL analysis allowed AMD to achieve measurable power reductions for typical applications of a new, low-power X86 AMD core.
The AMD Jaguar X86 core is a flexible, high-frequency processor aimed at system-on-a-chip designs for low-power markets and cloud clients. It uses the 28-nm process technology and has a small die area (3.1 mm2). Compared to the previous generation of this core, AMD Bobcat, many blocks were redesigned for improved power efficiency, including the IC loop buffer, store queue, and L2 clocks. The Jaguar compute unit (CU) includes four independent Jaguar cores and a shared-cache unit with four L2 databanks and an L2 interface tile. The L2 interface block runs at the core clock speed. The L2 databanks run at half-clock to save power and are clocked only when required, reducing power even further.

Figure 1. AMD Jaguar compute core architecture.
As design goals included increasing the frequency and instructions per clock cycle (IPC) in this generation of the core, designers worked on timing and minimizing the gates between flops. The goal at the start of the project was to lower typical application power by 10%. Ultimately, using a design methodology that included deployment of PowerPro® from Calypto®, AMD was able to lower the typical power by approximately 20% while increasing frequency at the given voltage by over 10%.
The power analysis flow
In AMD’s overall design flow, engineering managers would pick a tag from which to do synthesis at selected intervals. A snapshot of the relevant RTL code would run through PowerPro. Because PowerPro is able to analyze RTL in a matter of hours, AMD could run weekend regressions to make sure all of the simulations passed and to conduct power analysis of the RTL design very quickly, helping increase clock-gating efficiency by iteratively adjusting the existing clock gates based on the PowerPro recommendations. These weekend regressions also allowed the rapid analysis of design alternatives, resulting in significant performance and power improvements, including optimizations that could not have been done at the gate level or that may not have been detected and targeted without the PowerPro reports.
Typically, to run the power analysis of a given RTL snapshot, the following steps were completed using short AMD-internal scripts.
1. Run builtIt script
- Checks out the RTL view from the Perforce design repository
- Builds simulation model using pre-processor scripts and VCS
- Builds pre-synthesis view of the RTL code using pre-processor scripts
- Runs 39 tests using LSF to spawn jobs out to simulation farm machines
- Captures FSDB data that starts and ends at an instruction-count boundary
- Converts FSDB to SAIF files used by PowerPro
- Reads in IP.f and run.tcl files for each block and SAIF files for each simulation set
- Uses LSF to spawn PowerPro jobs out to simulation farm machines
- Creates output report directories and files with improved clock gating for review
- Analyzes PowerPro outputs and organizes results into summary tables to help track clock-gating improvements month to month and per IP
Next: Starting the simulation
Navigate to related information


yjchen
2/8/2013 2:21 AM EST
Hi Steve,
As you mentioned, the correlation between silicon and PTPX are about +/-10%. From your experience, what's the correlation between PowerPro and PTPX? And Powerpro between silicon?
Besides, in your flow, the input of Powerpro is saif. Why don't you use real waveform, like vcd or fsdb? Thanks.
yjchen
Sign in to Reply
GMN
2/22/2013 3:54 PM EST
PowerPro does use VCD and FSDB for more accurate analysis. However, if you are primarily concerned about clock gating efficiency, and not looking for peak power analysis, then SAIF is faster and more efficient
Sign in to Reply
SteveKo
3/1/2013 12:38 PM EST
GMN had a good reply for SAIF usage, we used the Calypto recommended flow for that technical decision.
For PowerPro to PTPX, we were not using their newer version which estimates actual power, we were looking and clock gating efficiency. However, there was useful correlation there. As per table 2, we achieved about a 25% reduction in flop activity rate from one design to the next, and that correlated with about 25% lower dynamic power for typical applications. (As a short point of interest, we did check how much power tended to be used per active flop on one of our early runs. But block-to-block varied a fair bit. As one would expect, blocks with lots of combination logic like floating point had more gate fanout capacitance per flop than other blocks).
Sign in to Reply
daleste
2/11/2013 10:29 PM EST
Good work to improve the efficiency of the design. What ever happened to the clock-less logic that was supposed to make all of this not needed?
Sign in to Reply
SteveKo
3/1/2013 12:42 PM EST
:-) About 10 years ago we did some serious looking into a clockless X86 design for deep low power, but the toolsets for efficient timing closure weren't there. And providing sufficiently robust async timing for state machines eats into perceived benefits. I think clock trees and meshes with optimized gating strategies will be with us for a while.
Sign in to Reply
Frank Eory
2/13/2013 9:18 AM EST
I find it amazing that after all these years of using clock gating to reduce power, the tools & methodologies continue to improve to such a degree that these types of large power reductions are still possible.
Sign in to Reply
SteveKo
3/1/2013 12:46 PM EST
Indeed, it's interesting that even our max power virus pattern only needs 15% of the flops clocked. There was a lot of designer work optimizing clock gating, but Calypto's SLEC methodology helped show what could be done too.
Tools evolve and designers gain experience, leading to ever lower active flop counts.
Sign in to Reply