Lowering the power consumption of consumer products and networking centers is an important design consideration, and this effort begins with many of the chips that go into these devices. Semiconductor design innovators like AMD want to improve on previous generation designs in terms of faster performance in a given power envelope, higher frequency at a given voltage, and improved power efficiency through clock gating and unit redesign.
With these aims, the AMD low-power core design team used a power analysis solution that helped analyze pre-synthesis RTL clock-gating quality, find opportunities for improvements, and generate reports that the engineering team could use to decrease the operating power of the design. By targeting pre-synthesis RTL, power analysis can be run more often and over a larger number of simulation cycles — more quickly and with fewer machine resources than tools that rely on synthesized gates. The focus on clock gating and the quick turnaround of RTL analysis allowed AMD to achieve measurable power reductions for typical applications of a new, low-power X86 AMD core.
The AMD Jaguar X86 core is a flexible, high-frequency processor aimed at system-on-a-chip designs for low-power markets and cloud clients. It uses the 28-nm process technology and has a small die area (3.1 mm2). Compared to the previous generation of this core, AMD Bobcat, many blocks were redesigned for improved power efficiency, including the IC loop buffer, store queue, and L2 clocks. The Jaguar compute unit (CU) includes four independent Jaguar cores and a shared-cache unit with four L2 databanks and an L2 interface tile. The L2 interface block runs at the core clock speed. The L2 databanks run at half-clock to save power and are clocked only when required, reducing power even further.
Figure 1. AMD Jaguar compute core architecture.
As design goals included increasing the frequency and instructions per clock cycle (IPC) in this generation of the core, designers worked on timing and minimizing the gates between flops. The goal at the start of the project was to lower typical application power by 10%. Ultimately, using a design methodology that included deployment of PowerPro® from Calypto®, AMD was able to lower the typical power by approximately 20% while increasing frequency at the given voltage by over 10%.The power analysis flow
In AMD’s overall design flow, engineering managers would pick a tag from which to do synthesis at selected intervals. A snapshot of the relevant RTL code would run through PowerPro. Because PowerPro is able to analyze RTL in a matter of hours, AMD could run weekend regressions to make sure all of the simulations passed and to conduct power analysis of the RTL design very quickly, helping increase clock-gating efficiency by iteratively adjusting the existing clock gates based on the PowerPro recommendations. These weekend regressions also allowed the rapid analysis of design alternatives, resulting in significant performance and power improvements, including optimizations that could not have been done at the gate level or that may not have been detected and targeted without the PowerPro reports.
Typically, to run the power analysis of a given RTL snapshot, the following steps were completed using short AMD-internal scripts.
1. Run builtIt script
- Checks out the RTL view from the Perforce design repository
- Builds simulation model using pre-processor scripts and VCS
- Builds pre-synthesis view of the RTL code using pre-processor scripts
2. Run simIt script
- Runs 39 tests using LSF to spawn jobs out to simulation farm machines
- Captures FSDB data that starts and ends at an instruction-count boundary
- Converts FSDB to SAIF files used by PowerPro
3. Run powerProIt script
- Reads in IP.f and run.tcl files for each block and SAIF files for each simulation set
- Uses LSF to spawn PowerPro jobs out to simulation farm machines
- Creates output report directories and files with improved clock gating for review
4. Run sum.pl script
- Analyzes PowerPro outputs and organizes results into summary tables to help track clock-gating improvements month to month and per IP