Design Article
Tell us What You Think
We want to know what you thought about this Design. Let us know by adding a comment.
Making ESL power optimization a reality
Shawn McCloud, Bryan Bowyer and Vikas Tyagi - Calypto
1/7/2013 8:00 AM EST
Additional basic concepts
4. Micro-architecture: Optimization and exploration of the micro-architecture is a powerful technology for improving an algorithm’s performance and adjusting for power. For algorithms, it’s not just about operating frequency. The important measurements are latency (how long it takes to get the first result) and throughput (how fast the data can be fed).
An eight tap FIR filter with eight multipliers may have a latency of one cycle if the period is long and the adder tree can be done with the multipliers, but it might have a latency of two or three, or even more (if using pipelined multipliers), yet the throughput might remain constant at one clock cycle.
If only one multiplier is used and the coefficients and tap register are restricted to a single RAM, then the latency might be nine or ten clock cycles (or more) and the throughput similarly longer, but this comes with the benefit of considerably reduced area and power [Figure-2].

One may reduce clock frequency to reduce power. This may then require increased parallelism in the design (using loop unrolling in the HLS tool) to balance latency. Using an HLS tool to perform loop pipelining and unrolling of constraints helps achieve these implementations quickly. Designers can then compare the power, performance, and area of each implementation. The right implementation depends on the design goals regarding frequency, latency, and throughput, collectively.
5. Frequency: The golden source code (SystemC/C++/C) is independent of technology details. The same code can be retargeted to different target technologies [Figure-3] because frequency is just a parameter. Through frequency explorations, designers can set or adjust the clock frequency; the HLS tool then figures out how to get things to fit in a clock cycle. Also, since the implementation can be controlled down to the resources used, designers can experiment with using different operators like pipelined multipliers and adders.

For example, if the analysis tools show that a design actually has some extra slack in one implementation, the designer can reduce voltage to save power. Or, with a little faster implementation, they can share more operators. In this way, they can balance dynamic power with parallelism for better performance.
6. Block hierarchy: Having hierarchical blocks naturally lends itself to multi-clock design. More advanced HLS tools support running the blocks at different clock speeds and handling the data transfer between blocks through FIFOs. Designs with decimation are well suited to multi-clock design [Figure-4].

Blocks with lower data rates may run with a slower clock, reducing the switching power and the static power by decreasing block area. In more general cases, the clock frequency can be tuned to match the best implementation for either throughput or latency and power, based on the technology target, with the same source code.
7. LVFS (Low Voltage Frequency Scaling): In low power mode, the HLS tool can insert an idle signal (1-bit output port) in the design. This signal is set when the block is in an idle state (not processing any data, not reading any input, and not writing to any output). This signal can be used in a system-level power management strategy, like LVFS or gating the clock power to a block.
4. Micro-architecture: Optimization and exploration of the micro-architecture is a powerful technology for improving an algorithm’s performance and adjusting for power. For algorithms, it’s not just about operating frequency. The important measurements are latency (how long it takes to get the first result) and throughput (how fast the data can be fed).
An eight tap FIR filter with eight multipliers may have a latency of one cycle if the period is long and the adder tree can be done with the multipliers, but it might have a latency of two or three, or even more (if using pipelined multipliers), yet the throughput might remain constant at one clock cycle.
If only one multiplier is used and the coefficients and tap register are restricted to a single RAM, then the latency might be nine or ten clock cycles (or more) and the throughput similarly longer, but this comes with the benefit of considerably reduced area and power [Figure-2].

Figure 2: FIR serial versus FIR parallel implementation
One may reduce clock frequency to reduce power. This may then require increased parallelism in the design (using loop unrolling in the HLS tool) to balance latency. Using an HLS tool to perform loop pipelining and unrolling of constraints helps achieve these implementations quickly. Designers can then compare the power, performance, and area of each implementation. The right implementation depends on the design goals regarding frequency, latency, and throughput, collectively.
5. Frequency: The golden source code (SystemC/C++/C) is independent of technology details. The same code can be retargeted to different target technologies [Figure-3] because frequency is just a parameter. Through frequency explorations, designers can set or adjust the clock frequency; the HLS tool then figures out how to get things to fit in a clock cycle. Also, since the implementation can be controlled down to the resources used, designers can experiment with using different operators like pipelined multipliers and adders.

Figure 3: Target optimized RTL code generation
For example, if the analysis tools show that a design actually has some extra slack in one implementation, the designer can reduce voltage to save power. Or, with a little faster implementation, they can share more operators. In this way, they can balance dynamic power with parallelism for better performance.
6. Block hierarchy: Having hierarchical blocks naturally lends itself to multi-clock design. More advanced HLS tools support running the blocks at different clock speeds and handling the data transfer between blocks through FIFOs. Designs with decimation are well suited to multi-clock design [Figure-4].

Figure 4: Decimation
Blocks with lower data rates may run with a slower clock, reducing the switching power and the static power by decreasing block area. In more general cases, the clock frequency can be tuned to match the best implementation for either throughput or latency and power, based on the technology target, with the same source code.
7. LVFS (Low Voltage Frequency Scaling): In low power mode, the HLS tool can insert an idle signal (1-bit output port) in the design. This signal is set when the block is in an idle state (not processing any data, not reading any input, and not writing to any output). This signal can be used in a system-level power management strategy, like LVFS or gating the clock power to a block.
Navigate to related information

