Design Article
Tell us What You Think
We want to know what you thought about this Design. Let us know by adding a comment.
Making ESL power optimization a reality
Shawn McCloud, Bryan Bowyer and Vikas Tyagi - Calypto
1/7/2013 8:00 AM EST
Low power is a central concern of digital design, especially for handheld and wireless devices, but also for servers and other computation intensive applications where the cost of cooling and packaging can be quite high. As a consequence, power optimization is an essential factor in meeting and improving quality of results as well as for optimizing performance and area.
Thus far, power optimization efforts have centered on RTL models and gate-level netlists, which are not sufficient for achieving optimal power savings. Optimizing for power should occur at all levels of design—from architecture to board. It is at the architecture, or electronic system level (ESL), where the potential for power savings are the greatest. Indeed, the opportunities for optimizing low power are significantly higher at the architectural level of abstraction—with as much as a 10X improvement over gate level optimizations. Yet, ironically, this is where low power methodologies and tools are the weakest. This deficiency drives the need for tools that not only allows designers to explore the best architecture for power at a higher level of abstraction but also automatically implements lower level transformations, like sequential clock gating, in the RTL produced.
The answer is found in the integration of high-level synthesis (HLS) and power analysis to create a new HLS product capable of optimizing across three dimensions - power, performance and area (PPA). HLS allows designers to synthesize different RTL architectures from C++ or SystemC electronic system level (ESL) models. The different hardware architectures are generated through user constraints which specify such things as clock period, resource limitations, IO protocol and the level of desired concurrency. Such a low power HLS solution can implement a generous range of low power techniques into synthesized RTL; including bit-width optimization, multiple clock domain partitioning, memory access minimization, resource sharing, frequency exploration, power gating, and clock gating.
In this article, we will discuss, in general, the ESL to RTL low power design flow, and then share the results of two case studies using real customer designs to evaluate the efficacy of a unique solution for ESL synthesis and power architecting.
7 basics of architectural power exploration
There are seven basic concepts that designers should focus on when looking for ways to save power, while satisfying performance goals, during architectural exploration.
1. Numerical refinement: The first design step for controlling power is numerical refinement. Algorithmic C bit-accurate data types support arbitrary precision, allowing designers to specify any desired bit width for both integer and fixed point data types. SystemC data types can be used interchangeably. At higher design abstraction levels, this allows using only numbers represented by a minimum number of bits to minimize area and power and remain within error tolerances.
2. Interfaces: If a design’s interface is hammering the bus or memory, the designer can expand the bit-width of the interface to do several read and writes at once and store the data locally. In pure C++ designs, this can be achieved simply by using HLS interface synthesis technology without modifying the source code. In SystemC, constraints may work in limited cases but can always be implemented by changing the source code.
3. Memory architecture: For many algorithms, power, performance, and area are highly dependent on memory architectures. For example, a FIR filter can be implemented using a shift register, a rotational shift register, or a circular buffer [Figure-1].

A shift register based implementation can consume higher power at higher frequencies because all taps will switch with each shift. This is typically suited for filters with a smaller number of taps and gives the highest performance.
Rotational shift is an intermediate solution. This removes the MUX feeding the multiplier. This becomes a bottleneck as the number of taps becomes larger. Rotation occurs as part of a MAC loop after the +=. A circular buffer based implementation is good for a filter with a large number of taps and is ideal for mapping into a memory. This uses one pointer to set a write point (advances forward) and another pointer to set a read point (decrements in reverse to round the array).
Thus far, power optimization efforts have centered on RTL models and gate-level netlists, which are not sufficient for achieving optimal power savings. Optimizing for power should occur at all levels of design—from architecture to board. It is at the architecture, or electronic system level (ESL), where the potential for power savings are the greatest. Indeed, the opportunities for optimizing low power are significantly higher at the architectural level of abstraction—with as much as a 10X improvement over gate level optimizations. Yet, ironically, this is where low power methodologies and tools are the weakest. This deficiency drives the need for tools that not only allows designers to explore the best architecture for power at a higher level of abstraction but also automatically implements lower level transformations, like sequential clock gating, in the RTL produced.
The answer is found in the integration of high-level synthesis (HLS) and power analysis to create a new HLS product capable of optimizing across three dimensions - power, performance and area (PPA). HLS allows designers to synthesize different RTL architectures from C++ or SystemC electronic system level (ESL) models. The different hardware architectures are generated through user constraints which specify such things as clock period, resource limitations, IO protocol and the level of desired concurrency. Such a low power HLS solution can implement a generous range of low power techniques into synthesized RTL; including bit-width optimization, multiple clock domain partitioning, memory access minimization, resource sharing, frequency exploration, power gating, and clock gating.
In this article, we will discuss, in general, the ESL to RTL low power design flow, and then share the results of two case studies using real customer designs to evaluate the efficacy of a unique solution for ESL synthesis and power architecting.
7 basics of architectural power exploration
There are seven basic concepts that designers should focus on when looking for ways to save power, while satisfying performance goals, during architectural exploration.
1. Numerical refinement: The first design step for controlling power is numerical refinement. Algorithmic C bit-accurate data types support arbitrary precision, allowing designers to specify any desired bit width for both integer and fixed point data types. SystemC data types can be used interchangeably. At higher design abstraction levels, this allows using only numbers represented by a minimum number of bits to minimize area and power and remain within error tolerances.
2. Interfaces: If a design’s interface is hammering the bus or memory, the designer can expand the bit-width of the interface to do several read and writes at once and store the data locally. In pure C++ designs, this can be achieved simply by using HLS interface synthesis technology without modifying the source code. In SystemC, constraints may work in limited cases but can always be implemented by changing the source code.
3. Memory architecture: For many algorithms, power, performance, and area are highly dependent on memory architectures. For example, a FIR filter can be implemented using a shift register, a rotational shift register, or a circular buffer [Figure-1].

Figure 1: Filter tap implementation in FIR filtering
A shift register based implementation can consume higher power at higher frequencies because all taps will switch with each shift. This is typically suited for filters with a smaller number of taps and gives the highest performance.
Rotational shift is an intermediate solution. This removes the MUX feeding the multiplier. This becomes a bottleneck as the number of taps becomes larger. Rotation occurs as part of a MAC loop after the +=. A circular buffer based implementation is good for a filter with a large number of taps and is ideal for mapping into a memory. This uses one pointer to set a write point (advances forward) and another pointer to set a read point (decrements in reverse to round the array).
Navigate to related information

