Blog
Comment
cshore
I think we're close on our definitions. I am thinking of any accesses to the ...
Paul A. Clayton
I think we may have different definitions of spilling. I think of spilling as ...
Guest editorial: Low power is everywhere
Mary Ann White
4/18/2012 11:22 AM EDT
Architectural and implementation techniques tradeoff power vs. complexity
There are several techniques and architectural methodologies that can be used to provide various degrees of power savings – either dynamically or statically (e.g. leakage). The tables below show some of the techniques that can be used to achieve power savings.

Many of these
techniques have been around awhile and are mainstream technology such as
clock gating and CTS for dynamic power savings, and multi-voltage
threshold (multi-Vt) cell library usage for leakage savings. Some of
the newer techniques have come about because of the effects of advanced
process geometries as discussed in the previous section -- for example,
final stage leakage recovery takes advantage of using the many channel
length/gate bias variations available for the 28nm technology and below
processes.
The effectiveness of the power saving techniques can have a tradeoff with design complexity and how easy it would be to deploy in a typical design flow. For example, there is some design complexity tradeoff with the use of biasing – it would require extra routes to the bias taps which can decrease utilization and increase area. This would require a methodology change in addition to supplying different characterized versions of the libraries for more accurate analysis of operation.
Biasing techniques were initially deployed as back-biasing to save in leakage power – this was a particularly effective technique for memories and larger process nodes (>90nm). Body- and source-biasing started to get adopted when some of the smaller geometries provided savings of about 15-25% active leakage at the 90- and 65nm process nodes. But as process nodes continued to shrink towards 20nm, the amount of savings became negligible for traditional MOS-based processes. FD-SOI and FinFET technologies have adapted the use of biasing where the technique may once again become effective.
Synopsys performs a global user survey every year and collects data from the design community that reflects current design trends. The graph below shows which of the low power techniques are used across various application market segments confirming the trend that design for power extends beyond mobile applications. Note that respondents were asked to provide all techniques used so the data adds up to more than 100%.

As
previously noted, mainstream techniques such as clock gating are
prevalent in terms of usage but what is surprising is the high level of
adoption of many of the more advanced techniques. Multi-voltage designs
(usually driven by standard-based power intent format such as UPF) are
now in use by approximately 50% of the respondent’s designs. Adoption
of these advanced techniques across applications indeed show that design
for power is needed everywhere!
Increasing the number of clock domains helps achieve performance and power targets which is also reflected in the user’s survey where approximately 30% of the respondents now design with more than 10 clock domains in their designs. Advanced users have reported using up to 1000 different clock domains.

In addition to clock domains, the adoption of multi-voltage design is now showing that most designs use anywhere from 1 to 3 different voltage domains operating at different voltage supply levels. More than 30% of the respondents reported that they design with more than 3 different voltage domains.

Next: Top Challenges
There are several techniques and architectural methodologies that can be used to provide various degrees of power savings – either dynamically or statically (e.g. leakage). The tables below show some of the techniques that can be used to achieve power savings.

The effectiveness of the power saving techniques can have a tradeoff with design complexity and how easy it would be to deploy in a typical design flow. For example, there is some design complexity tradeoff with the use of biasing – it would require extra routes to the bias taps which can decrease utilization and increase area. This would require a methodology change in addition to supplying different characterized versions of the libraries for more accurate analysis of operation.
Biasing techniques were initially deployed as back-biasing to save in leakage power – this was a particularly effective technique for memories and larger process nodes (>90nm). Body- and source-biasing started to get adopted when some of the smaller geometries provided savings of about 15-25% active leakage at the 90- and 65nm process nodes. But as process nodes continued to shrink towards 20nm, the amount of savings became negligible for traditional MOS-based processes. FD-SOI and FinFET technologies have adapted the use of biasing where the technique may once again become effective.
Synopsys performs a global user survey every year and collects data from the design community that reflects current design trends. The graph below shows which of the low power techniques are used across various application market segments confirming the trend that design for power extends beyond mobile applications. Note that respondents were asked to provide all techniques used so the data adds up to more than 100%.

Increasing the number of clock domains helps achieve performance and power targets which is also reflected in the user’s survey where approximately 30% of the respondents now design with more than 10 clock domains in their designs. Advanced users have reported using up to 1000 different clock domains.

In addition to clock domains, the adoption of multi-voltage design is now showing that most designs use anywhere from 1 to 3 different voltage domains operating at different voltage supply levels. More than 30% of the respondents reported that they design with more than 3 different voltage domains.

Next: Top Challenges
Navigate to related information


Paul A. Clayton
4/19/2012 5:41 PM EDT
A few other reasons that power use can be important:
*form factor (lowering cooling requirements can facilitate a lighter, smaller, and/or less exposed system; this is a factor with servers where data center space costs money as well as consumer electronics and deeply embedded systems)
*product cost (larger heat sinks add cost, fans add cost--material (including inventory) and assembly--, heat management adds design cost)
*reliability (keeping temperatures down reduces soft errors and hard failures, in addition power saving techniques like DVFS can also increase MTTF by reducing electromigration et al.; removal of active cooling can remove a point of failure--particularly one with a moving part--, even reducing active cooling requirements can improve resilience; tighter integration facilitated by lower power can also reduce vulnerability to mechanical stresses and external electromagnetic interference)
*performance (when performance is limited by TDP, energy-efficiency can increase performance; in addition, if the number of external connections (pins) for power and ground can be reduced, more pins can be available for signals increasing available signal bandwidth; lower power can also facilitate tighter integration which can improve latency and bandwidth)
There is also a distinction between chemical batteries and other power sources. Energy harvesting techniques and radioisotope power cells have different constraints than chemical batteries.
I realize that including all of the above in the introduction would have added too much length, but it is easy to forget how multifaceted power concerns are.
Sign in to Reply
BrianBailey
4/19/2012 5:58 PM EDT
I think what you are pointing out is that so many issues associated with complete product design are interrelated and that the consumption of power and the removal of the heat it generates impacts every facet of system design. Thanks for adding some of those dependencies.
Sign in to Reply
Paul A. Clayton
4/21/2012 6:33 PM EDT
Yes, it must be difficult for professionals to handle so much complexity (made worse by communication barriers even within organizations)--and with severe time limits and pressure to predict the result more than a year in advance. I am just a thinker (not even an academic), and even the limited complexity of which I am aware makes my head hurt (almost literally).
Sign in to Reply
Paul A. Clayton
4/19/2012 7:16 PM EDT
While this article focuses on low-level techniques--as reasonable coming from someone at Synopsis--, there might be interest in overviews of higher level (architectural, microarchitecture, and software) techniques.
Techniques like approximate computation (mainly for audio/visual but also sometimes applicable to sensor data analysis) and analog computation (as in Lyric Semiconductor's error correction technology) seem to show some promise. (These can also apply to predictive structures like branch predictors.)
Asynchronous design, "Power Balanced Pipelines" (Sartori et al.), and other general microarchitectural techniques look interesting (at least to someone with an academic interest in computer architecture).
Techniques to improve performance can also improve power efficiency.
Software techniques can include optimizations to improve cache utilization (code density and code and data layout can help) and the scheduling of work to reduce the number of power transitions.
Software optimizations which improve performance can also improve power efficiency by avoiding unnecessary work and improving hurry-up-and-go-to-sleep effectiveness.
Even the little I have read in this area indicates that there are a lot of interesting techniques for managing power use.
Sign in to Reply
BrianBailey
4/20/2012 9:55 AM EDT
I did run two articles on software and power a couple of weeks ago:
Efficient C code for ARM devices http://eetimes.com/design/eda-design/4370230/EDADL-Efficient-C-code-for-ARM-devices?
and
Optimizing performance, power, and area in SoC designs using MIPS® multi-threaded processors http://eetimes.com/design/eda-design/4370392/Optimizing-performance--power--and-area-in-SoC-designs-using-MIPS--multi-threaded-processors?
Sign in to Reply
Paul A. Clayton
4/21/2012 7:09 PM EDT
The former paper was somewhat interesting (I was surprised that 16-bit local variables would be expanded to 32-bit even in the cache) and points to some unfortunate limits of C and its compilers.
The latter article was more focused on the specific topic of exploiting the benefits of MIPS MT. I had already understood the principles, but the examples were interesting.
One problem seems to be that this information is scattered. Because the information content is vast and has complex interconnection, it seems that something like a wiki could be useful. Such a project would be outside the scope of EE Times (alone).
I do not know that such would be useful to anyone. Since I am just an information junkie, my feelings should have little weight.
Sign in to Reply
cshore
4/23/2012 12:00 PM EDT
I am the author the first of those papers which Brian cited. Glad you found it interesting.
I'm interested in your comment about local variables being expanded to 32-bit in the cache. Can you expand on that a bit more because I don't believe it has to be that way.
Sign in to Reply
Paul A. Clayton
4/23/2012 12:46 PM EDT
I based my comment on the statement "Remember, too, that local variables, regardless of size, always take up an entire 32-bit register when held in
the register bank and an entire 32-bit word in memory when spilled on to the stack." (page 4 of "Efficient C Code for ARM Devices")
If it meant callee spilling, I could understand the constraint. (This limitation could motivate a compiler optimization that would preferentially allocate 32-bit values into callee save registers.) I could also understand how such could make debugging easier. (Also on ARM, code density--or even performance as such has sometimes been implemented using paired word operations--goals might promote use of store/load multiple word.)
(The ABI forcing such expansion for function parameters may be a concession to simplify debuggers or perhaps compilers. In theory, one does not need to use the ABI, at least for internal functions.)
Sign in to Reply
cshore
4/26/2012 8:59 AM EDT
Thanks for the response.
I was referring to any spilling of variables onto the stack. All ARM stack accesses are 32-bit so any spilled variable (or parameter, or variable allocated to the stack) takes up a full word.
To my knowledge, the register allocator does not take this into account when allocating registers to variables within procedures. If it is possible to save/spill a pair of variables using LDRD/STRD, that is sometimes down to serendipity as I understand it (some forms of these instructions require that the registers be a consecutive odd/even pair).
You are right that you don't need to stick to the ABI for internal functions. Not doing so is obviously potentially dangerous, as I'm sure you are aware!
Leaving the stack aligned to anything less than a word boundary when interrupts are enabled can be especially perilous.
Chris
Sign in to Reply
Paul A. Clayton
4/26/2012 1:22 PM EDT
I think we may have different definitions of spilling. I think of spilling as any moving of a register value into memory (e.g., due to register pressure). I am guessing that you may mean something else, perhaps saving callee save registers (where the callee cannot conveniently know the size of register contents nor if the value already has a slot allocated in a previous stack frame--interprocedural optimization might be able to discover such).
I also do not understand your statement "All ARM stack accesses are 32-bit" since ARM provides LDRH/STRH using the stack pointer, which is just a GPR afterall (I doubt even AArch64--which makes SP a non-GPR--prohibits sub-word accesses using SP). (Pushing and popping smaller values would be problematic in making SP unaligned.)
By the way, my gmail.com address is 'paaronclayton'.
Sign in to Reply
cshore
4/27/2012 10:18 AM EDT
I think we're close on our definitions. I am thinking of any accesses to the stack carried out by code running on an ARM system which complies with the ABI. That covers parameters, spills, automatic variables, caller/callee-saved registers etc.
The ABI says that the stack pointer must be word aligned at all times (and doubleword-aligned at external boundaries). It doesn't actually say that you can't push/pop two halfwords at once in a pair of atomic operations but doing so would be impractically difficult while sticking to the ABI.
Yes, you can use halfword memory accesses indexed via SP, in the sense that the instruction set permits it. But it isn't possible (or at least practical) to do so in a way which doesn't violate the ABI.
The ABI for AArch64 specifies quadword alignment for SP at all times (whether externally visible or not) so, although instructions may exist for sub qword stack accesses, they aren't practically usable in this context.
Sign in to Reply