One big area that is missing is in addressing the large power budget with clock tree buffers for clock distribution. Companies like Cyclos Semi are working on LC resonant tank implementations which can reduce clock distribution power by 80%, and overall power by 15-20%, in GHz clock CPUs and SoCs
Power consumption the semiconductor largely depends on the area where the semiconductor chip is used, at what voltage level it is used, at what frequency it is used and the list will continue to grow, so it is very hard to consolidate it in a single article, a book will be a better way to explain it, but still the article is written very smartly that it has covered all the different angles.
One way to handle leakage is temperature control. One may not need to go all the way to liquid nitrogen temperatures to get benefits. Of course, this is probably more than ten years out, and cooling adds overall power as well. Sigh.
Planar, fully-depleted silicon-on-insulator (FD-SOI). As I just blogged (see http://bit.ly/xse0uI), the SOI Consortium's most recent results get a 40% power reduction on 28nm complex circuits including ARM cores and DDR3 memory controllers. It lets you run all digital device designs, including SRAMs, at very low Vdd (e.g., 0.6 volt).And see Steve Leibson's blog (http://bit.ly/wG22yL) in which IBM shows you get a 10x reduction in leakage power with back biasing on planar FD-SOI. Also, FinFETs (the vertical flavor of fully-depleted) on SOI are even lower power than FinFETs on bulk. Lots of info on www.soiconsortium.org.
Approximate computation is another technique for power saving. This can take the form of limited precision computation or approximate arithmetic. Approximate arithmetic can be hardwired or due to reducing voltage and inaccurate results can be simply tolerated or corrected (where the energy cost of correction is less than the cost of always correct arithmetic if the approximate answer is sufficiently accurate). Carry prediction could be an example of such.
Approximation can be used for predictive functions (e.g., branch prediction and motion estimation [in video compression]) and for approximate results (e.g., output to humans).
Approximation can also apply to storage. Not only might the accuracy of least significant bits be sacrificed but also predictive and caching structures could lose accuracy. Obviously in the predictive case, the loss of accuracy must not hurt performance so much that the power savings in predictor storage are more than lost by the extra power from misspeculation. Analog storage and computation have been proposed for some uses (like perceptron branch predictors)--mainly for performance reasons, but such techniques may also have energy efficiency benefits.
Improved prediction, early misspeculation detection, and pre-determination (applied to branches, cache way selection, prefetching, and other areas) can increase energy efficiency by reducing unnecessary work.
Along the lines of interconnects, greater integration and appropriate placement of components can reduce the cost of communication. Processor-in-memory (e.g., Intelligent RAM and recently Venray Technologies proposals) and processor-near-memory (e.g., on DIMM or in a logic chip of something like Micron's Hybrid Memory Cube) are usually proposed for improved performance but can also improve energy efficiency.
Even reducing on-chip communication can have an impact on energy efficiency. Placing communicating components close together can not only reduce the energy per communication but also reduce the latency of communication (which may reduce the duration of computation--facilitating a longer period of deeper sleep) and the unpredictability of communication (which may allow tighter scheduling of activity when chip-internal network congestion issues are not a concern--knowledge is power, or at least facilitates power-saving optimizations).
Clever and limited use of clocking can also reduce power. I think one of the grid layout many-core vendors uses a simple left-to-right (rather than tree) clocking because clock skew only matters locally. Asynchronous design at various granularities has been considered for power saving. Although variation may limit its application, there may still be some place for wave pipelining even in synchronous designs.
To reduce the chip power, I have few suggestions as follows:
- Apply top level H-tree high drive clock buffer structure to minimize the buffer usage
- Use the intermediate metal layers for clock tree routing
- Redesign special DFF to reduce the clock toggle power, it may require additional P&R support
- Apply gated clock latch approach which normally replaces about 40%~60% clock buffer
- Use the correct DFM guideline to reduce design margin for redundant logic elimination