# Reducing Power and Area in Cell-Based Design

Power consumption is a major problem for emerging complex designs, particularly in designs requiring low power. The trend will accelerate as process technologies shrink into the ultra deep submicron range with the increased transistor count and current densities.

In a typical synthesis and place-and-route (SPR) flow, a static set of library cells is used to map a given design into its final physical implementation. The number of cells-and the amount of optimization that can be done-is limited. This approach is considered less efficient than arbitrary full-custom design. In the past, the limitations of static libraries were necessary because SPR tools couldn't efficiently automate design at the transistor level. As a result, cell generation hasn't been included in the EDA design flow and has been largely done by hand.

Furthermore, with the advent of third-party library companies, library creation is often out-sourced.

However, bringing library creation back into the design flow will remove the restrictions of a pre-determined set of library elements and provide several advantages: performance improvement of 10 to 15 percent, time-to-market reduction due to reduced design iterations, area reductions of up to 25 percent, and power reduction of 25 percent or more. The approach works with any circuit or logic style.

While standard-cell libraries generally have a wide variety of logical functions, the primary issue with static libraries is that there are a limited number of discrete transistor sizes for any given logical function. For a typical 300-cell library, each logical function-for example, a 4-input OR gate-will have from 1 to 10 electrical variants. However, there are millions of possible variations of transistor sizes, producing radically different timing behavior. For example, for a given logic function, the drive-strength of the cell can be varied, as can the beta-ratio (ratio between the p-transistor widths and the n-transistor widths). Cells that comprise more than one stage of logic can also be varied by altering the ratio of the drive strengths between stages. There are easily hundreds or thousands of potentially useful electrical variants per logical cell.

The optimum choice of transistor sizes depends on the context of the cell, including the load of the cell and the drive-strength of the previous stage. Larger transistors drive their load faster, but they load and slow the previous stage and use more power. Having a limited set of choices produces a design that has longer cycle times, uses more power, and has more area than a fully optimized design.

**Power consumption**

The power consumption of a static CMOS block can be approximated by CV2f, where C is the total capacitance of the block, V is the voltage, and f is the frequency of the design. Assuming the voltage and frequency is fixed for a given design, the power consumption is proportional to the total capacitance. There are additional factors that contribute to power consumption, such as power-to-ground shorting during switching and switching activity. However, if the total transistor width and interconnect capacitance are reduced, these factors will decrease as well.

There are four major benefits of using arbitrary transistor sizes, rather than a fixed library of elements:

1) Increase performance by minimizing the timing through critical paths.

2) Improve timing/design-closure by eliminating cases where no suitable library element exists, resulting in timing paths that are grossly over budget, requiring manual re-work of the design.

3) Power reduction by reducing the total transistor size of the design.

4) Area reduction by reducing the total transistor size of the design, which in turn reduces the interconnect capacitance, which further reduces power.

The effects of electrical variants on timing, power, and area will be demonstrated by an example.

Figure 1 demonstrates an example of a timing path through a standard-cell block. Suppose that the initial inverter drive-strength is fixed at 4x-for example, the output driver of a flip-flop. Also, suppose that the input capacitance of the gates being driven total 48 units, where 1 unit is the input capacitance of a 1x inverter.

Assuming P-transistors have half the drive strength of N-transistors and that the 4-input NOR gate has equal rise and fall times, its input capacitance is 3 times the capacitance of a similar drive-strength inverter (logical effort of the NOR gate is 3.0). As interconnect capacitance is becoming a significant part of delay and power, assume each wire has a capacitance of 4 units. We will attempt to find the best solution for the drive-strengths (m and n) of the two gates. The delay through the path is therefore expressed as:

`
`

`
`

`
T = (3.0 * m + 4) / 4 + (3.0 * n + 4) / m + (48 + 4)/ n`

For example, if the first gate is an 8x NOR gate and the second gate is an 16x NOR gate, the total delay would be:

T = (3.0 * 8 + 4) / 4 + (3.0 * 16 + 4) / 8 + 52 / 16

= 7.0 + 6.5 + 3.25

= 16.75

Ignoring the fixed capacitance of the previous and following stages, the input capacitance of this path is as follows:

C = 4.0 + 3.0 * m + 4.0 + 3.0 * n units

The power consumption is proportional to the total capacitance and can be expressed in terms of units, where 1 unit is the power consumption of a 1x inverter.

As the drive-strength of the gate increases, its delay decreases, but the delay of the previous stage increases. In a typical library, there are several drive strengths for each cell. Assume that the drive strengths available are typical: 1x, 2x, 4x, 8x, 16x, and so forth. The optimum timing solution for such a library is:

m = 8x, n = 16x

This solution generates a delay of 16.75 units, with

a power consumption of 80.0 units (see Figure 2).

If the target cycle time is 16.75, it's unlikely that such a path using the fixed library elements would attract any attention; timing is met and the drive-strengths taper up nicely. The second NOR gate uses a fair amount of power, but no other solution meets timing, and therefore can't be avoided.

One of the advantages of creating derivative cells is improved timing and design closure. Consider the case where the standard-cell library contains only a 1x 4-input NOR gate. In such a case, m=1x, n=1x (the only solution possible), which gives a delay of 60.75 units, nearly 4 times the required cycle time. Clearly, this is unacceptable and the design will need to be re-worked.

Inserting buffers will sometimes alleviate the problem, but the buffers create additional delay, resulting in timing that still fails. It doesn't help to add to the library higher drive gates that consist of a smaller drive gate followed by a buffer. In practice, this is the same as with buffer insertion, with the added constraint of not being able to vary the drive-strengths independently. The lack of single-stage, higher drive-strength variants for many gates, which are expected to be used infrequently, is a common problem with many standard-cell libraries.

Automatically building the derivative cell eliminates this problem and makes design closure easier. By allowing the library elements to be arbitrarily sized, the drive-strengths, which result in, the optimum delay is as follows:

m = 7.0x, n = 11.0x

This gives a delay of 16.3 units, and a power consumption of 62.0 units.

The benefits of arbitrary transistor sizing include: the resulting solution is faster; the total transistor size is 30 percent less, resulting in 22 percent less power (including the interconnect capacitance), and should result in up to 30 percent less area. One alternative is to take advantage of the optimum sizing by reducing the overall cycle time of the design. However, if the target cycle time is 16.75 and is fixed in some other part of the design, there is no advantage to reducing the delay through this example path.

If we use the 16.75 cycle time, we can further reduce the power consumption. The optimum solution is then:

`m = 5.3x, n = 8.1x`

This gives a delay of 16.73 units, with a power consumption of 48.2 units, a staggering 40 percent reduction in power. Furthermore, the transistor sizes have been reduced by 44 percent, allowing up to a 44 percent reduction in area, depending on the layout style of the standard-cell library and the degree of routing congestion. This will reduce the capacitance due to routing, which will further reduce power (see Figure 3).

For paths that aren't in the critical timing path of the design, even more power reduction is possible. For example, if the cycle time is 16.9 units, instead of 16.75 units, the optimum power solution is:

m = 5.0x, n = 7.8x

This gives a power consumption of 46.4, a 42 percent reduction in power.

Even if we restrict the granularity of the solution, such that the number of electrical variations is restricted, nearly all of the benefits can be realized. For example, restricting the drive strengths to integers (for example: 1x, 2x, 3x, 4x, 5x) results in the following solution:

m = 5x, n=9x

This gives a power consumption of 50.0 units, an improvement of 37 percent instead of 40 percent.

The amount of power reduction that can be expected on average will depend on the design and the static library used. Suppose that the SPR route tool picked the lowest-power cells that meet timing. For each cell, there exists a replacement cell that matches the timing of the worst-case path through that stage, but has minimum power consumption. This replacement cell can be sized anywhere from slightly larger than the next smaller cell in the library, up to the size of the static library cell used. For example, the 16x cell in the above example can be replaced with an 8.7x cell and still meet timing:

m = 8.0x, n = 8.7x

This gives a delay of 16.75 units. Repeating this process for the other stage yields:

m = 5.0x, n = 8.7x

This gives a delay of 16.75 units, with a power consumption of 49.1 units. This isn't the true optimum, because it doesn't consider the interactions between the stages.

Using this simplistic algorithm-and using a library whose drive-strengths vary by 2x-this process will replace cells with another cell that is up to but not quite 50.0 percent smaller. Because of the nature of the delay curve, it's statistically more likely that the replacement cell is closer to the 50 percent smaller cell than the original (0 percent smaller) cell, but a 25 percent reduction in power and area is a reasonably conservative estimate. This result can be improved by sorting the cells by power consumption. More elaborate algorithms will yield significantly higher improvements.

This entire discussion has so far focused on drive-strength, which is only one of several methods of providing electrical variants. Other methods including varying the P-transistor to N-transistor width-the beta ratio-and varying the transistor widths to either skew or balance the delays through different inputs of a gate. Each of these alternate methods provides for additional timing, power and area reduction when using arbitrary transistor sizes. For example, if the critical path in the previous example was only for the final load falling from logic high to logic low, then the P-transistors of the last stage can be reduced. This will improve timing and reduce power, because the P-transistors of that 4-input NOR gate represent 89 percent of the total transistor width. In such a case, the total power consumption and area of the last cell can be reduced by up to 80 percent or so, reducing the total power by another 40 percent (64 percent total reduction). Attempting to include all possibilities in a static library would result in an unacceptable explosion of the number of cells in the library. This can be avoided by generating only the cells that are actually needed for the given context.

The best approach to extracting the additional timing, power, and area gains is to individually size every transistor in the design. However, this presents a number of challenges in SPR, as well as timing and power analysis. Prolific, Inc.'s automated library generating tool has a cell-based approach and minimizes the impact to the standard flow; the tool in use building libraries at 0.10 (m and below and produce layouts that are OPC and PSM ready (see Figure 4).

Depending on the capabilities of the tools used to implement the flow, synthesis creates a design using the base library, which is essentially a typical static library. After the place and route phase, buffer-sizing or in-place optimization is done to insert cells that are appropriately sized for the given routing context. This flow isn't restricted to using the base library for buffer-sizing. Instead, this flow uses a virtual library that represents all possible variations of drive-strength, beta-ratio, taper factors, and so forth. Depending on the capabilities of the synthesis, place and route tools, this can be represented as an abstraction, or simply as a very large set of cells. Cells are chosen based on reducing area and power, while meeting the cycle time. Those cells that aren't in the base library are automatically generated by the ProGenesis standard-cell layout generation tool, and re-characterized by whatever characterization method was used by the base library.

The resulting library makes it simpler for the SPR tools to find a high-quality solution. Furthermore, by limiting the number and granularity of new cells created, the total number of additional cells created can be kept reasonable. Including all possible electrical variants in a base library would require thousands or even millions of cells. In our experience, by only picking those cells that are needed to meet timing or power requirements, a typical 300 cell library would only need to be augmented by 50 to 150 cells to get the majority of the benefits.

By eliminating the restrictions of a fixed library, an automated approach allows for full-custom transistor sizing. These benefits accrue regardless of the circuit or logic style used to eliminate other sources of power consumption. Once SPR tools are able to optimize and characterize at the transistor level, a base library will only consist of a set of logic functions, unsized schematics, and a layout architecture. All sizing and optimization will be done by the fully integrated flow, which will include integrated layout generation, all the way to the transistor level. Cell generation is returning to the mainstream EDA flow.

Paul de Dood is president and CEO of Prolific, Inc. (Newark, CA). Previously, he managed the library and full-chip integration group for the Sun Microsystems' UltraSparc and UltraSparc-II product lines.