United Business Media EE Times


Search

HOMEMARKET INTELLIGENCE UNITFORUMSDESIGNNEW PRODUCTSCAREERSBLOGSCONTACTEVENTSSIGN UP!RSSMost Popular contentTrusted Sources

 



Semiconductor

Global Distribution: Clocks and Power

The Challenges of deep submicron design are intimately tied to the clock and power distribution on the IC.

by R.T. "Tets" Maniwa


When ASICs are built on a deep submicron process with over 5 million gates and a clock frequency greater than 250 MHz, the designer must consider many details in the clock and power circuitry. Normally these circuit elements are not given much thought; power is drawn from the rails drawn across the top and bottom of the page, and the clock has ideal characteristics: a square wave running at the specified frequency.

In reality, many other effects need to be evaluated in the clocking and power distribution areas of the design when the total chip power consumption will be in the range of tens to over 100 W and the clock power can be as much as half of the total power consumption. The clocking scheme cannot be assumed to be a clean, uniform signal network. It might be a complicated distribution structure with architectures ranging from a large distributed clock buffer for the high-performance chips to a complex system with multiple derived sub-clocks to help manage power consumption. The interaction between clocks and power consumption may require the ability to generate clock signals which can be stopped in the inactive sections to minimize power consumption. The power and ground system will take up about half of the available package pins to be able to handle the tens of amperes of average current consumed by the IC.

Many side effects of the basic IC process will have to be addressed to make the chip meet all the requirements of speed, power, and silicon area. If the supply voltage is reduced to take advantage of the power savings available at a lower supply voltage, noise margins and leakage currents may become significant problems. The various secondary effects within the system, like voltage drops on the supply lines, ground bounce, crosstalk, and glitches, may exacerbate the problems by adding enough noise to the system to decrease the clock slew rates and the clock rise and fall times. This further exacerbates the power consumption problems by making the big clock buffers stay in the high current linear portions of their transfer curves for greater amounts of each clock cycle. In addition, the clock network has many signals switching simultaneously, adding large power surges and a very large potential for crosstalk and interference to the clock and power distribution sub-systems.

What are the problems? The most basic problems facing designers are managing the skew in the clock system (which entails getting the clock everywhere at the same time), supplying sufficient clock drive to operate all of the clocked elements in the system, and getting operating power to all active circuit elements. For single-frequency clock systems, the tradeoffs are speed, power consumption, and area. The problems with skew and the process of balancing the delays across the chip occur in parallel with the increases in density and complexity.

Tom Katsioulas, marketing director of the IC design group of Cadence Design Systems (San Jose, CA), notes that the timing, density, clock, and power are intricately related in the following ways:

  • Downsizing cells to reduce power may degrade timing.

  • Upsizing cells to improve timing degrades power and area.

  • Timing-driven placement may increase wire length and power.

  • Clock generation before placement yields high skew.

  • Clock creation without wire delays affects power and delay.

  • Clock creation after placement adds area and affect density.

  • Large load count yields high clock delay and affects timing.

  • Post placement ECOs may require clock redesign.

Some of the clocking problems of complex, high speed circuits are associated with the physics of the devices and interconnections. At 250 MHz, the clock period is only 4ns. The amount of time available after accounting for clock skew and set-up and hold times leaves very little time for buffer and propagation delays. An example of a successful high speed, single-frequency clocking system is the DEC Alpha microprocessor. Its latest version runs at 300 MHz, draws a total average chip power of 50 W, and has a clocking system which draws average current of seven amps from a total chip supply of about 15 to 16 amps. DEC's efforts to develop a minimum skew clock system resulted in single large clock buffer block, physically distributed around the chip.

"In ideal cases," says Bill Bowhill, a consultant engineer and the implementation leader for Alpha products at Digital Semiconductor (Marlboro, MA), "you want to drive the whole chip from a single clock distributed across the whole chip. The clock needs to have clean edges and low skew. DEC developed the clock as an 'H' configuration with a central first buffer, feeding multiple stages of buffering to eventually drive the clock buffer strings in the legs of the 'H' to achieve the miminum skew. They are driving for maximum performance from the chip."

According to Dr. Sanjay Dhar, project manager, LSIM Power Analyst, Mentor Graphics (Wilsonville, OR), the designers needs to be concerned about power dissipation and noise reliability in addition to skew and delay when distributing the clock signal across the chip. The current methodology is to make sure that the layout meets the delay and skew budgets. This is running into problems because the size and speed of today's chips require very large drivers, and the need to drive large RC networks. The use of large buffers addresses the major issue of skew and delay, but may be unacceptable for power consumption and reliability requirements. The large clock buffers lead to high power consumption, often as large as 30 to 50 percent of the total chip power consumption, as well as noise problems due to the large current spikes generated when the buffers switch. An alternative approach is to distribute the buffers into the clock tree. This reduces the power consumption by requiring buffers of smaller size and also helps the reliability aspects by reducing the size of the current spike.

The clock system and the wide word datapaths all switching at the same time increase the possibility for glitches and higher peak switching currents. They put additional loads on the power delivery systems. The resulting datapath skews will require close scrutiny of the datapath localization and grouping, as well as careful analysis of pipeline lengths. The careful analysis of the signal paths relative to the clocks is critical to making a working integrated circuit.

"In synthesized circuits," says Charlie Xiaoli Huang, a senior architect at Epic Design Technology (Santa Clara, CA), "the software tries to make all paths the same length. This makes all data paths complete at the same time, which generates glitches and power surges at the end of each clock cycle. This effect gets worse at higher speeds."

To develop a working clock distribution topology, EDA and ASIC vendors have many tools for implementing a balanced clock tree. This technology seeks the physical location of the delay path's electrical middle. The nature of this optimization work has been mainly manual adjustment of the generated clock structure. The path delay associated with the clock lines and buffer stages needs to be adjusted so each buffer sees the same path lengths and loads. The number and strengths of the buffers can create a long first cycle latency delay.

"Clock tree synthesis at the physical design state allows the extraction of parameters which can be back annotated to the synthesis netlist," says Erach Desai, director of product marketing at Arcsys (Sunnyvale, CA). "The synthesizer users do not want to 'corrupt' their 'golden' netlist, so they cannot use the back annotated information. There is a need to review data flow design to see if a clock flow based design and layout might be a better methodology. The problems of data skew will require all localized inputs and outputs for the various datapaths."

Figure 1. The clock network is divided into equal delay paths in a clock tree.

T.C. Lee, vice president of engineering at SVR, (Mountain View, CA) suggests buffers generated for the clocks are inaccurate. At the beginning, the designer doesn't know the loading due to the interconnects, only the loading from the gates. Physical information is required to change and optimize the buffers.

"The designer needs accurate calculations of IR drop and current densities. In addition, they now need to also look at the signal activity with SPICE or Powermill. These are all post layout tools. The key to a good clock and power network is signal analysis cross-coupled with the place and route tools" (see "Issues in deep submicron timing verification").

Another issue is the current density on power and ground lines. This is becoming more critical due to concerns about local power density exceeding the maximum safe current density and causing electromigration. Lead wires can fuse if current density exceeds safe operating limits. This is exacerbated on the chip by the thinner and narrower lines for the clock and power distribution nets. Even with extra thick and extra wide lines, the available cross-sectional area is much less than a square mil and susceptible to electro-migration (the displacement of the metal due to the current density and associated power dissipation). Imbalances in the current flows can also create ground loops from the multiple ground points. Figure 2 shows how the ground loops are generated. The fast switching signals also create noise and ground offset because the capacitance and inductance in the various supply lines prevents an instantaneous change in voltage across any resistance, as in the metal traces. Unfortunately, the packaging exacerbates the problem by partially isolating the IC from the external decoupling capacitors. The same obstacles to clean power also apply to the power rail.


Figure 2. The voltage from A to D must equal the voltage from A to B to C to D. Excessive voltage drop across path A-D causes additional currents through path A-B-C-D to make the voltages match.

Reducing clock power consumption Clocking schemes and power distribution are going to be affected by the system requirements. The areas for compromise are power, area, and performance. If one of the areas is defined as much more critical than the others, it will drive the design. For example, if performance is the key parameter, a single point clock with sufficient buffers to drive all the circuitry would be the best choice. The tradeoff would be in a clock system which draws up to half of the total system current. An intermediate solution might be a multiply driven clock spine (see "Deep submicron design clocking techniques").

If all of the circuitry did not need to run at same speed, derived multiple clocks could be generated from the master reference clock. The sections will get clocks appropriate for their functions. Why have a 250 MHz clock for a serial I/O channel controller? This could save some more power since the frequency term in the power equation has now been reduced for much of the on-chip circuitry.

Obviously, if the designer gates the clock signals to unused sections of the chip, with the understanding that the gate delay will exacerbate the clock skew and clock edge uncertainty for those sections, this keeps the clocks from toggling the inputs of sections with no data changes. If the gate is used in place of a buffer in the clock tree section, the clock tree does not require an additional level of buffers to match the delays due to the extra gate levels.

Deep submicron design clocking techniques

Clock network design is one of the most critical steps in deep submicron ASIC design. It involves both physical implementation and clock delay measurement/verification.

Physical implementation of a clock network requires novel approaches to balance the tradeoffs between minimization of skew, small latency and power usage. One innovative approach is a clock network driven from multiple clock driver pads, also known as a multiply-driven clock spine network. Its benefit is that it can reduce both skew and latency.

One reason this technique produces low skew is because the clock signal is driven from multiple points on the chip, thereby reducing the effective distance between drivers and clock signal receivers (otherwise known as flip-flops). Additionally, the clock signal arrival time difference between the first flip-flop and the last flip-flop is much smaller, minimizing the skew. In multiply-driven clock networks, latency is reduced because fewer layers of buffer trees are needed to drive the clock net from multiple ends.

Clock networks for deep submicron designs are typically inserted during physical layout. This may be done with a clock tree place and route tool or manually inserted in physical layout of the design. After place and route of the design the RC values for the clock network are extracted and measured.

Multiply-driven clock spine network delays are very difficult to model because analytical RC algorithms only work for a net with a single driver. Circuit (Spice) simulation has been used as an alternative to analyze multiple driven clock nets, but the Spice results must be manually analyzed and backannotated to timing analysis tools. One alternative is a manual solution that breaks the multiply driven net into multiple subnets and extracts the subnet segments for RC analysis. This method totally breaks down for more than a few drivers which drive a single clock net. For accurate skew and latency analysis, special EDA tools are needed to model multiply-driven clock networks automatically and the extracted data needs to be back-annotated to timing analysis tools.

Multiply-driven clock networks can be designed with very small skew and latency, but special tools beyond RC extraction and analysis are required to ensure that such networks meet the requirements of high-performance deep submicron designs.

Shahid Khan is the product marketing manager at Compass Design Automation, (San Jose, CA).



A phase-locked loop (PLL) is useful to resynchronize clocks and to generate multiples of the base system clock. The PLL can develop a clock with zero or even negative effective skew by adjusting the phase comparator response. One caveat is that one must monitor the phase jitter and noise associated with the PLL and clock regeneration circuitry. The jitter and synchronization can create repeatable phase relationships within the clock network for continuous signals. However, PLLs consume a lot of power making them less attractive for low power applications.

According to John Harrington, manager of ASIC products at AT&T Microelectronics (Reading, PA), "PLLs are useful for clock doublers and triplers [and other multiples]. This can help by reducing external clock frequencies and allow lower cost crystals which can normally go up to 40 MHz. Three-fourths of their designs have a PLL to synchronize and or align clock edges. The designer needs to be careful of PLL latency and lock times for those situations where the clock is not continuous."

Jim Smith, ASIC product manager at Hitachi America (Brisbane, CA), agrees, noting,"We try to add PLLs to compensate and resync the clocks where possible. For multiple clocks, the problem is the latency and lock times for the clocks as well as the added jitter errors. The jitter errors add to the total clock skew."

If power consumption and/or management is the most important concern, then the complicated scheme described in the introduction should be considered. This could be multiple clocks, with multiple frequencies so only those circuits requiring extremely high performance would get the highest-speed clocks. Other areas would have lower-speed clocks and gated clocks and power-down circuitry to minimize the capacitive charging currents. Analyzing the intricacies of multiple clock interactions requires more detail and different techniques than is available in the standard ASIC flow (see "Issues in deep submicron timing verification").

If power consumption is minimized in the design through whatever techniques are available, it ameliorates the power distribution problems. The use of the "unused" gates as local decoupling capacitors mitigates the package isolation problems and minimizes the local IR drops. This additional on-chip capacitance reduces the effects of the synchronous power surges and the associated noise on the power and ground lines. The additional metal to the distributed local decoupling devices helps to reduce total supply and ground resistance, which reduces the potential for electromigration and improves overall manufacturability.

Issues in deep submicron timing verification

With decreasing device sizes and operating voltages, interconnect-related effects create regions of uncertainty during device switching. These effects include crosstalk, simultaneously switching outputs and signal noise. Specifically, at the deep-submicron level, device switching behavior is better described as a transition region rather than a transition point and this produces less correlated delays on the same chip (see Sidebar Figure 1).

Because the delay data in current ASIC cell libraries is based on the interval from one transition point to another, the libraries do not accurately model the uncertainties and the loss of delay correlation introduced by transition regions. This ambiguity can create timing variations that significantly lower production yields or lead to device failures. The solution to this problem is to model these effects in either ASIC libraries or in timing verification. The better solution is to handle it in timing verification using current libraries.

The two primary timing verification methods for eliminating the hazards today are static timing analysis and Verilog gate-level timing simulation (at best- and worst-case process, voltage and temperature conditions). Both methods use discrete delay values, and neither method can effectively model regions of uncertainty on interconnect delays.

Further, static timing analysis generates unrealistically pessimistic delays that cause large numbers of false paths. Gate-level timing simulation assumes that delays are perfectly correlated across the die. In deep submicron ASICs, gate delays are no longer fully correlated because of signal transition regions. Thus, traditional timing simulators miss many timing hazards in deep submicron ASICs.

The solution is to utilize range-delay simulation, which accurately models the effects of ambiguous switching regions and delay correlations. A range-delay simulator models delays with a range of values that represent switching uncertainties.

The range-delay simulator can also apply histograms to eliminate false paths. In this technique, a five-, equal-segment histogram represents the probability distribution of submicron ASIC delays. The simulator then combines the histograms to produce realistic delay ranges across paths, thus avoiding the false paths that would result from using a single pessimistic delay value.

Using histogram-based range simulation, designers can quickly identify and debug timing hazards due to deep submicron effects prior to sign-off. Eliminating these hazards can greatly improve ASIC production yields and reduce time to volume.

Allen Wu is an ASIC programs applications engineer at Nextwave Design Automation (San Jose, CA).


Sidebar Figure 1. Above 1.0µm (1990), signal transitions occur at a point. In the deep submicron realm, (1994) signal transistions occur during a wide timing region.



Methodologies The complexity and interactivity of the logic and physical design implementation phases requires more changes in the design flows. A number of EDA and ASIC companies are starting to advocate new methodologies for the whole area of clock and power distribution. The latest methodologies tend towards early development of the clock system, using chip location and loading as the key drivers for the clock design. After the basic clocking scheme is developed and the first pass parasitic parameters are extracted, then the rest of the circuitry is synthesized and routed. At this point, the first of the iterations for speed and power optimization is started. Area is no longer one of the driving issues, because a majority of the area is now fixed and dedicated to the logic and interconnections. Any changes in the clocking will be in the direction of trading off area for increased speed with little change in total power consumption.

According to Cadence's Tom Katsioulas, current technologies and methodologies are geared for a little automation and a lot of manual intervention. "The low amount of automation is a result of things like the number of gates (less than 50k); and the number of bits in the bus has been fairly small (eight or 16), so a lot of lines don't switch at the same time. Today clock trees are designed in pieces. Especially for large chips with multiple clocks, the tools generate one clock per block. They start with largest block with the most loads, which generates the maximum delays, then match other blocks [automatically with minimal constraints]."

Physical design manager Herman Lam of Fujitsu (San Jose, CA), says that they are encouraging place and route of the clock system first, then the rest of the signals. For high performance functions, they think a large clock buffer driving a minimum size clock tree is the best way to accomplish the clocking. They place virtual flip-flops at the ends of the clock lines for loads, then let the software move the virtual flip-flops to optimal locations based on the actual logic use. When people try to get the logic interconnections first, then try to balance the clock trees for matched delays, the resulting circuit has a much larger clock tree and its associated parasitics which increase power consumption.

Clock and power glossary

Clock buffer ­ A circuit element to isolate and amplify incoming clock signal.

Clock tree ­ A design technique to achieve balanced delays and loads in the clock buffers.

Gated clock ­ A clock line that can control clock transmission to the operating circuits.

Ground bounce ­ The change in ground (vss) reference levels due to current in the ground line.

Ground loop ­ The noise caused in the ground line(s) due to unbalanced IR drops in the ground line.

Insertion delay ­ The time from clock pad to individual flop-flops.

IR drop ­ The voltage drop caused by the current I through the resistor R.

Jitter ­ The change in period to period timing in a clock signal.

Latency ­ The time for a clock to become available in the circuit.

Multiphase clock ­ A clocking system with more than one phase may be overlapping or non-overlapping. Biphase-clock andcomplement, Quadrature-clocks separated by a phase angle of 90°.

PLL ­ Phase-Locked Loop, a variable frequency generator locked to a source signal.

Skew ­ The maximum difference in clock arrival time between any two flip-flops.

Slew rate ­ Also called rise time or fall time. The time for a signal to go from one level to the other level.




"Clock and power must be planned early in the submicron design process regardless of the implementation technique that is used," says George Janac, vice president of engineering at High Level Design Systems (Santa Clara, CA). "As a result, floorplanning tools must allow designers to accurately analyze clock and power distribution from the HDL level down to the physical design level."

The issues of clock and power distribution can have a disproportionate influence on the performance and power consumption in a large IC. A lack of attention to the details of implementation can increase die size and power while causing performance to be less than desired or specified. The interactions and tradeoffs in a clock and power distribution subsystem provide a rich environment for the designer to exercise creativity in conjunction with the tools and understanding of the physical effects.

Tets Maniwa is a technical editor for Integrated System Design.

To voice an opinion on this or any Integrated System Design article, please e-mail your message to: michael@asic.com.


integrated system design  August 1995



[ Articles from Integrated System Design Magazine ] [ ICs and uPs ]
[ Custom ICs and Programmable Logic ] [ Vendor Guide ]
[ Design and Development Tools ] [ Home ]


For more information about isdmag.com e-mail cam@isdmag.com
For advertising information e-mail amstjohn@mfi.com
Comments on our editorial are welcome.
Copyright © 1996 - Integrated System Design Magazine

  Free Subscription to EE Times
First Name Last Name
Company Name Title
Email address
  Click here for your Free Subscription to EETimes Europe
 
CAREER CENTER
Looking for a new job?
SEARCH JOBS
SPONSOR

RECENT JOB POSTINGS
CAREER NEWS
SRC Expands R&D Centers
The Semiconductor Research Corp has added a new center to its university R&D efforts.

For more great jobs, career related news, features and services, please visit EETimes' Career Center.


All White Papers »   

 
Education and
Learning


Learn Now:












Home | About | Editorial Calendar | Feedback | Subscriptions | Newsletter | Media Kit | Contact | Reprints|  RSS|   Digital|  Mobile
Network Websites
International
Network Features




All materials on this site Copyright © 2009 TechInsights, a Division of United Business Media LLC All rights reserved.
Privacy Statement | Terms of Service | About