With the increasing complexity and power dissipation of modern electronic designs, controlling peak temperature and predicting the temperature profile on the chip early in the process is becoming critical for insuring system reliability. As the complexity of chips scales according to Moore's Law, the power density, as well as the total power dissipation of chips, is still increasing. This creates severe challenges for the thermal design of packages and chips. SoC designers address this problem with one or more low-power design methodologies, such as switching off parts of the design when not in use. These low power design methodologies create their own thermal challenges, as they may generate "hot" and "cool" regions on the chip.
Thermal Impacts on Modern SoC Designs
At 90nm process node and below, an additional design constraint, such as leakage power, arises from thermal integrity considerations. In these systems, 30% or more of the total power dissipation can be caused by gate leakage, and leakage current is a strong function of local temperature. As the dependency between leakage power and temperature is highly non-linear, knowing the average temperature of a system is not sufficient. Therefore, predicting the leakage power, and thus the total system power, requires detailed and accurate knowledge of the temperature distribution. Local temperature hotspots, with higher- than- average temperatures, contribute over-proportionally to the total power dissipation of the system. Furthermore, local temperature distributions have an impact on the performance and reliability of integrated circuits. As a result, accurate thermal analysis is an essential part of the design process for modern SoC designs in today's technologies.
The key design parameters affected by heat production and dissipation characteristics, and the resulting temperature variations across the chip, are:
- Leakage power
- Timing closure
These thermal issues are described more completely in the following sections.
The contribution of leakage power to the total power dissipation of the chip is expected to increase as processes scale beyond the 90nm node. Figures 1 and 2 show a comparison of the power dissipation and temperature dependency trends for 90nm and 45nm nodes. Clearly, because of the increasing leakage power contribution, it is not sufficient to calculate the total power of a system based on an average temperature assumed constant over the entire chip. Only with the knowledge of the detailed temperature distribution over the chip can designers accurately calculate the leakage power at every location, and thus the total power of the chip.
1. At 90nm process node, leakage power increases more than linearly with temperature.
2. At 45nm process node, leakage power increases quadratically with temperature, nearing dynamic power consumption at higher temperature.
But in addition, as leakage current increases with local temperature, local heat generation increases as a result. Therefore, power dissipation and temperature distribution are interdependent, and have to be considered simultaneously in order to achieve an accurate prediction of both the power consumption and the temperature distribution on the chip. However, current design analysis methodologies typically assume an average system temperature based on estimates of the chip power dissipation and the thermal dissipation of the chip/package system, and do not consider the dependency between local temperature and leakage power, or model thermal hot-spots in the design. Thermal hot-spots, however, contribute significantly to the leakage power and the heat dissipation in the system. As a result, current thermal design methodologies fail to predict the correct total power dissipation and the maximum temperatures of the system.
Timing Verification and Closure
With rising temperature, transistor drive strengths decrease due to carrier mobility degradation, leading to slower slew rates and gate delays. Additionally, interconnect delay and slew also increase with rising temperature due to increased metal resistivity. This affects the setup and hold time margins of circuits. As temperature variation has an impact on the slew rates of signals, it also has an impact on the cross-talk noise between signal lines.
In current design methodologies, timing and cross-talk are analyzed at various process and temperature corners, also assuming a homogeneous temperature distribution across the chip for each corner case. Thermal hotspots, however, can be significantly above the average, or the peak temperature can be well below the worst corner temperature, which is not addressed by this methodology. The design risk becomes more severe for early-mode analysis, which analyzes race conditions between signals and their timing reference (or the clock). If the signal and clock path experience different temperatures, with a clock path slower and/or data-path faster, a hold-time violation can cause chip failure.
3. Thermal conditions impact timing of launch and capture paths.
Clock nets in particular typically cover the entire area of the chip, and can be exposed to the full chip temperature variation. Clock nets, and branches of the clock network exposed to different local temperatures, will show different delays, causing clock skew. Low power techniques such as clock gating, power gating, dual threshold voltage, and voltage islands increase the temperature variation and local hot spots, causing an even greater impact on timing. For these reasons it is not sufficient to analyze timing at one or more constant temperature points for the entire design. It is essential to perform accurate analysis of local temperature variations on all critical timing paths.
Electromigration (EM) in the metal and via interconnects is a major limiting factor for the reliability of integrated circuits. EM describes the transport of mass in metals under the stress of high current density, causing metallization failure. Electromigration increases exponentially with temperature. In general, conductor lifetime, or mean-time-to-failure (MTTF), is used to measure EM effect, which is modeled by Black's equation:
where Jαvg is the average current density, Tmis the metal/via temperature, A is a constant which depends on the geometry and microstructure, Eα is the activation energy, and KB is Boltzmann's constant. To satisfy the required conductor lifetime, an upper bound is placed on current density. However, the key element of this relationship is that a safe current density limit decreases exponentially with temperature increase to be protective, as shown in Figure 4. Since MTTF is an exponential function of temperature, the current density limit needs to be reduced with increasing temperature in order to maintain the same reliability.
Current VLSI design methodologies assume a constant temperature corner for analysis, that is, a temperature of 105°C for the entire chip. For hot spots with a temperature higher than 105°C, the design may fail, even though it passed during the conventional reliability analysis. On the other hand, constant temperature assumption also may be pessimistic, leading to overdesign. For example, assuming a constant temperature of 105°C demands a low current density limit for all wires in the design. For most wires, this is very pessimistic, and can lead to thousands of false violations.
4. When charting the EM current density limit as a function of local temperature an increase in temperature of 25 degrees Celsius decreases the current density by approximately a factor of three.
Low Power Designs
Low Power designs make extensive use of clock gating and power gating. This increases the temperature variation over the design by generating regions on the chip with vastly different activity and power dissipation profiles. This can have a large impact on the timing behavior and sign-off of the system, as discussed above. It also means that leakage current estimation based on an average temperature will be too optimistic, since clock-gating generates cooler and hotter regions, and the hot regions contribute over-proportionally to the leakage current of the system. Therefore, thermal integrity analysis is not only important for high-temperature, high-performance designs like microprocessors, but also for low-power designs, which are even more exposed to thermal effects that influence performance, reliability, functionality, and power dissipation of the system.