Design Article
Tell us What You Think
We want to know what you thought about this Design. Let us know by adding a comment.
Design for reliability – the golden age of simulation driven product design
Arvind Shanmugavel
5/7/2012 10:06 AM EDT
The design and implementation process for integrated circuits (ICs) has been honed and perfected for decades by the design and the electronic design automation (EDA) community. However, the reliability verification process has been slow to catch up, especially due to the complex nature of failure mechanisms. Chip designers in the past were willing to take risks when it came to reliability verification because it was not seen as a functional failure or something that caused yield fallout. But times have changed and the EDA and simulation software community has swiftly responded to the needs of a simulation driven reliability analysis model. This article will delve into what it takes to design for reliability today.
Over a decade ago, the IC design and verification process would include design margin for almost every form of physical verification. Margins were added for several checks such as timing, IR drop, decap requirements, etc. Essentially, these margins were built-in to the verification sign-off process because the true operating condition could not be modeled accurately. For example, the voltage drop at the full-chip level was only simulated using a static analysis. Both tools and compute power were not adequate enough to simulate the entire chip, package, and system in a transient analysis. Design margins were built into a static IR drop analysis to account for dynamic behavior. Another example is the case of timing sign-off. Margins for set-up and hold times were built-in to account for voltage drop effects and aggressor-induced slow down or speed up of interconnect delays. Reliability verification such as electro-migration and self-heat were typically done with worst case switching, temperature, and recovery factors. There was no clear way to achieve realistic switching behavior, or a realistic die temperature profile when signing off electro-migration and self-heat effects.
Fast forward to today, the age of simulation-driven product design. Every IC designer has a toolbox of EDA products to help simulate and verify various reliability phenomena. Reliability verification for ICs not only covers classic electro-migration and self-heat, but also verification of power / ground noise verification, substrate noise, thermal reliability, electro-magnetic interference (EMI) and electrostatic discharge (ESD) events.
Multi-physics modeling such as electro-magnetic, thermo-mechanical, electro-mechanical and thermo-electric are mature in the simulation industry, albeit still evolving. Failure mechanisms in ICs are caused by one physical phenomenon affecting another. For example, the effect of temperature on the electrical resistance of wires, or the effect of current flow on heat dissipation in wires (joules heating), are both thermo-electric multi-physics phenomena. Other examples include the impact of temperature on IC mechanical failures and electro-magnetic interference between multiple ICs in a system. Simulation tools no longer analyze one phenomenon in isolation. They are able to seamlessly straddle different domains of analysis in order to model the true behavior of the system. Multi-physics principles and complex model exchanges are being used to simulate failure mechanisms.
Design for Reliability
The IC design community has started to rely on the mantra of 'first silicon success' with a keen focus on 'design for reliability'. Every IC being designed is analyzed for various reliability failure mechanisms using a simulation-driven approach. Designers are no longer building in margins to account for unknown phenomenon or designing with a ‘correct by construction’ approach. State-of-the-art EDA tools not only have the ability to simulate complex failure mechanisms with multi-physics interactions, but also have the capacity to simulate the entire IC subsystem to include chip, package and board. Issues such as power delivery noise, substrate noise, electro-magnetic interference, and thermal stability can only be accurately simulated when the entire IC subsystem is considered. Reliability failure mechanisms in ICs can be broken down into three major types.
Operational Reliability Failures
Operational reliability failures are very different from functional failures in the IC world. A functional failure occurs when an improper logic condition happens during normal operation of a circuit. An operational failure on the other hand, occurs when the operating condition of an IC is outside the normal range of operation. Functional failures are very uncommon today, especially due to the high levels of sophisticated logic verification, synthesis, and test tools. However, operational failures are more complex to capture and model due to the uncertainties of operating conditions and multi-physics interactions.
The most common operational failure is transient voltage noise on the power delivery networks (PDNs) of ICs. PDN noise is very complex to model and simulate. Different noise coupling pathways exist for every aggressor and victim and the entirety of the PDN, starting from system and package all the way to the die, needs to be modeled and simulated.
When dealing with power delivery noise, we need to model three major aspects of an IC subsystem. The first part to model is the ‘source of noise’. This includes modeling all the switching instances or switching modules using appropriate current models. The second part is the ‘medium of propagation’. This includes modeling all the noise propagation mediums such as on-die power grid networks, substrate networks, package traces, and board components. The third part is to model the ‘impact on the victim’. These victims can either be neighboring instances, decap elements, or any active or passive component that is connected to the PDN. As long as we are able to model these three aspects of the system the overall simulation will be accurate, depending on the accuracy of the models. The same concept can be applied when modeling any type of system noise, be it electrical, thermal, EMI or signal transmission noise, although thermal noise modeling requires understanding of the boundary conditions of the system.
Understanding the operating scenario of a die is also critical in identifying power and ground noise issues. Capturing noise-critical scenarios such as high power cycles, high power transition cycles, and power-up scenarios is important when performing transient noise simulations. Models that accurately capture these critical cycles across millions of operations are necessary to sign-off on the power integrity of the chip. A mobile processor is a typical case in point. Complex interaction between cores, intellectual property (IP), I/Os and memories causes the processor to go into different states including idle, standby, wake-up, and peripheral access modes. Behavioral models that capture the activity during these state transitions should be used for operating reliability sign-off.
Advanced EDA tools provide the ability to capture the behavior of these state transitions across millions of cycles and simulate the noise response of the IC subsystem. These EDA tools not only provide the simulation infrastructure for modeling noise scenarios, but also provide a root-cause mechanism that can accurately identify the reason for a failure. For example, a transient noise on a PDN can be caused by one of several factors such as simultaneous switching of devices, weakly connected power straps, no decoupling caps, or high AC impedance through the package. Being able to understand these different aspects of the design and ranking their impact on the power noise can provide valuable information to the designer.
Electro-magnetic interference is another example of operational failure for IC reliability. When the electro-magnetic fields from an IC exceed a certain threshold for the near or far field radiation, it does not comply with the EMI standard of that device class. Typically, electro-magnetic interference is caused when the electro-magnetic field from one IC interferes with the electrical operation of another IC within certain proximity. These failure scenarios are difficult to model without understanding the complete subsystem of the IC. A very accurate current signature on all the metals’ interconnects of the die, along with the currents flowing through the 3-dimentional package traces, needs to be modeled in order to simulate the near and far field radiation patterns of the IC.
A structured approach to modeling the die using accurate chip emission models and modeling the package using full-wave electro-magnetic solvers is important. For example, the EMI radiation pattern for an application processor in a smartphone during a stand-by mode could be very different from a call-answer mode or WiFi access mode. Selecting EMI filters to apply on the package should be done after careful analysis of the energy spectrum for the die and package simulated together. First silicon success does not ensure EMI compliance as a rule. A simulation- driven approach needs to be taken to simulate reliability issues related to EMI.
Time Based Reliability Failures
Time-based reliability failure is another phenomenon that occurs over the lifetime of the device. These types of failures typically include electro-migration, self-heat, channel hot carrier, and negative bias temperature instability (NBTI). Technology migration in general has been accompanied by complex extraction rules for metal interconnects, complex rules for electro-migration and self-heat. The electro-migration problem has worsened over subsequent process nodes due to the increase in metal resistivity and current densities at the same time. A complete simulation-driven approach that can handle these complex rules needs to be performed in order to accurately capture the electro-migration issues.
Temperature also has an inversely exponential relation to the maximum allowable current in a metal interconnect, and directly affects the mean time to failure of an IC. Accurate operating die temperature needs to be considered when performing electro-migration simulations for reliability. Making an assumption about die temperature can drastically determine the number of electro-migration violations one fixes or not. Sometimes, even a worst case temperature assumption may not be a true localized worst case. High current densities could lead to localized hot spots that are only a few square microns in area. Not capturing these localized temperature gradients and simulating the electro-migration effects may compromise the lifetime of a device.
A combined die-package thermal analysis needs to be performed before fixing the violations with the impact of temperature. Leading-edge reliability platforms have the ability to perform IC-package-system level thermal simulations, and back annotation of the spatial temperature at a micron resolution while performing EM and self-heat analysis. These platforms also have the ability to handle complex extraction and electro-migration rules for advanced process nodes.
Event Based Reliability Failures
Event-based reliability failures are typically catastrophic events that can render the IC inoperable after the event. An electrostatic discharge (ESD) on an IC is the most common form of event-based reliability failure during normal operation. The protection mechanism for an ESD failure is usually in the form of a low impedance discharge path for the ESD currents. Verification of these failures is complex, considering the number of voltage domains we have in today’s ICs and the shrinking ESD margin due to technology scaling.
ESD design has quickly changed from an ‘art form’ to being ‘simulation-based’. ESD designers no longer have to rely on manual checks, looking through the power / ground grid structure and auditing the connection to the ESD cells and bump pads. Simulation tools are smart enough to show the resistance bottlenecks during the ESD events and are powerful enough to perform millions of resistance calculations between zap points in a package-die subsystem.
A current density check during ESD events is also an important aspect that needs to be verified in today’s designs. The ESD current and voltage standards have remained the same from one technology node to another. The same amount of charge needs to be shunted through a discharge path regardless of the technology node. The device not only needs the ability to discharge the current through the ESD clamps but also needs to reliably carry the current through the metal interconnects without burning out. Tools are smart enough to simulate the entire ESD event and check the current density on all die metal interconnects for failures.
The number of voltage domains in today’s ICs has also risen sharply. The complexity involved in modeling cross domain ESD checks is also very high. Designers not only need to perform ERC checks for the presence of clamps or back-to-back diodes between these domains, but they also need to check the validity of placement by performing appropriate resistance calculations between all these domains.
The Golden Age
Low cost and time-to-market are critical to the success of OEM IC suppliers today. If a particular failure mechanism cannot be modeled and predicted in an IC, then the reliability of the system is compromised. IC designers no longer have the luxury of testing the die in a lab and figuring out reliability issues with such tight schedules. A simulation- driven product design, testing, and verification sign-off is imperative in this climate. With today’s multi-physics technologies, creative thinkers, and innovative doers, we are one step closer to the golden age of simulation-driven product design!
About the author
Arvind Shanmugavel is Director of Applications Engineering at Apache Design, Inc., a subsidiary of ANSYS, Inc., supporting RedHawk™ and Totem™ product platforms. Prior to Apache he worked at Sun Microsystems, leading design initiatives for advanced microprocessor designs. He holds a MSEE from the University of Cincinnati, Ohio.
If you found this article to be of interest, visit EDA Designline where you will find the latest and greatest design, technology, product, and news articles with regard to all aspects of Electronic Design Automation (EDA).
Also, you can obtain a highlights update delivered directly to your inbox by signing up for the EDA Designline weekly newsletter – just Click Here to request this newsletter using the Manage Newsletters tab (if you aren't already a member you'll be asked to register, but it's free and painless so don't let that stop you [grin]).
Over a decade ago, the IC design and verification process would include design margin for almost every form of physical verification. Margins were added for several checks such as timing, IR drop, decap requirements, etc. Essentially, these margins were built-in to the verification sign-off process because the true operating condition could not be modeled accurately. For example, the voltage drop at the full-chip level was only simulated using a static analysis. Both tools and compute power were not adequate enough to simulate the entire chip, package, and system in a transient analysis. Design margins were built into a static IR drop analysis to account for dynamic behavior. Another example is the case of timing sign-off. Margins for set-up and hold times were built-in to account for voltage drop effects and aggressor-induced slow down or speed up of interconnect delays. Reliability verification such as electro-migration and self-heat were typically done with worst case switching, temperature, and recovery factors. There was no clear way to achieve realistic switching behavior, or a realistic die temperature profile when signing off electro-migration and self-heat effects.
Fast forward to today, the age of simulation-driven product design. Every IC designer has a toolbox of EDA products to help simulate and verify various reliability phenomena. Reliability verification for ICs not only covers classic electro-migration and self-heat, but also verification of power / ground noise verification, substrate noise, thermal reliability, electro-magnetic interference (EMI) and electrostatic discharge (ESD) events.
Multi-physics modeling such as electro-magnetic, thermo-mechanical, electro-mechanical and thermo-electric are mature in the simulation industry, albeit still evolving. Failure mechanisms in ICs are caused by one physical phenomenon affecting another. For example, the effect of temperature on the electrical resistance of wires, or the effect of current flow on heat dissipation in wires (joules heating), are both thermo-electric multi-physics phenomena. Other examples include the impact of temperature on IC mechanical failures and electro-magnetic interference between multiple ICs in a system. Simulation tools no longer analyze one phenomenon in isolation. They are able to seamlessly straddle different domains of analysis in order to model the true behavior of the system. Multi-physics principles and complex model exchanges are being used to simulate failure mechanisms.
Design for Reliability
The IC design community has started to rely on the mantra of 'first silicon success' with a keen focus on 'design for reliability'. Every IC being designed is analyzed for various reliability failure mechanisms using a simulation-driven approach. Designers are no longer building in margins to account for unknown phenomenon or designing with a ‘correct by construction’ approach. State-of-the-art EDA tools not only have the ability to simulate complex failure mechanisms with multi-physics interactions, but also have the capacity to simulate the entire IC subsystem to include chip, package and board. Issues such as power delivery noise, substrate noise, electro-magnetic interference, and thermal stability can only be accurately simulated when the entire IC subsystem is considered. Reliability failure mechanisms in ICs can be broken down into three major types.
Operational Reliability Failures
Operational reliability failures are very different from functional failures in the IC world. A functional failure occurs when an improper logic condition happens during normal operation of a circuit. An operational failure on the other hand, occurs when the operating condition of an IC is outside the normal range of operation. Functional failures are very uncommon today, especially due to the high levels of sophisticated logic verification, synthesis, and test tools. However, operational failures are more complex to capture and model due to the uncertainties of operating conditions and multi-physics interactions.
The most common operational failure is transient voltage noise on the power delivery networks (PDNs) of ICs. PDN noise is very complex to model and simulate. Different noise coupling pathways exist for every aggressor and victim and the entirety of the PDN, starting from system and package all the way to the die, needs to be modeled and simulated.
When dealing with power delivery noise, we need to model three major aspects of an IC subsystem. The first part to model is the ‘source of noise’. This includes modeling all the switching instances or switching modules using appropriate current models. The second part is the ‘medium of propagation’. This includes modeling all the noise propagation mediums such as on-die power grid networks, substrate networks, package traces, and board components. The third part is to model the ‘impact on the victim’. These victims can either be neighboring instances, decap elements, or any active or passive component that is connected to the PDN. As long as we are able to model these three aspects of the system the overall simulation will be accurate, depending on the accuracy of the models. The same concept can be applied when modeling any type of system noise, be it electrical, thermal, EMI or signal transmission noise, although thermal noise modeling requires understanding of the boundary conditions of the system.
Understanding the operating scenario of a die is also critical in identifying power and ground noise issues. Capturing noise-critical scenarios such as high power cycles, high power transition cycles, and power-up scenarios is important when performing transient noise simulations. Models that accurately capture these critical cycles across millions of operations are necessary to sign-off on the power integrity of the chip. A mobile processor is a typical case in point. Complex interaction between cores, intellectual property (IP), I/Os and memories causes the processor to go into different states including idle, standby, wake-up, and peripheral access modes. Behavioral models that capture the activity during these state transitions should be used for operating reliability sign-off.
Advanced EDA tools provide the ability to capture the behavior of these state transitions across millions of cycles and simulate the noise response of the IC subsystem. These EDA tools not only provide the simulation infrastructure for modeling noise scenarios, but also provide a root-cause mechanism that can accurately identify the reason for a failure. For example, a transient noise on a PDN can be caused by one of several factors such as simultaneous switching of devices, weakly connected power straps, no decoupling caps, or high AC impedance through the package. Being able to understand these different aspects of the design and ranking their impact on the power noise can provide valuable information to the designer.
Electro-magnetic interference is another example of operational failure for IC reliability. When the electro-magnetic fields from an IC exceed a certain threshold for the near or far field radiation, it does not comply with the EMI standard of that device class. Typically, electro-magnetic interference is caused when the electro-magnetic field from one IC interferes with the electrical operation of another IC within certain proximity. These failure scenarios are difficult to model without understanding the complete subsystem of the IC. A very accurate current signature on all the metals’ interconnects of the die, along with the currents flowing through the 3-dimentional package traces, needs to be modeled in order to simulate the near and far field radiation patterns of the IC.
A structured approach to modeling the die using accurate chip emission models and modeling the package using full-wave electro-magnetic solvers is important. For example, the EMI radiation pattern for an application processor in a smartphone during a stand-by mode could be very different from a call-answer mode or WiFi access mode. Selecting EMI filters to apply on the package should be done after careful analysis of the energy spectrum for the die and package simulated together. First silicon success does not ensure EMI compliance as a rule. A simulation- driven approach needs to be taken to simulate reliability issues related to EMI.
Time Based Reliability Failures
Time-based reliability failure is another phenomenon that occurs over the lifetime of the device. These types of failures typically include electro-migration, self-heat, channel hot carrier, and negative bias temperature instability (NBTI). Technology migration in general has been accompanied by complex extraction rules for metal interconnects, complex rules for electro-migration and self-heat. The electro-migration problem has worsened over subsequent process nodes due to the increase in metal resistivity and current densities at the same time. A complete simulation-driven approach that can handle these complex rules needs to be performed in order to accurately capture the electro-migration issues.
Temperature also has an inversely exponential relation to the maximum allowable current in a metal interconnect, and directly affects the mean time to failure of an IC. Accurate operating die temperature needs to be considered when performing electro-migration simulations for reliability. Making an assumption about die temperature can drastically determine the number of electro-migration violations one fixes or not. Sometimes, even a worst case temperature assumption may not be a true localized worst case. High current densities could lead to localized hot spots that are only a few square microns in area. Not capturing these localized temperature gradients and simulating the electro-migration effects may compromise the lifetime of a device.
A combined die-package thermal analysis needs to be performed before fixing the violations with the impact of temperature. Leading-edge reliability platforms have the ability to perform IC-package-system level thermal simulations, and back annotation of the spatial temperature at a micron resolution while performing EM and self-heat analysis. These platforms also have the ability to handle complex extraction and electro-migration rules for advanced process nodes.
Event Based Reliability Failures
Event-based reliability failures are typically catastrophic events that can render the IC inoperable after the event. An electrostatic discharge (ESD) on an IC is the most common form of event-based reliability failure during normal operation. The protection mechanism for an ESD failure is usually in the form of a low impedance discharge path for the ESD currents. Verification of these failures is complex, considering the number of voltage domains we have in today’s ICs and the shrinking ESD margin due to technology scaling.
ESD design has quickly changed from an ‘art form’ to being ‘simulation-based’. ESD designers no longer have to rely on manual checks, looking through the power / ground grid structure and auditing the connection to the ESD cells and bump pads. Simulation tools are smart enough to show the resistance bottlenecks during the ESD events and are powerful enough to perform millions of resistance calculations between zap points in a package-die subsystem.
A current density check during ESD events is also an important aspect that needs to be verified in today’s designs. The ESD current and voltage standards have remained the same from one technology node to another. The same amount of charge needs to be shunted through a discharge path regardless of the technology node. The device not only needs the ability to discharge the current through the ESD clamps but also needs to reliably carry the current through the metal interconnects without burning out. Tools are smart enough to simulate the entire ESD event and check the current density on all die metal interconnects for failures.
The number of voltage domains in today’s ICs has also risen sharply. The complexity involved in modeling cross domain ESD checks is also very high. Designers not only need to perform ERC checks for the presence of clamps or back-to-back diodes between these domains, but they also need to check the validity of placement by performing appropriate resistance calculations between all these domains.
The Golden Age
Low cost and time-to-market are critical to the success of OEM IC suppliers today. If a particular failure mechanism cannot be modeled and predicted in an IC, then the reliability of the system is compromised. IC designers no longer have the luxury of testing the die in a lab and figuring out reliability issues with such tight schedules. A simulation- driven product design, testing, and verification sign-off is imperative in this climate. With today’s multi-physics technologies, creative thinkers, and innovative doers, we are one step closer to the golden age of simulation-driven product design!
About the author
Arvind Shanmugavel is Director of Applications Engineering at Apache Design, Inc., a subsidiary of ANSYS, Inc., supporting RedHawk™ and Totem™ product platforms. Prior to Apache he worked at Sun Microsystems, leading design initiatives for advanced microprocessor designs. He holds a MSEE from the University of Cincinnati, Ohio.If you found this article to be of interest, visit EDA Designline where you will find the latest and greatest design, technology, product, and news articles with regard to all aspects of Electronic Design Automation (EDA).
Also, you can obtain a highlights update delivered directly to your inbox by signing up for the EDA Designline weekly newsletter – just Click Here to request this newsletter using the Manage Newsletters tab (if you aren't already a member you'll be asked to register, but it's free and painless so don't let that stop you [grin]).
Navigate to related information

