The importance of reliability can best be demonstrated using an anecdote I was told by a friend back in 2008. When working for a major IC firm from San Francisco, he had received a shipment of new and somewhat problematic desktop PCs.
Within months these PCs had started to crash. The IT department was rolled in to fix the assumed operating system gremlins and/or viruses that were affecting these new computers -- to no effect. After much investigation, and with many a stripped-down PC, it was eventually revealed that the problem was caused by substandard bulk capacitors in the AC/DC power supply. These had deteriorated in use, and were causing the supply rails to be out of regulation, producing the random crashes.
The episode highlights that, while power supplies may not have the glamour, nor get the attention that processors and displays receive, they are just as vital to system operation. Here we look at reliability in power supplies, how it's measured, and how it can be improved.
Predicting the power supply's expected life
First, a few definitions:
Reliability, R(t). The probability that a power supply will still be operational after a given time<./p>
Failure rate, λ. The proportion of units that fail in a given time. Note, there is a high failure rate in the burn-in and wear-out phases of the cycle -- see figure 1.
MTTF, 1/λ. The mean time to failure.
MTBF (mean time between failures) is also commonly used in place of MTTF and is useful for equipment that will be repaired and then returned to service. MTTF is technically more correct mathematically, but the two terms are (except for a few situations) equivalent and MTBF is the more commonly used in the power industry.
Figure 1: The bathtub curve, failure rate plotted against time with the three life-cycle phases: infant mortality, useful life, and wear-out.
A supply's reliability is a function of multiple factors: a solid, conservative design with adequate margins, quality components with suitable ratings, thermal considerations with necessary derating, and a consistent manufacturing process.
To calculate reliability -- the probability of a component not failing after a given time -- the following formula is used:
R(t) = e-λt
For example, the probability that a component with an intrinsic failure rate of 10-6 failures per hour wouldn't fail after 100,000 hours is 90.5%. After 500,000 hours this decreases to 60.6%. After 1 million hours of use this decreases to 36.7%.
Going through the mathematics can reveal interesting realities. First, the failures for a constant failure rate are characterized by an exponential factor, so only 37% of the units in a large group will last as long as the MTBF number. Second, for a single supply, the probability that it will last as long as its MTBF rating is only 37%. Third, there is a 37% confidence level likelihood that it will last as long as its MTBF rating. Additionally, half the components in a group will have failed after just 0.69 of the MTBF.
Figure 2: Curve showing the probability that a component is still operational over time.
It should also be noted that this formula and curve can be adapted to calculate the reliability of a system:
R(t) = e-λAt
Where λA is the sum total of all components failure rates (λA = λ1n1 + λ2n2 + … + λini)