While the overriding importance of thermal design is well understood by the experts who specialize in the field, its finer nuances still baffle most hardware development professionals. And the fact that most hardware engineers are introduced to thermal engineering shortly after being scorched by smoking components is only partly to blame. Thermal management is yet to become a ubiquitous requirement of a well-balanced electronics engineering curriculum, thus creating a need for other forms of education.
This article attempts to address the gap by providing an overview of key aspects of thermal management. In addition, critical factors in producing a robust hardware design from a thermal perspective are explained in order to enable the designers to do proper planning of their projects. More importantly, the relationship between cooling, reliability, and robust design factors are closely examined to hone a better intuitive sense for engineers of all backgrounds.
Today's hardware landscape
Very few things can be said about the state of leading-edge technologies of today that was true a decade ago, except maybe for one. To stay in the business of technology, in addition to walking on water, you have to keep packing more into less. By that, I mean you have to offer more ports within that same footprint, put that whole shelf in one pizza box, and squeeze a full cabinet's worth of hardware into a pair of slots.
But there are serious side effects to this practice, the least of which is deciding what to do about cooling. You may have heard the saying: "Once it smokes, strap a fan on!" But before you do, consider that the success of your product in the field will depend on the degree to which your approach contradicts that infamous adage. So, let's examine the situation in more detail.
This lasting drive to miniaturize is playing out at every level of the design. Devices have to deliver greater functionality within a higher density, lower profile package. Boards have to accommodate a larger number of components, and be subjected to finer spacing rules. Shelves have to house more boards that are slotted at a lower pitch. Frames have to pack it all within a smaller footprint. Let's face it; everything is getting squeezed. And it can't be all that cool!
From a thermal management perspective, there is a significant price to pay for this increase in functionality. To get some sense of the real story, take a look at the heat flux in your equipment. Only a few years ago, packing 10-30 (Kw/m2) into electronic equipment was unheard of. But we are talking about reaching levels twice that high in a couple of years. Device operating frequency and gate count are also increasing in a compounded fashion. And where these factors lead, the power dissipation follows. Pretty soon we will all be specifying a good number of 2.4 GHz, 160 W processors for our low speed board! But before we do, lets talk thermal.
General factors in thermal management
When it comes to thermal management, the science is very straightforward and can easily become second nature to most designers. But the process of thermal design may not be as intuitive. Since we want to change that, let's examine a handful of key factors:
- Time-to-market considerations
- Equipment reliability objectives
- Equipment thermal environment and temperature limits
- Thermal design approach and methodology
- Methods of cooling & related constraints
Impact on time To market
In the good old technology boom days, we learned that "Time-To-Market" was the key to a new product's success. The question to ask is simply: "Who wants to be second to market, and who is willing to accept lower market share as the consequence?" Not us! So, it is best to avoid last minute problems -- such as the one that kept one of our customers from shipping $20M worth of product until we helped them resolve a simple thermal problem (Figure 1).
Figure 1. The dense packaging led to inadequate airflow, causing test failures that prevented beta shipments. An ultra low profile heat sink (right) was added, and fan placement was optimized to reduce worst-case temperature by 18 oC, thus eliminating test failures. Shipment to customers was however delayed by thirty days.
From a thermal management perspective, avoiding problems early is the critical way we contribute to the overall health of the project. We should all know by now that the cost to correct problems grows asymptotically as the design matures. To do a board re-spin or a card rack layout change beyond the beta stage will set you back by astronomical sums. The best place to catch problems is to avoid them in the concept stage, where they cost pennies to fix. But believe me, some still wait until much later in the process before they do their thermal homework. Just remember that the cost of failing the inevitable thermal test will bite much harder than the few hundred dollars it costs to do the analysis up front.
Impact on reliability
Very few of us operate without competition nowadays. And the real impact of product failure is felt when the next set of big orders go to the competitor who is doing a better job designing reliable systems. For larger systems, getting your product through the "Customer Evaluation" hurdle, where reliability and functionality get evaluated, is a prerequisite to your success. Any failures, especially during this critical period, will give new meaning to the phrase "getting egg on your face!" So let's revisit what temperature acceleration can do to your system from a reliability perspective.
For your system to operate, N components have to perform reliably. In probability modeling, for any interval (t), the survival rate is proportional to the product of each of the components' survival rates. So a system with only five components, each having a 0.02 chance of failure, will only have about a 90 percent chance of survival. This drops to around 80 percent for the same system if ten components were being used. It's a wonder anything works!
Of course, component failure rates are much less than what I use in this simple example. And your take-away should be that with today's complex systems, we should pay close attention to device reliability. But the device reliability is a function of operating temperature. And the relationship between temperature and failure rate have been well understood.
Equation 1 -- Temperature acceleration factor
Unless you are doing your doctoral thesis, you don't actually need to plug numbers into the above equations. You can always use the well known rule of thumb that for every 10 degrees Celsius increase in junction (operating) temperature, you can expect 1.5 - 2 times increase in device failure rates; keeping in mind that for a large system, the number will make a very big overall difference. The point is that even if you are designing well within the envelope, you still need to consider the importance of running everything as cool as possible. And you can only decide what running cool means after you have done your thermal analysis.
Thermal environment and temperature limits
Lets get specific by looking at the case of air-cooled systems. In general, the equipment's operating environment should give us a clue about our thermal design challenge. Simply said, it all comes down to device operating temperature limits -a.k.a. maximum allowable device junction temperature (Figure 2). Device manufacturers publish this information along with other data, such as device power dissipation and package thermal characteristics, in order to equip you with what you need to do your analysis (see equation 2 -- Device Junction Temperature). But the ultimate key to finding the junction temperature is to decide what the temperature and the velocity of the approach air would be for any given device. And if you are wondering how to do that, the answer is easy to find out.
Figure 2 - Device level macrograph showing component die (left) and thermal map (right). Two distinct hot spots appear after power is applied (right). The uneven temperature distribution makes or a more challenging thermal design. Pictures were acquired using Liquid Crystal Thermography, which is accurate to one micron spacing.
Equation 2 -- Simple model, device junction temperature
Thermal design approach and methodology
Let's look at our problem in layers. At the highest level, for every electronic system, there are a set of operating environment assumptions that describe the worst-case scenario. For some equipment, such as telecommunication products, the operating conditions are well controlled by the customer existing office environment. But for others, as in the case of laptops, combat aircraft or automobiles, these conditions can undergo extreme shifts, adding to your challenge. It suffices to say that if you can conduct card rack, board, and device level worst-case analysis, you can rest assured of having a robust design. And, while you must verify your predictions through verification, it is very important for you to have an understanding of the potential problems well before the design has solidified.
Bay & Card Rack l Design Considerations: The critical factors can range from the obvious, such as physical architecture, power dissipation profile and fan placement, to the less obvious, which include venting, adjacent system spacing, or even the color of the enclosure (can impact surface emmissivity characteristics). What we are after is to develop an understanding of the volumetric flow within the system and decide where the hotspots may be located. Airflow, similar to electron flow, follows the path of least resistance. So, an uneven distribution of airflow passageways can deprive certain portions of the system of the benefit of convective cooling.
Computational Methods: To get a good picture of our airflow distribution, we can utilize a combination of experimental and computational techniques. Computational Fluid Dynamics (CFD) comes in the form of standard software programs sold at reasonable prices. Generally speaking, these programs are better suited for component level (rather than system level) analysis. And, they are best utilized early in the conceptual design stage, as a way to avoid problems. My favorite technique is direct numerical simulation, which can lead to a very accurate picture at all levels of the design both quickly and inexpensively (Figure 3).
Figure 3 - DNS model of power supply with variety of heat sinks. Notice the uneven flow pattern being predicted, leaving some heat sinks without the convective cooling benefits of higher speed flows.
Experimental Techniques: In addition to computational methods, experimental techniques can be used to get a speedy determination of air velocity profile within the equipment. An early version of the product or a geometrical mockup is adequate to take your measurements. Also, wind tunnel or water flow analyses can be used where better flow visualization is required.
The key here is to take enough measurements to be able to find the actual problem region. Some problem spots are not readily detectable. And the secret to finding them is to stick to non-intrusive measurement techniques (such as micro sensors) that do not change the flow characteristics (Figure 4). And once there is a good understanding of the airflow profile, we have enough to refine our focus to look at problem components more closely. The equipment mock-ups I mentioned earlier can also be used to conduct "what if" analyses to optimize component placement from a fluid flow perspective (Figure 5).
Figure 4 - Multiple stagnation points and flow reversals occur around components. These are the high pressure, high temperature regions of the board. To identify these problem spots, a non-intrusive measurement technique is the key. Micro sensors, such as air temperature-velocity sensors from ATS, are ideal for taking such accurate measurements.
Figure 5 -- Wind or water tunnel analysis can be done to optimize board layout. The original layout (top) was changed with minimal cost to allow for better alignment from an airflow perspective (bottom).
Methods of cooling and related constraints
Regardless of which techniques you use, your thermal analysis is done when you have done the following:
- You have a good understanding of air velocity profile within the system.
- You have identified all potential hotspots.
- You have verified your temp/velocity predictions through actual measurement (with no more than a 10% error)
Once you know the problem components, you need to decide what to do with them. For air-cooled systems, fan optimization tends to solve a variety of the problems. And some components may require you to find an off-the-shelf heat sink, costing a few pennies.
Variations of Equation 2 should give you a sense of how low a thermal resistance you need to look for in a heat sink. If you need a custom made, high-performance heat sink, there are many manufacturers, such as ATS, who can design one for you rather quickly. Over the years, we have never seen a "reasonable" problem situation which could not be fixed through a combination of high performance heat sinks and flow optimization. Just keep in mind that not all heat sinks are created equal. There are many low-profile, high performance heat sinks on the market today, and the best are the fan-tailed designs (Figure 6).
Figure 6. Performance heat sinks can achieve very low thermal resistance at low range air velocities. In high-density designs, a small performance difference can make or break the packaging concept. (Source: ASME Conference -- June 2000)
The bottom line
So, the next time you begin a new project, here are the things you should keep in mind:
- Plan thermal design into hardware development process. Put aside a couple weeks of up front work, followed by 3-4 weeks of verification and optimization. Add a few extra days if you need to find heat sinks or move a few components. All in all, you should plan 5-8 weeks of focused thermal work over a one-year system level development project.
- Do predictive analysis up front if you have the budget. There are many modeling techniques that can help you predict problem spots well ahead of design becoming solidified. DNS techniques can be done rather quickly, and can lead to temperature predictions to within 5-10% of actual, well before corrections become costly. It costs about $3000 - 5000 to characterize a problem board through DNS, which is very reasonable considering the unmentionable alternative.
- Do performance verification -- If you've done a good job doing your thermal design work, then verification can actually be fun to do. The trouble spots you had identified are the best place to begin. Power up the system and keep the extremes of your intended operating environment in mind. Blocking one of the fans to simulate a common failure condition does not hurt either. What you are looking for is a good thermal profile across the system. And remember that the science of predicting either the airflow or the thermal profile within the card rack is not exact. The best way to look at it is to remember that the flow is not static. And when you have verified that your measurements are no more than 10 percent off from what you predicted, you are done. Otherwise, it pays to look over your numerical analysis one more time, and be sure to increase your measurement points.
- Finally, if you have trouble with certain components, consider changing the flow characteristics, or pick a suitable heat sink. Custom heat sinks can cost a few dollars apiece, but they are often the only way to get power hungry components within the envelope. And don't be shy to ask for help; there are many good professionals who can help you solve the problem early, when it is cheapest to resolve.
Ben Ghamami is the vice president of corporate accounts at ATS. Ben has over 17 years of experience in product development, manufacturing and marketing. In addition to his current role, Ben has served in a variety of engineering and management positions at AT&T Bell Laboratories, Lucent Technologies and Kian Networks. Ben may be reached at firstname.lastname@example.org or 781-760-2800.
Advanced Thermal Solutions (ATS) offers complete thermal management solutions to the electronics industry. ATS offerings range from a comprehensive set of design, testing, verification, instrumentation, heat sinks and other cooling solutions.