Engineer grapples with consequences of vaguely-specified characteristics
One problem had been haunting my then-company for years. Among many thousands pieces of equipment deployed in a field, once in a while, a card that was a major part of an optical network could reset itself, causing service interruption. The card was Hot-Swappable and the Hot Swap IC, also functioned as an Electronic Circuit Breaker, was supposed to trip at 150%--the card maximum operational current.
The card held a lot of optics and electronics but for this story it could be viewed as a board that contained four identical laser drivers each built as DC-DC converter and implemented on PWM (Pulse Width Modulator) IC. All event logs indicated the resets were due to the circuit breaker trip, but no substantial current rise was recorded prior the reset. The Hot Swap IC was immediately suspected to be the culprit and, despite the fact that the IC manufacturer had sworn that no other customer had ever encountered this problem, indeed our team once found one defective IC.
After that, the case was closed without an explanation on how, among the thousands ICs distributed globally, the defective ones repeatedly hit only our company. It happens that my investigation of an unrelated issue on this card shed some light on that case. The “unrelated” issue on this card, described previously in EDN1, was a blown capacitor, the same capacitor exploded on three similar cards. The capacitor belonged to a cluster of identical six capacitors, all sitting in parallel on the same power plane. My first hypothesis was that an excessive ripple current overheated the capacitor so I decided to measure the ripple current through a capacitor expected to fail.
Figure 1: The current through (green) and the voltage across the capacitor similar to the failed ones.
The ripple current was ten times below the maximum, but not having a better idea, I still clinging to the excessive current theory, decided to compare ripples across all capacitors. So, using multi-channel scope, I measured the ripple voltage across capacitors that belonged to other DC-DC converters. I triggered the scope on the suspected capacitor and got a perfect picture: the ripples on all channels were clearly seen,-unfortunately the amplitudes did not differ much -- Figure 2 (ripples for 2 capacitors are shown; the yellow is candidate to fail). Looking at it, I felt that something was very strange, but what? I could not immediately answer. I went home feeling that I’d witnessed something surreal.
Figure 2. Voltage ripples across two input capacitors; the scope is continuously triggered on the channel 1.
Honestly, I woke up in the middle of the night realizing what was wrong: the picture was too perfect – the ripples on all channels were nicely aligned means all four PWM oscillators ran on the same frequency and were synchronized. But how it could it be? Each PWM frequency was set by its own resistor/capacitor. Scratching what was left on my head, I looked into the PWM data sheet. The situation had become interesting: the data sheet stipulated the operational frequency of 100 KHz, but also mentioned that 500 KHz is “practically possible.” Needless to say that each PWM on the board was set (by individual RC) to operate at 500 KHz.
The actual frequency was only ~400 KHz, a hint that the oscillators were close to their limits. The next day, I powered the board but saw on the scope only one static channel, the other were smeared out, the synchronization was gone! At that moment it became clear that the design, looking okay on paper, was operating in the area where the board parasitics, ground bounces, current loops, EMI and who knows what else joined together and took over each individual RC by creating a ghostly connection. Still it was not enough to trip the circuit breaker as these ripples were faster than the breaker reaction time of ~20uS. Each DC-DC converter powered a laser, and was managed by its own control loop and investigating further I saw the following:
Figure 3. Control loop outputs were also synced!!
The control loops (the signals that set the DC-DC outputs) were slowly oscillating (~2.5Hz); at the level that would not contribute much to a current consumption. But the clearly visible steps would push DC-DC converters into a transient response with current inrush being slow enough (few tens of uS) to trip the circuit breaker, but too quick for a slow data logger to catch it. That day I did not observe the reset, not surprisingly, as it had been a rare event, even for the vast amount of equipment in the field operating 24/7.
So why then were the resets so rare? To create the worst case, it would not be enough for the control loops to sync, they have to sync at the same direction, Figure 3 shows all four loops synced but three go one direction and one (yellow) goes opposite. The typical card operation did not require much current – usual scenario at the field was below 40% of maximum. The bulk capacitance and its ESR from board to board could vary by tens of percents - it could increase the ripple amplitude for some cards.
So a few events had to come together: all PWMs have to be synced; all four loops oscillating in unison and the operational conditions at higher power for the boards with elevated ESR. At the time of this investigation, the cards, being obsolete, had been gradually pulled from the field and replaced. My only reward was the lesson learned: for a reliable performance, always stay away from vaguely specified characteristics, try hard to look beyond what is shown on paper, and the satisfaction that the ghost who had been elusive for so many years was finally caught.
Samuel Kerem works at the Applied Physics Lab of Johns Hopkins University.