Why never to put too much stock in a hardware engineer's claim that the hardware is functioning correctly
When I started working for General Electric in 1980, my first assignment was to modify the operating system on one of their flight-control computers to support some additional I/O functions. But first, I had to get the system to run reliably.
The flight-control computers were a proprietary design and custom manufactured. As a result, they were very expensive and spare parts were hard to obtain, which is to say I had one and only one system on which to perform my testing and debugging.
This particular unit would run reliably for a while and then get a Watchdog Timer interrupt at seemingly random times. After hanging a logic analyzer on the CPU, monitoring the Memory Address and Memory Data buses, I found that one location was mysteriously changing from a JMP .+1 (NOP) to a JMP 0 instruction. JMP 0 is a tight loop, so the system would hang until the Watchdog Timer timed out. Program memory was losing a single bit, but that was enough to cause the problem.
The processor used a partitioned memory scheme, and the instruction for writing code memory was enabled only in supervisor mode. Yet, the mysterious change was occurring while the processor was in user mode. Software should not be able to cause the problem, so I suspected a hardware fault. I described the problem to the hardware engineer, and his immediate reaction was, “It’s a software problem.” He had not encountered a similar problem before.
One of the peculiarities of the setup was that the system ran on 75V 400Hz power, which was only available in the lab. One inverter supplied power to the entire bench, which had several stations. I persisted in pursuing the idea of a hardware fault, suggesting that a glitch in the 400Hz power could result in that one bit being reset. The memory was static RAM, but had a backup NiCad battery to retain memory in the case of a power loss. The hardware engineer grudgingly tested the terminal voltage on the battery pack and declared it good. The problem was mine.
I resigned myself to monitoring the system and trying to think of ways that the location could be getting modified. I poked here and there, but never got any solid leads. There did not seem to be any discernable pattern to the times at which it would hang. It would run for 20 minutes and then halt; restart, another 20 minutes (or so), and halt. Finally, at my wits’ end, I simply sat back in my swivel chair and stared at the ceiling. The air conditioning compressor on the roof started up, and just then the system halted.
Hah! I went back to the hardware engineer and described my new insight. He replaced the supposedly-good battery pack with a brand-new one and the problem went away.
Lessons Learned: I had assumed that there was no correlation between the timeout events, but had I taken careful notes and recorded the time at which each occurred, I might have noticed a pattern. The intervals would have been longer in the mornings when the air was cooler, and shortened in the hotter afternoons. Simply recording the time of timeout events over the course of a day would seem rather far from my assigned task of making software modifications, but in the end I spent more than a week trying to debug this hardware fault. It would have been time well spent.
Working only with a logic analyzer allowed me to isolate the manifestation of the fault but not its cause. If I had used an event trigger on the logic analyzer as an external trigger on an oscilloscope hooked to the power supply rails inside my computer, I would have seen the power glitch in black and green. I should have followed my hunch and tracked down the test equipment I needed to establish or refute my assumption.
I put too much stock in the hardware engineer’s claim that the hardware was functioning correctly. Check your assumptions (especially if they are generated by someone else)!
I learned early that transient errors are notoriously difficult to debug. And they are often an indication that a component is about to fail completely. If you can eliminate the problem by swapping in a known-good component, that’s good enough. In this case, the hardware engineer’s spare battery pack would have been a better diagnostic tool than his digital voltmeter.
Tom Hildebrandt has been a software professional for over 30 years and is currently employed by Microsoft. He has designed custom ICs and instruction sets, implemented C and C++ compiler components and developed other applications. He has three US patents.