This peculiar case of debugging happened way back,30 years ago when we were debugging the networking software for a 8080 based microcomputer system.
At that time recompiling the code every now and then was not possible because of the time taken to assemble the whole code and create a new binary.So the normal practice was to have a patch list which we will be keeping updated as a ew bug was found and fixed. The patches had to be entered manually .
For a few days we were observing that our software would run fine in the morning and start behaving erratically as the day progressed. As the test hardware had already been debugged and was not changed for a couple of months , the only suspect was software
This was frustrating, as at the end of everyday we used to be at the same point where we started in the beginning.
The hardware guys were in relaxed mood as they had no work till we crossed our debugging stage.
On such an idle afternoon, one of the hardware guys was just curiously looking at the main board and trying to clear the dust over the ICs when he noticed some peculiar thing. In the memory bank there ICs with different access time ( some with 15ns and some with 22 nsec if I remember correctly).
This speed difference worked fine in the morning when the ambient temperature was around 20 deg C but as the day progressed and the ambient temperature reached around 40 deg, the mismatch in access speed would become apparent and create havoc with the software creating a runaway condition.
You can imagine the how furious we became on the hardware group when this bug was found.
You mean the hardware guys never tested their goodies in a hot environmental chamber or even used a hot-air-gun/hair dryer? Tsk Tsk!
A poor-man's environmental chamber is so easy to home-brew with a table saw. Build a plywood box with a hinged door large enough for the UUT, line it with styrofoam insulation covered with grounded tinfoil (for ESD), place an electric space heater and a fan inside and drill a hole to insert a glass thermometer. If you want to get fancy use a temperature probe and a feedback loop. For a small UUT a picnic cooler works good. For cold remove the heater and blow the fan through a minnow-trap wire basket filled with dry ice nuggets. When thawing out you are guaranteed 100% humidity too.
I remember a debug session where the watchdog timer occasionally timed out. It was a long 1 second ripple counter chain. Using an analog scope we managed to see the fault occur and it looked like the watchdog timed out prematurely. It had been designed by a junior, so I counted up the number of flipflops. The junior had designed it to provide a period of 1 second - he forgot that the timeout occurs on the rising edge HALFWAY through the period.
It took 5 minutes with the knife and soldering iron to cut in another flipflop in the counter IC - problem fixed.
Don't feel bad.
I had a problem with a 6301 processor that 'hiccuped' every 2 weeks and required a reset. As the H/W engineer, I was blamed for the problem. But no one, not even the lead engineers, could find the problem.
Eventually another H/W engineer did. Based on prior experience, he looked through the operating code to the CBIT (Continuous Built in Test) section and found a bug that would manifest itself every two weeks. Part of our BIT requirements was to periodically test every RAM location. The code would save the contents of the location to be tested, disable the interrupts, test it, and then restore the location. The problem was about once every two weeks, between the save of the location and disabling of the interrupts, the location's content would be modified by an interrupt! The subsequent restore after testing that location overwrote those modified results with what was saved prior to disabling the interrupts.
After disabling interrupts before saving the location, the problem dissappeared.
Needless to say I was just a upset with the S/W guys as you were with the H/W guys for the problem they caused.
A very recent one for me initially looked like wayward software, then digital hardware, then analog hardware, and ultimately turned out to be a noise issue. I think noise problems are by far the worst to debug, especially when they create symptoms that strongly suggest a different root cause.
The strangest clue I ever saw, not related to the problem was with a sine wave oscillator. It was part of a circuit that was being built into a hybrid. The oscillator circuit comprised two feedback loops, one for AGC and the crystal in another to provide the oscillator. I was called in because, while it would oscillate, the way form looked like anything but a sine wave. It twisted and turned and spiked and repeated it each period. Then the tech showed me something cute. He took his examination lamp and brought it near the open hybrid and suddenly the waveform popped into a perfect sine wave and stayed there as he removed the lamp. It had nothing to do with the problem (which was caused by placing amplitude limiting on the positive peak while AGCing on the negative peak), but it was mind-blowing.
I've had two debugging problems that stumped me for a while, one caused by hardware and one by software.
The hardware problem was with a USB peripheral that kept failing the startup inrush current limit. We were using every bit of the 100 mA we were allowed to draw at startup, and the inrush current test kept failing. This was an 8052 based design with XDATA memory latched I/O. I finally figured out that there was an extra 40 mA of current being drawn whenever the processor was in reset. The Address/Data lines were floating while the processor was in reset. The TTL latches quite naturally didn't like floating inputs and consumed their maximum current (20 mA each). This problem only happened at power up and cleared as soon as the processor came out of reset so it looked like an inrush current problem. I added pullup resistors to the Address/Data lines and the "Inrush" current problem was solved. We had designed litterally hundreds of PCBs with external memory and I/O over the years and had never noticed this behavior before.
I had a software problem that intermitantly locked up the microprocessor at power up. One clue was that the program had a very large number of initialized global variables. When I commented out some of these initialized variables the lockup problem went away, put them back and the problem reappeared (intermitantly.
This particular processor came out of reset with the watch dog timer reset enabled. The first line of code in my program turned off the watch dog timer, but the variable initialization code the compiler added took longer to run than the default watch dog timout period. We were right on the edge of timeout each time the processor was reset. I modified the compiler startup code to disable the watch dog timer reset before the variable initialization code was called. Problem solved.
A similar thing affects MSP430 line of microcontrollers; there's a debate beteen those who argue that the default watchdog state can be ON, and those that point out that complicated init sequences can under random circumstances take longer than the default watchdog timeout, and so the default has to be off. FWIW, MSPGCC chose the latter option.
My biggest debug problem involved a system with about 60 PCBs, no system level schematic, multiple DSP processors and intermittent communication problems on a strangely terminated buses (kind of like RS485).
Since the bus looked terrible, improving termination seemed like the proper attack. Nope, with "nice" looking bus signals the system didn't work.
I "patched" the system with a 20pF cap on a digital signal line. Can't say I fixed it. Why a 20pF cap on a digital signal improves the operation requires more room than I can write here.
If it had separate clock and data, possibly they were clocking on the 'wrong' edge. Clean signals would give maximum uncertainty, but delaying either the clock or data (with 20pF) gave it just enough timing margin to work. This can also occur with SPIbus.
I even recall a datasheet that called up the wrong polarity for the clock phase bits, so most systems were running with almost the worst-case margins!
Debugging gets tougher when the issue gets reproduced frequently at the user's site, whereas the same issue doesn't get reproduced on the same failed module in lab. Once the issue gets reproduced in lab and we could probe signals by oscilloscopes or logic analyzers, things get easier a bit from that point.
I had faced a couple of such situations and for both the cases the issues did not get reproduced unless the equipments were taken to the stressed environmental conditions: one case it was low temperature and other case it was artificially induced noise pulses on the inputs.
A large portion of my career has been debugging, and it has been very interesting, and sometimes quite rewarding. There are two primary categories, which are "things that worked at one time", and "things that never worked". The approach is a bit different, except that for the things that never worked there is the additional possibility that it was never put together as designed, while mostly things that worked at one time are mostly assembled correctly. The real key to efficient debugging is to understand how the system is supposed to work, then find out where it doesn't.
Ah, the intermittent.
I was the analog engineer who designed the power distribution, data acquisition, and telemetry portion of a seismic streamer cable at a previous company 25 years ago. I went out to a boat to solve a power problem. When my chopper landed I discovered that they had never successfully powered up the system. We got the system running after a couple of days at sea. Time to relax. Ha !
I was told that a certain module kept failing. It just disappeared occasionally and wouldn’t return any data. Fortunately I had designed some troubleshooting tools into the shipboard system. Sure enough, the data burst would disappear occasionally and then reappear. The Observer’s Log revealed that the offending module S/N was good ole’ #39.
I had been called down to the Manufacturing test floor in Houston a couple of times in the previous few months to figure out what was wrong with ole’ #39. Extreme frustration aids the memory. It was labeled “bad data” and Manufacturing could not find anything wrong with it. I never found anything wrong with it, even with temp cycling. Manufacturing had sent it back to the field twice.
My intuition said this was a potentially serious problem. I arranged for the shipboard system hardware guy and the module hardware guy to chopper out the next day and bring along half of the equipment used in developing the system; a high speed scope, logic analyzer, OTDR, adapters, and 40 lbs of drawings.
We were able to get the module to fail while it was on board the ship and opened up for probing. I put the module in a refrigerator for awhile and then let it warm up on the back deck under power with heat guns to keep it dry. It failed for awhile and then start working again. The module logic designer found a subtle race condition in his command decoding logic and was able to fix it back in Houston.
I wasn’t the only one to miss being with his family on July 4th.
I recall finding that RCA4038B monostables would (correctly) not retrigger when a second input pulse arrived before the previous output had timed out, but the Motorola MC14538B would retrigger, extending the output pulse.
The Motorola datasheet blithely stated that it would not retrigger - but it did.
Trying to stop Procurement from intermittently buying MC1438s or clone equivalents was a nightmare recurring problem.
'The mark of the beginner is a circuit littered with monostables.' True, true, (I forget the source) but I learned fast not to trust datasheets.
That reminds me of the time I was debugging a circuit where the previous engineer used a mux as a latch. I traced the problem down to the part but when I probed a signal on the output it looked like it was working.
Turned out that the circuit only worked with a particular manufacturers parts because there was a brief time when both inputs were connected to the output. The scope probe on the output signal added just enough capacitance to make it work with the other parts.
My biggest headache debugging problems have always lead to pirate components.
I was working on a 16 year old design of mine which was a 20 watt 29MHz to 1000 MHz power amp.
This amp was 20 dB low on power and after some head scratching by 2 good techs I was called in, fortunately having worked designing power transistors for a few years I was able to instantly see the problem was a pirate final transistor.
Further eval under a microscope showed one manufacturers TMOS-FET die placed in another Mfg package with an incorrect ceramic cap on the entire mess, the ceramic cap gave it away even though it fit it was the wrong style with a split center, gemini style, cap whereas the correct part used a continuous ceramic cap.
The die used was from one Mfg 2 watt device placed in a 25 watt flange package from another MFG and best of all it was also missing half of the gate bond wires.
Way back in late 80's when I worked down-under with a manufacturing facility manufacturing digital radio for telcommunications and defense facilities I was responsible for testing and certifying a RF module. We have had a problem for many months where the modules would get int self oscillation and causing many headaches with returned modules.
We suspected that the rise inside the modules (they were aluminium boxes with tight covers) was the cause. Innemurable times the modules were put inside the ovens and oscillations would start even with a rise of 1 degree.
On a day when I did not have much work I thought of having a through look at one of the modules. The cct board had wires running from it to the outside world through feed through capacitors. The supplier was changed once suspecting the feed thru caps but the problem remained. Then I noticed that the wires connecting the board to the feed thrus were strand wire. I remembered in amaeture radio we would never use strand wire in high frequency cct's. I then knew where the problem was. Changing those wires to singel 18 awg wires and keeping them very short solved the problem never to come again during my long career with that company. Come to think of it now I wonder why this was not detected at the design stage. My reward was allowing me to work with the desin teams as well and I had lot of satisfaction.
NASA's Orion Flight Software Production Systems Manager Darrel G. Raines joins Planet Analog Editor Steve Taranovich and Embedded.com Editor Max Maxfield to talk about embedded flight software used in Orion Spacecraft, part of NASA's Mars mission. Live radio show and live chat. Get your questions ready.
Brought to you by