Way back in late 80's when I worked down-under with a manufacturing facility manufacturing digital radio for telcommunications and defense facilities I was responsible for testing and certifying a RF module. We have had a problem for many months where the modules would get int self oscillation and causing many headaches with returned modules.
We suspected that the rise inside the modules (they were aluminium boxes with tight covers) was the cause. Innemurable times the modules were put inside the ovens and oscillations would start even with a rise of 1 degree.
On a day when I did not have much work I thought of having a through look at one of the modules. The cct board had wires running from it to the outside world through feed through capacitors. The supplier was changed once suspecting the feed thru caps but the problem remained. Then I noticed that the wires connecting the board to the feed thrus were strand wire. I remembered in amaeture radio we would never use strand wire in high frequency cct's. I then knew where the problem was. Changing those wires to singel 18 awg wires and keeping them very short solved the problem never to come again during my long career with that company. Come to think of it now I wonder why this was not detected at the design stage. My reward was allowing me to work with the desin teams as well and I had lot of satisfaction.
Don't feel bad.
I had a problem with a 6301 processor that 'hiccuped' every 2 weeks and required a reset. As the H/W engineer, I was blamed for the problem. But no one, not even the lead engineers, could find the problem.
Eventually another H/W engineer did. Based on prior experience, he looked through the operating code to the CBIT (Continuous Built in Test) section and found a bug that would manifest itself every two weeks. Part of our BIT requirements was to periodically test every RAM location. The code would save the contents of the location to be tested, disable the interrupts, test it, and then restore the location. The problem was about once every two weeks, between the save of the location and disabling of the interrupts, the location's content would be modified by an interrupt! The subsequent restore after testing that location overwrote those modified results with what was saved prior to disabling the interrupts.
After disabling interrupts before saving the location, the problem dissappeared.
Needless to say I was just a upset with the S/W guys as you were with the H/W guys for the problem they caused.
That reminds me of the time I was debugging a circuit where the previous engineer used a mux as a latch. I traced the problem down to the part but when I probed a signal on the output it looked like it was working.
Turned out that the circuit only worked with a particular manufacturers parts because there was a brief time when both inputs were connected to the output. The scope probe on the output signal added just enough capacitance to make it work with the other parts.
My biggest headache debugging problems have always lead to pirate components.
I was working on a 16 year old design of mine which was a 20 watt 29MHz to 1000 MHz power amp.
This amp was 20 dB low on power and after some head scratching by 2 good techs I was called in, fortunately having worked designing power transistors for a few years I was able to instantly see the problem was a pirate final transistor.
Further eval under a microscope showed one manufacturers TMOS-FET die placed in another Mfg package with an incorrect ceramic cap on the entire mess, the ceramic cap gave it away even though it fit it was the wrong style with a split center, gemini style, cap whereas the correct part used a continuous ceramic cap.
The die used was from one Mfg 2 watt device placed in a 25 watt flange package from another MFG and best of all it was also missing half of the gate bond wires.
I recall finding that RCA4038B monostables would (correctly) not retrigger when a second input pulse arrived before the previous output had timed out, but the Motorola MC14538B would retrigger, extending the output pulse.
The Motorola datasheet blithely stated that it would not retrigger - but it did.
Trying to stop Procurement from intermittently buying MC1438s or clone equivalents was a nightmare recurring problem.
'The mark of the beginner is a circuit littered with monostables.' True, true, (I forget the source) but I learned fast not to trust datasheets.
If it had separate clock and data, possibly they were clocking on the 'wrong' edge. Clean signals would give maximum uncertainty, but delaying either the clock or data (with 20pF) gave it just enough timing margin to work. This can also occur with SPIbus.
I even recall a datasheet that called up the wrong polarity for the clock phase bits, so most systems were running with almost the worst-case margins!
Ah, the intermittent.
I was the analog engineer who designed the power distribution, data acquisition, and telemetry portion of a seismic streamer cable at a previous company 25 years ago. I went out to a boat to solve a power problem. When my chopper landed I discovered that they had never successfully powered up the system. We got the system running after a couple of days at sea. Time to relax. Ha !
I was told that a certain module kept failing. It just disappeared occasionally and wouldn’t return any data. Fortunately I had designed some troubleshooting tools into the shipboard system. Sure enough, the data burst would disappear occasionally and then reappear. The Observer’s Log revealed that the offending module S/N was good ole’ #39.
I had been called down to the Manufacturing test floor in Houston a couple of times in the previous few months to figure out what was wrong with ole’ #39. Extreme frustration aids the memory. It was labeled “bad data” and Manufacturing could not find anything wrong with it. I never found anything wrong with it, even with temp cycling. Manufacturing had sent it back to the field twice.
My intuition said this was a potentially serious problem. I arranged for the shipboard system hardware guy and the module hardware guy to chopper out the next day and bring along half of the equipment used in developing the system; a high speed scope, logic analyzer, OTDR, adapters, and 40 lbs of drawings.
We were able to get the module to fail while it was on board the ship and opened up for probing. I put the module in a refrigerator for awhile and then let it warm up on the back deck under power with heat guns to keep it dry. It failed for awhile and then start working again. The module logic designer found a subtle race condition in his command decoding logic and was able to fix it back in Houston.
I wasn’t the only one to miss being with his family on July 4th.
A large portion of my career has been debugging, and it has been very interesting, and sometimes quite rewarding. There are two primary categories, which are "things that worked at one time", and "things that never worked". The approach is a bit different, except that for the things that never worked there is the additional possibility that it was never put together as designed, while mostly things that worked at one time are mostly assembled correctly. The real key to efficient debugging is to understand how the system is supposed to work, then find out where it doesn't.
NASA's Orion Flight Software Production Systems Manager Darrel G. Raines joins Planet Analog Editor Steve Taranovich and Embedded.com Editor Max Maxfield to talk about embedded flight software used in Orion Spacecraft, part of NASA's Mars mission. Live radio show and live chat. Get your questions ready.
Brought to you by