Debugging gets tougher when the issue gets reproduced frequently at the user's site, whereas the same issue doesn't get reproduced on the same failed module in lab. Once the issue gets reproduced in lab and we could probe signals by oscilloscopes or logic analyzers, things get easier a bit from that point.
I had faced a couple of such situations and for both the cases the issues did not get reproduced unless the equipments were taken to the stressed environmental conditions: one case it was low temperature and other case it was artificially induced noise pulses on the inputs.
My biggest debug problem involved a system with about 60 PCBs, no system level schematic, multiple DSP processors and intermittent communication problems on a strangely terminated buses (kind of like RS485).
Since the bus looked terrible, improving termination seemed like the proper attack. Nope, with "nice" looking bus signals the system didn't work.
I "patched" the system with a 20pF cap on a digital signal line. Can't say I fixed it. Why a 20pF cap on a digital signal improves the operation requires more room than I can write here.
A similar thing affects MSP430 line of microcontrollers; there's a debate beteen those who argue that the default watchdog state can be ON, and those that point out that complicated init sequences can under random circumstances take longer than the default watchdog timeout, and so the default has to be off. FWIW, MSPGCC chose the latter option.
I've had two debugging problems that stumped me for a while, one caused by hardware and one by software.
The hardware problem was with a USB peripheral that kept failing the startup inrush current limit. We were using every bit of the 100 mA we were allowed to draw at startup, and the inrush current test kept failing. This was an 8052 based design with XDATA memory latched I/O. I finally figured out that there was an extra 40 mA of current being drawn whenever the processor was in reset. The Address/Data lines were floating while the processor was in reset. The TTL latches quite naturally didn't like floating inputs and consumed their maximum current (20 mA each). This problem only happened at power up and cleared as soon as the processor came out of reset so it looked like an inrush current problem. I added pullup resistors to the Address/Data lines and the "Inrush" current problem was solved. We had designed litterally hundreds of PCBs with external memory and I/O over the years and had never noticed this behavior before.
I had a software problem that intermitantly locked up the microprocessor at power up. One clue was that the program had a very large number of initialized global variables. When I commented out some of these initialized variables the lockup problem went away, put them back and the problem reappeared (intermitantly.
This particular processor came out of reset with the watch dog timer reset enabled. The first line of code in my program turned off the watch dog timer, but the variable initialization code the compiler added took longer to run than the default watch dog timout period. We were right on the edge of timeout each time the processor was reset. I modified the compiler startup code to disable the watch dog timer reset before the variable initialization code was called. Problem solved.
The strangest clue I ever saw, not related to the problem was with a sine wave oscillator. It was part of a circuit that was being built into a hybrid. The oscillator circuit comprised two feedback loops, one for AGC and the crystal in another to provide the oscillator. I was called in because, while it would oscillate, the way form looked like anything but a sine wave. It twisted and turned and spiked and repeated it each period. Then the tech showed me something cute. He took his examination lamp and brought it near the open hybrid and suddenly the waveform popped into a perfect sine wave and stayed there as he removed the lamp. It had nothing to do with the problem (which was caused by placing amplitude limiting on the positive peak while AGCing on the negative peak), but it was mind-blowing.
You mean the hardware guys never tested their goodies in a hot environmental chamber or even used a hot-air-gun/hair dryer? Tsk Tsk!
A poor-man's environmental chamber is so easy to home-brew with a table saw. Build a plywood box with a hinged door large enough for the UUT, line it with styrofoam insulation covered with grounded tinfoil (for ESD), place an electric space heater and a fan inside and drill a hole to insert a glass thermometer. If you want to get fancy use a temperature probe and a feedback loop. For a small UUT a picnic cooler works good. For cold remove the heater and blow the fan through a minnow-trap wire basket filled with dry ice nuggets. When thawing out you are guaranteed 100% humidity too.
I remember a debug session where the watchdog timer occasionally timed out. It was a long 1 second ripple counter chain. Using an analog scope we managed to see the fault occur and it looked like the watchdog timed out prematurely. It had been designed by a junior, so I counted up the number of flipflops. The junior had designed it to provide a period of 1 second - he forgot that the timeout occurs on the rising edge HALFWAY through the period.
It took 5 minutes with the knife and soldering iron to cut in another flipflop in the counter IC - problem fixed.
A very recent one for me initially looked like wayward software, then digital hardware, then analog hardware, and ultimately turned out to be a noise issue. I think noise problems are by far the worst to debug, especially when they create symptoms that strongly suggest a different root cause.
This peculiar case of debugging happened way back,30 years ago when we were debugging the networking software for a 8080 based microcomputer system.
At that time recompiling the code every now and then was not possible because of the time taken to assemble the whole code and create a new binary.So the normal practice was to have a patch list which we will be keeping updated as a ew bug was found and fixed. The patches had to be entered manually .
For a few days we were observing that our software would run fine in the morning and start behaving erratically as the day progressed. As the test hardware had already been debugged and was not changed for a couple of months , the only suspect was software
This was frustrating, as at the end of everyday we used to be at the same point where we started in the beginning.
The hardware guys were in relaxed mood as they had no work till we crossed our debugging stage.
On such an idle afternoon, one of the hardware guys was just curiously looking at the main board and trying to clear the dust over the ICs when he noticed some peculiar thing. In the memory bank there ICs with different access time ( some with 15ns and some with 22 nsec if I remember correctly).
This speed difference worked fine in the morning when the ambient temperature was around 20 deg C but as the day progressed and the ambient temperature reached around 40 deg, the mismatch in access speed would become apparent and create havoc with the software creating a runaway condition.
You can imagine the how furious we became on the hardware group when this bug was found.