Sorry, I've never seen a professor, program or ... Subsequently you will find an abbreviated description of how I'm personally addressing the more or less frequent 'taskforce' jobs with quite some success.
The very core of troubleshooting is root-cause-analysis, which starts with a long list of 'is'/'is not' statements. In other words: list, which effects are correlated with the 'misbehavior' and which are not. This stage of analysis does not necessarily require profound system knowledge. But it helps to know how "things" are done (= implemented, 'solved', ???) 'normally'.
Having compiled a reasonable list of 'is'/'is not', proceed from correlation to causality analysis. That is: find out which physical or 'side' effects could cause the effects observed. This is the stage where profound knowledge of building blocks is at least helpful.
The rest is about measuring, probing, testing, modifying, iterating, ... Whether short or long, following the path described above routinely leads to results.
'Experience' (own or learned from others' faults) helps to speed up things (or to stay in the tracks) but there is always a first time.
A good start to learn general troubleshooting is to read "Debugging" by David J. Agans (ISBN 0-8144-7457-8).
For electronics, "Troubleshooting Analog Circuits" by Robert A. Pease (ISBN 0-7506-9184-0) covers analog circuit issues, but many "digital" problems are "analog" in cause.
Additional: Some "questions" (paraphrasing from Pease's book).
Did it ever work right? How do you know it's not working? When did it stop working? What else happened at the same time? Milligan's Law: "When you are taking data, if you see something funny, Record Amount of Funny."
Yes, I think there are some skills that can be taught. Dividing the problem down to find the subsystem that is causing the error is clearly a general tool.
There are also classes of errors that are easier to debug than others. Construction errors, for example. This is true for hardware or software.
As the problems get more rare and exotic, the more difficult the investigation. Divide and conquer works, as long as you know all the subsystems you are dividing by. How could you not? I've seen digitial timing issues (think jitter), where engineers were dividing the problem into the different circuit blocks. The culprit was cross-coupling through the common power supply. In the engineer's mind, the different digital blocks were the subsystems, the power supply wasn't even considered.
Here's another one. Cosmic rays changing a memory value. Ultra-reliable systems deal with this issue.
Peer reviews can be very helpful. Someone has a thought or a previous experience and looks at a problem in a new way.
Many years ago, HP was experiencing reliability problems with an exotic semi fab. "Phantom" sodium doping was appearing in the devices, which eventually migrated and caused a failure. The equipment and process was checked and rechecked. Perfect. The wafer was watched all through the process. Perfect. How could this happen? Turned out the night crew used one of the ovens occasionally to warm up their pizza.
The PDP-11 had a multi-bus architecture between computers. Any computer could request the bus, and the one that requested it first got it. It worked well, but what happens if they request at the same time? At EXACTLY the same time. It's like clocking a flip-flop exactly when the data is changing. In fact, it is the same. There is an arbitration flip-flop somewhere that decides. The FF becomes perfectly balanced, a metastable state, like a broom balanced vertically. While in a metastable state, a digital device has some interesting analog properties. Both Q and Q-Bar can be high (or low) for some undefined period of time, much longer than quoted settling times. Depending on the logic that follows, you may create a state that shouldn't ever exist. In the PDP-11 case, I forgot the failure, but it was catastrophic. It didn't destroy the equipment, but crashed the bus. It was extremely rare, and required two weeks between incidences.
This was exactly the case of the precision integrating A/D converter I witnessed. A FF sampled the value of a zero-crossing comparator as the integrator was near zero. It didn't matter whether the comparator was a 1 or a 0, the algorithm would work either way. But occasionally, once every few hours, the flip-flop sampled the comparator right as it was transitioning. The FF would momentarily hang. Q and Q-Bar would both go to 1 for a few nanoseconds, and then settle. Boolean logic dictates that a 0->1 transition only occurs if the state is changing from 0 to 1, but there were cases when there would be a 0 before and after the clock, but a brief momentary 1. Exactly the same as the PDP-11 arbitration problem. This caused two different logic paths afterwards to treat the comparator value differently.
Though it is not possible to bring the probability of a metastable issue in asynchronous circuits to 0, it is possible to bring it very very close (double clocking, for instance). It is also possible to inspect the logic that follows to make the consequences less severe.
Two people walking down a hallway face to face. Two cars entering an intersection at the same time. We've all seen it. There is a delay in the logic of deciding who goes first. Theoretically, it is possible to starve to death if placed equal-distance between two pizzas. That one I have NOT witnessed.
There are also some tricks that probably seem obvious to most readers, but things like dividing the problem temporally and physically.
SO you try and trace forward to the point where abnormal behavior first appeared...or trace backward through time all the points where the abnormal behavior existed to try and find a point of injection.
Similarly if you can cut away parts of the circuit as being OK, what is left should be the most likely source of the error.
All good Sherlock Holmes stuff. "When you have eliminated the impossible, whatever remains, however improbable, must be the truth."
And always check your assumptions.
Very often the source of the problem that escapes detection for a long time is some fact or condition that was not deemed worth checking.
A primer in general systems theory might be helpful. Do engineering schools provide that at all? Understanding of fundamentals like unintended consequences and the like ought to be second nature to engineering graduates!!
I couldn't agree more about the critical need for good troubleshooting skills in engineering, having launched several new content sections on EE Times and Design News like Engineering Investigations and Made by Monkeys, where engineers relate stories involving mysterious problems and how they were resolved. Though nothing takes the place of hands-on experience, I think that engineers can learn from the story-telling by other engineers on problems they solved. Detailing their thought process of what they tried and why and what worked and what didn't work is an important transfer of hard-earned experience.
"A student once asked me how I learned how to debug. I hadn't really thought about it before, but after considering the question a bit I told him it was probably the many detective novels I read as a teenager."
Or perhaps you were drawn to such material because you already possessed a problem-solving "gene" - a trait that I suspect may be common among engineers.
A student once asked me how I learned how to debug. I hadn't really thought about it before, but after considering the question a bit I told him it was probably the many detective novels I read as a teenager. I think the best genre for this is the "police procedural", such as the Martin Beck series by Maj Sjöwall and Per Wahlöö (e.g., The Laughing Policeman) and Georges Simenon's Maigret novels. A police procedural puts you in the right frame of mind for the slow, methodical process of tracking down a bug. An anomaly has occurred. First you try to reproduce the crime. Then you have to interview all the signals and/or variables that may know something about the crime. You have to assume that some or all of them are lying to you. You have to eliminate suspects until only one is left, or else get lucky and find a key clue or get an unexpected report from an informer. You have to look for Joseph P. McGillicuddy, Lt. Dan Muldoon's code name for the suspect they don't know about -- yet -- in The Naked City (Jules Dassin, 1948). Trying to hurry the process makes you miss things. You neglect to follow up an unpromising lead that ends up solving the puzzle.
Private detective novels are also good, especially the complex Ross McDonald and Raymond Chandler novels that have so many characters that it's hard to keep them straight. Rex Stout's Nero Wolfe novels are excellent, because they also have the police procedural form, with Archie Goodwin and other operatives bringing the information to Nero Wolfe, who is the only one who can fit all the pieces together. Another favorite is John Dixon Carr, master of the "locked room" mystery.
When I have a really tough debugging problem and I can't seem to get there, I put on my hat -- an authentic English deerstalker such as is worn by Sherlock Holmes. It always puts me in the right frame of mind to reëxamine the evidence one more time and see what I missed before.
The worst thing you can do? Look at your design or code and say "it's got to work!" Obviously it doesn't, so assuming otherwise is not going to put you in the right frame of mind for debugging it.
NASA's Orion Flight Software Production Systems Manager Darrel G. Raines joins Planet Analog Editor Steve Taranovich and Embedded.com Editor Max Maxfield to talk about embedded flight software used in Orion Spacecraft, part of NASA's Mars mission. Live radio show and live chat. Get your questions ready.
Brought to you by