Solving tough, mysterious problems is hard. And there are no textbook methods on how to do it.
When we honor the best engineers, we usually identify those with top-notch system architecture skills and design savvy, who can be creative and pragmatic at the same time. Let me add one more skill: troubleshooting.
Like an ace relief pitcher in the World Series, an excellent troubleshooter may be exactly what you need to wrap up a product design and get it to ship. All designs will eventually get an “A” -- the question is how long that will take.
As an R&D manager, I’ve experienced the following situation over and over. A complex product near shipment is not behaving as expected, and nobody has a clue why. Throw everything at it -- design reviews, physical inspections, code inspections, multiple prototypes, anything! The laws of science have somehow been violated. The electron gods are not pleased.
Management is breathing down your neck. The viability of the business seems to hinge on getting this particular product to ship. “How long until this is fixed?” they ask. You can’t answer, because you don’t know what is wrong, so you can’t fix it. Time to call in your ace.
Troubleshooting is a special set of skills. A big part of it is questioning assumptions. There are many ways for designs to fail -- are you clever enough to identify them? For the most complex problems, it takes a twisted mind to think outside the models to discover the fatal flaw. Once discovered, they all look obvious.
As a teacher’s aid for electronic lab courses during college, I delighted in this. A student would design and construct a small circuit, either digital or analog, and it wouldn’t work. After some time, the exasperated student would declare that something was wrong. It has to work, but it doesn’t. There is no scientific reason for its behavior. The electron gods are not pleased.
Back to questioning calculus and poles and zeros. But invariably something simple was at fault. A power supply wasn’t connected to a part. A pin was miswired. A part in backwards. The ground was missing. A relieved student would slap his forehead. Obvious now. Question your assumptions. These were simple construction errors. Those students learned something.
As systems get more complex and subtle, so do the problems. “Ideal models” are often the culprit. Capacitors are networks, resistors aren’t linear, grounds aren’t perfect grounds. Memory leaks, register overflows, timing errors, compiler bugs... the list continues. I’ve seen problems caused by ions in the water that washed a PC board, digital circuits oscillating for no apparent reason, intermittents that have days between occurrences. The DEC PDP-11 had huge bus reliability issues due to a rare metastable state. I learned about the PDP-11 history in college, and years later it would be the spark that allowed me to find a metastable problem in a precision analog-to-digital converter that was holding up shipments.
As an R&D manager, faced with a tough problem with a critical product, I’d go to my ace. Like the manager at the World Series calling in his ace reliever, I’d want the best person on the toughest and most time-critical problems. No, that isn’t quite right. I’d throw all my aces at them. Solving tough, mysterious problems is hard. And there are no textbook methods on how to do this.
Which brings me to my question: How can we develop this expertise? College lab courses are a start, but is there anything else? What can industry do? It seems to just come from experience, but some people are much better than others.
So, I’m curious: Do you think we can accelerate the experience and insight curves some way?