It always have been a nightmare for me whenever "intermittent failure" happens. The worse one which takes me almost 48 hours to debug is the component being used was a counterfeit part! Now that is a very good challenge. Luckily we are able to identify most of them prior to use nowadays.
The best choice for finding and curing intermittant problems is to have a very good understanding of the system, and then to understand what part of the failure is a result, as opposed to being the driver. So unless one has a history of fixing a certain system, whith a good knowledge of what the problem usually is, it is good to work toward an understanding first.
It is certainly true that intermittent failures are a large challeng. In this case, if they had been able to detect that the failure did in fact have a specific driving event, it would have saved a lot of investigation time. Of course, if the very first step had been a check to see what had changed recently, then the time to find the problem would also have been reduced a lot. I also have had experiences where purchasing, or some other cost reducing person, has made a change "that should not have any effect" on the function of a design. A capacitor that is completely adequate for power-rail bypassing is seldom acceptable as a timing circuit element.
Corollary #1 to Lesson #10. The person that introduced the cost reduction will receive a bonus & accolades. The engineer that originally designed the circuit & will have to "fix" it again will receive a low performance rating for letting a bad design be released.
Corollary #2: Look for any changes that occurred prior to the failures starting. H/W, S/W or mechanical. e.g. I've seen a change in box layout move a noise source near a sensitive circuit which caused problems.
Lesson 1 is the red flag that gets me going. It is important to look and listen to those that are [possibly] operators for a particular system. This is the area that will yield the most satisfying amount of real engineering follow-up data. One of the things that I have discovered is that many lower level workers that are present to assist in the design,production, and/or implementation of a system are not as fully trained as you would like them. So the idea that "Lesson #1" is even on this list tells me that staff training must be intensified.
I would like to add Lesson 10, which couldn've eliminated the need for several of the earlier Lessons.
Lesson 10: When a product that was previously reliable starts failing after a cost reduction, look first at the components that were modified to reduce cost.
Inevitably, when a product is re-engineered to reduce cost, the level of verification and analysis is less than what was done during the initial development. This often comes back to bite you and your customers.
Thanks to Jit for sharing his nice experience. Sorry for the trouble he had gone through with the hotel room booking. I believe, most of the time, debugging is much harder than designing from the scratch, especially when the problem doesn't get reproduced easily and the design was done by somebody else. I too have faced situations similar where a memory chip was replaced with a cheaper part during cost-out activity, which was less immune to noise. The problem did not used to show-up consistently unless the temperature was brought to a couple of degrees Celsius. But I was luckier to have an oven and much more time than Jit.
NASA's Orion Flight Software Production Systems Manager Darrel G. Raines joins Planet Analog Editor Steve Taranovich and Embedded.com Editor Max Maxfield to talk about embedded flight software used in Orion Spacecraft, part of NASA's Mars mission. Live radio show and live chat. Get your questions ready.
Brought to you by