It's midnight, you haven't found the bug yet, and in the morning the government contractor is visiting to see the prototype, but the logic isn't acting logically.
My first engineering job was when I was in college working for a famous professor who was a leading error detection and correction theorist. He had a contract for a government agency to build 5 error correction processors. These single board prototypes used a mix of complex LSI components and a few scattered logic gates. The processor was micro-coded in PROM and implemented Galois field arithmetic (Lots of shifting and XOR-ing to create finite field adding and subtracting. It was interesting stuff for an undergraduate engineer) to detect and correct errors in communications channels.
We had 4 of the 5 prototypes working and the government program manager was visiting the next day to review the operation of all the prototypes. I had 8 hours (if I worked all night since it was already midnight) to get the last device working. It was making a maddening error. Just about 99% of the operations worked fine, but some data patterns failed- always on the same instructions, so I knew it wasn’t noise, clocking or power issues. The trouble was that the errors wouldn’t show up until many cycles in the process. How could I figure out which operation started the error?
We were smart enough to design-in signature analysis as a part of the final test process. Signature analysis breaks all loops in the design (for example the address counter just steps thru the micro-instruction PROM) to create a tree of data pattern at each test node. A Cyclic Redundancy Check (CRC) circuit in a digital probe compresses the probed data pattern into a 16-bit result, so that on the hundred or so test points we had a unique digital signature for each probe location (except for the few all zero or all one patterns). By comparing the failing board to a known good board we could determine which circuit was failing. (Since the processing loop was now broken, the error didn’t propagate from one processing cycle to the next. It would be easier to track down.
After probing around on the board for several minutes I was able to track errors at the edge of the tree back to the ‘first’ failing node on the error filed ‘limb’. The output was failing the signature test, but all the inputs were OK. The failing device was a four-input AND gate (remember 7400 TTL?). Figure 1, below shows the failing device.
Figure 1: Failing Four-input AND Gate
Well the solution seemed pretty simple- the gate must be bad. I pulled the part out of the socket and put a new one in. I reran the test, but the same error came up! The signature was the same as on the previous part. What was going on?
In order to look at the gate more closely I put the board in a single step mode and looked at some data patterns on the AND gate using a logic analyzer. Since there were only 16 combinations I was able to step thru several cycles and observe the inputs and outputs. Everything worked fine- the AND gate created a high when all inputs were high. But one combination was wrong. When all the inputs were low the output was high too! What was going on? There couldn’t be two devices that failed in just the same way could there?
(Spoiler Alert— think for a minute and see if you can figure out what was wrong...)
After considering for a bit I decided to get all the data so I probed all the pins on the device, even power and ground – just to be complete. What do you think I found? The power pin on the device was unconnected! When did that mean? After thinking a bit about what I learned in my TTL logic design class I figured it out. When at least one input to the gate was high, the device was drawing enough power thru the input that the device would operate OK and the output would be correct. If all inputs were low however the device would power off and the output would float high and be incorrect! (If I had used an oscilloscope earlier in the debugging process and treated the problem as a possible analog issue instead of a purely digital one I probably would have found the bug sooner, but at that time I was a ‘digital only’ designer and hadn’t crossed over to the ‘dark side’ yet).
I connected power to the device and reran the test. It worked! It was 2AM so I put the now working board with the others, left a note on the door for my professor and went home. Plenty of time to get some sleep before the project leader showed up at 9AM. Unfortunately I was so excited about fixing that bug I couldn’t sleep.
Warren Miller has over 30 years of experience in the electronics industry working in product planning, applications, marketing and engineering. He has worked for AMD, Actel, Anvet, Marshal, MMI and Velogix. Warren is President of Wavefront Marketing, a technical marketing consulting company, serving the semiconductor, IP and EDA markets.