Does five years as an engineer mean no more mistakes? Probably not.
I was asked to help out a co-worker, “Joe”, a less-experienced engineer, with an ESD problem. Highly appropriate, after all, I had five whole years of work under my belt.
The product was only a computer mouse. But we were selling tons of these to our best customer, one of the biggest computer makers in the world at the time. The production line was down for about a week when I joined the effort, so management was…let’s say, “perturbed”.
The mouse had shipped for months during the relatively wet spring. But the drier weather of summer led to more static events, and defective mice were coming back faster than we could ship out replacements.
The mice passed the basic static zap test, but we played with the ground method a bit and lo-and-behold, they were blowing up in the lab just like in the field.
The mouse included an ASIC designed by our customer, the computer maker. “Use this,” they told us (the “…or die” part was assumed.) So the mouse was essentially a build-to-print job for our factory. Still, we were responsible for the design performance because…well, our customer assured us we were responsible, and Sales agreed.
So there we were, holding daily internal meetings on our progress, mice exploding right and left, ankle deep in returns…and no answers.
The first clue came quickly enough. Of course, the ASIC was failing, and we found a failed input pin. But the chip passed the version of MIL-STD-883 that was in use at the time, and that test was the gold standard for chip-level ESD robustness. So the theory was that our board layout was faulty. At least, that was Sales’ theory. (Sales was really quite helpful. Really.)
We (mostly Joe, to be honest) tried all kinds of things to reinforce the board. We knew that adding filters and caps was not dealing with the issue at the source—but the ASIC passed and re-passed the MIL-STD-883 test, and there was nothing obviously wrong with the schematic otherwise…
After the initial panic of the first week, another week went by…followed by more weeks, then months. I cannot describe the ongoing hell this high-priority, hugely expensive issue turned into. This thing simply would not die. We pretty much knew there was something wrong with the ASIC’s ESD performance, but had no proof. We were having conference calls on a weekly basis with the ASIC vendor, with no result.
Then, a savior appeared. Our QC manager, who was qualified by dint of previous experience as a dental hygienist (don’t ask), hired an outside consulting firm to analyze the ASIC. These experts de-capped the package and took hi-res photos through a microscope.
Their images showed tiny craters all over the die, as if the chip had been sandblasted. Their verdict was that the ASIC vendor had “a process problem”.
As the consultants were explaining this, I was getting more and more worried—this did not really ring true. A major ASIC vendor is shipping product with visible scars on the top insulating layer, and had been for months? Even after we complained of problems, they didn’t find this? And even if this was correct, how was it tied to ESD failures?
Then, the consulting company’s chief engineer pointed dramatically at scar near a large structure, and said, “You can see how close this damage is to that output transistor.” Output transistor? It was an alignment mark! Anyone who had worked with die at this level would immediately recognize the marks used to match up the layers in the semiconductor process. We were wasting our time with these consultants. Later I found they’d used a combination of sulfuric acid and water to prepare the chip, a combination that caused the scarring. (Water on top of H2S04 is bad.)
I went back and called a friend, who had access to similar equipment at his job. After hours, I went over and we decapped the chip—the right way—and looked at it under his microscope.
Now the problem became obvious. Any CMOS gate tied to an input pin needs an ESD protection structure. The failing input did have such a structure, but it was on the wrong side of the input gate. The order should have been bond pad and ESD structure, then the CMOS input. Instead, they had the bond pad and CMOS input, then the ESD structure. Under the amazingly fast rise times of ESD events, the CMOS input had time to blow before the event could trigger the ESD structure downstream!
This was the equivalent of putting the airbag behind the driver, so that the driver’s body could protect the airbag from the crash.
That afternoon, Joe and I called our engineering contact at the ASIC vendor to tell him we’d found the problem. His response still makes my knuckles go white: “Oh. That. We have a fix in fab, samples should be ready in two weeks.” They had known about the problem for 5 months by that time. They knew before we started production. And they knew, during every weekly conference call. Excuse me, I have to go hit something…
Anyway, we assembled the evidence for our customer, who by this time was furious about the millions of dollars in costs for the returns. We were able to show that the ASIC our customer designed was defective due to the mistake by the ASIC vendor.
So—of course!—we ended up paying for the recall!
If you’re early in your career, and this has you shaking your head about our profession, don’t sweat. "Joe" and I have done well in our careers, and in fact he owns his own business. Solve the problem and move on—there will always be another one.