I was asked to help out a co-worker, “Joe”, a less-experienced engineer, with an ESD problem. Highly appropriate, after all, I had five whole years of work under my belt.
The product was only a computer mouse. But we were selling tons of these to our best customer, one of the biggest computer makers in the world at the time. The production line was down for about a week when I joined the effort, so management was…let’s say, “perturbed”.
The mouse had shipped for months during the relatively wet spring. But the drier weather of summer led to more static events, and defective mice were coming back faster than we could ship out replacements.
The mice passed the basic static zap test, but we played with the ground method a bit and lo-and-behold, they were blowing up in the lab just like in the field.
The mouse included an ASIC designed by our customer, the computer maker. “Use this,” they told us (the “…or die” part was assumed.) So the mouse was essentially a build-to-print job for our factory. Still, we were responsible for the design performance because…well, our customer assured us we were responsible, and Sales agreed.
So there we were, holding daily internal meetings on our progress, mice exploding right and left, ankle deep in returns…and no answers.
The first clue came quickly enough. Of course, the ASIC was failing, and we found a failed input pin. But the chip passed the version of MIL-STD-883 that was in use at the time, and that test was the gold standard for chip-level ESD robustness. So the theory was that our board layout was faulty. At least, that was Sales’ theory. (Sales was really quite helpful. Really.)
We (mostly Joe, to be honest) tried all kinds of things to reinforce the board. We knew that adding filters and caps was not dealing with the issue at the source—but the ASIC passed and re-passed the MIL-STD-883 test, and there was nothing obviously wrong with the schematic otherwise…
After the initial panic of the first week, another week went by…followed by more weeks, then months. I cannot describe the ongoing hell this high-priority, hugely expensive issue turned into. This thing simply would not die. We pretty much knew there was something wrong with the ASIC’s ESD performance, but had no proof. We were having conference calls on a weekly basis with the ASIC vendor, with no result.
Then, a savior appeared. Our QC manager, who was qualified by dint of previous experience as a dental hygienist (don’t ask), hired an outside consulting firm to analyze the ASIC. These experts de-capped the package and took hi-res photos through a microscope.
Their images showed tiny craters all over the die, as if the chip had been sandblasted. Their verdict was that the ASIC vendor had “a process problem”.
As the consultants were explaining this, I was getting more and more worried—this did not really ring true. A major ASIC vendor is shipping product with visible scars on the top insulating layer, and had been for months? Even after we complained of problems, they didn’t find this? And even if this was correct, how was it tied to ESD failures?
Then, the consulting company’s chief engineer pointed dramatically at scar near a large structure, and said, “You can see how close this damage is to that output transistor.” Output transistor? It was an alignment mark! Anyone who had worked with die at this level would immediately recognize the marks used to match up the layers in the semiconductor process. We were wasting our time with these consultants. Later I found they’d used a combination of sulfuric acid and water to prepare the chip, a combination that caused the scarring. (Water on top of H2S04 is bad.)
I went back and called a friend, who had access to similar equipment at his job. After hours, I went over and we decapped the chip—the right way—and looked at it under his microscope.
Now the problem became obvious. Any CMOS gate tied to an input pin needs an ESD protection structure. The failing input did have such a structure, but it was on the wrong side of the input gate. The order should have been bond pad and ESD structure, then the CMOS input. Instead, they had the bond pad and CMOS input, then the ESD structure. Under the amazingly fast rise times of ESD events, the CMOS input had time to blow before the event could trigger the ESD structure downstream!
This was the equivalent of putting the airbag behind the driver, so that the driver’s body could protect the airbag from the crash.
That afternoon, Joe and I called our engineering contact at the ASIC vendor to tell him we’d found the problem. His response still makes my knuckles go white: “Oh. That. We have a fix in fab, samples should be ready in two weeks.” They had known about the problem for 5 months by that time. They knew before we started production. And they knew, during every weekly conference call. Excuse me, I have to go hit something…
Anyway, we assembled the evidence for our customer, who by this time was furious about the millions of dollars in costs for the returns. We were able to show that the ASIC our customer designed was defective due to the mistake by the ASIC vendor.
So—of course!—we ended up paying for the recall!
If you’re early in your career, and this has you shaking your head about our profession, don’t sweat. "Joe" and I have done well in our careers, and in fact he owns his own business. Solve the problem and move on—there will always be another one.
Could you show what's the difference of old 883C and 883D? I searched MIL-STD-883C and 883E and 883F, they all said the rise time of current waveform is less than 10 nanoseconds. I can't find the difference of them.
Absolutely. On a similar note, at one company I found out that the previous year's biggest issue wasn't fixed in the new model year. Why? Because Sales was able to hit quota anyway! This was actually a massive effort by Sales to use personal relationships, incentives, discounts, freebies--anything they could pull out of their bag of tricks to get customers to accept the previous year model with the issue. So the overseas group, noting that "sales wasn't impacted", didn't fix it. Just goes to show, things can work quite smoothly until humans get involved.
Hardware design guys are not without fault. I have a current design that I need to build that due to "a small error", when the substrate was layed out, has resulted in multiple capacitor insertions acting as leap frogging pads for bonding wires and in one part running a wire bond over top of the IC to the correct trace.
At another employer, several years ago, Same type of situation resulted in power jumping over ground on one of the devices. Things like that just don't make for a robust design and when pointed out, sales, and the design guy both responded with we'll fix it on the next rev. work around it for now.
Good question. The old 883C spec required testing with a fairly slow rise time. Actual air discharge ESD events have very fast rise times.
With the ESD protection structure on the wrong side of the input strucvture, a slow event could still safely discharge through the ESD structure. A really fast event (fast rise time) required a large current flow to discharge its energy in a short time.
The poor input gate was stuck between the bonding pad and the ESD structure. So the gate saw high voltage on the trace (between the pad and the ESD structure). That was enough to rupture the gate dielectric.
Not long after we learned all of this, MIL-STD-883C was replaced with -883D, which had a much faster rise time.
Join our online Radio Show on Friday 11th July starting at 2:00pm Eastern, when EETimes editor of all things fun and interesting, Max Maxfield, and embedded systems expert, Jack Ganssle, will debate as to just what is, and is not, and embedded system.