I was asked to help out a co-worker, “Joe”, a less-experienced engineer, with an ESD problem. Highly appropriate, after all, I had five whole years of work under my belt.
The product was only a computer mouse. But we were selling tons of these to our best customer, one of the biggest computer makers in the world at the time. The production line was down for about a week when I joined the effort, so management was…let’s say, “perturbed”.
The mouse had shipped for months during the relatively wet spring. But the drier weather of summer led to more static events, and defective mice were coming back faster than we could ship out replacements.
The mice passed the basic static zap test, but we played with the ground method a bit and lo-and-behold, they were blowing up in the lab just like in the field.
The mouse included an ASIC designed by our customer, the computer maker. “Use this,” they told us (the “…or die” part was assumed.) So the mouse was essentially a build-to-print job for our factory. Still, we were responsible for the design performance because…well, our customer assured us we were responsible, and Sales agreed.
So there we were, holding daily internal meetings on our progress, mice exploding right and left, ankle deep in returns…and no answers.
The first clue came quickly enough. Of course, the ASIC was failing, and we found a failed input pin. But the chip passed the version of MIL-STD-883 that was in use at the time, and that test was the gold standard for chip-level ESD robustness. So the theory was that our board layout was faulty. At least, that was Sales’ theory. (Sales was really quite helpful. Really.)
We (mostly Joe, to be honest) tried all kinds of things to reinforce the board. We knew that adding filters and caps was not dealing with the issue at the source—but the ASIC passed and re-passed the MIL-STD-883 test, and there was nothing obviously wrong with the schematic otherwise…
After the initial panic of the first week, another week went by…followed by more weeks, then months. I cannot describe the ongoing hell this high-priority, hugely expensive issue turned into. This thing simply would not die. We pretty much knew there was something wrong with the ASIC’s ESD performance, but had no proof. We were having conference calls on a weekly basis with the ASIC vendor, with no result.
Then, a savior appeared. Our QC manager, who was qualified by dint of previous experience as a dental hygienist (don’t ask), hired an outside consulting firm to analyze the ASIC. These experts de-capped the package and took hi-res photos through a microscope.
Their images showed tiny craters all over the die, as if the chip had been sandblasted. Their verdict was that the ASIC vendor had “a process problem”.
As the consultants were explaining this, I was getting more and more worried—this did not really ring true. A major ASIC vendor is shipping product with visible scars on the top insulating layer, and had been for months? Even after we complained of problems, they didn’t find this? And even if this was correct, how was it tied to ESD failures?
Then, the consulting company’s chief engineer pointed dramatically at scar near a large structure, and said, “You can see how close this damage is to that output transistor.” Output transistor? It was an alignment mark! Anyone who had worked with die at this level would immediately recognize the marks used to match up the layers in the semiconductor process. We were wasting our time with these consultants. Later I found they’d used a combination of sulfuric acid and water to prepare the chip, a combination that caused the scarring. (Water on top of H2S04 is bad.)
I went back and called a friend, who had access to similar equipment at his job. After hours, I went over and we decapped the chip—the right way—and looked at it under his microscope.
Now the problem became obvious. Any CMOS gate tied to an input pin needs an ESD protection structure. The failing input did have such a structure, but it was on the wrong side of the input gate. The order should have been bond pad and ESD structure, then the CMOS input. Instead, they had the bond pad and CMOS input, then the ESD structure. Under the amazingly fast rise times of ESD events, the CMOS input had time to blow before the event could trigger the ESD structure downstream!
This was the equivalent of putting the airbag behind the driver, so that the driver’s body could protect the airbag from the crash.
That afternoon, Joe and I called our engineering contact at the ASIC vendor to tell him we’d found the problem. His response still makes my knuckles go white: “Oh. That. We have a fix in fab, samples should be ready in two weeks.” They had known about the problem for 5 months by that time. They knew before we started production. And they knew, during every weekly conference call. Excuse me, I have to go hit something…
Anyway, we assembled the evidence for our customer, who by this time was furious about the millions of dollars in costs for the returns. We were able to show that the ASIC our customer designed was defective due to the mistake by the ASIC vendor.
So—of course!—we ended up paying for the recall!
If you’re early in your career, and this has you shaking your head about our profession, don’t sweat. "Joe" and I have done well in our careers, and in fact he owns his own business. Solve the problem and move on—there will always be another one.
Many years ago (at a company I am no longer with) the salesheads pressured engineering into a product modification which they them sold to customers prior to a repeat of RF emissions testing to FCC Part 15. My own project had been suddenly canceled and I got pulled into the mess when the subsequent RF emissions testing revealed that the modified product vastly exceeded allowable FCC radiated limits.
After I did much physical and mechanical redesign and got the product to pass FCC testing, sales response was "How do we explain to the customers why they need to make this change? Can't you fix the problem without having to make all these changes?"
To which I basically replied "That's YOUR problem. YOU decided to sell the modified product prior to emissions testing."
The fact is that while the customer is NOT always right, THEY ARE the ones with the money. I have showed customers that the "build to print" they supplied would not work, and that customer was grateful, responding with a PO to make the design work. But not all sales weasels are as cooperative as they were at that employer.
A coworker came up with a complement of the 5 Why process. The "5 Who" process used to determine "Who will pay?" First it got a laugh and then we realized that, more often than not, it reflected reality.
I know you probably won't name the ASIC company, but the flippant response they gave you really irks me. If I hear that someone has had problems with say a certain appliance manufacturer, car etc, they will never get my business. The ASIC company was at fault and should have owned up to it right away. Anything less is reason to go somewhere else.
We're totally on the same page. My reaction was, "Never again." Their ASIC business closed, but they still make other kinds of chips. I don't flat-out refuse to consider their product, but on the other hand, if there is an alternative, I do sort of lean that way.
Good question. The old 883C spec required testing with a fairly slow rise time. Actual air discharge ESD events have very fast rise times.
With the ESD protection structure on the wrong side of the input strucvture, a slow event could still safely discharge through the ESD structure. A really fast event (fast rise time) required a large current flow to discharge its energy in a short time.
The poor input gate was stuck between the bonding pad and the ESD structure. So the gate saw high voltage on the trace (between the pad and the ESD structure). That was enough to rupture the gate dielectric.
Not long after we learned all of this, MIL-STD-883C was replaced with -883D, which had a much faster rise time.
Could you show what's the difference of old 883C and 883D? I searched MIL-STD-883C and 883E and 883F, they all said the rise time of current waveform is less than 10 nanoseconds. I can't find the difference of them.
Hardware design guys are not without fault. I have a current design that I need to build that due to "a small error", when the substrate was layed out, has resulted in multiple capacitor insertions acting as leap frogging pads for bonding wires and in one part running a wire bond over top of the IC to the correct trace.
At another employer, several years ago, Same type of situation resulted in power jumping over ground on one of the devices. Things like that just don't make for a robust design and when pointed out, sales, and the design guy both responded with we'll fix it on the next rev. work around it for now.
Absolutely. On a similar note, at one company I found out that the previous year's biggest issue wasn't fixed in the new model year. Why? Because Sales was able to hit quota anyway! This was actually a massive effort by Sales to use personal relationships, incentives, discounts, freebies--anything they could pull out of their bag of tricks to get customers to accept the previous year model with the issue. So the overseas group, noting that "sales wasn't impacted", didn't fix it. Just goes to show, things can work quite smoothly until humans get involved.
David Patterson, known for his pioneering research that led to RAID, clusters and more, is part of a team at UC Berkeley that recently made its RISC-V processor architecture an open source hardware offering. We talk with Patterson and one of his colleagues behind the effort about the opportunities they see, what new kinds of designs they hope to enable and what it means for today’s commercial processor giants such as Intel, ARM and Imagination Technologies.