SANTA CLARA, Calif. – Some more details of a design error in a companion chip to the Sandy Bridge processor – and the fix being implemented – have emerged in a conference call held by Intel with financial analysts to discuss the issue and the impact on Intel's revenues and margins.
The chip, known as Series 6 or Cougar Point, passed rigorous functional testing performed by both Intel and its OEMs but nonetheless there is a problem which can show up in a low percentage of chips, according to Steve Smith, vice president of PC client operation enabling, speaking on the call.
Smith said his best estimate was that a single-digit percentage of the chips, about 5 percent, have the potential to be affected over the typical 3-year life of a notebook computer. And the error would manifest itself with up to 4 of 6 serial-ATA channels being degraded in performance or failing altogether.
Systems including the chips only started shipping to consumers on Jan. 9 and there are no known reported failures in the field. Nonetheless Intel has suspended shipments of the chip while it brings up a corrected design and will provide replacements and support to affected parties.
"The root cause is a design oversight, if you will, and all we needed to do was make a metal change to configure that circuit back to a robust operating mode. And it's on one of the later layers of metal so we actually can utilize all the chipset pipeline that has been there and is there in the fab right now," said Smith.
Because the chipset is built in a relatively mature 65-nm process, Smith said there is confidence that the corrected chip can ramp up production quickly.
The testing Intel did was not that good. I would argue Intel should do a much better job on their chip sets in general and that Intel management and marketing is now regretting their poor decision making. Intel's problem in this case is quite serious. Nobody is going to like potential data corruption on their hard drives. I am quite surprised Intel did not announce a full recall.
The problem was Intel took LVT cells in the clock tree for the 3G SATA controller and biased the substrate of them further for speed in revision B silicon. The revision A silicon did not have this issue. Another related issue is Intel should have had the Z68 chipset ready for the launch of Sandy Bridge. Much of this information is already in the public domain on sites such as Anandtech. Overall, I am fairly disappointed by Intel.
More than likely the problem was related to a lithography/etch margin issue at that metal level. Problems like these only happen when multiple factors drift like focus, reflectivity, planarization etc causing a notching/thinning of the metal trace width.
Design verification would not catch such a low probability event. This margin is maintained/eliminated by good fab process tool controls. Still, they would tweak the mask to add as much margin as possible.Then, perhaps add another design rule
My experience we TI is frustrating me at the moment (Chipcon part). I believe I have found a reliability issue, but they are giving me the run around. Suggesting it is caused by silly things like bad joints, when inspection as well as the mode of the failure clearly indicates otherwise.
TI is a very diverse company though, so I imagine the response would vary depending on the group you are dealing with.
Both of these statements cant be true!!!
Intel mentioned that after it had built over 100,000 chipsets it started to get some complaints from its customers about failures.
Intel expects that over 3 years of use it would see a failure rate of approximately 5 - 15% depending on usage model. Remember this problem isnít a functional issue but rather one of those nasty statistical issues, so by nature it should take time to show up in large numbers (at the same time there should still be some very isolated incidents of failure early on).
Thanks Tom Mariner,
"If they claim it ain't the silicon, I'm looking elsewhere". A new classic quote!
I assume the writer means that if the supplier doesn't admit there's a problem with the silicon, the customer should look elsewhere for a better, more honest, chip supplier.
Since it is only degradation, may not be e-migration. Anyone remember the "Fast Cadillac" reliability problem with a small percentage of Delco's first cruise control chips? Cause was a mask defect on a contact print mask.
There seems to be a grand tradition in the chip design world of fessing up to your boo boos. Possibly because in the future when you say it is not in your section of the IC, you will be believed.
Once found a problem in earlier layers of a TI DSP chip -- it seems as though noone had written software that used the entire chip at once in the three years it had been released. (If I don't give my company / customer the best the hardware will do, it leaves an opening for a competitor to them, and I don't let my customers lose!) They could have pointed the finger at me for a firmware glitch, but instead thanked me in front of my customer and put the fix into a wafer partially done to get the revised parts out in record time.
Class tells -- and in both the Intel and TI cases, it tells me that if they claim it ain't the silicon, I'm looking elsewhere.
Reliability issues are tough to catch unless there is significant design reviews and all. It can be easy for large teams to assume someone else has checked this or that. Can sneak up and infect the best of teams. Electromigration and/or NBTI are my best guess what they are dealing with but we may not know the details for a while. Those are tricky and many of the tool will not adequately predict the outcome.
The previous instance when Intel had this kind of bug was in 1994 (the infamous FPU bug). I guess intel has learnt lesson and didnt wanted to take chance this time around. Hence they are taking necessary steps rather than ignoring the bug.
Join our online Radio Show on Friday 11th July starting at 2:00pm Eastern, when EETimes editor of all things fun and interesting, Max Maxfield, and embedded systems expert, Jack Ganssle, will debate as to just what is, and is not, and embedded system.