SAN MATEO, Calif. Networking equipment is growing increasingly susceptible to soft errors nonrecoverable, temporary misfires that can play havoc with things like traffic destinations as chip and systems designers pile on SRAM to boost performance. To keep the problem at bay, memory experts are urging designers to beef up their error correction and system reliability mechanisms.
The electronics industry has devised defenses against soft errors, but many say they expect the rate to worsen as memory makers continue to shrink line widths and scale down voltage. And SRAM makers could exacerbate the problem by packing more bits on a chip and cycling the memory core more quickly.
In a PC, soft errors are eclipsed by more common software bugs and may pass unnoticed in something like a graphical display. But networking equipment is much less forgiving.
Routing is the big problem, said Thomas Pawlowski, a senior fellow at memory manufacturer Micron Technology (Boise, Idaho). An uncorrected soft error "could send a packet to Cleveland that was supposed to go to Los Angeles," he said. "It's lost traffic."
Memory experts have witnessed the problem before in DRAMs. What makes the current situation more worrisome, they say, is that few chip or systems designers know that SRAMs are vulnerable to soft errors even as they pack in more of the memory bits, making it less likely they will design in adequate error correction or bit parity.
Russ Lange, vice president of IBM's technology group, said IBM provides a tutorial and database of soft error rate (SER) models for its SRAM customers, but often designers don't realize beforehand that they need to take precautionary steps. "It doesn't occur to many of our customers, to be honest," he said. "If your system doesn't anticipate it and you're exposed to this phenomenon it's going to be a problem."
Sun Microsystems learned the hard way several years ago, said David Yen, vice president and general manager of Sun's processor group and former head of its integrated-products group. Sun at times found itself at odds with server customers over problems that it only later learned were attributable to soft errors. "As a vendor we couldn't tell the customer the reason [initially] and everyone would get upset," he said. "It's been a lesson to us all. We have to look at components from the perspective that they're not 100 percent reliable."
Server makers have since made strong error correction part of their designs from the outset. But networking OEMs are just starting to notice the effects of soft errors, observers said. "The awareness has not been very great," said Micron Technology's Pawlowski. "I do technical seminars all over the planet and everywhere I go I always bring up SER."
Soft errors occur when charged particles penetrate a memory cell and cross a junction, creating an aberrant charge that changes the state of the bit. Among the most common sources of soft errors are alpha particles emitted by contaminants in memory chip packages or cosmic rays penetrating the earth's atmosphere.
In the 1970s, researchers began to notice that soft errors were occurring frequently in DRAMs, and traced much of it to the packaging. Since then, DRAM vendors have gone to higher-quality packaging materials and coatings over the die, making the problem much less severe than it was before.
But the same phenomenon is starting to affect SRAMs. Unlike capacitor-based DRAMs, SRAMs are cross-coupled devices that have far less capacitance in each cell. The lower the capacitance, the greater the likelihood that an alpha particle or cosmic ray will upset a bit if it strikes the right place. And as SRAM makers reduce the voltage with every process generation, that cell capacitance continues to go down, making the cell vulnerable to more types of particles.
A particle with as little as 10 femtocoulombs has enough energy to change the state of an SRAM cell today. Ten years ago it would have taken about five times more energy. "There are lots of particles that are swimming around that can upset a cell," IBM's Lange said.
Impossible to stop
As few as two or three atoms of uranium or thorium contaminating a package are enough to flip a bit. Alpha particles like these usually have a range of only 25 nanometers, Lange said, and can often be shielded by placing a plastic coating over the die.
But cosmic rays are almost impossible to stop. "They'll go through 5 feet of concrete without any trouble," Lange said. "As they pass through they can separate junction current flow for 5 ps [and cause a bit to flip]."
As the SRAM cells get smaller, so does the area of the junctions, making it less likely that a charged particle will disrupt an individual junction. But the trend is to use more SRAM bits both in standalone memory devices and as embedded memory, to reduce memory access latencies. That tends to increase exposure to charged particles.
This is especially apparent in network hardware. "SRAM usage goes up in networking," said Narayan Purohit, vice president of the memory division at Mitsubishi Electric and Electronics USA (Sunnyvale, Calif.). "Typically, networking apps use an array of SRAMs. You can have a mass of 144 of them. Potentially, it's a lot more susceptible to SER, and that's where immunity efforts come into play."
Designers of next-generation networks should start considering SER, some caution. In the next six months, SRAM vendors are expected to begin shipping a new breed of high-density 18-Mbit quad-data-rate parts based on 0.15- and 0.13-micron design rules that will go into these systems. "As you shrink the node capacitance you have to figure out a way to give it some immunity to soft errors in design and process development," Mitsubishi's Purohit said.
As SRAM usage and density rise, so does the internal speed. Faster speeds could increase soft error rates because memory cells are especially prone to error during read and write cycles.
"Everyone wants higher and higher bandwidth and lower access times. There's always the pressure to cycle the core faster and faster," Pawlowski said. "[Users] want us to be four times bigger and four times faster."
But building in full-fledged error correction code isn't practical because it will hurt performance. "We're killing ourselves to get SRAMs to clock at 333 MHz and the outputs are coming in at 1.5 clocks. If we did error correction, that might go to two clocks," he said. "We don't want to make that choice for them."
The soft error problem is also being felt by chip designers adding embedded SRAM to their high-performance processors or systems-on-chip, where SRAM can make up half the die area.
Virage Logic Corp. (Fremont, Calif.), which provides SRAM compilers to some of the leading networking-chip companies, has noticed the soft errors, particularly those caused by packaging, and is mapping a plan to lessen their effects. "It's starting to become an area of concern at 0.13 micron," said Vincent Ratford, vice president of marketing and business development. "In a cell phone someone might not notice the difference, but if it's a high-performance router and you're moving money around the world, you care a lot about that."
Some in the DRAM camp suggest that rising soft error rates in SRAMs argue for a switch to DRAM. Among them is Mosys Inc., with a one-transistor SRAM that uses a multibank DRAM cell but that touts SRAM speed. "The implication is that the bit lines are very short and that makes it less susceptible to soft error rates," said vice president and general manager Mark-Eric Jones. Jones said the failure in time (FIT) of Mosys' so-called 1T SRAM is below 1,000 and will stay that way down to 0.13 micron, while SRAMs are on track to hitting 10,000 FITs at 0.15 micron. (A single failure in time represents one malfunction for every 1 billion hours per device.)
IBM's Lange said the big difference in SER between DRAM and SRAM could persuade chip designers to use more DRAM. "It's now true that DRAMs are much more immune," he said. "We're finding apps where DRAM is taking over what SRAM used to do."
As an alternative to standalone SRAM, Fujitsu and Toshiba offer fast-cycle RAM, a DRAM that boasts low latency and an FIT rate of less than 1,000, Fujitsu said.
Pawlowski said designers should be wary of SRAMs with an FIT rating of 10,000 or more. "If you're in the low thousands, [customers say] that's fine," he said.
Though SRAM vendors say they will provide their FIT rate to customers who ask, few if any disclose their soft error rates openly on product data sheets. Vendors fear customers will hold them accountable for errors that are inherently so unpredictable. "People are going to be worried about liability. Proving a statistic is hard to do," Pawlowski said.
Indeed, soft errors, which occur randomly and cause no permanent damage to the memory device, are tough to track and test. An SRAM vendor may generate an FIT statistic from tests based on accelerated radiation sources, but a host of external variables must be considered. IBM has shown, for example, that at an altitude of 10,000 feet an SER can be 14 times higher than at sea level because of the greater exposure to cosmic rays.
No standard test
"There's no standardized test that I'm aware of which says, this is how you test for such a transient phenomenon under these radiation, altitude and packaging conditions," said IBM's Lange. "It could take two months and 2,000 parts in a special test fixture that you have to set up."
Mosys' Jones, however, challenged the memory vendors' policy of keeping error-rate information under wraps. "We all issue data sheets with lots of numbers. If there was more published on this it would help improve awareness," he said.
Several large memory makers say they are working to make their devices less susceptible to soft errors, or at least to keep the rate from ballooning out of control. Micron would not disclose its plans but Pawlowski said the rate is improving even while moving from 3.3 to 1.8 volts.
Mitsubishi said it has moved to a stacked memory cell that triples the capacitance of the storage node, adjusted the p-valve and switched to a low-alpha packaging material. "We're trying not just to maintain but to get better," Purohit said.
Still, observers said, there's no substitute for system designers building in error correction code or bit parity from the start. IBM, for one, has built systems with processors running in parallel that cross-check for errors. "There are degrees of system-level solutions that can and need to be invoked," Lange said.
If not, systems designers could find themselves at the mercy of the laws of physics. SRAM makers can try to limit soft errors, but they can't stop them from occurring, especially as more SRAM bits are loaded into systems and chips. "The truth is, I can't imagine anything we can do to actually have significant improvement in the situation," said Micron's Pawlowski. "I see it at best holding its own or getting worse."