LOS ANGELES Desktop and notebook computers may need to adopt error-correcting code (ECC) memory to combat rising system crashes from single-bit memory errors, according to a confidential white paper written by Microsoft Corp. The software giant raised the issue in a panel discussion on memory at the Windows Hardware Engineering Conference here although it admits its data on system failures is still inconclusive.
For about four years Microsoft has been collecting data through its Online Crash Analysis (OCA) tool that reports system crashes to a Microsoft Web site. About 18 months ago it began sharing OCA data and the white paper with systems and chip makers. According to one source, the report said single-bit error rates in DRAM are now among the top ten causes of systems failures.
Microsoft admits the data is still inconclusive because OCA does not provide enough detail about the types of systems that crash and the memory they use. As it tries to improve the tool, Microsoft is asking OEMs to help provide more data and to consider ECC memory in desktops and notebooks.
Today ECC memory is widely used in PC servers. But so far desktop, notebook and many chip makers have resisted the move because it would add costs in the form of extra DRAM chips on a module and upgraded memory controllers in chip sets.
Some system maker in the audience at the WinHEC panel expressed support for a move to ECC, but DRAM makers on the panel were still skeptical.
"I think the problem is significant," said Jeff Galloway, an engineer in Hewlett-Packard's x86 server group. Microsoft has shown him data on HP crashes that appeared to come from single-bit DRAM errors and were all on systems not running a Windows Server operating system, he added.
"The industry needs to do something about this," Galloway said. "Microsoft got ECC into servers by requiring it for a Windows Server logo, and I think they should do the same thing for desktop and notebooks now," he added.
"This kind of forum is one way we can engage OEMs in what we should do going forward," said Son VoBa, a principal program manager in Microsoft's Windows Server group who led the panel discussion. "ECC may be only one way to address the problem," he added.
The single-bit errors are typically traced to the effects of neutron radiation, so-called cosmic rays, bombarding individual capacitors in a DRAM and changing their charge state. DRAM makers say that effect has actually been diminishing over time and the errors could have come from a variety of sources including chip sets.
"We have seen reductions [in soft error rates] with each of the last several process technology generations," said Dean Klein, vice president of market development for Micron.
DRAM makers, including Samsung and Qimonda, also note that SDRAM and DDR1 memories provided ECC capabilities that notebooks and desktops did not use. Thus when the standard was set for today's DDR2 memories, engineers eliminated ECC to save costs associated with the unused feature.
One memory maker suggested a better approach would be to create a retry facility in the DDR4 interface standard now in the works. A Samsung spokesman said the DDR4 group is in the early stages of discussing a feature for monitoring the memory I/O interface.
Peter Glaskowsky, an analyst with Envisioneering (Seaford, NY), said Microsoft pushed for adopting on ECC to combat soft errors in the mid 1990s, but OEMs resisted. They refused to take on the costs of the shift, making the case that more crashes were caused by Windows failures than DRAM soft errors.
Now that the Windows operating system is becoming more stable it makes sense that the company would re-open the issue. However it is unclear whether the soft errors have become significant enough to convince OEMs to change this time, he added.