United Business Media EE Times


Search

HOMEMARKET INTELLIGENCE UNITFORUMSDESIGNNEW PRODUCTSCAREERSBLOGSCONTACTEVENTSSIGN UP!RSSMost Popular contentTrusted Sources

 


Hot-plug memory curtails server downtime
Print this article Email this article Reprints RSS Digital Edition

EE Times


Server memory footprints have increased steadily — they are now reaching capacities of 64 Gbytes in a single server — and entire databases can now be run out of memory, providing near instantaneous response times. But memory, like other storage devices, is inherently susceptible to data errors, so systems that deploy larger memory must also address the need to maintain the reliability of the memory.

Understanding the impact that downtime can cause as a result of memory failure, our engineers developed a suite of memory protection architectures. One of our objectives in our next-generation server architecture and development efforts was to innovate in the main memory subsystem to ensure reliability, serviceability and scalability. To this point, we have looked to disk drive technology for inspiration, creating in silicon-based DRAM the equivalent of the following functions: online spare memory, hot-plug mirrored memory and hot-plug RAID memory.

Errors occurring in memory are commonly divided into two classes: single-bit and multibit. As the names imply, single-bit errors are those that affect only a single data bit within a data word; multibit errors affect more than one bit of data. This distinction is important because single-bit errors are much more prevalent than multibit ones, and because single-bit errors are easily corrected by common memory error-detection and -correction algorithms.

Large memory subsystems, such as main memory in a server, are commonly protected from memory errors by use of an error detection and correction protocol. Parity, commonly used in desktop computers and some entry-level servers, can detect some errors but has no correction capability. Eight-bit error-correction-code, or ECC, algorithms, which are commonly employed in server main memory subsystems, provide not only detection, but also correction of the most common error conditions: single-bit errors. When an error occurs that cannot be fixed by the memory subsystem — such as a multibit error — the server will experience an uncorrectable data error and the operating system will crash.

But the number of memory errors increases with memory capacity. A percentage of these errors will be multibit errors that ECC cannot correct, so the potential for a failure in ECC systems also increases with memory capacity. Small memory subsystems, common in a desktop PC, have very low failure rates and may employ parity. A 1-Gbyte memory subsystem, common in a server today, utilizing an ECC algorithm with single-bit error correction, would have an annualized failure rate of about three percent. However, a complex server system with a 16-Gbyte memory subsystem would have an annualized failure rate of nearly 50 percent. Clearly, utilizing a standard 8-bit ECC code for memory protection in a server that has several gigabytes of main memory does not provide the reliability desired in a server running a critical application.

Online spare memory allows a bank of memory to be designated as such and the remaining banks as system memory. If a DIMM within the system memory begins to experience increases in single-bit errors, which is typical when memory is about to fail, the system will fail over to the online spare bank, thus avoiding unscheduled downtime as a result of a memory failure. This failover is performed when software copies the data out of the bank of memory containing the faulty memory device to the online spare memory bank. This maintains server availability and memory reliability without service intervention. The DIMM that exceeded the error threshold can be replaced at the customer's convenience during a scheduled shutdown, saving the expense of network downtime and rushed service.

Compaq's engineers are working to make it possible to use hot-plug mirrored memory in the company's 500 series ProLiant servers. This will permit customers to hot-plug defective memory out of a server and replace it with good memory — with no downtime. Future enhancements to this technology will make it possible for customers to actually hot-add memory as their business grows without taking the server down.

When a server is in hot-plug mirrored memory mode, data is written to two groups of industry-standard DIMMs. Data is read from one group of DIMMs while the other group contains a mirrored copy of the data. If a read error is encountered in a DIMM, or if the DIMM reaches a pre-failure warranty condition, the data is read from its mirrored DIMM. This allows the customer to hot-replace the failed DIMM without shutting down the server, thus improving availability.

Hot-plug RAID memory, or HPRM, was developed for industry-standard 700 series of eight-way servers where we develop chip sets in-house. Here, RAID stands for redundant array of industry-standard DIMMs. As the name implies, there are several parallels between this technology and RAID commonly used in disk-drive technology, where the acronym stands for redundant arrays of independent disks.

In our application, memory is configured in a 4+1 redundant array. A bank of memory is equally distributed across the five memory cartridges; one cartridge is always designated as parity while the other four contain raw ECC-protected data. As such, any one of the five memory cartridges can be removed to replace a bad DIMM or add additional DIMMs while the system continues to be fully operational. Hot-plug redundant-DIMM memory provides customers an unequaled level of memory protection value with only a 25 percent increase in memory costs.

HPRM, utilizes a redundant-DIMM parity engine in the host controller to create a redundant copy of the data word for each of the four memory cartridges. Therefore, if an error is found in any of the four devices, its contents are replaced by the redundant-DIMM copy. Obviously, if multiple devices fail on the same transaction, the redundant memory logic will be defeated. But this is similar to other redundant subsystems such as power supplies. The reality is that two independent devices are extremely unlikely to fail at the same time. While the redundant-DIMM logic provides redundancy, it does not detect error conditions. The memory subsystem depends on ECC logic to detect errors. So in our implementation, any error that can be detected by ECC can be corrected by redundant DIMM. This combination provides an extremely reliable memory subsystem even with large memory capacities anticipated in next-generation servers.

HPRM brings hot-plug to the memory subsystem. When a memory device is determined to be bad, generally by exceeding an error threshold defined by the health driver, the operator is informed by whatever means provided by the software — such as a page or e-mail. In addition, LEDs are provided on the server's front panel to indicate the source of the error. Furthermore, LEDs, locking switches and an audible alarm are used to guide the operator through the task of removing the memory cartridge with the failed DIMM, replacing the correct device and restoring the system to a fully redundant configuration.

Also, the hot-plug capabilities of the memory subsystem brought another benefit: the ability to hot-add memory. In other words, the amount of main memory in the system can be increased through a series of hot-plug operations. Each memory cartridge is independently removed, memory is added and the cartridge is restored. Once each memory cartridge is upgraded with like-sized additional memory, a message is sent to the operating system indicating that more memory is now available. Compaq has worked closely with all major operating system vendors to add this support to their operating system kernels.

See related chart






  Free Subscription to EE Times
First Name Last Name
Company Name Title
Email address
  Click here for your Free Subscription to EETimes Europe
 
CAREER CENTER
Looking for a new job?
SEARCH JOBS
SPONSOR

RECENT JOB POSTINGS
CAREER NEWS
Engineers take a bad year in stride
According to the findings of the 2009 EE Times Global Salary & Opinion Survey, generally, engineers are satisfied with their career choices.

For more great jobs, career related news, features and services, please visit EETimes' Career Center.


All White Papers »   

 
Education and
Learning


Learn Now:












Home | About | Editorial Calendar | Feedback | Subscriptions | Newsletter | Media Kit | Contact | Reprints|  RSS|   Digital|  Mobile
Network Websites
International
Network Features




All materials on this site Copyright © 2009 TechInsights, a Division of United Business Media LLC All rights reserved.
Privacy Statement | Terms of Service | About