Onyx gets its NV memory
PCM devices from Micron were the stars of a paper  presented by members of the Department of Computer Science and Engineering at the University of California, San Diego at the recent Hot Storage conference .
They described the results of testing a PCM-based solid state drive (SSD) called Onyx. This SSD uses what are described as Micron’s “first generation” P8P (90nm) 16 MB PCM devices. Onyx has a capacity of 10GB organized in 8 banks of 1.25GB, connected to a host system by a PCIe bus. Data storage is allocated 8GB of storage with 2GB of storage for error correction. Figure 1 is a schematic of the high-level architecture of Onyx.
Some concerns were raised that Onyx may not have been fully populated, the system requiring some 640 PCMs each with16MB capacity. We have now had assurances from Adrian Caulfield, one of the authors of the paper , that the system was fully populated, with all its 16 x 40 PCM DIMMs. It represents the largest collection of PCMs that has been subjected to the rigors of assembly and shown to the public in a system; this is a significant PCM milestone.
Onyx as a prototype system is based on the design of Moneta, an SSD that was designed in anticipation that at some time, some type of non-volatile memory would become available. It uses DRAM in place of PCM. Onyx now uses PCM in place of the DRAM, but it retains the highly-optimized software stack of Moneta to minimize latency and maximize concurrency.
In essence, the Onyx architecture employs eight memory controllers, each controlling 1GB memory and linked on 4GB/s ring communicating with the “brain” of the system that interfaces with the PCIe bus. The prototype system employs four FPGAs ring connected, with four DIMMs to each FPGA. The system clock frequency is 250MHz. Each DIMM has 40 of Micron’s 16 MB P8P PCM devices. The DIMMs fit into a standard DIMM slot
Some of the techniques for dealing with PCM design challenges, “its own idiosyncrasies” , are worth commenting on. The first is the use of a “large capacitor” to assure that PCM does not breach the fundamental definition of a NV memory, i.e. it does not lose data in the event of a mains failure. The use of the large capacitor is not quite as bad as it might at first appear. The PCM controller is able to provide two indications of the write to PCM status. One is called “late completion,” indicating write is complete. The other, called “early completion,” is provided when all the data is in the PCM buffers. Early completion is used to allow Onyx to hide most of the write latency but is vulnerable to power failure. In the event of a mains failure, the large capacitor has enough power to complete the write operation. The position is defended on the basis that flash can achieve this. It is claimed the use of early completion provides a peak bandwidth per PCM DIMM pair of 156 MB/s for read and 47.1MB/s for writes.
The next PCM “idiosyncrasy” design challenge with which the U of Cal team had to deal, is PCM wear out . They cited discussions with the PCM manufacturer explaining the difference between lifetime of a PCM and flash. Simply put, the PCM lifetime, 1 million cycles, is an estimate of the number of programs per cell before the first bit error occurs in a large population of the device (no population number provided) without error correction. While for flash, lifetime is the number of program/erase cycles before the error-correcting scheme can no longer handle the problems.
To deal with the write lifetime and wear out problem, Onyx employs what is claimed as the first real-system implementation of a “start-gap” wear-leveling scheme in order to avoid uneven PCM wear out. In operation, it slowly rotates the mapping between 4KB rows of PCM memory and their storage addresses. If the storage address of row x is n, after some interval it will become n+1 and so on. This does mean that, periodically, memory content must be rewritten. The start-gap interval used was 128. It introduces a new term into the memory lexicon “line vulnerability factor,” as the number of writes to an address before it is rewritten by start-gap. In a system, the trade-off is vulnerability against extra overhead for access and writing.
Testing of Onyx with standard benchmarks against other systems, The FusionIO (ioDrive) and Moneta, showed that Onyx with PCM could outperform Fusion IO for small writes and reads. With early write completion, Onyx write performance improves for both large and small requests. For small 512B requests, it is claimed Onyx can sustain 478K IOPS compared to ioDrive’s 90K IOPS. The 4KB random read time for Onyx is 38us, while a 4KB write requires 179us.
As well as several design challenges that remain for the future, the real problem is again PCM scaling. Admitting that scaling is key to the future, the design team  states “assuming PCM scaling projections hold” PCM storage arrays will be competitive with flash. They then let us into a secret with respect to Micron’s PCM plans. They state “next generation PCM devices will sustain up to 3.76MB/s,” up from 0.5MB. Perhaps this is Micron’s promised 1 or 2 G-bit PCM?
One of your contacts and I have been discussing abandoning the silicon substrate and using CVD or AVD to build the PCM on an alternate substrate. My suggestion is that the materials folks design a chalcogenide substrate to meet the necessary specs. If that were done, couldn't one of the specs for the chalcogenide substrate be that it allow for a laser driven modification of the state of the cells AFTER fabrication? Getting even farther out in la-la land, the production could be changed to a continuous ribbon of chips just one chip wide.
Now you can have the moderator boot both me and Volatile off the comments area so you can get serious with the real engineers and researchers :-)
Should that be done, I may as well go out expressing my unsupportable opinion that the optimum chalcogenide for the PCM memory material includes arsenic and is doped with terbium.
A number of people have contacted me with respect to the energy aware paper. If MLC PCM are going to be part of the solution to PCM scaling problems, then how do we decide which of the levels 00,01,10 or 11 in a four level cell are the best for energy awareness. The problem is further complicated because a number of the MLC techniques require an initial reset pulse before the write/ease steps to the get to the target level are started. Even if that is not the case, set and rest pulses will be necesary to achieve the target level.
The other problem I see, even for two level cells is PCM cells are manufactured in the crystallized state. That is all sectors of the memory are in the most undesirable state as far as energy awareness is concerned. If I initially write to a few sectors, these will also be the best sectors as far as energy awareness is concerned and might get written to destruction.
rbtbob: Naah, you misunderstood the "read-before-write" paper. The paper clearly shows that PCM has a higher energy consumption per bit than even DRAM, plus it does nothing to address the fundamental problem of current density, migration, and all those other issues that Mr. Neale has already covered.
A paper from George Washington University makes claims of considerable improvement in PCM energy consumption.
In this paper, we investigate new techniques that would perform writes to PCM with energy awareness. Our results show that we can minimize the write energy consumption by up to 8.1% by simply converting
PCM native writes to read-before-write, and up to
an additional 22.9% via intelligent out-of-position
Seems the researchers are going to tweak PCM into viability.
rbtbob-Quotes from the cited patent “”when more than a certain amount of current passes through a phase change memory device in a reset state, its resistance and threshold voltage may change”” …..””Without being limited to theory, it may be that the reason for these disturb problems is due to the presence of crystal nuclei within the amorphous state. These crystal nuclei are the sites for the growth of the crystalline phase from the amorphous phase””
It is not a matter of “may” and “maybe”, there is no doubt that the threshold voltage changes and recovers with a time constant that exceeds the thermal time constant, however small the current and pulse width. This is initially an annealing process, the closing of dangling bonds. If and how this relates to nucleating sites is not clear. This change in threshold voltage and structure, limits read access time and is more likely than not the reason why it has so far been impossible to make stable oscillators using the “S” shaped negative resistance of a threshold switch. I think the fact that the post switching threshold recovery starts from zero as one continuous process might also teach something about the post-switching internal temperature. On nucleating sites, most structures, of necessity, use crystallized chalcogenide as one electrode structure, itself a massive nucleating site!
The link below is to a new patent assigned to Ovonyx and it lists Semyon Savransky as one of the inventors:
Patent 7,990,761 August 2, 2011
Immunity of phase change material to disturb in the amorphous phase
Mr. Neale is baaack! After falling for Mr. Savransky's and Ms. Kuzum's pseudo research (that's putting it too mildly), Mr. Neale obliterates IBM's "claims:" "How any of the material discussed above can lead anyone to be able to predict that PCM will be available for use in servers in the period 2014 to 2016 is beyond this writer." And, of course, now we have Samsung finally admitting (by buying Grandis) that PCM is a dead end.
I found the Onyx system details particularly illuminating. Of course, a simple picture would have shown those huge heatsinks that make any PCM-based storage device impractical (compare to the decent NAND-based devices such as Fusion-io's). And, of course, the Onyx creators know very well that no enterprise application writes just one sector - in the real world, Onyx simply cannot compete with sparse NAND-based storage. But, hey, if anyone thinks that people will pay over $4,000 for a 10GB underperforming heater - here is the newsflash - 16GB DRAM + 16GB NAND and a battery, or a "huge capacitor" (for about $400).