Its interesting to look at the status of alternatives to DRAM among memory architectures. up to this point, they weren't getting much attention but now that they have been whittled down to just a few, those alterntives are gaining visibility and credibility as it is hard to deny what has been proven over the course of time.
It would be interesting to know in quantitative terms what "in low volume production" means and even more intriguing what does "it has its place, it's its own thing." actually mean in the light of the following.
By late December 2013 both the 128Mb and the 1Gbit MCP had been quietly removed from the Micron product list on their web site. It was reported elsewhere* that Micron had indicated that their earlier generations of phase change memory were no longer available for new designs or for those wishing to evaluate the technology and the focus for PCM had moved to developing a new PCM process, in order to lower bit costs and power while at the same time improve performance. What then is the PCM device type that is in low volume production, why would low volume production be maintained for devices that are no longer available to potential customers and have the bit cost, power and performance limitations indicated? If PCM is not suitable for NAND or DRAM replacement, for what then is it suited?
Micron also have a paper a paper co-authored with Sony at ISSCC 2014 that reports a 16Gbit ReRAM based on a 27nm process, one wonders why that not get a mention along with STT/MRAM as one of the whittled down list of emerging memory types with future potential on which Micron are working?
Lot of hush-hush about memory controller ownership in HMC. Intel of course wants to put all the ownership in its CPU, as would anyone who integrates an on-chip memory controller into the main processing unit. It's a big factor in chip design strategy. Designing with HMC-based controller is actually a big risk.
It seems that everyone is ignoring the fact that the memory cube will have significantly higher latency than DDR-4. A RMW will stall the CPU for eons. This means that it cannot be used by a CPU as the main memory attached to the cache. It essentiallty brings in a new tier to the memory hierarchy. It seems like a great idea that will bring much higher overall memory bandwidth, but the critical latency to the CPU is not solved.
Maybe the local DRAM will become a 4th level cache. Maybe someday the DRAM will be displaced by MRAM. In any case, I cannot see the DDR interface being simply replaced with a bunch of serial links.
It seems like the first niche for the memory cube would be in comm, where latency is not as big a deal and throughput is king... You could make an amazing switch with such a device.
Even if the memory cube is directly attached to the CPU (which is a very bad idea from a manufacturing yield perspective), the latency will be higher. To access a DRAM, you need to provide the row and column addresses and a few nanoseconds later a cache line is available. To use a serial interface, you need to create a command packet that says "read starting at this address and give me so many bytes". That command packet then needs to be serialized and then sent to the memory cube controller. That has to be de-serialized and interpreted. If the command is not for that memory cube, it has to be passed along the chain to another cube. If it IS for that memory cube, the DRAM has to be read (same row/column read cycle, but at a higher frequency). The data needs to be read into a buffer, then a response packet needs to be generated, serialized, and finally sent to the CPU. Whichever thread of the CPU that was trying to do the read has had to twittle its proverbial thumbs this whole time while waiting for a cache fill to complete. This takes a few nanoseconds with DDR and will take 10s or 100s of nanoseconds with a memory cube.
That should drag just about any high performance CPU to its knees. If the idea is good enough, the CPU makers might be willing to reinvent the whole multi-thread, cache, and memory management infrastructure, but I kind of doubt it :-).
Like I hinted in my earlier post, this may make a great main memory as long as there is a very large low latency RAM between it and the CPU (4th level cache) - and the cache hit rate of the 4th level cache is VERY high...
Hmmm. Now you have me thinking about this with a new perspective. First of all, the FPGA based systems can definitely take advantage of this. I've designed a DDR interface for an FPGA and it is not only a pain in the butt, it also wastes the bandwidth capability of the DRAM. By using the HMC, very few pins are needed and the latency is not a problem. Fan-out to logic that can inhale the data at full bandwidth could be a problem but it is easily solved with wide internal buses. Then the memory can be shared amongst all of the hardware accelerators and embedded processors...
Hello Xilinx and Altera - can you please build me a big FPGA in a smaller package? With PCIe and HMC, I don't need all of those pins!
My other thought about an application of the HMC is for an array of small low-power, lower frequency processors (remember the transputer?). When scaled out, this could provide a lot more compute power per sq in than the monster heater CPUs we use today.
OK - maybe I'm not as skeptical now. Even though it is still a bad fit for conventional CPUs, it might be a good fit for compute intensive workloads that can be parallelized.
I still think that a comm application with built-in packet inspection/routing/etc. would be a great place to start. The array of light weight processors or FPGAs might even be the right infrastructure for this.
In real life systems arch, I think every system deserves its own dedicated architecture.
As an engineer, I'd love to do it right from the bottom-up. The reality is that drastic changes aren't possible. Look at how long it took us to get multi-threaded CPUs fully supported. First, the CPU guys had to implement it. It took a long time after that before the compiler, OS, and application folks figured out how to take advantage of it. This is one reason that the transputer never really got out of academia - nobody knew how to program it. Maybe now with GPGPU architectures being embraced by the HPC folks, the time of the transputer has come - provided that somebody takes the time to generate a robust library of commonly used functions.
But I would prefer to junk PCIe which I frankly think is an abomination as an interconnect !
Junking PCIe has the same problem as I cited above - it is everywhere, and people know how to use it. Having said that, I would love it if I didn't have to pay certain IP vendors a small fortune to use their PCIe cores.
"Micron's process technology experts have expressed "wild disagreement" about when a DRAM replacement will be needed. "The earliest points to 2015, and the latest points to far enough out you could call it never."
Seems inside Micron there are those who want DRAM forever, those who want MRAM, those who want PCM, those who want RRAM, those who want Flash...
Good for R&D to thrive, but bad for immediate product development..