2GB would make an interesting LLC (Last Level Cache), you could use a large cache line size (512-4K bytes) to optimize the transfer between DDR and HMC. I suspect you would still see most of the power savings as the DDR memory lines would be idle most of the time.
Another application could be as cache on a rotating media disk drive, the HMC could be glued right to the top of the disk controller.
I take it that HMC is lower in power which would reduce opex. However the cost of memory would be higher. Any data on how much power say 16GB of HMC would burn vs. DDR4. I don't think the aquisition cost of HMC could come close to commodity DDR4 memory. Also given the more exotic (for now) manufacturing techniques like TSVs etc are bound to hurt yield, further increasing cost.
I believe the cost of HBM will get lower but in order to do so it will need significant volumes and improvements in manufcaturing to get there. This can only happen if the large CPU vendor (Intel) gets on board. Without that this cannot replace DDR4.
As TanjB pointed out, the bandwidth/GB just doesn't make sense with so little memory in a fabric. Why would I want to put 32GB on a server using HMC when I can get the same amount on a single DIMM - at lower cost and less physical space?
Until they actually get more GB/HMC, this looks like a great product for high speed switches and high performance FPGA-attached hardware accelerators, but not servers.
Look at Dell's and other's rack servers. You can drop 512+GB of memory in them today - and many do.
Even though you can chain them together, you are limited to 8 HMC parts per channel. So, the CPU will need multiple channels to support a large memory server. That's no problem - they already support multiple channels with far more signals required for DDR4 than for HMC.
The real problem is the size of the HMC. They are huge (31mmx31mm)! You can't cram enough of those on a motherboard or DIMM to get a server with 1.5TB of DRAM like you can with the 64GB DDR4 DIMMs that will be available later this year.
The point is that the chip stack is overkill for large space. 100GB/s bandwidth per 2GB cube is way overkill for building a server with large memory - 50 modules like that, what host chip will have the interconnect or even need it? And you can't get much bigger cubes because that is the limit you get multiplying DRAM chip capacity x number of TSV layers possible. So, the whole thing looks optimized for small-memory scenarios.
Where is the server equivalent, or is this simply not coming to a server any time soon?