Breaking News
View Comments: Newest First | Oldest First | Threaded View
Page 1 / 3   >   >>
resistion
User Rank
Manager
Internal disagreement?
resistion   2/5/2014 11:18:29 AM
NO RATINGS
"Micron's process technology experts have expressed "wild disagreement" about when a DRAM replacement will be needed. "The earliest points to 2015, and the latest points to far enough out you could call it never."

Seems inside Micron there are those who want DRAM forever, those who want MRAM, those who want PCM, those who want RRAM, those who want Flash...

Good for R&D to thrive, but bad for immediate product development..

GSMD
User Rank
Manager
Re: HMC's DRAM Controller
GSMD   2/4/2014 7:54:27 PM
NO RATINGS
This was the very reason I shfted to academia after a long stint in the industry. Tough to refuse an offer when you get to design a whole family of CPUs from scratch with no concession to backward comptability and a companion microkernel to go along with it !

Let me know if you want to try out our CPU cores, low end cores should be available in 3 months or so. We support the Xilix tool flow for FPGAs.

In the same vein we junked PCIe quite simply because it does not naturally support peer to peer (non-transparent bridging is a pain) and has no support for DSM. We use DSM to build a MESIF based CC chip to chip interconnect. I used to run the Asian operations of a company that sold PCIe, SRIO and HT cores, so know the game well. I agree that these cores are ridiculously priced. Hopefully our open source RTL will change the industry a bit and make technical merit the determinant of a standard's sucess rather than the marketing muscle of its backer.

The other team that is luck enough to do everything from scratch is the BAE/UPenn/Harvard team doing the Crash SAFE program. See www.crash-safe.org. They are doing a secure CPU, OS, two new languages and app framework from scratch.

DougInRB
User Rank
Manager
Re: HMC's DRAM Controller
DougInRB   2/4/2014 1:08:05 PM
NO RATINGS
In real life systems arch, I think every system deserves its own dedicated architecture.

As an engineer, I'd love to do it right from the bottom-up.  The reality is that drastic changes aren't possible.  Look at how long it took us to get multi-threaded CPUs fully supported.  First, the CPU guys had to implement it.  It took a long time after that before the compiler, OS, and application folks figured out how to take advantage of it.  This is one reason that the transputer never really got out of academia - nobody knew how to program it.  Maybe now with GPGPU architectures being embraced by the HPC folks, the time of the transputer has come - provided that somebody takes the time to generate a robust library of commonly used functions.

But I would prefer to junk PCIe which I frankly think is an abomination as an interconnect !

Junking PCIe has the same problem as I cited above - it is everywhere, and people know how to use it.  Having said that, I would love it if I didn't have to pay certain IP vendors a small fortune to use their PCIe cores.

GSMD
User Rank
Manager
Re: HMC's DRAM Controller
GSMD   2/4/2014 11:49:42 AM
NO RATINGS
These discussions are fun aren't they, I like the different perspectives I get from them.

My focus area in using HMCs is pretty much what you are talking about. I used to be the kernel guy at an RDBMS compnay and we had 10s of thousands of threads running simultaneouly. For such a workload, a sea of processors was a great fit. Tried using an IBM SP/2 but was a pain to use. There was a transputer based system called the Meiko computing surface (you could change fabric topology dynamically) but the transputer was too lightweight. Loved the transputer though. Our ISA will use transputer style messaging instructions to send messages over Serial rapidIO (sendmg - coreid, data). Best homage I can think of !

FPGAs also can make use of HMCs the way you suggested. Once Altera sends me a sample, plan to try some sea of cores design as a master's project next year.

I was just reviewing an SMT based experimental processor design from one of my master's students. 8 simultaneous threads all stressing the fetch unit and multiply this by 64 (it is a 64 core system, each core is as heavy as a Cortex A7), you get your sea of core system that can really use an HMC.

I agree with your analysis. It is no one's case that an HMC is a universal panacea but in highly parallel applications it may be great fit inspite of the higher latency. For RDBMDS type apps, I am planning a dedicated server processor with HMC and specialized functional units dedeicated for RDBMS sub-system processing.


In real life systems arch, I think every system deserves its own dedicated architecture. 

I am also asking our FPGA contacts to give only SERDES based parts. Woudl be a great part for our CPU prototyping. But I would prefer to junk PCIe which I frankly think is an abomination as an interconnect !

DougInRB
User Rank
Manager
Re: HMC's DRAM Controller
DougInRB   2/4/2014 11:14:07 AM
NO RATINGS
Hmmm.  Now you have me thinking about this with a new perspective.  First of all, the FPGA based systems can definitely take advantage of this.  I've designed a DDR interface for an FPGA and it is not only a pain in the butt, it also wastes the bandwidth capability of the DRAM.  By using the HMC, very few pins are needed and the latency is not a problem.  Fan-out to logic that can inhale the data at full bandwidth could be a problem but it is easily solved with wide internal buses.  Then the memory can be shared amongst all of the hardware accelerators and embedded processors...

Hello Xilinx and Altera - can you please build me a big FPGA in a smaller package?  With PCIe and HMC, I don't need all of those pins!

My other thought about an application of the HMC is for an array of small low-power, lower frequency processors (remember the transputer?).  When scaled out, this could provide a lot more compute power per sq in than the monster heater CPUs we use today.

OK - maybe I'm not as skeptical now.  Even though it is still a bad fit for conventional CPUs, it might be a good fit for compute intensive workloads that can be parallelized.

I still think that a comm application with built-in packet inspection/routing/etc. would be a great place to start.  The array of light weight processors or FPGAs might even be the right infrastructure for  this.

resistion
User Rank
Manager
Re: HMC-CPU connection
resistion   2/4/2014 3:26:17 AM
NO RATINGS
So no takers for HMC on CPU? The DRAM-CPU communication was supposed to be the main beneficiary of going to TSV technology.

GSMD
User Rank
Manager
Re: HMC-CPU connection
GSMD   2/3/2014 10:06:04 PM
NO RATINGS
The issue of serial link latency is an interesting topic. At first glance it can appear to be pretty high but actual experiments may show that the latency can ce brought down if teh protocol is simple.

We are implementing a Serial RapidIO 3.0 IP. If you are curious the source is in bitbucket.org/casl {IIT madras Comp. Arch and systems Lab). The source published till now is the logical and transport layer. When synthesized in Synopsys DC targetted at a 65 nm library (still waiting for our 28 nm FD-SOI library), we are getting  single cycle processing for 64 bit packets and 1-1.3 cycle for 128 bit packets. This is at 2 Ghz for 64 bit packets. We are still coding the Physical layer and SERDES will be a std part.

So best case latency at 65 nm is 500ps for logical and transsport layers. have no idea what Physical layer with SERDES will be. But hopefully it willbe below 10 ns (HT was lower, PCIe is higher). Interlaken IPs are claiming 13.5 ns without the SERDES but all other layers included. Interlaken is closer to HMC in terms of traffic type than PCIe. Inphy is claiming sub 15 ns for its 10/28G serdes. Not sure if this includes PCS. Also since a lot of protcols need to be supported, a simpler serdes could shave off a couple of ns.

Worst case sceanrio is still looking like 17ns. maybe 15 ns at 20nm at 3 Ghz.

If net latency is 10ns then HMC does not look too bad. You save a ns or two due to having integrated controller for all the banks.

Master's thesis from UCB on silicon photonics optical interconnects. Has some data on latency.

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.182.3362&rep=rep1&type=pdf

Please take into consideration the following

1. HMC protcol will be simpler comapred to I//O protocols

2. Probably less robust ecc will also suffice since distances will be minimal on the PCB

3. We plan to link our HMC interface directly to the cache logic to see if we can reduce latency. can be done with HMC since protocol is simpler. So no DDR controller going through AXI. Please take controller access latency also into consideration when you look at conventional DDR.

To sum up, the jury is still out on latency but it does not seem to be a deal breaker. I could be wrong, but I hope not. We will know in about 4-6 months since we will be able to synthesize our entire IP minus the SERDES.

GSMD
User Rank
Manager
Re: HMC's DRAM Controller
GSMD   2/3/2014 8:55:58 PM
NO RATINGS
Actually if you look thru the HMC architecture, the shared logic across multiple DRAM dies actually helps matters a bit. I have not studied this in detail but functions like striping can be handled transparently. In these scenarios HMCs should improve latency. Micron itself claims lower latency. As per HMC web page " Reduced Latency – With vastly more responders built into HMC, we expect lower queue delays and higher bank availability, which will provide a substantial system latency reduction". The key here is lots of banks and lots of CPUs. If you just conside one core making an access to one bank, obviously latency will be higher. But in a real life scanario where you have a 16 core monster with quad issue per core, with hyper-agressive preftech with limited spatial locality, all standard latency models go haywire loose and an HMC will help.

Besides as someone else pointed out L4 caches have started appearing and they will absorb a lot of the latency hit. IBM as usual is first off the gate. Intel is presumably next.

But before you start wrininging your hands over latency, please consider all computing scenarios. Latency problems manifest diffreently in each case. In  situations like  shared nothing databases or hadoop, a lot traffic will go over QPI(assuming an Intel box), which has higher latenccy than HMC. Besides with HMC there is zero copy. So in these scenarios, HMC can be a big win. Plan to test postgres with HMC to see if this pans out.

basically the question is if remote dram access over qpi is better than local dram access over hmc ?

Putting memory along with the CPU is a bad idea, heat and pacakaging issues will all create nasty issues. Besides the whole idea is that the HMC can act as a fabric linking multiple CPUs together, avoiding the use of QPI like interconnects. If we can embed some security/MMU logic in the HMC then secure shared memory can be achieved. This will help avoid the use of a seperate intecconect for low socket counts ( 2-4). This is my theory. let us see how it holds up in an actual implementation.

 

So what I am proposing is a Adaptive System fabric that will combine a HMC fabric with a I/O fabric like RapidIO. The CPU can transparently switch from using memory fabric to I/O fabric depending on datapath availability, projected latency and congestion. Ideally I want to share a low level protocol between the two, in which case a part of the HMC controller will itself acts as an I/O fabric switch. a truly universal fabric. latency can be an issue and has to be sorted out. But hey that is why this is called research !

 

Seiroulsy why cannot EE Times have a bigger webinar/arcticle on this issue ? This is something that will change systems arch. significantly especially when these links become Silcon Photonic. These by the way are really cool, especially the Altera FPGA part that has the optics integrated right on the FPGA. Samples are a pain to come by through. But solves the major headache of routing 28G or 56G traces on a PCB.

 

 

 

DougInRB
User Rank
Manager
Re: HMC-CPU connection
DougInRB   2/3/2014 7:18:46 PM
NO RATINGS
Even if the memory cube is directly attached to the CPU (which is a very bad idea from a manufacturing yield perspective), the latency will be higher.  To access a DRAM, you need to provide the row and column addresses and a few nanoseconds later a cache line is available.  To use a serial interface, you need to create a command packet that says "read starting at this address and give me so many bytes".  That command packet then needs to be serialized and then sent to the memory cube controller.  That has to be de-serialized and interpreted.  If the command is not for that memory cube, it has to be passed along the chain to another cube.  If it IS for that memory cube, the DRAM has to be read (same row/column read cycle, but at a higher frequency).  The data needs to be read into a buffer, then a response packet needs to be generated, serialized, and finally sent to the CPU.  Whichever thread of the CPU that was trying to do the read has had to twittle its proverbial thumbs this whole time while waiting for a cache fill to complete.  This takes a few nanoseconds with DDR and will take 10s or 100s of nanoseconds with a memory cube.

That should drag just about any high performance CPU to its knees.  If the idea is good enough, the CPU makers might be willing to reinvent the whole multi-thread, cache, and memory management infrastructure, but I kind of doubt it :-).

Like I hinted in my earlier post, this may make a great main memory as long as there is a very large low latency RAM between it and the CPU (4th level cache) - and the cache hit rate of the 4th level cache is VERY high... 

resistion
User Rank
Manager
Re: HMC-CPU connection
resistion   2/3/2014 6:26:03 PM
NO RATINGS
Good point. I guess it's supposed to be on top of CPU with TSV connection. This would also require CPU maker buy-in.

Page 1 / 3   >   >>
Flash Poll
Radio
LATEST ARCHIVED BROADCAST
Join our online Radio Show on Friday 11th July starting at 2:00pm Eastern, when EETimes editor of all things fun and interesting, Max Maxfield, and embedded systems expert, Jack Ganssle, will debate as to just what is, and is not, and embedded system.
Like Us on Facebook

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)
EE Times on Twitter
EE Times Twitter Feed
Top Comments of the Week