It is hard to comment on A3Cube's latency claims, since their implementation is secret.
There's one latency and throughput number that I don't see being mentoined here. If you are transmitting small (<64 byte) frames, then the packet overhead will add significantly to the latency and throughput, and the fact is that PCIe packets have a bigger overhead than RapidIO packets. For small packets, there's no question which one is faster.
My post on PCIe vs SRIO (or others) was not comparing the business side to the technical – I was explaining that PCIe was technically the best solution based on key metrics. The fact is that PCIe provides a universal connection – this is not really open to reasonable debate – and that this offers advantages – technical advantages – that no other interconnect can offer.
PCIe provides the largest universe of components with direct connection, which offers a clear advantage in power, cost, and latency, and this leads to a potential performance edge as well. And it allows the use of most devices – since they almost all have PCIe as a connection – so that you can build up a system that is tuned to your need with off-the-shelf components. These are not business advantages, but technical advantages. And it is these technical advantages that lead to the business success, not the other way around.
I would encourage people to go to the page that I noted before (www.plxtech.com/expressfabric) to see that the PCIe solution that I was referencing goes beyond a simple PCIe network. It offers both DMA and RDMA, and in each case they are compatible with the vast number of applications that have been written for Ethernet and InfiniBand. You benefit from the advantages I mentioned, and you can use the same applications.
And it allows sharing of I/O devices among multiple hosts with existing devices and drivers. These features can all be had with a single, converged fabric.
1. I have made it abundantly clear that I am giving only technical opinions and not commenting on how the market decides the winners. I am still annoyed that VHS won over Beta ! Or that CP/M lost to DOS. Or that ATM lost to Ethernet. Nearly 30 years in the industry has taught me not to make predictions.
2. My mandate is simple. The Govt. of India has tasked me and my team to come up with a homegrown processor architecture (ranging from Cortex M3 level to Intel Knight's Landing level HPC processors with all the varinats in between, for a total of 6 families). For the HPC variants, I was also asked to come up with the fastest interconnect possible. Market considerations and any notion of compatibility were to be completely ignored.
And an open source Micro-kernel was also mandated to run on these processors (Linux and Android would be virtulaized)
Wrt the interconnect, I have the luxury of co-designing the CPU (including the MMU), interconnect and OS. So no HBA penalty applies. In fact my goal is to get from a CPU register to the SERDES in 3-5 cycles (for non-cached data). That is pretty much the theoritical limit and hence helps me reach my goal of lowest latency interconnect around.
So we are talking less than 10ns latency from CPU to CPU for the digital logic. SERDES will still extract a 24-30ns penalty on top of that. I am hoping optical will reduce this but you would have to go silicon photonics for that.
Ultimately memory and I/O fabrics have to merge to a degree. See my proposal at
This is a similar approach to the company mentioned in thsi article except that use of HMC allows a clear architecture. This has ramifications beyond the interconnect since you can have VM sharing across processors, not just shared memory. Building a proto as soon as I can convince Micron to give me samples !
If you notice this also potentially bypasses all SW stacks, since in very low latency message passing, especially when you are sending cache lines, you cannot have SW in the way. But even in cases where you want OS mediated messaging, our MK OS will stay out of the datapath. I used a similar approach when building clusters at Sybase back in the 90s.
3. Based on these requirements I did an eval of IB, PCIe and SRIO since I saw no point in inventing a protocol from scratch. For point to point links, PCIe was Ok but it was very difficult to overlay fabric semantics. I had discussions with PLX and IDT on this. We even started building our own non-std PCIe variant. But after 6 months, I dumped PCIe and settled on SRIO with the option of including some IB features down the line.
I am a non-commercial entity simply tasked with building the fastest interconnect. If somebody can show me how to build a mesh or a torus using PCIe to connect 256-1024 CPU boards with simpler SW, lesser silicon and lower latency than IB or SRIO, I would be glad to switch to PCIe.
4. I agree with the shared memory approach advocated by the folks who are the subject of this article. My point is that i am concerned with HW latency while they are concerned about SW latency. and that finally it is the HW latency that is the determinant.
4. RIONet is a highly sub-optimal implementation. Even with shipping SRIO parts, see Concurrent's SRIO driver paper to see how lower latency drivers can be built.
5. Even given the constraints of going through PCIe for the HBA interface, there are ways to get around the extra PCIe latency. Build an intelligent SRIO adaptor that runs the message passing part of your kernel (MPI, TCP/IP or whatever). The CPU inside the SRIO controller almost acts as a co-processor and I/O from your app to the SRIO controller can will be zero copy if you use the same shared memory approach mentioned by the company that is building the new interconnect. You can use an of Freescale's QorIQ parts to build such an intelligent NIC. If you have a Microkernel OS (unlike an ancinet dinosaur liek UNIX or Linux !) this is a trivial exercise and is in fact the way a network sub-system should be architected. I/O and message passing simply do not belong in a general purpose CPU inspite of Intel's protestations to the contrary.
6. We are also building an all optical switch with optical digital logic (simple boolean logic only). The goal is to have a simple enough header that can be parsed by this primitive optical compute engine and routed to the appropriate egress port. This is obviously still a while away but with this I can get switch latency down to the 20-30 ns range from the current 100ns. protocols with simpler headers obviously are a better fit.
Look I have been building bus based systems since the early 80s (Z80 based systems) and started building clusters in the late 80s when the US embargoed India. My heartfelt gratitude to the State Dept. of the US since without the embargo, a Govt. lab in india would have just bought a Cray instead of asking my company to built a cluster. My point being that all my opinions are based on data from experiments that I have done. If someone presents evidence to the contrary, I can change my mind with ZERO latency ! But only hard nos please.
ARM server vendors will use PCIe for I/O. But I think the SRIO folks are talking to ARM vendors to consider using it as a CPU to CPU interconnect, something that PCIe cannot be used for. I have also talked to ARM but they seem to have no opinion one way or another.
We use a MESIF protcol on top of SRIO global shared memory interface to get a CC CPU to CPU interface. Home-snooping for low processor counts and directory based for higher socket counts. Intel found out that source snooping was no longer needed and so I dumped it too. Who am I to argue with their research ! I think the sweet spot is 8 sockets but I have another 6-9 months of simulation runs to go before I can say that with confidence. But code for a basic proto should be ready by May.
If someone wants to look at the code or the design, let me know and I can send a private copy. It will come on our bitbucket site after we validate it.
I can think of two ARM SoCs with PCIe off the top of my head: Freescale i.MX6 and Xilinx Zynq. There are probably others, but most ARM SoCs are designed for mobile devices so they don't need high-speed wired connectivity. As more ARMs get designed for servers, I bet you'll see plenty of PCIe.
PCIe is quite common in the PowerPC space, such as Freescale PowerQUICC 3 and QorIQ, and AMCC (now Applied Micro).