PCIe as a fabric and an interconnect is a lost cause ! I wish efforts to make it one would stop. It is a good interconnect for what it was intended to do, connect master devices to I/O sub-systems. Of course, its technical deficiencies will not stand in its way to becoming a sucesses since marketing trumps technology any day !
Full disclosure: I am a member of the RapidIO trade association but have no commercial interests since I work for a University. Our SRIO controller is also open source.
Hence I would recommend SRIO which works and you can buy PCIe HBAs today to do the same. Of course using PCIe increases latency but only Freescale and TI CPUs have SRIO controllers. IB offers similar perf. and latency but is pretty expensive since vendors are few.
And there is no push to make it a CPU to CPU interconnect.
1. SRIO 3.0 which is the current standard specifies lane speed at 10 Gbit/sec and 25 Gbit/sec and a max of 16 lanes per port. You should see parts this year. The lower speed is faster than PCIe and the same speed as 10Ge. SRIO is basically tracking Ethernet SERDES standards and hence is assured of market competitive lane speeds. the higher speed is competitive with other standards. So I think that takes care of your obselete line speed comment.
the encoding is also 64/67b which makes it more reliable than PCIe or 10Ge on longer PCB tracks.
Higher than 25 G speeds are problematic on PCBs and it is not clear on how the industry is going to proceed. We are going optical beyond 25G for longer lenths but may continue to use electrical up to 50g speeds for ultra short lengths.
2. All interconnects other than GigE have sourcing issues. PCIe switches are also available only with IDT and PLX and IDT nearly acquired PLX last year. So I am not sure what your point is since it is a problem for all interconnects, not just SRIO.
I had also pointed out that my insititution, IIT-Madras is releasing a commercial grade SRIO 3.0 IP (10 G first and then 25G) with BSD license. We will also jointly release with Xilinx dev kits using Xilinx 10G/25G SERDES so that users have a ready to use FPGA platform. The open source kit will contain all digital portions including the PCS/PMA components. It will take some time but the non-PHY components are already online (bitbucket.org/casl). I would like to release the SERDES also but that involves coordinating with foundries, so we will use 3rd part SERDES for now.
the kit will also include a complete verification IP, again completely free under a BSD lic. More than 20 man years of effort will go into this effort.
Commercial entities have already started evaluating this IP.
That should take care of any second sourcing concern since no other interconnect will have this wide an availability.
In any case you are wrong about IDT being the only IP source. Xilinx, Altera, Praesum, Mobevel and possibly others provide FPGA and Silicon IP.
"The 10xN specification, backward compatible with RapidIO Gen1 and Gen2 systems, supports 10.3125 Gbaud per serial lane"
The 25GB story is:
"RapidIO specifications are under development to support 25...."
Other interconnects are shipping 25Gbaud Now, and yes IP has been available for years, but never full cores for RapidIO (just thin Phy layers) complete cores are available for FPGA's for interconnects like PCIe and Ethernet.
You may have access to information about roadmaps for Gen 3 RapidIO parts, but the the reality is Infiniband, PCIe and Ethernet are shipping these speeds, and have been for sometime in volume.
In Niche applications the second sourcing may not be a issue, but in volume it is.
Freescale and TI support both PCIe and Ethernet. (and older sRIO) (Gen2)
Per the RapidIO Product showcase last Freescale product for sRIO was 2008
1. Site is misleading. spec as publsished today supports 10 and 25G. Some of the features optional for 10G (mainly error correction related) are mandatory for 25G. Since I am implementing a 25G solution today using Xilinx SERDES, I am pretty certain the specs supports 25G. The specs are free online, take a look. You will see PHY only for 10G since 25G SERDES for Ethernet is not final. Once that is final, SRIO will specify that too. But you can implement 802.3bm prelim if you want to go 4x25G optical now. Which is what I am doing and using zQSFP modules (Intel MXC is the other option)
2. I do not what you are referring to as shipping high speed interconnects.
Ethernet is only 10G per lane now. 40G is 4 x 10. 100G is 10x10 and a proposed 4x25.
SRIO uses the same SERDES technology as Ethernet, so by definition it will track Ethernet in terms of speeds.
PCIe is only 8G today, The proposed higher speed standard is not ready. How can you claim PCIe is shipping at speeds greater than 8G ?
Only Infiniband in the interconnect space is faster. Not incuding FC since it is irrelevant in thsi space).
Among other responsibilities, I am part of the official interconnect standards effort in India, so you can be rest assured I track these on a daily basis ! I also used to sell PCIe, SRIO and HT IP for a decade.
To sum up
Ethernet is currently only spec'ed at 10G per lane
PCIe is spec'ed only at 8G per lane
Infiniband EDR (the only finalized variant of Infiniband) is spec'ed at 25G, same as SRIO. HDR IB will ship only in 2017
If you think otherwise, please show me shipping Ethernet and PCIe parts that have speeds greater than 10G per lane and IB at more than 25 G per lane.
So of the lot only SRIO and IB are spec'ed at 25G. Granted SRIO is tracking IB by one year but that hardly makes it an antique interconnect.
By the way, there is nothing in the SRIO standard that limits it to 25G. The changes will come mainly in the PHY since error correction becomes a major issue. As you can see from the spec the encoding is conservatively spec'ed at 67/64 since 10Ge had problems with 66/64. Interlaken is similarly conservative.
I had actually dsecribed the latency figures in detail in another post. But bottom line is that SRIO's 100ns latency is the best today. Nothing magical in that, keep the protocol simple, latency would be low. The KISS principle is applied well in SRIO.
PCIe is slightly worse. Having impemented both IPs, there frankly is not that much theoritical latency difference but SRIO being lighter will have less latency. Most of the latency actually gets swallowed in the PHY. Having said that switch latencies for PCIe are higher than I would have thought. Just see the public datasheets of PCIe switches from IDT and PLX and SRIO from IDT. I am not making it up.
Let me go thru the latencies of our published IP. 1 cycle for the logical and transport layer. That works out to .5 ns at 2 Ghz. Maybe 1 Ghz is more typical, so 1 ns. The rest is Digital + Analog PHY.
Public SERDES figures are in the 15 ns range. the PCS/PMA layer seems sub 5ns but we have not finished coding yet. CRC itself seems to be 3 cycles. So a 20-30 ns is a good target to aim for if attached directly to the bus. PCIe will be higher.
It is amazing that there is not a single detailed technical analysis at this level comparing the following PCIe, Ethernet, SRIO, IB, FC, Interlaken, QPI.If there was, we would not be having these discussions. Freescale by the way bought the SRIO IP from my prev company and Cray got HT from us too. So I have had to do these analyses for a while now. I do wish these discussions would come to this level so that we can sort out issues at a technical level instead of having to rely on marketing FUD from vendors and trade bodies.
I basically got pulled into all this for two reasons, I occasionally teach Comp. Arch at a master's level at one of India's premier tech universities and i had to select a standard interconnect for the India processor project and the supercomputer project. Settled on PCIe first but it simply did not cut it technically. So after 6 months decided to switch to SRIO.
After all the analysis what I realized was that all standards pretty much were at the same speed since they all had to use the same SERDES. Which basically is 10, 14, 25/28, 32 and 50/56. Latency varies depending on the protocol. Eth is the worst obviously. QPI probably is the best. I am trying to match QPI in our cache coherent interconnect which is built over SRIO's GSM functionality.
But as I pointed out, the issue with PCIe is its fundamentally flawed architecture model. Like our reptilian brain stem it cannot get rid of its PCI ancestry ! The attendent flaws are fixed in the genes. Remember Intel's aborted switch fabric standard over PCIe, ASI. We started on an IP on that too before quickly coming to the conclusion that it was a non-starter.
PCIe switch fabrics are like experiments with fully socialist governments. Every now and then somebody thinks it is a good idea and tries to have a go at it. Then quickly realizing the futility of that exercise, give it up. Till somone comes along a few years later ! Unless you area glutton for punishment, why on earth would you try doing a fabric using non-transparent bridging ?
Now the latency claims you are talking about I think are purely in the SW/driver domain. technically has nothing to do with the standard. But if all standards implemented optimal drivers, then the stadard's silcion latency would again be the determining factor. It is to alleviate this, that in our experimental processors we are linking the SRIO endpoint to the processor core (exactly the way Transputer did it eons ago, funny how things never change) . So a SRIO message is just a single instruction overhead. message will appear either on a special buffer or the cache in the remote cpu. That is the way to build a fast interconnect. You can bypass all the cache and MMU nonsesnse.
1. I would hesitate to answer your question without knowing more about A3Cube's claims. Not sure if the latency is lower becuase of better SW stack or an optimized silicon datapath.
I guess they feel there is a good PCIe interconnect market and the best PCIe implementation will get them business. This is not bad reasoning since to use IB or SRIO in an x86 system, you anyway add the extra PCIe latency. So the latency advantages of SRIO or IB get wiped out. So even if you are on par with SRIO oor IB, there probably is good business.
2. It is also not clear as to what portion of the packet flow the 100ns latency refers to, host controller or switch. Tough to analyse given the paucity of data.
But as I pointed out, latency is not the only issue when doing interconnects, the interconnect should support peer to peer topologies. PCIe does not natively do so and hence you pay the penalty in switch latency and silicon cost.
Who wrote the article wrong the latency number, we have a memory to memory latency of about 750 nanoseconds ( including software latency).
The goal of the system is to use the memory mapping capability of the PCIe to the fabric to avoid any protocol encapsulation ( like Eth) to maintain the lower latency is possible and working bypassing the operating system kernel stack.
In x86 server there is no RIO native interface so also Rapid IO bridge must plug into a PCIe slot, also for the IB so in x86 server the minimum latency that you can experience it the rout complex latency. That is the reason because we extend the memory mapping of the PCIe.
RONNIEE Express is a low latency interconnection or data plane that use shared memory mapping for the communication between nodes extending the memory mapping used by the PCIe.
RONNIEE Express implement the support for a low latency 3D torus topology with flow control and traffic congestion management.
The most interesting thing of this interconnection is that permit to implement a TCP socket in memory that use the memory mapping for the communication bypassing the kernel and permit to use unmodified TCP application with a 1-2 microseconds latency with no modification.
This latency of our memory mapped socket is 10 time less that RapidIO RIONET so our memory approach is something that is really really powerful and it is not limited to the PCIe, but opens a new way to use distributed memory for communication
To understand better watch this video http://www.youtube.com/watch?v=YIGKks78Cq8
The discussion in this thread seems to have migrated off topic, since the original question pertained to the viability of a PCIe fabric. But since this has turned into a discussion of the merits of SRIO, it makes sense to compare PCIe and SRIO for use as a general-purpose fabric. And that comparison comes out significantly in favor of PCIe.
For purposes of full disclosure here, I am an advocate of using PCIe as a rack-level fabric, and a PLX employee.
The major advantage of PCIe as a fabric is that almost every device -- CPUs, NICs, HBAs, HCAs, FPGAs, you name it -- has a PCIe port as at least one of its connection points, meaning you can eliminate all of the bridging devices necessary in other fabric types. This reduces power and cost, since you are eliminating a lot of bridges, and for that same reason it also reduces latency, so it seems obvious that this is the right approach.
There seems to be a series of assumptions here about how PCIe can be used as a fabric, and much of it is outdated. A PCIe-based fabric that connects directly to existing devices can be constructed, and it doesn't need to use non-transparency (which seems to be an implicit assumption in this thread). PLX is doing just that, and you can go to www.plxtech.com/expressfabric for a more in-depth explanation.
SRIO has the same drawbacks as Ethernet and InfiniBand in regards to needing additional bridges in most cases, since the devices that have a direct SRIO interface can be easily counted – and that's not many. And SRIO doesn't even have the advantage of the nearly universal usage that Ethernet has, or the incumbency for HPCs that InfiniBand has. So it has all of the disadvantages of the other alternatives, and none of their advantages.
This has nothing to do with the technical merits of SRIO at the SerDes or protocol level. It is a well-defined, high-performance interconnect. But it lost out to PCIe as the general-purpose connection back when it mattered, and the virtuous cycle that developed for PCIe made this a non-competition. You can argue about the speeds and feeds as much as you want, but as none other than the philosopher/actor Bill Murray once said, "It just doesn't matter."
What about the impact of ARM? How many ARM MPUs have direct PCIe interfaces? What happens if ARM vendors adopt SRIO? (Probably not likely, but could happen).
PCIe has achieved its success primarily because of Intel's dominance. If Intel loses its dominance (or, arguabley, has already lost outside of the traditional PC desktop/laptop world), then we could see changes.
I can think of two ARM SoCs with PCIe off the top of my head: Freescale i.MX6 and Xilinx Zynq. There are probably others, but most ARM SoCs are designed for mobile devices so they don't need high-speed wired connectivity. As more ARMs get designed for servers, I bet you'll see plenty of PCIe.
PCIe is quite common in the PowerPC space, such as Freescale PowerQUICC 3 and QorIQ, and AMCC (now Applied Micro).
ARM server vendors will use PCIe for I/O. But I think the SRIO folks are talking to ARM vendors to consider using it as a CPU to CPU interconnect, something that PCIe cannot be used for. I have also talked to ARM but they seem to have no opinion one way or another.
We use a MESIF protcol on top of SRIO global shared memory interface to get a CC CPU to CPU interface. Home-snooping for low processor counts and directory based for higher socket counts. Intel found out that source snooping was no longer needed and so I dumped it too. Who am I to argue with their research ! I think the sweet spot is 8 sockets but I have another 6-9 months of simulation runs to go before I can say that with confidence. But code for a basic proto should be ready by May.
If someone wants to look at the code or the design, let me know and I can send a private copy. It will come on our bitbucket site after we validate it.
1. I have made it abundantly clear that I am giving only technical opinions and not commenting on how the market decides the winners. I am still annoyed that VHS won over Beta ! Or that CP/M lost to DOS. Or that ATM lost to Ethernet. Nearly 30 years in the industry has taught me not to make predictions.
2. My mandate is simple. The Govt. of India has tasked me and my team to come up with a homegrown processor architecture (ranging from Cortex M3 level to Intel Knight's Landing level HPC processors with all the varinats in between, for a total of 6 families). For the HPC variants, I was also asked to come up with the fastest interconnect possible. Market considerations and any notion of compatibility were to be completely ignored.
And an open source Micro-kernel was also mandated to run on these processors (Linux and Android would be virtulaized)
Wrt the interconnect, I have the luxury of co-designing the CPU (including the MMU), interconnect and OS. So no HBA penalty applies. In fact my goal is to get from a CPU register to the SERDES in 3-5 cycles (for non-cached data). That is pretty much the theoritical limit and hence helps me reach my goal of lowest latency interconnect around.
So we are talking less than 10ns latency from CPU to CPU for the digital logic. SERDES will still extract a 24-30ns penalty on top of that. I am hoping optical will reduce this but you would have to go silicon photonics for that.
Ultimately memory and I/O fabrics have to merge to a degree. See my proposal at
This is a similar approach to the company mentioned in thsi article except that use of HMC allows a clear architecture. This has ramifications beyond the interconnect since you can have VM sharing across processors, not just shared memory. Building a proto as soon as I can convince Micron to give me samples !
If you notice this also potentially bypasses all SW stacks, since in very low latency message passing, especially when you are sending cache lines, you cannot have SW in the way. But even in cases where you want OS mediated messaging, our MK OS will stay out of the datapath. I used a similar approach when building clusters at Sybase back in the 90s.
3. Based on these requirements I did an eval of IB, PCIe and SRIO since I saw no point in inventing a protocol from scratch. For point to point links, PCIe was Ok but it was very difficult to overlay fabric semantics. I had discussions with PLX and IDT on this. We even started building our own non-std PCIe variant. But after 6 months, I dumped PCIe and settled on SRIO with the option of including some IB features down the line.
I am a non-commercial entity simply tasked with building the fastest interconnect. If somebody can show me how to build a mesh or a torus using PCIe to connect 256-1024 CPU boards with simpler SW, lesser silicon and lower latency than IB or SRIO, I would be glad to switch to PCIe.
4. I agree with the shared memory approach advocated by the folks who are the subject of this article. My point is that i am concerned with HW latency while they are concerned about SW latency. and that finally it is the HW latency that is the determinant.
4. RIONet is a highly sub-optimal implementation. Even with shipping SRIO parts, see Concurrent's SRIO driver paper to see how lower latency drivers can be built.
5. Even given the constraints of going through PCIe for the HBA interface, there are ways to get around the extra PCIe latency. Build an intelligent SRIO adaptor that runs the message passing part of your kernel (MPI, TCP/IP or whatever). The CPU inside the SRIO controller almost acts as a co-processor and I/O from your app to the SRIO controller can will be zero copy if you use the same shared memory approach mentioned by the company that is building the new interconnect. You can use an of Freescale's QorIQ parts to build such an intelligent NIC. If you have a Microkernel OS (unlike an ancinet dinosaur liek UNIX or Linux !) this is a trivial exercise and is in fact the way a network sub-system should be architected. I/O and message passing simply do not belong in a general purpose CPU inspite of Intel's protestations to the contrary.
6. We are also building an all optical switch with optical digital logic (simple boolean logic only). The goal is to have a simple enough header that can be parsed by this primitive optical compute engine and routed to the appropriate egress port. This is obviously still a while away but with this I can get switch latency down to the 20-30 ns range from the current 100ns. protocols with simpler headers obviously are a better fit.
Look I have been building bus based systems since the early 80s (Z80 based systems) and started building clusters in the late 80s when the US embargoed India. My heartfelt gratitude to the State Dept. of the US since without the embargo, a Govt. lab in india would have just bought a Cray instead of asking my company to built a cluster. My point being that all my opinions are based on data from experiments that I have done. If someone presents evidence to the contrary, I can change my mind with ZERO latency ! But only hard nos please.
My post on PCIe vs SRIO (or others) was not comparing the business side to the technical – I was explaining that PCIe was technically the best solution based on key metrics. The fact is that PCIe provides a universal connection – this is not really open to reasonable debate – and that this offers advantages – technical advantages – that no other interconnect can offer.
PCIe provides the largest universe of components with direct connection, which offers a clear advantage in power, cost, and latency, and this leads to a potential performance edge as well. And it allows the use of most devices – since they almost all have PCIe as a connection – so that you can build up a system that is tuned to your need with off-the-shelf components. These are not business advantages, but technical advantages. And it is these technical advantages that lead to the business success, not the other way around.
I would encourage people to go to the page that I noted before (www.plxtech.com/expressfabric) to see that the PCIe solution that I was referencing goes beyond a simple PCIe network. It offers both DMA and RDMA, and in each case they are compatible with the vast number of applications that have been written for Ethernet and InfiniBand. You benefit from the advantages I mentioned, and you can use the same applications.
And it allows sharing of I/O devices among multiple hosts with existing devices and drivers. These features can all be had with a single, converged fabric.
It is hard to comment on A3Cube's latency claims, since their implementation is secret.
There's one latency and throughput number that I don't see being mentoined here. If you are transmitting small (<64 byte) frames, then the packet overhead will add significantly to the latency and throughput, and the fact is that PCIe packets have a bigger overhead than RapidIO packets. For small packets, there's no question which one is faster.