GSMD, from what I see the battle is over. Your points may have some purity, but the IP blocks are on the chips today and they run well. The server class SOCs will generally have a minimum of 10GbE on them. Some much more. And if you want to build a Moonshot style chassis, why not just directly connect each server to a switch chip along the backplane? Broadcomm chips are cheap, and you can run Xaui traces a meter. So while there might in theory have been some physical advantage, in fact it has been trumped by silicon.
My physical speed of light latency within the data center is up to 10 us, so sub-us makes little difference. You can see at Chelsio they document 2us as routine for TCP, and Mellanox will give you somewhat lower for RDMA. Running over the Ethernet hardware which is simply a given.
RDMA is interesting for tying one set of machines together to be a supercomputer, what Moonshot could be if it did not have a wimpy network.
The overheads apps have above L3 are pretty much of their own making. If you want IP to run fast, you can organize your software stack to do that.
At this point it just feels academic. The apps assume IP because they need to access everything inside and outside the data center, and IP is the lingua franca, and the delay is mostly a problem of speed of light, the switches are down around 10% of the delay. The chips come with Ethernet on board, it would cost extra and a couple of years delay to ask the vendors to give us something different. If it was not working, that would be an opportunity for replacement. But it is working well and accelerating data rates (a 25GE rate on a single link is a 10x jump over today's norm) keep it low on the list of problems needing to change. That has always been Ethernet's success: evolve ahead of the needs.
Rick, IB anecdotally tops out around 50 to 100 servers. Are you aware of anything bigger? The protocols seem to revolve around a single shared switch connecting all servers, which is a serious limit to scale.
SRIO is pretty much a point to point link at the moment. Simple, but there are reasons why protocols for large networks are not that simple. I'd be fascinated to read any case studies of large networks using it.
Ethernet with VL2 or similar Clos switching networks in practice supports data center clusters with more than 100k servers each with at least a 10 GbE link. Google is thought to have been like that for a while, Amazon and Microsoft too. It is the baseline for a modern DC.
GSMD, not clear what about SRIO is so interesting? At the physical link level (which is the original article) it seems to use the same interconnects as Ethernet (or as you say, anything else). The existing SRIO protocols seem to use XON/XOFF pacing which is archaic and not scalable in large flat networks. Without a better and proven answer to the pacing/congestion protocols claims to be more efficient than Ethernet at data center scale will meet justified skepticism. I'm not sure why you think Ethernet is slow: a VL2 fabric with commodity switches will connect any to any in a 100k pool of computers with about a half microsecond transit time. RDMA is problematic at that scale since RDMA is a paired connection with dedicated buffer at each end, not ideal for cloud scale data centers where your services may randomly map to 10s of thousands of servers. Costs of Ethernet tend to be cheaper than alternatives with less competition. Failure rates in practice are very low and in a cloud data center redundancy is managed through server redundancy not component redundancy.
For SRIO to be interesting it needs to drive costs further down than existing commodity Ethernet 10/40G ethernet, and somehow offer
For an older but still interesting discussion of VL2 networks see:
Mark, IB does not scale to the data center. It is nice for building a supercomputer in a rack with RDMA and a few dozen number crunchers, but not designed to connect 100,000 or more servers in a data center. That is where the interest in 25G serdes based links comes into play. Maximum data in minimum counts of connectors and fibers.
Mellanox is playing with Ethenet using RRoCE to try to get some of the IB benefits on an ethernet fabric, but it is unclear if the PFC mechanism used to replace the tokenized pacing in IB will really work with an interesting number of machines and the short bursts of random sourced traffic which characterize the DC.
I always find it strange how the eth world bumbles along, apparently oblivious to the IB world. In spite of IB being almost sole-sourced by Mellanox, and that company appearing in this effort.
56Gb IB (which is admittedly 4-lane, thus 16Gb/lane) has been around for years, and is pretty much the entry level in the computational datacenter. And it's copper. So this whole thing is puzzling on two counts: if the demand is there, why are these eth/optical efforts lagging, and why are they insisting on optical? power can't possibly be the issue, since we're not talking about high-density applications (even a bundle of 25 Gb coming from a rack is never going to compare to the power dissipated by the *compute* contents of the rack, or even disks. cable *length* instead?
Thanks Nicholas for cutting to the heart of the issue--the rise of the 25G serial link after so many byears of hard work.
It's easy to predict quite a big wave of products wikl ride this technology.
So what are engineers (serdes experts) turning their energies to next? I know there was a 100G serial workshop sponsored by Ethernet Allaince this month, but methinks that's pretty far future stuff, yes?
At first glance it might not seem obvious to readers why you would have a 40G standard and a 50G stsndard as they are so similar in speed, but this all needs to be seen through the lens of the SERDES transeiver rates on the chips.
When a 10G SERDES was the fastest transeiver available then it made perfect sense to use bundles of fibres each of which terminated at a 10G transeiver, and so we ended up with standards based on 1, 4 and 10 fibres aggregating to bandwidths of 10G, 40G and 100G respectively.
Now, chips are avilable with 25G SERDES transeivers, so companies are right to revisit the standards and update them to be based on multiples of 25G. Hence bundles of 1, 2 and 4 fibres aggregating to bandwidths of 25G, 50G and 100G Ethernet respectively.