25Gbps is where the transducers are landing for simple, single fiber capacity. 10Gbps underutilizes the hardware. 40Gbps does not look feasible without 2 or more fibers, it is an implausible step up for transducers any time soon.
In a data center with millions of connectors it makes sense to aim at the optimal feasible point which seems to be 25G for the next few years. It is a good idea to get all the parties interested in buying and selling at this performance point to agree on compatibility.
At first glance it might not seem obvious to readers why you would have a 40G standard and a 50G stsndard as they are so similar in speed, but this all needs to be seen through the lens of the SERDES transeiver rates on the chips.
When a 10G SERDES was the fastest transeiver available then it made perfect sense to use bundles of fibres each of which terminated at a 10G transeiver, and so we ended up with standards based on 1, 4 and 10 fibres aggregating to bandwidths of 10G, 40G and 100G respectively.
Now, chips are avilable with 25G SERDES transeivers, so companies are right to revisit the standards and update them to be based on multiples of 25G. Hence bundles of 1, 2 and 4 fibres aggregating to bandwidths of 25G, 50G and 100G Ethernet respectively.
Thanks Nicholas for cutting to the heart of the issue--the rise of the 25G serial link after so many byears of hard work.
It's easy to predict quite a big wave of products wikl ride this technology.
So what are engineers (serdes experts) turning their energies to next? I know there was a 100G serial workshop sponsored by Ethernet Allaince this month, but methinks that's pretty far future stuff, yes?
I always find it strange how the eth world bumbles along, apparently oblivious to the IB world. In spite of IB being almost sole-sourced by Mellanox, and that company appearing in this effort.
56Gb IB (which is admittedly 4-lane, thus 16Gb/lane) has been around for years, and is pretty much the entry level in the computational datacenter. And it's copper. So this whole thing is puzzling on two counts: if the demand is there, why are these eth/optical efforts lagging, and why are they insisting on optical? power can't possibly be the issue, since we're not talking about high-density applications (even a bundle of 25 Gb coming from a rack is never going to compare to the power dissipated by the *compute* contents of the rack, or even disks. cable *length* instead?
Mark, IB does not scale to the data center. It is nice for building a supercomputer in a rack with RDMA and a few dozen number crunchers, but not designed to connect 100,000 or more servers in a data center. That is where the interest in 25G serdes based links comes into play. Maximum data in minimum counts of connectors and fibers.
Mellanox is playing with Ethenet using RRoCE to try to get some of the IB benefits on an ethernet fabric, but it is unclear if the PFC mechanism used to replace the tokenized pacing in IB will really work with an interesting number of machines and the short bursts of random sourced traffic which characterize the DC.
1. Basic IB I think is 16 bit addressing but there is extended addressing support available. Cost is an issue with iB but where performance matters, IB is used in storage and RDBMS interconnects.
2. SRIO does not even have the adressing limitations and the new 10xN spec allows 25KM links. SRIO dis not have a standard SW ecosystems for large clusters but that is being remedied. Once thats spec is out, I supect you will see SRIO adoption increase in data centers. I am member of a couple of SRIO WGs so I am necessairily biased !
3. Ethernet is surviving in the data center only becuase of legacy reasons. It is a horrible anachronism in this day and age of fast, low latency interconnects. In fact I question the very need for networking in a data center. IP packets are a horribly inefficient way to communicate in a closed data center. RDMA with proper capability based security at OS level is vastly more efficient. Just imagine the wasted bandwidth due to the IP stack and the processing overhead. And when you start to deploy large storage newtorks like NVMe or my own lightstor, Ethernet is not even an option that can be considered. Does someone seriously think that when I connect two CPU cards or two boxes ina rack at 100G per link (backplane lane or fiber), I have to give up a max of 30% capacity to protocol overheads ?
4. And when you to extra large clusters of CC-NUMA machines, Ethernet is even more of a killer since packets tend to be of 64KB size.
Fundamentally teh computing model ina data center is changing and Ethernet frankly does not have a place in it. But there are dyed in the wool diehards who cannot conceptualize a non-ethernet world and we are paying the price for it !
Forget the non-technical arguments for a while and let anyone prove that Ethernet is better in any respect.
- usable bandwidth (protocol efficiency)
- cost per 10G port
- enery per 10G port
- error resiliency at HW level
- cost of cable, connectors (washout since all use the same)
GSMD, not clear what about SRIO is so interesting? At the physical link level (which is the original article) it seems to use the same interconnects as Ethernet (or as you say, anything else). The existing SRIO protocols seem to use XON/XOFF pacing which is archaic and not scalable in large flat networks. Without a better and proven answer to the pacing/congestion protocols claims to be more efficient than Ethernet at data center scale will meet justified skepticism. I'm not sure why you think Ethernet is slow: a VL2 fabric with commodity switches will connect any to any in a 100k pool of computers with about a half microsecond transit time. RDMA is problematic at that scale since RDMA is a paired connection with dedicated buffer at each end, not ideal for cloud scale data centers where your services may randomly map to 10s of thousands of servers. Costs of Ethernet tend to be cheaper than alternatives with less competition. Failure rates in practice are very low and in a cloud data center redundancy is managed through server redundancy not component redundancy.
For SRIO to be interesting it needs to drive costs further down than existing commodity Ethernet 10/40G ethernet, and somehow offer
For an older but still interesting discussion of VL2 networks see:
1. The whole idea on using existing physical layer interconnects is to reuse existing IP and not reinventing the wheel. But the SRIO PHY like Interlaken is better suited to large backplanes.But SRIP PHY layer has useful stuff like in-band control symbols which really help in protocol efficiency.
2. XON/XOFF was what SRIO started with and other pacing/congestion protocol support is very much available and the basic protocol provides the primitives necessary for that. The upcoming switches and the SRIO fabric APIs will adress these issues more explicitly. Since DC was never a SRIO focus area this just was not given prominence. The more complex protocol support was done by various OEMs and never was standardized. Part of the problem was the SRIO consortium willinlingly restricted itself to the protocol definition and did not venture into systems issues. That is changing now. The basic protocol provides excellent flow control, classes of service and negotiation primitives.
3. You are talking Ethernet at the L2 level. I am talking end to end app latency at sub microsec. We did not dump Ethernet and PCIe on a whim. They simply did not pass technical muster. I had high hopes for PCIe but turns out, using it as a fabric is a mug's game. I was earlier part of the Intel ASI effort, so I shoudl have know better than to give PCIe a second chance. A year's worth of benchmarking validated our choices conclusively. For an apples to apples comparison, you shoud compare Ethernet with realiable UDP against SRIO messaging. This will give you an idea on how inadequate Ethernet is. When Ethernet plus UDP/IP demonstrates sub microsec latency with SRIO level protocol efficiency, I will gladly drop SRIO!
4. Nobody is talking using RDMA at large cluster level. Base SRIO provides messaging and a very robust, end-end one at that. RDMA and DSM are services on top of that. For most data center apps which use TCP/IP, messaging is the way to go. What I am advocating is not just getting rid of Ethernet but the whole IP stack for a large class of applications. TCP/IP and sockets simply do not belong in modern server class applications. Database engineers, app server engineeers spend a lot of time and effort in connection pooling and similar efforts to get around the fallacy that is the IP stack.
I spend the best part of 3 decades on such problems, I shoud know ! As any server architect will tell you, sockets programming and TCP/IP connections are a terrible way to talk to a server. This is not Ethernet's fault but Ethernet falls victim to the excessive layering of the OSI model. That truly where is where SRIO shines. I need the HW interconnect to provide realiable, routed messages.
4. And when you go on to 100 core + cpus and HP moon-shot type servers with 40+ sockets per chassis how exactly is a TCP/IP network with Ethernet supposed tos cale. Unnecessary amounts of energy gets spend in running large number of the IP stack and for smaller cores the IP stack is a burden. You then create an unwanted industry in terms of TCP/IP accelerators. How does the Ethernet community propose to do low cost, low latency core to core communications using Ethernet and IP. I am not even talking about the various network virtualization and interrupt handling issues. Please realize at the app level and these days at the language level too, the paradigm is messaging, not networking. So the need is for the lower level interconnects to align semantically to the needs of the upper SW layers. I spent 2-3 unfrruiful years on making Ethernet more amenable to such requirements. It was a losing battle.
5. And last but not the least, even today at 10G levels, SRIO silicon is just plain cheaper. So cost per port is lower and energy per port is naturally lower too.
6. Displacing an entrenced incumbent is not easy but the discussion here is on purely technical terms not about market forces. If you look at where various interconnects are going, nobody mentions Ethernet and Storage in the same phrase. PCIe seems to be the flavor of the day but since it is not a fabric, I think SRIO has a better chance in the longer run. Same goes for RDBMS cluster, app server clusters, big data clusters - I could keep goin on. I agree that to move away from classic networking requires a major paradigm shift and that is a slow process. But the difficulty of the transition does not make Ethernet a better technical choice !
GSMD, from what I see the battle is over. Your points may have some purity, but the IP blocks are on the chips today and they run well. The server class SOCs will generally have a minimum of 10GbE on them. Some much more. And if you want to build a Moonshot style chassis, why not just directly connect each server to a switch chip along the backplane? Broadcomm chips are cheap, and you can run Xaui traces a meter. So while there might in theory have been some physical advantage, in fact it has been trumped by silicon.
My physical speed of light latency within the data center is up to 10 us, so sub-us makes little difference. You can see at Chelsio they document 2us as routine for TCP, and Mellanox will give you somewhat lower for RDMA. Running over the Ethernet hardware which is simply a given.
RDMA is interesting for tying one set of machines together to be a supercomputer, what Moonshot could be if it did not have a wimpy network.
The overheads apps have above L3 are pretty much of their own making. If you want IP to run fast, you can organize your software stack to do that.
At this point it just feels academic. The apps assume IP because they need to access everything inside and outside the data center, and IP is the lingua franca, and the delay is mostly a problem of speed of light, the switches are down around 10% of the delay. The chips come with Ethernet on board, it would cost extra and a couple of years delay to ask the vendors to give us something different. If it was not working, that would be an opportunity for replacement. But it is working well and accelerating data rates (a 25GE rate on a single link is a 10x jump over today's norm) keep it low on the list of problems needing to change. That has always been Ethernet's success: evolve ahead of the needs.
Rick, IB anecdotally tops out around 50 to 100 servers. Are you aware of anything bigger? The protocols seem to revolve around a single shared switch connecting all servers, which is a serious limit to scale.
SRIO is pretty much a point to point link at the moment. Simple, but there are reasons why protocols for large networks are not that simple. I'd be fascinated to read any case studies of large networks using it.
Ethernet with VL2 or similar Clos switching networks in practice supports data center clusters with more than 100k servers each with at least a 10 GbE link. Google is thought to have been like that for a while, Amazon and Microsoft too. It is the baseline for a modern DC.
Now it looks like Etherent is tracking 25G SRIO ! The most logical upgrade path is always to track commodity SERDES and also to support single lane configurations. For some reason the IEEE Eth group could not garner enogh votes for subsets of the 100G spec' 4x25G variant.
Since the RapidIO consortium is advocating replacing ETH with SRIO in datacenters, this makes an apples to apples comparison easier. While SIO supports 16 lanes, common config at 25G will be 1-4 lanes.
Not sure why IB went to the Intermediate 14G lane before going to the 26G lane. 25/28G is getting standardized and it would have made sense to wait for it. I guess they wanted to keep the tag as the fatest link to gain marketshare. Not sure it is worth it though since using non-std (relatively speaking) strategies pushes up cost. IB already has a cost issue.
PCIe has enough volume so it can afford to go down its own path for lane speeds. Also they have to cater to low cost designs and hence cannot easily go 25G.
But it wouls be nice if all protcols standardize on lane speeds, connectors and cabling and differentiated only at the protocol level. I do not think any of teh protocls use lane speed or encoding difference as a marketing ploy. I guess that is too much to ask but we kind of getting there.
Ethernet, HMC and SRIO are at 25G per lane. IB is a 26G. Forget where Interlaken is going. Only holdout is PCie. But lane encoding standardization will also help in creating common SERDES parts. Eth is at 64/66b, PCIe is at 128/130b. But I prefer teh interlaken/SRIO 64/67b since it limits 1/0 disparity and helps maintain DC balance. This is crucial in making line interface design simpler and keep costs lower. Will definitely make teh lives of the FPGA makers simpler. Currently the SERDES configurations in FPGAs is a trifle complex !