News & Analysis
Cloud Computing--Packet Buffering for Data Center Switches
Ori Aruj, Dune Networks
11/16/2008 7:29 PM EST
Data centers consist of hundreds to many thousands of servers and several storage machines co-located within a single room or a building. The servers are interconnected using an Ethernet network. The servers connect to storage machines via a separated fiber-channel network. Traffic in the Ethernet network is typically carried via TCP. A data center presents a unique environment for TCP traffic. The rate of the links is high (1GE, 10GE, and soon 40/100GE), while server-to-server latency is low (approximately 100 microseconds or less). The network is large, thus is built from up to 3 tiers (access, distribution and core) where connection between switches at the different tiers may be oversubscribed.
Several papers studied the packet buffering size required for TCP traffic in a switch (switch and router will be used interchangeably). Unfortunately, the main focus of many of these papers is the wide area network. None of the papers discussed the data center network specifically. This article surveys known results and applies them specifically to the data center environment.
TCP Algorithm
TCP's algorithm and the behavior it produces outside of the scope of this article. The following is a summary of the most relevant points:
A TCP algorithm behaves best in a state called "steady state." In this state the TCP sender runs with a full window of packets in transit. A new packet isn't put into the network by the TCP sender until an old packet leaves. The TCP sender knows when a packet leaves the network by receiving an ACK packet from the TCP receiver.
A system running in steady state is self-clocked. Clock ticks are the ACKS messages received from the TCP receiver. If the latency via the network gets higher (i.e., due to a backlogged queue) the ACK messages for these packets arrives with delay, thus slowing down the rate at which the TCP sender puts new packet into the network. However, in order for all TCP senders to run in steady state, the network should buffer all the packets that are possibly in transit.
At the TCP flow establishment, a TCP sender receives a maximum window parameter (W_max) from the TCP receiver. The W_max is used as the limit that the actual window used for transmit can grow to. W_max size should be large enough to enable the TCP sender to continuously transmit packets as long as the network is not congested. However, it should not be too large, to prevent the TCP sender from overloading a congested network.
The rate a TCP flow uses at any specific time is roughly equal to the present window size used by the TCP sender divided by the round-trip propagation delay (RTT). For example, assume a TCP sender in a DC network, with network RTT of 100us and the server link of 10Gbps. For a TCP sender to utilize the 10GE link its window has to be approximately 128KB (100us/10Gbps).
Since network congestion changes over time, TCP uses an adaptive algorithm to pick a window size ranging from 0 to W_max. After every second window of data that it sends, TCP increases its window size by one packet. The TCP flow decreases the window size by two mechanisms:
- If TCP sender does not receive ACK for specific segment it infers that the network has dropped a packet due to congestion and reduces the window by 50%
- Timeout--If it fails to transmit a segment within a time out period, it reset W =1, and backs-off for a RTO (Retransmit Timeout), with each successive transmit failure the RTO is doubled.
The intended purpose is for the TCP window to oscillate around a value that gives it a fair share of the network bandwidth. For each window oscillation, the packet loss is retransmitted. This reduces the goodput of the link and the network. A rough approximation for the packet loss was calculated in [MSMO] P(loss probability) = 1/W ^2 where W is calculated in number of packets. For example, for an average window size of 10 packets (this can result for window oscillating between 7 to 13), the packet loss is approximately 1%. For another extreme example, if the window oscillates between window-size of one to two packets, the retransmit happens for every packet sent.
The intended purpose of the Time Out mechanism is to maintain in transmission only a subset of the TCP flows that operate in steady-state (or close to it) mode, while deferring the transmission of the rest of the flows, thus retaining good goodput.
Previous Results


Today, the line card buffer in a switch is sized based on a rule-of-thumb commonly attributed to a 1994 paper by Villamizar and Song [VS]. Using experimental measurements of at most eight TCP flows on a 40 Mb/s link, they concluded that because of the dynamics of TCP's congestion control algorithms, a router needs an amount of buffering equal to the average round-trip time of a flow that passes through the router, multiplied by the capacity of the router's network interfaces. This is the well-known B = RTT X C rule.
The goal of Villamizar and Song in [VS] is to make sure that the output link (Figure 1 bottlenecked link) is fully utilized at all times. That is, the link sends packets 100% of the time after the TCP sender noticed that its packets are dropped. This is equivalent to making sure its buffer never goes empty when the TCP sender finds out that its packet is dropped.
Using this rule of thumb, a 10Gb/s router for the WAN line-card needs approximately 250ms X 10Gb/s = 2.5Gbits of buffers in order to keep its output links (which can be congested) 100% utilized.
Appenzeller, Keslassy and McKeown in [AKM] argue that the rule-of-thumb (B = RTT X C) is incorrect for backbone routers. This is because of the large number of TCP flows multiplexed together on a single backbone link (Figure 2). Using theory, simulation and experiments on a network of real routers, they show that a link with n unsynchronized flows requires no more than B = (RTT x C)/sqrt(n). As an example, a 2.5Gb/s link carrying 100 flows could reduce its buffers by 90% with negligible difference in throughput.
The intuition behind [AKM] is obvious. At the presence of n unsynchronized flows, if few of them back-off, the rest send packets towards the congested buffer. This reduces the requirement of the router buffer (Figure 3).

Although this result looks promising we should notice that the goal of both [VS] and [AKM] is to optimize the utilization (throughput) of the congested link. However, the more relevant properties of the switch behavior are goodput and fairness.
The "goodput" of the network can be referred as the amount of packets exchanged between the applications in the network. This obviously does not include TCP re-transmitted packets, which are considered as part of the throughput calculation in the papers above.
"Fairnes"" is the ability of the switch to allocate bandwidth to several competing flows that need access to the network. High-throughput and high-goodput can be trivially obtained by allocating the buffer memory to only few flows, and suppressing the rest of the flows into back-off timeout. This, of course is unacceptable, for the applications that require low-latency, or rely on multiple processing threads that run in parallel.
Goodput and fairness was discussed in [Morris]. Morris shows that packet loss has a substantial and adverse impact on fairness, as the TCP traffic control mechanism is inadequate. When the number of active flows exceeds the switch buffer as measured in packets, the TCP flows population splits into two groups, those that transmit data at W>4, and those that are at ever increasing timeout periods.
The mechanism that may affect the interchange between the two populations is Random Early Discard (RED) that randomly drops packets at the buffers. RED is inherently fair, and in the long run all flows would eventually get to transmit. However, the turnaround between the transmitting and backed-off flows is a random process with large variance that does not converge in a time scale of an application (e.g., 10 second). Thus, it is impossible to guarantee fairness.
Sizing the Packet Buffer in Data Center Networks
Assume a data center with 10,000 connected servers. The RTT in the network is 100us. Each server is connected using 10GE link; inter-switch links may have higher rates. Each server has on average 100 active TCP flows. Although in many cases the number of active flows is much higher, we used here 100 flows as an example to simplify the calculations. Packets exchanged include high percentage of jumbo packets.
For the purpose of estimating the overall packet buffer required in the network we model the network as in Figure 2.
A common oversight described above is to analyze the buffer required for the link utilization (throughput) only. To show how dangerous this can be we repeat this calculation. Using the rule of thumb [VS] one gets RTT X C = 100us X 10Gbps = 1Mb (128KB) as the overall buffer in the model. Using [AKM] one can even get 3Kb (400B).
Using 128KB (or 400B) buffer in the network means that even if each of the active flows has a single packet in transit this packet is dropped by the network buffer. Obviously, the link utilization is 100% but most of the packets sent over the link are retransmitted packets. That is, the goodput of the network is close to 0. Alternatively, most of the flows are in back-off state and only few flows are active. The simulation results below prove switching between these active and non-active flows does not happen in a reasonable time, creating an acute fairness issue.
In order to analyze the goodput and fairness of the network we use Morris [Morris] results. That is the buffer size of the network has to be proportional to the number of active flows. Assuming we would like to enable all flows to transmit with 1% packet loss (99% goodput) each flow has to have an average window of 10 packets (using the rough estimation of P=1/W ^2 (W in packets)). Assuming jumbo packets, this translates into 1,000,000 (flows) x 10 (packets) x 10KB (per packet) = 100GB of total network buffer.
This buffer is distributed among the network switches and implied hundreds of megabytes packet memory per switch. The next section presents the network behavior in cases where the packet memory used by the switch is too small.
Simulation
We simulated the top-of-rack switch. This is an interesting point to simulate as it is typically oversubscribed. The switch is connected to 40 hosts via 10Gb/s links and to the aggregation switch via an uplink of 40Gb/s. Each host has 10 TCP flows (We used very small number flows per server), for a total of 400 flows for the top-of-rack switch. The switch is oversubscribed at 1:10 ratio. All TCP flows transmit 9KB packets.

We investigate two types of switching devices, a switch with 24Mb buffer memory (on chip SRAM), and a switch with 2Gb (external DRAM).
The simulation was done on NS2 network simulator. The simulation was carried out for 10 seconds, and measured the bandwidth allocation between the flows. The results are presented in Figure 5.

Figure 5 presents results for a switch with either 4MB or 256MB packet memory. In the case of 256MB, each one of the 400 competing flows gets its fair share of the uplink bandwidth (100Mbps). Variance is small.
In the case of a switch with 4MB packet memory, the system is extremely un-fair. That is, 25% (!) of the flows not getting any bandwidth at all. The rest of the flows are allocated rates from 20Mbps to 320Mbps almost linearly. That is, the highest rate to smallest rate active flows ratio, is X16. Simulation results with higher number of flows per server present even worse results.
Summary
When sizing the amount of packet buffer required in a network, a common error is to focus on throughput of the network. We argue that this focus is misleading. Insufficient memory can lead inherent inability to maintain any notion of fairness between the TCP flows, and may cause the application running on the switch large and unpredictable and delays.
In order for a TCP flow to behave well it needs few packets to be in transit. For a network where tens of thousands of active flows can coexist, this translates into GBs of packet buffering.
The location of the packet buffer can shift in the network depending on the network design. In the general case, we believe that each layer of the network has to buffer the packets in transit of all the active TCP flows. As each layer has different number of switches the amount of packet memory per switch type varies based on its layer. A top of the rack access switch would require hundreds of megabyte of packet memory. A line card of an aggregation and core switch would require again hundreds of megabyte of packet memory.
References:
- [VS] "High performance TCP in ansnet," C. Villamizar and C. Song. ACM Computer Communications Review 1994
- [MSMO] "The macroscopic behavior of the TCP congestion avoidance algorithm" Mathis, Semke, Mahdavi & Ott in Computer Communication Review, 27(3), July 1997
- [JK] "Congestion Avoidance and Control," Van Jacobson' Lawrence Berkeley Laboratory Michael J. Karels University of California at Berkeley November, 1988
- [AIM] "Sizing Router Buffers," Guido Appenzeller, Isaac Keslassy and Nick McKeown, ACM SIGCOMM '04
- [Morris] "TCP behavior with many flows," R. Morris. In Proceedings of the IEEE International Conference on Network Protocols, Atlanta, Georgia, October 1997
About the Author
Ori Aruj is VP Marketing at Dune Networks. He previously served as Director of Product Management and Marketing at Charlotte's Web Networks, a privately held company focused on developing high-end core equipment for next generation optical networks. At Charlotte's Web, Ori led Product Management, as well as worldwide pre-sale and marketing activities. He has led Engineering projects at both Charlotte's Web and Intel Corporation. Ori graduated his BSc in Computer Science summa cum laude from the Technion (Israel Institute of Technology), and earned his MBA from INSEAD (Fontainebleau, France).



