Design Article

Cloud Computing--Packet Buffering for Data Center Switches

Ori Aruj, Dune Networks

11/16/2008 7:29 PM EST

The data center market is growing large enough to justify its own set of requirements. These are implemented in a new breed of switching products expected beginning in 2008. The requirements emerged from a cross pollination of the Enterprise and HPC switches, and include high throughput, reduced latency, dense and low-cost 10GE port machine and high speed 40GE/100GE interfaces.

Data centers consist of hundreds to many thousands of servers and several storage machines co-located within a single room or a building. The servers are interconnected using an Ethernet network. The servers connect to storage machines via a separated fiber-channel network. Traffic in the Ethernet network is typically carried via TCP. A data center presents a unique environment for TCP traffic. The rate of the links is high (1GE, 10GE, and soon 40/100GE), while server-to-server latency is low (approximately 100 microseconds or less). The network is large, thus is built from up to 3 tiers (access, distribution and core) where connection between switches at the different tiers may be oversubscribed.

Several papers studied the packet buffering size required for TCP traffic in a switch (switch and router will be used interchangeably). Unfortunately, the main focus of many of these papers is the wide area network. None of the papers discussed the data center network specifically. This article surveys known results and applies them specifically to the data center environment.

TCP Algorithm
TCP's algorithm and the behavior it produces outside of the scope of this article. The following is a summary of the most relevant points:

A TCP algorithm behaves best in a state called "steady state." In this state the TCP sender runs with a full window of packets in transit. A new packet isn't put into the network by the TCP sender until an old packet leaves. The TCP sender knows when a packet leaves the network by receiving an ACK packet from the TCP receiver.

A system running in steady state is self-clocked. Clock ticks are the ACKS messages received from the TCP receiver. If the latency via the network gets higher (i.e., due to a backlogged queue) the ACK messages for these packets arrives with delay, thus slowing down the rate at which the TCP sender puts new packet into the network. However, in order for all TCP senders to run in steady state, the network should buffer all the packets that are possibly in transit.

At the TCP flow establishment, a TCP sender receives a maximum window parameter (W_max) from the TCP receiver. The W_max is used as the limit that the actual window used for transmit can grow to. W_max size should be large enough to enable the TCP sender to continuously transmit packets as long as the network is not congested. However, it should not be too large, to prevent the TCP sender from overloading a congested network.

The rate a TCP flow uses at any specific time is roughly equal to the present window size used by the TCP sender divided by the round-trip propagation delay (RTT). For example, assume a TCP sender in a DC network, with network RTT of 100us and the server link of 10Gbps. For a TCP sender to utilize the 10GE link its window has to be approximately 128KB (100us/10Gbps).

Since network congestion changes over time, TCP uses an adaptive algorithm to pick a window size ranging from 0 to W_max. After every second window of data that it sends, TCP increases its window size by one packet. The TCP flow decreases the window size by two mechanisms:

  • If TCP sender does not receive ACK for specific segment it infers that the network has dropped a packet due to congestion and reduces the window by 50%
  • Timeout--If it fails to transmit a segment within a time out period, it reset W =1, and backs-off for a RTO (Retransmit Timeout), with each successive transmit failure the RTO is doubled.

The intended purpose is for the TCP window to oscillate around a value that gives it a fair share of the network bandwidth. For each window oscillation, the packet loss is retransmitted. This reduces the goodput of the link and the network. A rough approximation for the packet loss was calculated in [MSMO] P(loss probability) = 1/W ^2 where W is calculated in number of packets. For example, for an average window size of 10 packets (this can result for window oscillating between 7 to 13), the packet loss is approximately 1%. For another extreme example, if the window oscillates between window-size of one to two packets, the retransmit happens for every packet sent.

The intended purpose of the Time Out mechanism is to maintain in transmission only a subset of the TCP flows that operate in steady-state (or close to it) mode, while deferring the transmission of the rest of the flows, thus retaining good goodput.

Previous Results



Today, the line card buffer in a switch is sized based on a rule-of-thumb commonly attributed to a 1994 paper by Villamizar and Song [VS]. Using experimental measurements of at most eight TCP flows on a 40 Mb/s link, they concluded that because of the dynamics of TCP's congestion control algorithms, a router needs an amount of buffering equal to the average round-trip time of a flow that passes through the router, multiplied by the capacity of the router's network interfaces. This is the well-known B = RTT X C rule.

The goal of Villamizar and Song in [VS] is to make sure that the output link (Figure 1 bottlenecked link) is fully utilized at all times. That is, the link sends packets 100% of the time after the TCP sender noticed that its packets are dropped. This is equivalent to making sure its buffer never goes empty when the TCP sender finds out that its packet is dropped.

Using this rule of thumb, a 10Gb/s router for the WAN line-card needs approximately 250ms X 10Gb/s = 2.5Gbits of buffers in order to keep its output links (which can be congested) 100% utilized.

Appenzeller, Keslassy and McKeown in [AKM] argue that the rule-of-thumb (B = RTT X C) is incorrect for backbone routers. This is because of the large number of TCP flows multiplexed together on a single backbone link (Figure 2). Using theory, simulation and experiments on a network of real routers, they show that a link with n unsynchronized flows requires no more than B = (RTT x C)/sqrt(n). As an example, a 2.5Gb/s link carrying 100 flows could reduce its buffers by 90% with negligible difference in throughput.


Next:




Please sign in to post comment

Navigate to related information

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)