Design Article
Tell us What You Think
We want to know what you thought about this Design. Let us know by adding a comment.
Network switch device equipment balances performance, cost and power in the cloud
Sujal Das, Broadcom Corp.
8/2/2012 1:39 PM EDT
Cloud and Web 2.0 applications deployed in private and public cloud environments are significantly influencing network infrastructure design due to their increasing scale and performance requirements. Data centers must be purpose-built to handle current and future workloads – evolving rapidly and driven by high volumes of end users, application types, cluster nodes, and overall data movement in the cloud. A primary design challenge in this networking landscape is to select and deploy intelligent network switches that robustly scale the performance of applications, and achieve this goal cost-effectively. Ethernet switches must be architected at the silicon level to ensure that cloud network requirements can be implemented comprehensively, economically and in volume scale.
The design of a switch device’s memory management unit (MMU), including its packet buffering resources, is a key element in meeting network design challenges. The MMU directly impacts both performance and cost of network switching equipment; most importantly, its performance is closely tied to the switch’s ability to transfer data at line rate and handle congestion without dropping packets under varied and adverse traffic conditions. The MMU must be designed with a holistic approach, enabling cost-effective yet robust data center switches that can absorb the traffic bursts of network intensive workloads and ensure deterministic performance.
Burst Behavior in Popular Cloud Applications
In simple terms, the functions of a network switch include receiving packets on an ingress port, applying specific policies implemented by the network operator, identifying the destination port(s), and sending the packet out through the egress port(s). When application-induced traffic bursts create an imbalance between incoming and outgoing packet rates to a given port, packets must be queued in the switch packet buffer. The allocation and availability of the switch’s packet buffer resources to its ports – determined not only by size of the buffer memory but also by the MMU architecture choice – determines burst absorption capabilities of the network switch. This in turn dramatically affects the performance of distributed computing applications over a cloud network.
“Bursty” traffic patterns are prevalent in cloud data centers that have high levels of peak utilization, and workloads that are typically varied and non-uniform in nature. Examples of these diverse workloads include use of MapReduce and distributed file systems in Big Data analytics, distributed caching related to high performance transaction processing, streaming media services, and many other demanding, high bandwidth computing processes. Consider traffic characteristics in the context of Big Data workloads such as Hadoop/MapReduce, which are becoming increasingly prominent in large-scale data centers. Hadoop File System (HDFS) operations, such as input file loading and result file writing, give rise to network burstiness due to a high amount of data replication across cluster nodes in a very short time span.
When application traffic exceeds the burst absorption capability in the access layer of a cloud network, TCP (Transmission Control Protocol) incast can become a problem. In this scenario, a parent server sends a barrier-synchronized request for data to many child nodes in a cluster. When multiple child nodes respond synchronously to the singular parent – either because they take the same time to complete the operation, or return partial results within a parent-specified time limit – significant congestion occurs at the network switch port to which the parent server is connected. If the switch’s egress port to the parent server lacks adequate burst absorption capability, packets overrun their buffer allocation and get dropped, causing the TCP back-off algorithm to kick in. If excessive frame loss occurs in the network, the result can be a TCP collapse phenomenon; many flows simultaneously reduce bandwidth resulting in link underutilization, and a catastrophic loss of throughput results from inadequate switch buffering.
Although congestion management mechanisms may mitigate the occurrence of TCP collapse, these protocols are not largely deployed in today’s Web 2.0 and other cloud-based networks. To be effective, these mechanisms require complex end-to-end deployment among all cluster nodes, via hardware upgrade and/or software modification. Because feedback loops across the network are required for congestion management algorithms to function properly, incast scenarios resulting from short-lived traffic flows or microbursts are not prevented. For network designers, this means switch buffers must still be appropriately sized in order to account for the round-trip times required for congestion signaling.
The Need for Integrated Switch Buffers
Overdesigning buffer capacity at each network node would certainly reduce the probability of congestion at any given egress port. However this is not realistic or viable given the critical cost and power factors constraining today’s data centers. The reality is that cloud data centers will only scale out as fast as the effective per-port cost and power consumption of the network infrastructure allows – key factors which are driven by the level of silicon integration inside the equipment.
Traditionally, switch MMU designs have enabled high burst absorption through the use of large, external packet buffer memories. That has evolved however, based on significant increases in switching bandwidth requirements – particularly in the server access layer of the data center – and the need to contain cost and power of such designs. Today, traditional fixed switch designs using distributed chipsets and highly integrated devices with on-chip buffering have largely replaced external packet buffers. For example, through the advent of new, innovative MMU designs, Broadcom’s Smart-Buffer solutions enable performance using cost-effective, integrated packet buffering; Smart-Buffer switches uniquely maximize burst absorption capability through full resource sharing and dynamic port allocation schemes.
Switch Packet Buffer Performance and Cost Tradeoffs
As servers transition from GbE to 10GbE network interfaces, the packet processing bandwidth currently deployed in a fully integrated top-of-rack switch device ranges from 480 to 640 Gigabits per second (Gbps). Assuming a single, in-order processing pipeline in the switch device core, this processing bandwidth amounts to a “packet time” as fast as one nanosecond. In this scenario, each pipeline step or memory access required to resolve a packet (such as L2/L3 forwarding lookups, buffer admission control, credit accounting, and traffic management decisions) must be completed with each single nanosecond in order to maintain wire rate performance. This sharp increase in aggregate switching throughput per access switch system has important implications for switch silicon architectures.
Increased bandwidth and port densities translate into large on-chip memories and complex combinational logic stages that must run at very high (Gigahertz) speeds. Using external packet buffer memories to maximize burst absorption capability places a ceiling on performance; this is due to external memory access times falling well below the single-chip switching throughputs demanded of today’s top-of-rack switches. At the same time, integrating very large packet buffers on a single switch chip operating at such elevated performance levels is prohibitive from a cost and power perspective. The switching chipset would have to be split up into multiple devices at lower individual throughput, maximizing raw packet buffer size by using integrated or external memories to support each chip’s specific ingress and egress port buffer allocation needs. The impact of such a multi-chip topology is increased system cost, power consumption, and board complexity – factors, which are unfavorable and often prohibitively expensive for a cloud access layer deployment. Instead, the optimal solution lies in a fully integrated packet buffering architecture, designed with inherent size and sophistication to deliver excellent burst absorption.

