Design Article
Design low latency iWARP network systems
Gary Lee and David Fair Intel Corporation
6/20/2012 5:00 PM EDT
The demands placed on data center networks are changing as the amount of server-to-server (east-west) traffic grows and as new data center services emerge that require or can benefit from low-latency Remote Direct Memory Access (RDMA)-enhanced Ethernet. Some of the changes in data centers that are driving this trend include:
- Web applications that can spawn hundreds of server-to-server workflows, with each workflow requiring a consistently rapid, customized response for each unique client.
- Low-latency server-to-server transactions are a requirement in financial trading. Other applications such as Hadoop* and Memcached* require similar accelerations.
- New storage protocols are emerging, such as Microsoft Windows* 8 SMB Direct* 2.2, that can exploit an RDMA-enabled network for significantly accelerated storage performance.
- Virtual machine migration can be dramatically accelerated in a low-latency, RDMA-capable data center.
These changes are driving network OEMs and designers to re-consider Internet Wide-Area RDMA Protocol (iWARP) product offerings as it provides the right combination of processor offload and low latency to deliver the performance needed for the emerging east-west traffic in the emerging data center.
Network Adapters Using iWARP
Virtualized servers result in multiple server CPU cores feeding a single network link, so high-bandwidth is required. That’s driving the trend of 10GbE server ports becoming the de facto standard for network adapters. But intelligent use of this bandwidth means the switch designer must be aware of the features of sophisticated network adapters that can segregate and flow-control traffic to maintain performance for the applications that need it.
In addition to providing high bandwidth, these adapters need to move data efficiently between servers using protocols employing RDMA. RDMA eliminates the need to copy data from receive buffer memory to server memory, which improves overall application performance.
iWARP is the leading implementation of RDMA over Ethernet technology in these high-performance data center applications. iWARP runs over TCP/IP and delivers improved latency and performance compared to conventional Ethernet adapters, while retaining TCP/IP benefits such as routability and guaranteed delivery. Adapters with iWARP implement these performance features that are key for today’s data center networks:
- Kernel-Bypass: Applications interface directly to the Ethernet adapter, removing the latency of the OS and the expensive CPU context switches between kernel space and user space.
- Direct Data Placement: The data is written directly into user space, eliminating the need for wasteful, intermediate buffer copies, thus reducing processing latency and improving memory bandwidth.
- Transport Acceleration: The TCP/IP and iWARP protocols are accelerated in silicon versus host software stacks, thereby freeing up valuable CPU cycles for application compute processing.
10GbE Switch Silicon
The heart of the data center network is the switch, and the silicon inside of it must provide true output-queued shared memory architecture to maximize the performance necessary for an iWARP based network design. Memory access bandwidth has long been a problem for switch chip architects in these applications.
When using traditional cross bar memory designs, there is insufficient on-chip bandwidth to allow every input port to write into the same output queue simultaneously. To get around this blocking issue, chip architects may include virtual output queues at every switch input, known as a combined input/output queued (CIOQ) architecture.

Virtual output queues provide at each ingress port a single queue for each switch output (egress) port. If a particular egress queue is temporarily blocked, the matching ingress queue will be flow controlled, but packets destined for other egress ports can bypass this blocked queue and send data to other non-blocked egress ports. For an N-port switch, however, this means N*N input queues and associated schedulers, which adds significant complexity. It also adds to packet latency since each packet must be queued twice through the switch. Because of the complexity of VOQs and associated schedulers, many switch designs trade off complexity at the expense of some level of internal blocking, which further adds to latency.
The output-queued architecture, however, provides full bandwidth access to every output queue from every input port, so no blocking occurs within the switch.
Next: Title-1

