Network processors are emerging as a versatile tool that will decouple the functions to be implemented in a switching system from the hardware that implements those functions. A key assumption in this model is that the hardware to support these functions will also exist.
Although first-generation network processors are appearing that can each process up to 2.5 Gbits/second of data, system requirements already greatly exceed this capability. As demand for system throughput rises even higher and the number of ports increases, it becomes obvious that network processors are just one aspect of the system design. The switch fabric is the means through which these devices must be interconnected.
While the crossbar chip is the center of today's high-performance switching systems, a switch fabric is much more than just a crossbar chip. The crossbar is an element within the switch fabric; a fabric needs a layer of intelligence to configure the switching element, manage the data flow and provide a consistent interface to the network processors. The switch fabric provides system designers with an integrated solution-a black box-that safely routes data to the desired destination in the expected time frame.
It is this kind of switch fabric concept that spawned the Common Switch Interface (CSIX), developed by Power X Ltd. with network processor partner XaQti Corp. (now a subsidiary of Vitesse Semiconductor), which is now an open industry standard. CSIX defines the interface between the network processor and the switch fabric, hiding the complexity of the interconnect function while enabling system designers to choose network processors and fabrics that best implement the performance objectives of the system.
Bandwidth and number of ports are not the only aspects a designer considers when selecting a switch fabric. Other aspects of data transport that must be considered include fairness, efficiency, quality of service (QoS) and the ability to switch different protocols. Importantly, today's converging voice and data markets demand systems that efficiently handle both circuit-switched and packetized data. Line rates of OC48 (2.5 Gbits/s) and OC192 (10 Gbits/s) are available today. Within a system's lifetime, it must also be able to scale to support OC768 (40 Gbits/s).
What kind of switch fabrics can match these design requirements, providing scalable bandwidth and high port count plus sophisticated, multiprotocol support in a physical package that can be easily implemented with today's network processors? While evaluating switch fabrics, three key aspects should be considered: the interconnect architecture, physical layer concerns and functionality.
Within the range of interconnect architectures-shared bus, shared memory and the crossbar-both the shared bus and shared memory architectures bear physical and functional constraints that limit scalability. Shared bus architectures scale through wider and faster buses but suffer from crosstalk, reflections, signal skew and contention latency, thus limiting throughput to a few gigabits/s. Shared memory architectures, while effective to speeds around 20 Gbits/s, are limited by bus width (which is related to cell size) and by realistically achievable memory cycle times for such wide buses. The crossbar is recognized as a switching architecture that can scale into the hundreds of gigabits/s and even multiple terabits/s. It is the element of choice in today's high-performance systems and switch fabrics.
In designing a system, matters such as pin count, signal integrity, signal skew, power dissipation and choice of connector must be considered holistically to achieve a functioning system. Whether the crossbar is a parallel or serial affects many of these aspects.
For example, consider a parallel 32-port crossbar that is designed to support OC48 speeds per port. It is likely that, because of inefficiencies in the crossbar, the fabric bandwidth might need to be as much as twice the aggregate line bandwidth. At a clock speed of 250 MHz, each port interface (with parity) would require more than 20 pins in each direction. At 32 ports, there can be more than 1,300 signal pins needing more than 20 W to drive all those I/Os. For signal integrity across a backplane, the number of ground pins would more than double the connector pin count. At 100 pins per inch a 30-inch connector clearly renders the parallel approach impractical.
One solution would be to serialize the crossbar and the backplane. At 2.5 Gbits/s-typical for a high-speed serial transceiver-the crossbar would require 64 transceivers (assuming 2X over speed), with the port effectively split across two serial links. The total number of signal pins reduces to 256 (two differential pairs per transceiver), although several additional package pins may be needed for power and ground. This example demonstrates that high-speed serial links may provide a more manageable pin count for chip packages; however, the main benefit of switched serial backplanes is the reduction in size of the backplane connector.
Despite this apparent good news, transceivers at 2.5 Gbits/s are known to consume around 600 mW each, close to 40 W in a 64-transceiver device. At OC192 line rates, the power and pin problems worsen since there is little expectation that serial links much higher than 2.5 Gbits/s will propagate across a traditional backplane environment by any reasonable distance. In selecting a high-performance switched serial fabric, the benefits accrue when low-power serial transceivers, which can drive lengthy backplane traces at low bit error rates, can be embedded in the crossbar chip.
Emerging terabit switches use multistage crossbar topologies borrowed from the world of supercomputing. Though such proprietary systems may work, more elegant merchant switch fabrics are becoming available.
The three-stage Clos is a common multistage crossbar topology used in communications applications. The first stage may be considered as input and the third stage as outputs while the center stage provides an expansion function to minimize the effects of blocking. In a three-stage Clos, the number of ports on the crossbar chips defines the upper limit of the number of chips in each stage. The requirement of the central expansion stage effectively reduces the number of usable ports in the input and output stages to [(n/2)+1], where n is the number of ports on the crossbar chip.
Although multistage fabrics consist of several crossbar stages, which increase the chip count and complexity, single-stage fabrics simplify the fabric design and require fewer chips while providing the same performance. Using a multistage topology, scaling increases the number of ports but not the bandwidth per port. To scale the bandwidth from 2.5 Gbits/s per port to 10 Gbits/s per port, either the physical port speed must increase by four or four ports must be aggregated, which reduces the number of user ports in the fabric to one-quarter of the calculated maximum.
Besides poor port utilization and speed scalability, buffered multistage topologies also introduce delay and blocking because data is stored and forwarded at each stage. With the sheer number of components and increased complexity in multistage systems, there is an increased probability of failures and repairs.
A single-stage topology does not share the same limitations inherent in multistage fabrics. In a single-stage topology, crossbar chips are logically stacked on top of each other, so that the aggregate bandwidth per port increases as more chips are added. The number of crossbar chips in the stack can be no more than w chips, where w is the width of the data cell-this is similar to shared memory.
Scaling per-port bandwidth to 40 Gbits/s (OC768), a single-stage 64-port crossbar will provide more than 2.5 Tbits/s of nonblocking full-duplex user bandwidth. By multiplexing several lower-speed user ports onto a single high-speed crossbar port, a single-stage system with 64 crossbar ports at 40 Gbits/s can support 256 user ports at 10 Gbits/s.
With the single-stage fabric, multistage delay is eliminated. In addition, one of the greatest advantages of this type of fabric is the reduced number of components required to achieve the same performance. While multistage topologies can theoretically scale larger than the single-stage topology, they introduce additional design problems and risks not inherent in single-stage fabrics. A practical single-stage fabric can provide the same design throughput with minimal delay while using far fewer chips.
Theoretically, crossbar switches are nonblocking. However, with a random set of cells arriving at the ingress there may be contention for the same egress port during any given switch arbitration cycle. Because of contention at the head of the line cells behind the head cannot get serviced even though an egress port might be available for this cell. This results in a reduction of effective switch capacity.
Virtual output queues can eliminate head-of-line blocking. As the name suggests, a queue for each output port is maintained at the ingress. A small number of memory locations (m) per egress port for each input allows each head-of-line to be sent to a virtual egress port so that subsequent cells can be serviced. The arbiter can now choose from a rich set of ingress queues per egress port.
The memory requirements to accommodate virtual queues differ greatly between multistage and single-stage fabric topologies. In a multistage topology, there is a need for n x n x m cells of storage per crossbar chip; however, in a single-stage fabric the requirement is n x n x m cells of storage for the entire fabric, where n is the number of ports per crossbar chip.
Memory requirements significantly expand for queues in a multistage topology and the arbiter's view is limited. In single-stage topologies, the less memory is required and the arbiter sees the entire fabric.
Queuing cells in a multistage topology introduces complexity not seen in single-stage fabrics. In a multistage fabric, cells arriving at an ingress port in the first stage that are destined for the same egress port at the third stage can take different routes through the second stage, resulting in out-of-order data at the egress. This requires a reorder buffer in the design, introducing indeterministic latency, which translates into jitter-an unwanted component in the time-sensitive data streams of today.
A single-stage fabric contains no second stage; such fabrics are characterized by a low, deterministic delay, making them suited for circuit switching applications.
While CSIX-compliant network processors are emerging as the new tool to allow designers to build better, faster systems for less cost and in less time, off-the-shelf switch fabrics will be the workhorse that allows new switch designs to reach into the hundreds of gigabits-per-second and terabit-per-second speeds.
Evaluating switch fabrics means looking beyond bandwidth and number of ports to the interconnect architecture, physical-layer problems, functionality and interoperability with CSIX-compliant network processors. Synchronous serial switch fabrics with scalable, single-stage crossbar elements provide the designer with the means for building switching systems for tomorrow's demanding marketplace.
See related chart