United Business Media EE Times


Search

HOMEMARKET INTELLIGENCE UNITFORUMSDESIGNNEW PRODUCTSCAREERSBLOGSCONTACTEVENTSSIGN UP!RSSMost Popular contentTrusted Sources

 


Load-balancing behind switch fabrics








EE Times


popular method for building terabit switches is to use parallel switch elements with distributed control. Two critical issues for using this type of architecture are the load-balancing algorithm and the design of the queueing structure. We have introduced a load-balancing algorithm that runs in a distributed fashion and dynamically balances workload among the parallel switch elements by equalizing the length of request-only virtual output queues in each of the switch elements.

The load-balancing algorithm introduces little communication overhead and does not introduce any throughput degradation. As a result, it enables nonblocking switching without the need for internal speed-up. The load-balancing algorithm can perform one load-balancing action in less than 10 ns, which is suitable for OC-768 line rates of 40 Gbits/sec. The memory bandwidth requirement in the architecture is independent of the number of ports in the system.

In the current generation of switch fabric designs, single stage switch architectures are widely adopted, due to their simplicity and efficiency. Single stage switches can be classified into two types: architectures with centralized control or architectures with distributed control. In the type with centralized control, there is only one switch domain, which may consist of several switch slices. A centralized scheduler controls the connection configuration between the ingress ports and egress ports for all the switch slices. On the other hand, the switch architecture with distributed control consists of parallel switching domains, each having an independent scheduler. A load-balancing algorithm distributes each arriving cell to a certain switch component, where the local scheduler determines when to switch the cell to its egress port.

Figure 1 shows an example of the centralized switch architecture, where IP denotes the ingress processor, EP denotes the egress processor, SE denotes the switch element and CS denotes the centralized scheduler.

The architectures with centralized scheduling control can choose from one of the following three queueing structures: shared-memory-based output-queued structure, crossbar-based input-queued structure or a combined input-output-queued structure. Because the output-queued switch architecture requires high memory bandwidth, it does not scale well. Thus, input-queued and combined input-output-queued crossbars are better choices for switch architectures with centralized scheduling.

The advantage of the centralized switch architecture is the simplicity of its scheduling scheme, since only one central scheduler is needed. The disadvantage of this type of architecture is its limited scalability. This raises the following issues:

  • The centralized scheduler needs to make a scheduling decision in every cell time. When the line speed is increased, the cell time is shortened such that it is hard for the scheduler to finish the scheduling task. For example, when the link speed is 40 Gbits/s, the cell time to transmit an ATM cell is about 10 ns. At a clock rate of 400 MHz, the scheduler needs to generate one scheduling result in every four clock cycles, which is extremely difficult, if not impossible.
  • All switch slices need to have synchronous configurations according to the control of the centralized scheduler. When the cell time is shortened by a higher line speed, it is hard to achieve synchronization among the switch slices.
  • The centralized scheduler can become a bottleneck, such that the scalability is limited. Further, the centralized scheduler implies a single failure point, which shuts down the entire switch fabric in case of malfunction.

One example of the alternate switch architecture, that is distributed switch architecture, is shown in Figure 2, where cells arriving at the N ingress ports with link data rate R are first processed by the N ingress processors. There are P parallel switch elements, each of which is an N x N switch. Cells are distributed to the switch elements according to a certain load-balancing scheme, and are aggregated by the egress processors and forwarded to the egress ports at link data rate R. In Figure 2, each block can be implemented by an ASIC chip.

The switch architecture with distributed scheduling control has the following advantages:

  • Each switch element works independently and makes its own decisions. There is no need for high-speed synchronizations among the switch elements, which simplifies the switch system design.
  • The architecture provides built-in fault-tolerance, since the load-balancing distributors can easily bypass a malfunctioning switch by not distributing cells to it. Thus, a single point of failure is avoided.
  • If the overhead of the load-balancing algorithm is small and the throughput degradation is prevented, then the switch elements and their I/O links only need a speed of 1/P of the line rate R in order to achieve nonblocking transmission. Since the current state-of-the-art serial I/O technology provides a switching capacity of only several hundreds of Gbits/s for a single switch element chip, it is beneficial for a terabit switch fabric to decrease the capacity of each switch element.
  • Since each switch element only needs a speed of R/P, the cell time in the distributed architecture with P switch elements is P times larger than that in the centralized architecture, which simplifies the implementation of the scheduler and enables the choice of a more intelligent scheduling algorithm. For example, the cell time of an ATM cell with 40-Gbit/s OC-768 lines is about 10 ns in the centralized control architecture. However, with eight parallel switch elements in the distributed architecture, the cell time is 80 ns, which enables the use of sophisticated scheduling algorithms.

The above analysis shows that the switch architecture with distributed control is more suitable for the implementation of a terabit switch than the architecture with centralized control. In order to satisfy the high-speed requirement of a terabit switch, the load-balancing algorithm and queueing structure design must be simple and efficient. For example, if a load-balancing algorithm needs to retrieve status information from other components, then extra communication overhead is introduced. This might lead to hundreds or even thousands more high-speed signals running at multi-GHz range. Further, the load-balancing algorithm is required to evenly distribute cells to all the switch elements such that the workloads of all the switch elements are balanced and throughput degradation is prevented. In addition, it is essential for the memory bandwidth requirement to be independent of the number of switch fabric ports. Otherwise, the architecture will not be able to scale to higher switch capacities.

Some work has been done that adopts the distributed switch architecture. For example, the parallel ATOM architecture, the threshold-based load-balancing with parallel output buffers and the fork-join architecture use output queues in the switch elements. A major problem with output-queued switch elements is that the bandwidth requirement is proportional to port number N, which implies that the total switch capacity is limited by the memory bandwidth in each switch element.

Our queueing structure uses parallel nonbuffered input-queued crossbars with request-only virtual output queues. The load-balancing algorithm runs in distributed fashion and dynamically balances the workload among the parallel switch elements by equalizing the length of request-only virtual output queues in each of the switch elements.

Figure 3 shows an example of the proposed queueing structure of an N x N system with P switch elements, where N=2 and P=2. Let IP[i] denote the ith ingress processor, EP[j] denote the jth egress processor and SE[p] for the pth switch element.

In each ingress processor, cells are stored in the cell virtual output queues (CVOQ) according to the egress port of the cells. Specifically, we use CVOQ[i][j] to denote the FIFO for cells that are originated from the ith ingress port and destined to the jth egress port. CVOQ[i][j] resides in the ingress processor IP[i], where a load-balancing distributor (LBD) allocates every cell in the CVOQ to a certain switch element based on the workload of all the switch elements.

In our queueing structure, cells are switched to the egress port in two steps, i.e., the request-grant phase and the cell-transfer phase. In the request-grant phase, the LBD sends a request for a cell to one of the parallel switch elements. Cell requests are stored in the request virtual output queues (RVOQ) in the switch elements. We use RVOQ[i][p][j][t] to denote the tth request entry in the request FIFO RVOQ[i][p][j] that resides in the pth switch element and stores the request from the ith ingress port to the jth egress port. RVOQ[i][p][j] is 1-bit wide with fixed length T. There are N2 RVOQs in each SE[p], which generate N2 request signals to the local crossbar scheduler. The scheduler arbitrates and grants up to N requests in a single cell time, such that it is guaranteed that no two granted requests come from the same ingress processor or go to the same egress processor. Once the local scheduler of each switch element schedules a set of requests, the grants of the scheduled requests are sent back to their originating ingress processors. In the cell-transfer phase, the granted cells are transmitted via the corresponding switch elements to the egress processors.

Our load-balancing algorithm requires the length of each RVOQ to make a decision. In order to prevent extra communication overhead caused by retrieving the RVOQ lengths from the switch elements, we add a set of tracking virtual output queues (TVOQ) in the ingress processors, which mirror the RVOQs in the switch elements. TVOQ[i][p][j] represents the FIFO that tracks the pending cell requests that are sent from IP[i] to SE[p] and are destined for EP[j].

Both TVOQ[i][p][j] and RVOQ[i][p][j] are updated when a cell request is transmitted from IP[i] to SE[p], and when a cell grant is transmitted from SE[p] to IP[i], with a synchronization delay equal to the propagation time between an ingress processor and a switch element.

Once a cell is transmitted from the switch element to the destination egress processor, it is collected by the cell aggregator in the egress processor, and put into the output queue of the egress port. Since multiple cells in a CVOQ may be transmitted in parallel by the distributed switch elements in one cell time, the cell aggregator needs to restore the order of the cells. However, since cells are sent directly from the ingress processors to the egress processors without being buffered in the switch elements, the upper bound for the number of out-of-order cells arriving at a cell aggregator at any given time is equal to P. Therefore, the reordering scheme is simple, and the buffer space in the cell aggregators is limited. Further, the upper bound P implies that we only need log2(P) bits in the cell header as the reordering label. Thus, our reordering scheme introduces little communication overhead.

The advantages of our approach are obvious. There is no re-transmission-related cost, and since RVOQs accumulate ungranted requests, crossbar schedulers in the switch elements have a more accurate picture of the CVOQ status and thus can achieve more efficient schedules. In contrast, the schedulers in the concurrent dispatching algorithm can only see one request from each ingress processor per cell time.

The mirroring between TVOQ and RVOQ enables per-port-per-class-based lossless backpressure protocol without introducing communication overhead from the SEs to the IPs. This is in contrast to some other architectures that require dedicated backpressure networks. Further, the parallel switch elements with distributed load-balancing and crossbar schedulers of our architecture enable the support of built-in fault-tolerance functionality in the switch fabric.

The load-balancing algorithm is described in Figure 4 in pseudo-C language. The first dimension of the variables indicates the ingress processor to which they belong.

It takes three steps to make a load-balancing action. In the first step, the load-balancing distributor uses a round-robin scheduler to pick a CVOQ with a nonzero value in CVOQ_REQ. In the second step, the algorithm finds the minimum TVOQ length among the P switch elements for the selected CVOQ. In the third step, the algorithm randomly picks a switch element with the minimum TVOQ length and forwards a cell request to it.

The algorithm can be implemented in simple hardware. It has been implemented in 0.18-um CMOS process in a 320-Gbit/s switch fabric consisting of eight switch elements and having 64 CVOQs per ingress processor. The algorithm performs a load-balancing schedule in less than 10 ns. Due to space limitations, we will not detail the implementation of the algorithm.

Figure 5 shows the simulated performance of the proposed architecture in a 16 x 16 switch fabric with four switch elements. The incoming traffic has an average bursty length of 16 cells with destinations uniformly distributed across all egress ports. The traffic iterates between on and off periods with the length of each period geometrically distributed. Our architecture can be viewed as a P-server queueing system with each server running at 1/P of the line rate. As a comparison, Figure 5 also shows the results of the input-queued crossbar architecture with centralized control. It can be viewed as a single server queueing system with the server running at the line rate, P times faster than ours.

As expected, our P-server system can sustain the same throughput as the single-server system but with slightly larger average cell delays. However, the distributed control and dynamic load-balancing allow our architecture to scale to much larger switch capacities with much easier implementations than the centralized control architecture.











  Free Subscription to EE Times
First Name Last Name
Company Name Title
Email address
  Click here for your Free Subscription to EETimes Europe
 
CAREER CENTER
Ready to take that job and shove it?
SEARCH JOBS
SPONSOR

RECENT JOB POSTINGS
CAREER NEWS
With Acquisition Delayed, Sun Cutting 3,000 Jobs
With its proposed acquisition by Oracle being delayed by regulators, Sun plans to cut 3,000 jobs across several regions over the next 12 months.

For more great jobs, career related news, features and services, please visit EETimes' Career Center.


All White Papers »   

 
Education and
Learning


Learn Now:












Home | About | Editorial Calendar | Feedback | Subscriptions | Newsletter | Media Kit | Contact | Reprints|  RSS|   Digital|  Mobile
Network Websites
International
Network Features




All materials on this site Copyright © 2009 TechInsights, a Division of United Business Media LLC All rights reserved.
Privacy Statement | Terms of Service | About