With more packets flowing through today's networking boxes, scheduling is becoming a bigger concern for designers. To solve this problem, many member of the design community have turned to the weighted round-robin (WRR) scheduling mechanism.
There are quite a few WRR implementations on the market today. But as rate disparity increases, the complexity associated with these WRR algorithms increase drastically while speed of operation goes down.
To solve this problem, this two-part paper proposes a heap-based architecture that can perform true WRR based selection at very high speeds for large number of queues with very high disparity in weights. In Part 1 of this paper, we'll provide an overview of the typical WRR architecture and discuss the proposed implementation. In Part 2, we'll further the discussion by looking at some limitations with the architecture as well as solutions to these limitations. We'll also examine jitter and fairness performance as well as how this architecture stacked up against another WRR method.
Network speeds are increasing at a dramatic pace. In order to provide high quality-of-service (QoS) in today's high-speed networks, different priorities are assigned to different queues. A selection policy has to be devised to select the queues according to their priority. Also, the selection policy needs to ensure that there is low jitter at the output of each of the queue.
The hardest and most interesting scheduling algorithms rely on one common computational primitive: priority queues.4,5 As the priority disparity of the queues increase, the computational complexity of the selection policy increases exponentially, even if there is small number of queues. For example if the priority is 10-bit wide or maximum priority disparity is 1024, then the selection policy has a tough time to ensure high speed and low jitter even if there are just 64 queues. In addition, if the rate at which queues become inactive and reactivates is very high, then the scheduling algorithm again has to update its list of active queues and select queues accordingly.
Many applications can't tolerate any delay when certain queues are activated while few queues are deactivated. Furthermore, in such situations, ensuring that each and every queue gets its fair share becomes even more cumbersome. Several architectures have been proposed in literature to cope with these problems and high computational overhead.
One of the selection policies that approach towards the ideal model for priority queues is the weighted round-robin (WRR) selection policy.11,12 WRR ensures fair selection interval among all active queues with minimal jitter. The smoothest operation of WRR scheduler is achieved when urgency counters are kept for each of the queue. Each urgency counter precisely reflects the next selection order and the history of actual selection. For performing a selection, the urgency counter of all the active queues is incremented by their respective weights. Then the active queue having the highest value of the urgency counter is selected.
After the selection, the urgency counter of the selected queue is decremented by the aggregate weight of all the active queues. The urgency counters of inactive queues are kept unchanged. This is the ideal implementation of the WRR selection policy.
As the rate disparity among the queues increases, the width of the "weight of queue" parameter also increases. For example, to support a maximum discrepancy of 1000 among queue rates, queue weights must be 10-bit wide. Now if there are 64 such queues (6-bit queue ID), then the urgency counter should be at least 17-bit wide. We will explain the selection of urgency counter width in subsequent sections.
For each selection, WRR scheduler should be able to sort the entire sixty-four 17-bit urgency counters and select the maximum urgency counter. In addition, the active status of queues has to be considered while selecting the maximum urgency counter.
For OC-768 applications, if we consider the average packet size to be 40-bytes, the WRR scheduler must select one queue in every clock running at 133 MHz. Traditional WRR architectures takes multiple clocks to perform first selection. After that the pipelined architecture ensures one selection in each clock. But if the queues are rapidly becoming active or inactive, the throughput of such architecture degrades considerably. Apart from this, these architectures works on stale status of the queue, therefore the WRR selection order, although fair results in very high jitter. True WRR should give fair selection order with little jitter.
In this paper, we have proposed a heap-based WRR selection architecture that is very fast and yields very high fairness and little jitter for much practical traffic patterns. Though, the proposed WRR selection policy has already been proposed in abstraction for many software applications, we have modified it for an optimized hardware implementation. We have seen that, the jitter at the output of any queue is acceptable and it is almost fair for a range of possible combination of queue rates and activity. Apart from that we have observed that the architecture can run at very high clock frequencies with selections in every clock. In modeling, we have observed that, for 64-byte average packet size, it can deliver bandwidths as high as OC-3072.
Urgency Counter-Based WRR
Now that we've provided an overview of how WRR works, let's dive into more specifics on the urgency counters. Since the urgency counters are implemented in hardware, we also explain, how to select the counter widths for different applications. From now onwards, we will refer to "counter-based WRR selection policy" simply as WRR selection policy.
One of the fairest WRR selections is achieved by using urgency counters for each of the queues.1,18. The selections of the queues are based on the urgency counters values and after each selection the urgency counters are updated. Each selection needs lot of pre and post processing of each of the urgency counters. Preprocessing involves sorting the entire set of urgency counters and is used to select queues. Post processing involves incrementing or decrementing the counters depending on the activity status of queues and their weights (priority). Assuming that Ci is the urgency counter value of queue i and Wi is its weight, the working of WRR is described briefly in following steps:
- Choose all the active queues.
- For all active queues, Ci (active) = Ci (active) + Wi.
- Select maximum active Ci, namely Ci (active)max.
- Ci (active)max = Ci (active)max - ΣWi (active). Also, this is the currently selected queue.
In a conceptual implementation, the urgency counters can be unbounded integers or natural numbers. However, we will prove that the urgency counters are always bounded by an upper and lower limit. In addition, if implemented in hardware, the urgency counters can't be unbounded and the upper and lower limits have to be used to select the size of urgency counters.
Therefore, a WRR policy implemented in hardware requires the selection of appropriate width of the urgency counters. The width to be selected or the upper and lower bounds depend on two parameters namely the number of queues and their maximum rate disparity. We will assume that the number of queues is a power of two (like 8, 16, 32, etc) and hence it will be represented as:
No of queues = N = Log2M.
The rate disparity is also represented as power of two like following:
Rate Disparity = R = Log2W
We will later see why these two parameters have been expressed as power of 2.
Now, we will first prove the following axioms before arriving at the appropriate width of urgency counters.
Axiom 1: The ΣCi is constant and equals the initial sum.
Proof: In each iteration, the total increment of Ci's is ΣWi (active) [described in step 2] and it is also the total decrement [described in step 4]. Thus total ΣCi remains constant after any iteration.
Axiom 2: The lower bound of Ci is -ΣWi (assuming that Ci's were initialized at 0).
Proof: Assume that Ci is less than -ΣWi. This is possible only when, Ci was negative and got selected while all other queues were active. According to axiom 1, Σ Ci must be 0 at any point of time. Therefore, if Ci is negative, there must exist at least one Cj that is positive. Thus the selection of Ci, which is negative, is not possible.
Axiom 3: The upper bound of Ci is +ΣWi (assuming that Ci's were initialized at 0).
Proof: Until now, we haven't been able to prove this mathematically, that covers all the cases. However, logically thinking one can easily see that a counter can never exceed +ΣWi and will not get a chance to get selected. And as soon as a counter is selected, its urgency counter is decremented. We have seen in simulation and in our past experience that urgency counter hardly even reaches +ΣWi. However, we are still working a mathematical proof and will soon come up with a general proof.
Axiom 4: Every queue i get selected at least once in (2 * ΣWi / Wi) iterations.
Proof: Assuming that a queue i hasn't been selected in (2 * ΣWi / Wi) iterations, Ci will be incremented by more than Wi * (2 * ΣWi / Wi) = 2 * ΣWi. This will violate the maximum possible variation of Ci from -ΣWi to +ΣWi.
The above axioms at times are proven logically without taking into account all possible cases. However, they always stand true, as we have seen in the modeling, simulation, and also by logical reasoning.
Urgency Counter Width
Urgency counters should not underflow (or become negative) if they are implemented in hardware and an unsigned arithmetic operation is performed. As explained in above axioms, urgency counters are incremented or decremented by at most ΣWi. Now, for any maximum possible rate disparity, ΣWi is bounded by N*R. Thus, the maximum possible ΣWi is 2(M + W). Thus, if the urgency counters are chosen to be (M+W+1) bit wide and are initialized at half the maximum value ("100000..."), it will be ensured that none of the urgency counters ever underflows or overflow. Thus, the urgency counters are chosen to be (M+N+1) bit wide.
This also explains why the rate disparity and number of queues are taken in terms of power of two. The width of urgency counters change only at the boundary of W and M in above equations. For example, if the number of queues decreases from 64 to 40, there will not be any impact on the width of the urgency counters. Only the total number of urgency counters will decrease. The width will go down only when the number of queues goes down to 32 or M goes down from 6 to 5. Thus, for rate disparity and number of queues less than the power 2 boundary, the complexity of the algorithm remains same until the next power 2 boundary.
Proposed WRR implementation
As described before, WRR is performed in 4 steps. Of all the steps, step 3 of selecting the Max Ci (active) carries the highest arithmetic complexity.
In the proposed WRR queuing scheme, the entire list of Ci (active) needs to be sorted and the Ci (active)max has to be selected. In our scheme, we have implemented a hardware based heap sorting algorithm to select Ci (active)max. Each of the urgency counter Ci are placed at the nodes of a binary tree.
Like the conventional heap sorting algorithm, each of the nodes are compared against the neighboring nodes in adjacent layers. After each comparison, the nodes are exchanged if the lower node's Ci greater than the upper node's Ci. Figure 1 depicts the tree data structure in which the urgency counters Ci are arranged.
Figure 1: Diagram of the heap data structure.
In addition to the urgency counter Ci's, the respective queue ID i and its weight Wi are also kept at the nodes. This is done because Ci's has to be qualified with their queue IDs and weights. Otherwise, after a node swapping, the Ci's will lose its association with the queue ID and weight. Thus each node consists of storage elements (like flip flops) that keep the queue ID, its urgency counter and its weight.
A node swapping is the swapping of entire contents of the flip-flops at that node. Thus the urgency counters move through the binary tree with its queue ID and weight. During configuration, the queue IDs are assigned to each of the node sequentially and the priority Wi's are assigned according to the configuration. As described above the Ci's at each of the nodes are initialized at (100000....).
In the heap data structure, the number of layers, branches and nodes are function of each other. Lets redefine the naming conventions of the binary tree:
No of nodes (queues) = N = Log2M.
Number of layers = L = Log2(N+1)
Number of branches = B = N-1.
Rate disparity = R = Log2W.
Between each of the adjacent nodes of different layer (or at every branch), an unsigned arithmetic comparator is placed. The arithmetic comparator's result is used to determine if the nodes are required to be swapped or not. If the upper node's Ci is found smaller than lower node's Ci and lower node is active, the nodes are swapped (i, Ci and Wi).
The activity of a queue is also considered in the swapping decision. If the upper node is found inactive, it is sent down irrespective of the comparison result. Thus, the active node having highest Ci reaches at the top after a maximum of Nlog2N arithmetic comparisons and swapping. This entire operation gets completed in L time intervals only, because there are B comparators operating in parallel. Therefore the total swapping time is reduced by a factor of B. Thus the entire Ci's are sorted in L intervals, complying with the complexity of parallel heap sort, O(log2N).
All the inactive nodes are placed at the bottom layers. After L intervals, the WRR selection starts and node at the top is picked up. Then again, the swapping occurs because Ci of the selected node is decremented. The nodes at layer 2 are ready for swapping and reaches at the top in just one interval. Thus after an initial latency of L time intervals, the selection occurs seamlessly, one in each time interval.
There are some limitations in the seemingly straightforward comparison and swapping of nodes. One of them is due to the clashes that might occur while making any node swapping decisions. Another limitation may arise due to the complexity involved in the computation of ΣWi (active) to be used in step 4.
Each of these two limitations will be explained and addressed in Part 2 of this article. Part 2 will also look at fairness and jitter while also comparing the proposed architecture against another common WRR scheme. To view Part 2, click here.
- A. Ioannou, "An ASIC Core for Pipelined Heap Management to Support Scheduling in High Speed Networks", Master of Science Thesis, University of Crete, Greece; Technical Report FORTH-ICS/TR-278, Institute of Computer Science, FORTH, Heraklio, Crete, Greece, November 2000.
- Aggelos Ioannou and Manolis Katevenisy, "Pipelined Heap (Priority Queue) Management for Advanced Scheduling in High-Speed Networks," Institute of Computer Science (ICS), Foundation for Research & Technology -- Hellas (FORTH).
- I. Cidon, A. Kharnisy, and M. Sidi, "Dispersed messages in discrete-time queues: Delay, jitter and threshold crossing," IEEE INFOCOM, Toronto, Canada, June 1994.
- V. Kumar, T. Lakshman, D. Stiliadis: "Beyond best effort: Router architectures for the differentiated services of tomorrow's internet," IEEE Communications Magazine, May 1998, pp. 152-164.
- D. Jones: "An empirical comparison of priority-queue and event-set implementations," Communication of the ACM, vol. 29, no. 4, Apr. 1986, pp. 300-311.
- D. Ferrari, "Delay jitter control scheme for packet-switching internet works," Computer Communications, 15(6): 367-373, July/August 1992.
- F. Guillemin and J. W. Roberts, "Jitter and bandwidth enforcement," IEEE GLOBECOM, pages 261--265 (9.3), Phoenix, Arizona, December 1991.
- F. Guillemin, P. Boyer, A. Dupuis, and L. Romoeuf, "Peak rate enforcement in ATM networks. IEEE INFOCOM," pages 753-- 758, Florence, Italy, May 1992.
- R. Landry and I. Stavrakakis, "Traffic shaping of a tagged stream in an ATM network: Approximate end-to-end analysis," IEEE INFOCOM, Boston, Massachusetts, April 1995.
- S. Q. Li and C. Fulton, "Delay jitter first-order statistics of general traffic on high-speed networks," IEEE ICC, Dallas, Texas, June 1996.
- Manolis Katevenis, Stefanos Sidiropoulos, and Costas Courcoubetis, "Weighted Round-Robin Cell Multiplexing in a General-Purpose ATM Switch Chip," IEEE Journal on Selected Areas in Communication, vol. 9, No. 8, Oct. 1991, pp. 1265-1279.
- J. C. Bennett and H. Zhang, "WF2Q: worst-case fair weighted fair queueing," in Proc. IEEE INFOCOM '96, San Francisco, CA, Mar. 1996, pp. 120-128.
- Y. Tamir and G. Frazier, "High performance multi-queue buffers for VLSI communication switches," Proceeding 15th Ann. Symposium, Computer Architecture, June 1988, pp 343-354.
- Aristides Nikologiannis and Manolis Katevenis, "Efficient per-flow queuing in DRAM at OC-192 line rate using out-of-order execution techniques," proceedings of the IEEE International Conference on Communications (ICC'2001). Helsinki, Finland, June 2001, pp. 2048-2052.
- J. W. Roberts and F. Guillemin, "Jitter in ATM networks and its impact on peak rate enforcement," Performance Evaluation, 16:35-48, 1992.
- M. Katevenis, D. Serpanos, E. Markatos: "Multi-Queue Management and Scheduling for Improved QoS in Communication Networks'," Proceedings of EMMSEC'97 (European Multimedia Microprocessor Systems and Electronic Commerce Conference), Florence, Italy, Nov. 1997, pp. 906-913.
- Manolis Katevenis, Evangelos Markatos, and Ioannis Mavroidis, "Weighted Round-Robin Scheduler using Per-Class Urgency Counters," Computer Architecture and VLSI Systems Division, Institute of Computer Science (ICS), FORTH.
- A. Demers, S. Keshav, and S. Shenker, "Analysis and simulation of a fair queueing algorithm," Internetworking: Research and Experience, vol. 1, no. 1, pp. 3-26, 1990.
About the Authors
Sailesh Kumar is a senior design engineer at Paxonet Communications (India) Ltd. Sailesh can be reached at firstname.lastname@example.org.
Prosun Chatterjee is a team leader at Paxonet Communications. Prosun can be reached at email@example.com.