EEE/ACM TRANSACTIONS ON NETWORKING. VOL 7. NO. 2. APRIL 1999 The isLIP Scheduling algorithm for Input-Queued Switches Nick McKeown, Senior Member, IEEE Abstract-An increasing number of high performance inter- is nonblocking, it allows multiple cells to be transferred across networking protocol routers, LAN and asynchronous transfer the fabric simultaneously, alleviating the congestion found on rossbar switch. Most offer a switched backplane based on a a conventional shared backplane. In this paper, we describe an hold packets waiting to traverse the switching fabric. It is algorithm that is designed to configure a crossbar switch using well known that if simple first in first out(FIFO) input queues a single-chip centralized scheduler. The algorithm presented are used to hold packets then, even under benign conditions, here attempts to achieve high throughput for best-effort unicast head-of-line (HOL) blocking limits the achievable bandwidth traffic, and is designed to be simple to implement in hardware to approximately 58.6% of the maximum. HOL blocking can Our work was motivated by the design of two such systems described in this paper. A scheduling algorithm is used to con- the Cisco 12 000 GSR, a 50-Gb/s IP router, and the Tiny Tera figure the crossbar switch, deciding the order in which packets a 0.5-Tb/ MPLS switch [7] will be served. Recent results have shown that with a suitable Before using a crossbar switch as a switching fabric. it is scheduling algorithm, 100% throughput can be achieved. In important to consider some of the potential drawbacks, we An iterative round-robin algorithm. i sliP can achieve 100% consider three here. First, the implementation complexity of an are presented, along with modified versions for prioritized traffic. Fortunately, the majority of high-performance switches and Simulation results are presented to indicate the performance routers today have only a relatively small number of ports of iSLIP under benign and bursty traffie conditions. Prototyp and commercial implementations of iSLIP exist in systems with (usually between 8 and 32). This is because the highest aggregate bandwidths ranging from 50 to 500 Gb/s. When the performance devices are used at aggregation points where port traffic is nonuniform, isLiP quickly adapts to a fair scheduling density is low.Our work is, therefore, focussed on systems policy that is guaranteed never to starve an input queue. Finally, with low port density. A second potential drawback of crossbar we describe the implementation complexity of isLIP. Based on switches is that they make it difficult to provide guaranteed hedulers have been built supporting up to 32 ports, nilele-chip qualities of service. This is because cells arriving to the switch a two-dimensional(2-D) array of priority encoders, sing must contend for access to the fabric with cells at both the input and the output. The time at which they leave the input Index Termns-ATM switch, crossbar switch, inpul-queueing, queues and enter the crossbar switching fabric is dependent on router, scheduling. other traffic in the system, making it difficult to control when a cell will depart. There are two common ways to mitigate L INTRODUCTION this problem. One is to schedule the transfer of cells from N AN ATTEMPT to take advantage of the cell-switching inputs to outputs in a similar manner to that used in a time- hag a pacity of the asynchronous transfer mode(ATM), there slot interchanger, providing peak bandwidth allocation for recently been a merging of ATM switches and Inter- reserved flows. This method has been implemented in at least net Protocol(IP) routers [29[32]. This idea is already two commercial switches and routers. The second approach Deing carried one step further, with cell switches forming is to employ "speedup "in which the core of the switch the core, or backplane, of high-performance IP routers [26), runs faster than the connected lines. Simulation and analytical 31](61, [4]. Each of these high-speed switches and routers results indicate that with a small speedup, a switch will deliver is built around a crossbar switch that is configured using a cells quickly to their outgoing port, apparently independent of centralized scheduler, and each uses a fixed-size cell as a contending traffic [271,[371-[41]. While these techniques are transfer unit. Variable-length packets are segmented as they of growing importance, we restrict our focus in this paper to arrive, transferred across the central switching fabric, and the efficient and fast scheduling of best-effort traffic then reassembled again into packets before they depart. A crossbar switch is used because it is simple to implement and I Some people believe that this situation will change in the future, and that switches and routers with Manuscript received November 19 r even thousands of port se systems become real, then cross proved by IEEE/ACM TRANSACTIONS not be suitable g, Stanford However, the techniques described here will be suitable for a few years University, Stanford, CA 94305-9030 ford. edu) Publisher Item Identifier S 1063-6692(99)03593.1 Gigaswitch/ATM (2)and the Cisco Systems LS2020 ATM Switc 1063-6692/99s1000@1999IEEE
188 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 7, NO. 2, APRIL 1999 The iSLIP Scheduling Algorithm for Input-Queued Switches Nick McKeown, Senior Member, IEEE Abstract—An increasing number of high performance internetworking protocol routers, LAN and asynchronous transfer mode (ATM) switches use a switched backplane based on a crossbar switch. Most often, these systems use input queues to hold packets waiting to traverse the switching fabric. It is well known that if simple first in first out (FIFO) input queues are used to hold packets then, even under benign conditions, head-of-line (HOL) blocking limits the achievable bandwidth to approximately 58.6% of the maximum. HOL blocking can be overcome by the use of virtual output queueing, which is described in this paper. A scheduling algorithm is used to con- figure the crossbar switch, deciding the order in which packets will be served. Recent results have shown that with a suitable scheduling algorithm, 100% throughput can be achieved. In this paper, we present a scheduling algorithm called iSLIP. An iterative, round-robin algorithm, iSLIP can achieve 100% throughput for uniform traffic, yet is simple to implement in hardware. Iterative and noniterative versions of the algorithms are presented, along with modified versions for prioritized traffic. Simulation results are presented to indicate the performance of iSLIP under benign and bursty traffic conditions. Prototype and commercial implementations of iSLIP exist in systems with aggregate bandwidths ranging from 50 to 500 Gb/s. When the traffic is nonuniform, iSLIP quickly adapts to a fair scheduling policy that is guaranteed never to starve an input queue. Finally, we describe the implementation complexity of iSLIP. Based on a two-dimensional (2-D) array of priority encoders, single-chip schedulers have been built supporting up to 32 ports, and making approximately 100 million scheduling decisions per second. Index Terms— ATM switch, crossbar switch, input-queueing, IP router, scheduling. I. INTRODUCTION I N AN ATTEMPT to take advantage of the cell-switching capacity of the asynchronous transfer mode (ATM), there has recently been a merging of ATM switches and Internet Protocol (IP) routers [29], [32]. This idea is already being carried one step further, with cell switches forming the core, or backplane, of high-performance IP routers [26], [31], [6], [4]. Each of these high-speed switches and routers is built around a crossbar switch that is configured using a centralized scheduler, and each uses a fixed-size cell as a transfer unit. Variable-length packets are segmented as they arrive, transferred across the central switching fabric, and then reassembled again into packets before they depart. A crossbar switch is used because it is simple to implement and Manuscript received November 19, 1996; revised February 9, 1998; approved by IEEE/ACM TRANSACTIONS ON NETWORKING Editor H. J. Chao. The author is with the Department of Electrical Engineering, Stanford University, Stanford, CA 94305-9030 USA (e-mail: nickm@stanford.edu). Publisher Item Identifier S 1063-6692(99)03593-1. is nonblocking; it allows multiple cells to be transferred across the fabric simultaneously, alleviating the congestion found on a conventional shared backplane. In this paper, we describe an algorithm that is designed to configure a crossbar switch using a single-chip centralized scheduler. The algorithm presented here attempts to achieve high throughput for best-effort unicast traffic, and is designed to be simple to implement in hardware. Our work was motivated by the design of two such systems: the Cisco 12 000 GSR, a 50-Gb/s IP router, and the Tiny Tera: a 0.5-Tb/s MPLS switch [7]. Before using a crossbar switch as a switching fabric, it is important to consider some of the potential drawbacks; we consider three here. First, the implementation complexity of an -port crossbar switch increases with making crossbars impractical for systems with a very large number of ports. Fortunately, the majority of high-performance switches and routers today have only a relatively small number of ports (usually between 8 and 32). This is because the highest performance devices are used at aggregation points where port density is low.1 Our work is, therefore, focussed on systems with low port density. A second potential drawback of crossbar switches is that they make it difficult to provide guaranteed qualities of service. This is because cells arriving to the switch must contend for access to the fabric with cells at both the input and the output. The time at which they leave the input queues and enter the crossbar switching fabric is dependent on other traffic in the system, making it difficult to control when a cell will depart. There are two common ways to mitigate this problem. One is to schedule the transfer of cells from inputs to outputs in a similar manner to that used in a timeslot interchanger, providing peak bandwidth allocation for reserved flows. This method has been implemented in at least two commercial switches and routers.2 The second approach is to employ “speedup,” in which the core of the switch runs faster than the connected lines. Simulation and analytical results indicate that with a small speedup, a switch will deliver cells quickly to their outgoing port, apparently independent of contending traffic [27], [37]–[41]. While these techniques are of growing importance, we restrict our focus in this paper to the efficient and fast scheduling of best-effort traffic. 1Some people believe that this situation will change in the future, and that switches and routers with large aggregate bandwidths will support hundreds or even thousands of ports. If these systems become real, then crossbar switches—and the techniques that follow in this paper—may not be suitable. However, the techniques described here will be suitable for a few years hence. 2A peak-rate allocation method was supported by the DEC AN2 Gigaswitch/ATM [2] and the Cisco Systems LS2020 ATM Switch. 1063–6692/99$10.00 1999 IEEE
MCKEOWN: THE ISLIP SCHEDULING ALGORITHM FOR INPUT-QUEUED SWITCHES Input 1 bipartite matching algorithms [35], which have a running-time complexity of O(Nlog N). Matching, M Output I D,(t) A. Maximum Size Matching Q Most scheduling algorithms described previously are heuris- tic algorithms that approximate a maximum size'matching [],[2][5],[81[18][301[36]. These algorithms attempt to maximize the number of connections made in each cell Output n time, and hence, maximize the instantaneous allocation of bandwidth. The maximum size matching for a bipartite gra can be found by solving an equivalent network flow problem [35]; we call the algorithm that does this maxsize. There exist many maximum-size bipartite matching algorithms, and the no/2)time is eliminated by using a separate queue for each output at each input [12]4 The problem with this algorithm is that although it is guaranteed to find a maximum match, for our application it a third potential drawback of crossbar switches is that they is too complex to implement in hardware and takes too long (usually ) employ input queues. When a cell arrives, it is placed to complete in an input queue where it waits its turn to be transferred across One question worth asking is"Does the maxsize algorithm the crossbar fabric. There is a popular perception that input- maximize the throughput of an input-queued switch? "The queued switches suffer from inherenty low performance due answer is no, maxsize can cause some queues to be starved of to head-of-line(HOL)blocking. HOL blocking arises when service indefinitely. Furthermore, when the traffic is nonuni- the input buffer is arranged as a single first in first out(FIFO) form, maxsize cannot sustain very high throughput [25]. This is queue: a cell destined to an output that is free may be held because it does not consider the backlog of cells in the VOQs up in line behind a cell that is waiting for an output that is or the time that cells have been waiting in line to be served. busy. Even with benign traffic, it is well known that HOL can For practical high-performance systems, we desire algo- limit thoughput to just 2-V2 N 58.6%[16]. Many techniques rithms with the following properties have been suggested for reducing HOL blocking, for example High Throughput: An algorithm that keeps the backlog >1[8],[131,[17]. Although these schemes can improve low in the VOQs; ideally, the algorithm will sustain ar throughput, they are sensitive to traffic arrival patterns and offered load up to 100% on each input and output may perform no better than regular FIFo queueing when the Starvation Free: The algorithm should not allow a traffic is bursty. But HoL blocking can be eliminated by usin nonempty voQ to remain unserved indefinitely a simple buffering strategy at each input port. Rather than Fast: To achieve the highest bandwidth switch, it is im- portant that the scheduling algorithm does not become the maintain a single FIFO queue for all cells, each input maintains performance bottleneck; the algorithm should therefore a separate queue for each output as shown in Fig. 1. This find a match as quickly as possible scheme is called virtual output queueing(VOQ)and was first Simple to Implement: If the algorithm is to be introduced by Tamir et al. in [34]. HOL blocking is eliminated because cells only queue behind cells that are destined to in practice, it must be implemented in special-purp the same output; no cell can be held up by a cell ahead of hardware, preferably within a single chip it that is destined to a different output. When VOQs are The islIP algorithm presented in this paper is designed Ised, it has been shown possible to increase the throughput to meet these goals, and is currently implemented in a 16 of an input-queued switch from 58.6% to 100% for both commercial IP router with an aggregate bandwidth of 50 uniform and nonuniform traffic [25],[28]. Crossbar switches s [6], and a 32-port prototype switch with an aggregate that use VOQ's have been employed in a number of studies bandwidth of 0.5 Tb/s [26]. isLIP is based on the parallel [1[14],[191[23],[34], research prototypes [26],[31], [331, Iterative matching algorithm(PIM)[2], and so to understand its operation, we start by describing PIM. Then, in Section II and commercial products [2], [6]. For the rest of this paper, we describe iSLIP and its performance. We then consider When we use a crossbar switch, we require a scheduling some small modifications to 2SLIP for various applications, agorithm that configures the fabric during each cell time and and finally consider its implementation complexity decides which inputs will be connected to which outputs; this determines which of the N2 VOQs are served in each cell B. Parallel Iterative Matching time. At the beginning of each cell time, a scheduler examines PIM was developed by DEC Systems Research Center for the contents of the N input queues and determines a conflict the 16-port, 16 Gb/s AN2 switch [2].5 Because it forms the free match M between inputs and outputs. This is equivalent 3In some literature, the maximum sice matching is called the maximum to finding a bipartite matching on a graph with N vertices cardinality matching or just the maximum bipartite matching 2],[25],[35]. For example, the algorithms described in [25] 4 This algorithm is equivalent to Dinic's algorithm [9] and [28] that achieve 100% throughput, use maximum weight ' This switch was commercialized as the Gigaswitch/ATM
MCKEOWN: THE iSLIP SCHEDULING ALGORITHM FOR INPUT-QUEUED SWITCHES 189 Fig. 1. An input-queued switch with VOQ. Note that head of line blocking is eliminated by using a separate queue for each output at each input. A third potential drawback of crossbar switches is that they (usually) employ input queues. When a cell arrives, it is placed in an input queue where it waits its turn to be transferred across the crossbar fabric. There is a popular perception that inputqueued switches suffer from inherently low performance due to head-of-line (HOL) blocking. HOL blocking arises when the input buffer is arranged as a single first in first out (FIFO) queue: a cell destined to an output that is free may be held up in line behind a cell that is waiting for an output that is busy. Even with benign traffic, it is well known that HOL can limit thoughput to just [16]. Many techniques have been suggested for reducing HOL blocking, for example by considering the first cells in the FIFO queue, where [8], [13], [17]. Although these schemes can improve throughput, they are sensitive to traffic arrival patterns and may perform no better than regular FIFO queueing when the traffic is bursty. But HOL blocking can be eliminated by using a simple buffering strategy at each input port. Rather than maintain a single FIFO queue for all cells, each input maintains a separate queue for each output as shown in Fig. 1. This scheme is called virtual output queueing (VOQ) and was first introduced by Tamir et al. in [34]. HOL blocking is eliminated because cells only queue behind cells that are destined to the same output; no cell can be held up by a cell ahead of it that is destined to a different output. When VOQ’s are used, it has been shown possible to increase the throughput of an input-queued switch from 58.6% to 100% for both uniform and nonuniform traffic [25], [28]. Crossbar switches that use VOQ’s have been employed in a number of studies [1], [14], [19], [23], [34], research prototypes [26], [31], [33], and commercial products [2], [6]. For the rest of this paper, we will be considering crossbar switches that use VOQ’s. When we use a crossbar switch, we require a scheduling algorithm that configures the fabric during each cell time and decides which inputs will be connected to which outputs; this determines which of the VOQ’s are served in each cell time. At the beginning of each cell time, a scheduler examines the contents of the input queues and determines a conflictfree match between inputs and outputs. This is equivalent to finding a bipartite matching on a graph with vertices [2], [25], [35]. For example, the algorithms described in [25] and [28] that achieve 100% throughput, use maximum weight bipartite matching algorithms [35], which have a running-time complexity of A. Maximum Size Matching Most scheduling algorithms described previously are heuristic algorithms that approximate a maximum size3 matching [1], [2], [5], [8], [18], [30], [36]. These algorithms attempt to maximize the number of connections made in each cell time, and hence, maximize the instantaneous allocation of bandwidth. The maximum size matching for a bipartite graph can be found by solving an equivalent network flow problem [35]; we call the algorithm that does this maxsize. There exist many maximum-size bipartite matching algorithms, and the most efficient currently known converges in time [12].4 The problem with this algorithm is that although it is guaranteed to find a maximum match, for our application it is too complex to implement in hardware and takes too long to complete. One question worth asking is “Does the maxsize algorithm maximize the throughput of an input-queued switch?” The answer is no; maxsize can cause some queues to be starved of service indefinitely. Furthermore, when the traffic is nonuniform, maxsize cannot sustain very high throughput [25]. This is because it does not consider the backlog of cells in the VOQ’s, or the time that cells have been waiting in line to be served. For practical high-performance systems, we desire algorithms with the following properties. • High Throughput: An algorithm that keeps the backlog low in the VOQ’s; ideally, the algorithm will sustain an offered load up to 100% on each input and output. • Starvation Free: The algorithm should not allow a nonempty VOQ to remain unserved indefinitely. • Fast: To achieve the highest bandwidth switch, it is important that the scheduling algorithm does not become the performance bottleneck; the algorithm should therefore find a match as quickly as possible. • Simple to Implement: If the algorithm is to be fast in practice, it must be implemented in special-purpose hardware, preferably within a single chip. The algorithm presented in this paper is designed to meet these goals, and is currently implemented in a 16- port commercial IP router with an aggregate bandwidth of 50 Gb/s [6], and a 32-port prototype switch with an aggregate bandwidth of 0.5 Tb/s [26]. is based on the parallel iterative matching algorithm (PIM) [2], and so to understand its operation, we start by describing PIM. Then, in Section II, we describe and its performance. We then consider some small modifications to for various applications, and finally consider its implementation complexity. B. Parallel Iterative Matching PIM was developed by DEC Systems Research Center for the 16-port, 16 Gb/s AN2 switch [2].5 Because it forms the 3 In some literature, the maximum size matching is called the maximum cardinality matching or just the maximum bipartite matching. 4This algorithm is equivalent to Dinic’s algorithm [9]. 5This switch was commercialized as the Gigaswitch/ATM
EEE/ACM TRANSACTIONS ON NETWORKING. VOL 7. NO. 2. APRIL 1999 L,2)=4 pk=2 requests μ1, L4,4)=3 Fig 3. Example of unfairness for PIM under heavy oversubscribed load with more than one iterations. Because of the random and independent selection by the arbiters, output I will grant to each input with probability 112, put I will only accept output I a quarter of the time. This leads to different =2 grants that no input queue is starved of service. Third, it means that no memory or state is used to keep track of how recently a Fig. 2. An example of the three steps that make up one iteration of the PIm connection was made in the past. At the beginning of each cell heduling algorithm[2]. In this example, the first iteration does not match time, the match begins over, independently of the matches that put 4 to output 4, even though it does not conflict with other connections. were made in pre enVious c ell times. Not only does this simplify his connection would be made in the second iteration (a) Step 1: Request our understanding of the algorithm, but it also makes analy ere as a graph with all weights w;i=1.(b)Step 2: Grant. Each output of the performance straightforward; there is no time-varying lects an input uniformly among those that requested it. In this example, state to consider, except for the occupancy of the input queues (e) step 3: Accept. Each input selects an output uniformly among those that Using randomness comes with its problems, however. First, ranted to it. In this example, outputs 2 and 4 both granted to input 3. Input it is difficult and expensive to implement at high speed; each chose to accept output arbiter must make a random selecti a time-varying set. Second, when the switch is oversubscribed basis of the isLIP algorithm described later. we will describe PIM can lead to unfairness between connections. An extreme the scheme in detail and consider some of its performance example of unfairness for a 2 x 2 switch when the inputs are characteristics oversubscribed is shown in Fig 3. We will see examples later PIM uses randomness to avoid starvation and reduce the for which PIM and some other algorithms are unfair when number of iterations needed to converge on a maximal-sized no input or output is oversubscribed Finally, PiM does not match. A maximal-sized match(a type of on-line match) is perform well for a single iteration; it limits the throughput one that adds connections incrementally, without removing to approximately 63%o, only slightly higher than for a FIFO maximal match is smaller than a maximum-sized match, but is remain ungranted is(N-1/N),hence as N increases,the nuch simpler to implement. PIM attempts to quickly converge throughput tends to 1-(1/e)63%. Although the algorithm on a conflict-free maximal match in multiple iterations, where will often converge to a good match after several iterations each iteration consists of three steps. All inputs and outputs the time to converge may affect the rate at which the switch can operate. We would prefer an algorithm that performs well are initially unmatched and only those inputs and outputs not with just a single iteration matched at the end of one iteration are eligible for matching in he next. The three steps of each iteration operate in parallel on each output and input and are shown in Fig. 2. The steps are IL. THE ZSLIP ALGORITHM WITH A SINGLE ITERATION Step 1: Request. Each unmatched input sends a request to In this section we describe and evaluate the iSLIP algorithm every output for which it has a queued cell This section concentrates on the behavior of i sliP wit rant. If an unmatched outp a single iteration per cell time. Later, we will consider quests, it grants to one by randomly selecting a with multiple iterations uniformly over all requests The isLIP algorithm uses rotating priority (round-robin) 6. Step 3: Accept. If an input receives a grant, it accepts one arbitration to schedule each active input and output in turn selecting an output randomly among those that granted to The main characteristic of iSLIP is its simplicity; it is readily implemented in hardware and can operate at high speed. We By considering only unmatched inputs and outputs, each find that the performance of iSLIP for uniform traffic is iteration only considers connections not made by earlier iter- high, for uniform independent identically distributed (i.i.d. ation Bernoulli arrivals, iSLIP with a single iteration can achieve Note that the independent output arbiters randomly select 100% throughput. This is the result of a phenomenon that we a request among contending requests. This has three effects: encounter repeatedly; the arbiters in iSLIP have a tendency to first,the authors in [2] show that each iteration will match desynchronize with respect to one another or eliminate, on average, at least 3 /4 of the remaining pos sible connections and thus the algorithm will converge to a A. Basic Round-Robin Matching Algorithm maximal match, on average, in ((log M ons. Second, iSLIP is a variation of simple basic round-robin matching ensures that all requests will eventually be granted, ensuring algorithm (RRM). RRM is perhaps the simplest and most
190 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 7, NO. 2, APRIL 1999 (a) (b) (c) Fig. 2. An example of the three steps that make up one iteration of the PIM scheduling algorithm [2]. In this example, the first iteration does not match input 4 to output 4, even though it does not conflict with other connections. This connection would be made in the second iteration. (a) Step 1: Request. Each input makes a request to each output for which it has a cell. This is shown here as a graph with all weights wij = 1: (b) Step 2: Grant. Each output selects an input uniformly among those that requested it. In this example, inputs 1 and 3 both requested output 2. Output 2 chose to grant to input 3. (c) Step 3: Accept. Each input selects an output uniformly among those that granted to it. In this example, outputs 2 and 4 both granted to input 3. Input 3 chose to accept output 2. basis of the algorithm described later, we will describe the scheme in detail and consider some of its performance characteristics. PIM uses randomness to avoid starvation and reduce the number of iterations needed to converge on a maximal-sized match. A maximal-sized match (a type of on-line match) is one that adds connections incrementally, without removing connections made earlier in the matching process. In general, a maximal match is smaller than a maximum-sized match, but is much simpler to implement. PIM attempts to quickly converge on a conflict-free maximal match in multiple iterations, where each iteration consists of three steps. All inputs and outputs are initially unmatched and only those inputs and outputs not matched at the end of one iteration are eligible for matching in the next. The three steps of each iteration operate in parallel on each output and input and are shown in Fig. 2. The steps are: Step 1: Request. Each unmatched input sends a request to every output for which it has a queued cell. Step 2: Grant. If an unmatched output receives any requests, it grants to one by randomly selecting a request uniformly over all requests. Step 3: Accept. If an input receives a grant, it accepts one by selecting an output randomly among those that granted to this output. By considering only unmatched inputs and outputs, each iteration only considers connections not made by earlier iterations. Note that the independent output arbiters randomly select a request among contending requests. This has three effects: first, the authors in [2] show that each iteration will match or eliminate, on average, at least of the remaining possible connections, and thus, the algorithm will converge to a maximal match, on average, in iterations. Second, it ensures that all requests will eventually be granted, ensuring Fig. 3. Example of unfairness for PIM under heavy oversubscribed load with more than one iterations. Because of the random and independent selection by the arbiters, output 1 will grant to each input with probability 1/2, yet input 1 will only accept output 1 a quarter of the time. This leads to different rates at each output. that no input queue is starved of service. Third, it means that no memory or state is used to keep track of how recently a connection was made in the past. At the beginning of each cell time, the match begins over, independently of the matches that were made in previous cell times. Not only does this simplify our understanding of the algorithm, but it also makes analysis of the performance straightforward; there is no time-varying state to consider, except for the occupancy of the input queues. Using randomness comes with its problems, however. First, it is difficult and expensive to implement at high speed; each arbiter must make a random selection among the members of a time-varying set. Second, when the switch is oversubscribed, PIM can lead to unfairness between connections. An extreme example of unfairness for a 2 2 switch when the inputs are oversubscribed is shown in Fig. 3. We will see examples later for which PIM and some other algorithms are unfair when no input or output is oversubscribed. Finally, PIM does not perform well for a single iteration; it limits the throughput to approximately 63%, only slightly higher than for a FIFO switch. This is because the probability that an input will remain ungranted is hence as increases, the throughput tends to Although the algorithm will often converge to a good match after several iterations, the time to converge may affect the rate at which the switch can operate. We would prefer an algorithm that performs well with just a single iteration. II. THE SLIP ALGORITHM WITH A SINGLE ITERATION In this section we describe and evaluate the SLIP algorithm. This section concentrates on the behavior of SLIP with just a single iteration per cell time. Later, we will consider SLIP with multiple iterations. The SLIP algorithm uses rotating priority (“round-robin”) arbitration to schedule each active input and output in turn. The main characteristic of SLIP is its simplicity; it is readily implemented in hardware and can operate at high speed. We find that the performance of SLIP for uniform traffic is high; for uniform independent identically distributed (i.i.d.) Bernoulli arrivals, SLIP with a single iteration can achieve 100% throughput. This is the result of a phenomenon that we encounter repeatedly; the arbiters in SLIP have a tendency to desynchronize with respect to one another. A. Basic Round-Robin Matching Algorithm SLIP is a variation of simple basic round-robin matching algorithm (RRM). RRM is perhaps the simplest and most
MCKEOWN: THE ISLIP SCHEDULING ALGORITHM FOR INPUT-QUEUED SWITCHES le+03- 2)= I REM 4 SLI L44)=3 Fig. 4. Example of the three steps of the RRM matching algorithm. (a) Step 1: Request. Each input makes a request to each output for which it has a ce Step 2: Grant. Each output selects the next requesting input at or after the ointer in the round-robin schedule. Arbiters are shown here for outputs 2 and Inputs I and 3 both requested output 2. Since 92= 1. output 2 grants to input 1. g2 and g4 are updated to favor the input after the one that is granted ) Step 3: Accept. Each input selects at most one output. The arbiter for input I is shown. Since a1 =1, input I accepts output 1. a1 is updated to point to output 2.(c)When the arbitration is completed, a matching of size two has een found Note that this is less than the maximum sized matching of three. Offered Load(%0) Bernoulli ith destinations uniformly distributed over all output bvious form of iterative round-robin scheduling algorithms, Results ng simulation for a 16x 16 switch. The graph shows the comprising a 2-D array of round-robin arbiters, cells areaverage ell, measured in cell times, between arriving at the input scheduled by round-robin arbiters at each output, and at each butters and departing from the switch. input. As we shall see, RRM does not perform well, but it helps us to understand how isLIP performs, so we start here λ,,=1 μ1.1=μ1,2=0.25 with a description of RRM. rrM potentially overcomes two problems in PIM: complexity and unfairness. Implemented as priority encoders, the round-robin arbiters are much simpler and can perform faster than random arbiters. The rotating priority aids the algorithm in assigning bandwidth equally nd more fairly among requesting connections. The RRM vy load In the ex algorithm, like PIM, consists of three steps. As shown in of Fig. 7, synchronization of output arbiters leads to a throughout of just Fig 4, for an N X N switch, each round-robin schedule contains N ordered elements. The three steps of arbitration Bernoull arrivals. For an offered load of just 63%RRM becomes unstable 6 Step 1: Request. Each input sends a request to every output The reason for the poor performance of RRM lies in the for which it has a queued cell rules for updating the pointers at the output arbiters. We Step 2: Grant. If an output receives any requests, it chooses illustrate this with an example, shown in Fig. 6. Both inputs the one that appears next in a fixed, roundrobin schedule 1 and 2 are under heavy load and receive a new cell for eaching from the highest priority element. The output notifies both outputs during every cell time. But because the output input whether or not its request was granted. The pointer schedulers move in lock-step, only one input is served durr gi to the highest priority element of the round-robin schedule cemented(modulo N) to one location beyond the each cell time. The sequence of requests, grants, and accepts for four consecutive cell times are shown in Fig. 7. Note that the grant pointers change in lock-step: in cell time l, both Step 3: Accept. If an input receives a grant, it accepts the point to input 1, and during cell time 2, both point to input one that appears next in a fixed, round-robin schedule starting 2, etc. This synchronization phenomenon leads to a maximum from the highest priority element. The pointer a; to the highest throughput of just 50% for this traffic pattern ority element of the round-robin schedule is incremented lulo N)to one location beyond the accepted output Synchronization of the grant pointers also limits perfor mance with random arrival patterns. Fig. 8 shows the number of synchronized output arbiters as a function of offered load B. Performance of RRM for Bernoulli Arrivals The graph plots the number of nonunique gis, i.e., the number As an introduction to the performance of the rRM algo- of output arbiters that clash with another arbiter. Under low rithm, Fig. 5 shows the average delay as a function of offered 6 The probability that an inm wil nds to l-(1/e)863%1/N). main ungranted is( load for uniform independent and identically distributed (i.i.d. hence as N increases, the throughput
MCKEOWN: THE iSLIP SCHEDULING ALGORITHM FOR INPUT-QUEUED SWITCHES 191 (a) (b) (c) Fig. 4. Example of the three steps of the RRM matching algorithm. (a) Step 1: Request. Each input makes a request to each output for which it has a cell. Step 2: Grant. Each output selects the next requesting input at or after the pointer in the round-robin schedule. Arbiters are shown here for outputs 2 and 4. Inputs 1 and 3 both requested output 2. Since g2 = 1; output 2 grants to input 1. g2 and g4 are updated to favor the input after the one that is granted. (b) Step 3: Accept. Each input selects at most one output. The arbiter for input 1 is shown. Since a1 = 1; input 1 accepts output 1. a1 is updated to point to output 2. (c) When the arbitration is completed, a matching of size two has been found. Note that this is less than the maximum sized matching of three. obvious form of iterative round-robin scheduling algorithms, comprising a 2-D array of round-robin arbiters; cells are scheduled by round-robin arbiters at each output, and at each input. As we shall see, RRM does not perform well, but it helps us to understand how SLIP performs, so we start here with a description of RRM. RRM potentially overcomes two problems in PIM: complexity and unfairness. Implemented as priority encoders, the round-robin arbiters are much simpler and can perform faster than random arbiters. The rotating priority aids the algorithm in assigning bandwidth equally and more fairly among requesting connections. The RRM algorithm, like PIM, consists of three steps. As shown in Fig. 4, for an switch, each round-robin schedule contains ordered elements. The three steps of arbitration are: Step 1: Request. Each input sends a request to every output for which it has a queued cell. Step 2: Grant. If an output receives any requests, it chooses the one that appears next in a fixed, roundrobin schedule starting from the highest priority element. The output notifies each input whether or not its request was granted. The pointer to the highest priority element of the round-robin schedule is incremented (modulo to one location beyond the granted input. Step 3: Accept. If an input receives a grant, it accepts the one that appears next in a fixed, round-robin schedule starting from the highest priority element. The pointer to the highest priority element of the round-robin schedule is incremented (modulo to one location beyond the accepted output. B. Performance of RRM for Bernoulli Arrivals As an introduction to the performance of the RRM algorithm, Fig. 5 shows the average delay as a function of offered load for uniform independent and identically distributed (i.i.d.) Fig. 5. Performance of RRM and iSLIP compared with PIM for i.i.d. Bernoulli arrivals with destinations uniformly distributed over all outputs. Results obtained using simulation for a 16 16 switch. The graph shows the average delay per cell, measured in cell times, between arriving at the input buffers and departing from the switch. Fig. 6. 2 2 switch with RRM algorithm under heavy load. In the example of Fig. 7, synchronization of output arbiters leads to a throughout of just 50%. Bernoulli arrivals. For an offered load of just 63% RRM becomes unstable.6 The reason for the poor performance of RRM lies in the rules for updating the pointers at the output arbiters. We illustrate this with an example, shown in Fig. 6. Both inputs 1 and 2 are under heavy load and receive a new cell for both outputs during every cell time. But because the output schedulers move in lock-step, only one input is served during each cell time. The sequence of requests, grants, and accepts for four consecutive cell times are shown in Fig. 7. Note that the grant pointers change in lock-step: in cell time 1, both point to input 1, and during cell time 2, both point to input 2, etc. This synchronization phenomenon leads to a maximum throughput of just 50% for this traffic pattern. Synchronization of the grant pointers also limits performance with random arrival patterns. Fig. 8 shows the number of synchronized output arbiters as a function of offered load. The graph plots the number of nonunique ’s, i.e., the number of output arbiters that clash with another arbiter. Under low 6The probability that an input will remain ungranted is (N 1=N)N ; hence as N increases, the throughput tends to 1 (1=e) 63%:
EEE/ACM TRANSACTIONS ON NETWORKING. VOL 7. NO. 2. APRIL 1999 arbiters with the same highest priority value is 9.9. This agrees well with the simulation result for RRM in Fig 8. As the offered load increases, synchronized output arbiters tend move in lockstep and the degree of synchronization changes only slightly means:(, Pu I requests outputs I and 2 III. THE iSLIP ALGORITHM means: Output I grants to input The isLIP algorithm improves upon RRM by reducing the Output 2 grants to input 1 synchronization of the output arbiters. iSLIP achieves this by not moving the grant pointers unless the grant is accepted eans: [ Input I accepts output 2 SLIP is identical to RRM except for a condition placed on updating the grant p cH1间·1(8R4-c Step 2: Grant. If an output receives any requests, it choose the one that appears next in a fixed round-robin schedule starting from the highest priority element. The output notifies each input whether or not its request was granted. The pointer Cell 3: 4 iAi 3: to the highest priority element of the round-robin schedule is incremented (modulo M)to one location beyond the granted input if, and only if the grant is accepted in Step 3 Cell 4: A This small change to the algorithm leads to the following properties of iSLIP with one iteration Fig. 7. Illustration of low throughput for RRM caused by onization Property 1: Lowest priority is given to the most recently output arbiters. Note that rs】] stay synchron ading to a made connection This is because when the arbiters move their aximum throughput of just pointers, the most recently granted(accepted) input(output) becomes the lowest priority at that output (input). If input i successfully connects to output j, both ai and gj are updated and the connection from input 2 to output J becomes the lowest priority connection in the next cell time Property 2: No connection is starved. This is because an input will continue to request an output until it is successful The output will serve at most N-l other inputs first, waiting at most N cell times to be accepted by each input. Therefore a requesting input is always served in less than N cell times Property 3: Under heavy load, all queues with a common utput have the same throughput. This is a consequence of Property 2: the output pointer moves to each requesting input in a fixed order, thus providing each with the same throughput But most importantly, this small change prevents the output arbiters from moving in lock-step leading to a large improve- ment in performance IV. SIMULATED PERFORMANCE OF iSLIP A. With Benign Bernoulli arrivals Fig. 5 shows the performance improvement of isLIP over RRM. Under low load, iSLIP's performance is almost identical to RRM and FIFO; arriving cells usually find empty input Offered Load (%) queues, and on average there are only a small number of inputs Fig8. Synchronization of output arbiters for RRM and iSLIP for iid. requesting a given output. As the load increases, the numbe ernoulli arrivals with destinations Results obtained using simulation for a 16x16 switch. distributed over all outputs. of synchronized arbiters decreases(see Fig. 8), leading to a large-sized match. In other words, as the load increases, we can expect the pointers to move away from offered load, cells arriving for output j will find gj in a more likely that a large match will be found quickly in the random position, equally likely to grant to any input. The next cell time. In fact, under uniform 100% offered load, the robability that gi+ gk for all k+ j is(N-1/N)-, iSLIP arbiters adapt to a time- division multiplexing scheme, which for N=16 implies that the expected number of providing a perfect match and 100% throughput. Fig.9 is an
192 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 7, NO. 2, APRIL 1999 Fig. 7. Illustration of low throughput for RRM caused by synchronization of output arbiters. Note that pointers [gi] stay synchronized, leading to a maximum throughput of just 50%. Fig. 8. Synchronization of output arbiters for RRM and iSLIP for i.i.d. Bernoulli arrivals with destinations uniformly distributed over all outputs. Results obtained using simulation for a 16 16 switch. offered load, cells arriving for output will find in a random position, equally likely to grant to any input. The probability that for all is which for implies that the expected number of arbiters with the same highest priority value is 9.9. This agrees well with the simulation result for RRM in Fig. 8. As the offered load increases, synchronized output arbiters tend to move in lockstep and the degree of synchronization changes only slightly. III. THE SLIP ALGORITHM The algorithm improves upon RRM by reducing the synchronization of the output arbiters. achieves this by not moving the grant pointers unless the grant is accepted. is identical to RRM except for a condition placed on updating the grant pointers. The Grant step of RRM is changed to: Step 2: Grant. If an output receives any requests, it chooses the one that appears next in a fixed round-robin schedule, starting from the highest priority element. The output notifies each input whether or not its request was granted. The pointer to the highest priority element of the round-robin schedule is incremented (modulo to one location beyond the granted input if, and only if, the grant is accepted in Step 3. This small change to the algorithm leads to the following properties of with one iteration: Property 1: Lowest priority is given to the most recently made connection. This is because when the arbiters move their pointers, the most recently granted (accepted) input (output) becomes the lowest priority at that output (input). If input successfully connects to output both and are updated and the connection from input to output becomes the lowest priority connection in the next cell time. Property 2: No connection is starved. This is because an input will continue to request an output until it is successful. The output will serve at most other inputs first, waiting at most cell times to be accepted by each input. Therefore, a requesting input is always served in less than cell times. Property 3: Under heavy load, all queues with a common output have the same throughput. This is a consequence of Property 2: the output pointer moves to each requesting input in a fixed order, thus providing each with the same throughput. But most importantly, this small change prevents the output arbiters from moving in lock-step leading to a large improvement in performance. IV. SIMULATED PERFORMANCE OF SLIP A. With Benign Bernoulli Arrivals Fig. 5 shows the performance improvement of SLIP over RRM. Under low load, SLIP’s performance is almost identical to RRM and FIFO; arriving cells usually find empty input queues, and on average there are only a small number of inputs requesting a given output. As the load increases, the number of synchronized arbiters decreases (see Fig. 8), leading to a large-sized match. In other words, as the load increases, we can expect the pointers to move away from each, making it more likely that a large match will be found quickly in the next cell time. In fact, under uniform 100% offered load, the SLIP arbiters adapt to a time-division multiplexing scheme, providing a perfect match and 100% throughput. Fig. 9 is an
MCKEOWN: THE ISLIP SCHEDULING ALGORITHM FOR INPUT-QUEUED SWITCHES +1-1R4-A erroll Ic+04 Cell 3 cl4:8 Fig. 9. Illustration of 100% throughput for iSLIP caused by desynchronize tion of output arbiters. Note that pointers [gi] become desynchronized at the end of Cell I and stay desynchronized, leading to an altemating cycle of 2 cell times and a maximum throughput of 100% example for a 2 x2 switch showing how, under heavy traffic the arbiters adapt to an efficient time-division multiplexing schedule B. With Bursty Arrivals Real network traffic is highly correlated from cell and so in practice, cells tend to arrive in bursts, correspon perhaps to a packet that has been segmented or to a packe video frame. Many ways of modeling bursts in network traffic have been proposed [11 ,[15][3],[22]. Leland et al. [32] Bernoulli arrivals. al cells within a burst are sent to Markov-modulated to the same output. have demonstrated that measured network traffic is bursty at very level making it important to understand the performance 1 of switches in the presence of bursty traffic We illustrate the effect of burstiness on isLIP using an on-off arrival process modulated by a two-state Markov chain Sinc=l6 The source alternately produces a burst of full cells(all with the same destination) followed by an idle period of empty cells The bursts and idle periods contain a geometrically distributed number of cells. Fig 10 shows the performance of iSLIP 9 under this arrival process for a 16X 16 switch, comparing it 3 with the performance under uniform i.i.d. Bernoulli arrivals The burst length indicated in the graph represents the average A length of each busy period. As we would expect, the increased burst size leads to a higher queueing delay. In fact, the average latency is proportional to the expected burst length. With bursty arrivals, the performance of an input-queued switch ecomes more and more like an output-queued switch under the save arrival conditions [9]. This similarity indicates that he performance for bursty traffic is not heavily influenced by the queueing policy or service discipline. Burstiness tends to concentrate the conficts on outputs rather than inputs; each burst contains cells destined for the same output, and each input will be dominated by a single burst at a time, reducing put contention. As a result, the performance becomes limited Offered Load(%o) by output contention, which is present in both input and output Fig. 11. The performance of iSLIP as function of switch size Uniform i.i.d. Bernoulli arrival C. As a Function of Switch Size However, the performance degrades differently under low Fig. Il shows the average latency imposed by a iSLIP and heavy loads. For a fixed low offered load, the queueing de- scheduler as a function of offered load for switches with 4, lay converges to a constant value. However, for a fixed heavy 8, 16, and 32 ports. As we might expect, the performance offered load, the increase in queueing delay is proportional to degrades with the number of ports N. The reason for these different characteristics under low and
MCKEOWN: THE iSLIP SCHEDULING ALGORITHM FOR INPUT-QUEUED SWITCHES 193 Fig. 9. Illustration of 100% throughput for iSLIP caused by desynchronization of output arbiters. Note that pointers [gi] become desynchronized at the end of Cell 1 and stay desynchronized, leading to an alternating cycle of 2 cell times and a maximum throughput of 100%. example for a 2 2 switch showing how, under heavy traffic, the arbiters adapt to an efficient time-division multiplexing schedule. B. With Bursty Arrivals Real network traffic is highly correlated from cell to cell and so in practice, cells tend to arrive in bursts, corresponding perhaps to a packet that has been segmented or to a packetized video frame. Many ways of modeling bursts in network traffic have been proposed [11], [15], [3], [22]. Leland et al. [32] have demonstrated that measured network traffic is bursty at every level making it important to understand the performance of switches in the presence of bursty traffic. We illustrate the effect of burstiness on using an on–off arrival process modulated by a two-state Markov chain. The source alternately produces a burst of full cells (all with the same destination) followed by an idle period of empty cells. The bursts and idle periods contain a geometrically distributed number of cells. Fig. 10 shows the performance of under this arrival process for a 16 16 switch, comparing it with the performance under uniform i.i.d. Bernoulli arrivals. The burst length indicated in the graph represents the average length of each busy period. As we would expect, the increased burst size leads to a higher queueing delay. In fact, the average latency is proportional to the expected burst length. With bursty arrivals, the performance of an input-queued switch becomes more and more like an output-queued switch under the save arrival conditions [9]. This similarity indicates that the performance for bursty traffic is not heavily influenced by the queueing policy or service discipline. Burstiness tends to concentrate the conflicts on outputs rather than inputs; each burst contains cells destined for the same output, and each input will be dominated by a single burst at a time, reducing input contention. As a result, the performance becomes limited by output contention, which is present in both input and output queued switches. C. As a Function of Switch Size Fig. 11 shows the average latency imposed by a scheduler as a function of offered load for switches with 4, 8, 16, and 32 ports. As we might expect, the performance degrades with the number of ports. Fig. 10. The performance of iSLIP under two-state Markov-modulated Bernoulli arrivals. All cells within a burst are sent to the same output. Destinations of bursts are uniformly distributed over all outputs. Fig. 11. The performance of iSLIP as function of switch size. Uniform i.i.d. Bernoulli arrivals. However, the performance degrades differently under low and heavy loads. For a fixed low offered load, the queueing delay converges to a constant value. However, for a fixed heavy offered load, the increase in queueing delay is proportional to The reason for these different characteristics under low and
EEE/ACM TRANSACTIONS ON NETWORKING. VOL 7. NO. 2. APRIL 1999 heavy load lies once again in the degree of synchronization of the arbiters. Under low load. arriving cells find the arbiters in Burstein random positions and isLIP performs in a similar manner to Bu-32 the single iteration version of PIM. The probability that the cell scheduled to be transmitted immediately is proportional to he probability that no other cell is waiting to be routed to the same output. Ignoring the(small) queueing delay under low offered load, the number of contending cells for each output is approximately X(1-(N-1/N)A-), which converges with increasing N to A(1-(1/e). Hence, for constant small A the queueing delay converges to a constant as N increases Under heavy load, the algorithm serves each FIFO once every N cycles, and the queues will behave similarly to an M/D/1 queue with arrival rates A/N and deterministic service time N cell times. For an M/G/1 queue with random service times S, arrival rate A, and service rate u, the queueing delay is given by 入E(S2) (1) for the isLiP switch under a heavy load of bernoulli arrivals, the delay will be approximately Fig. 12. Average burst length at switch output as a function of offered load. odulated by a two-state DTMC. Results are for a 16x 16 switch using the isliP scheduling algorithm which is proportional to M bursts are interleaved at the output. In fact, if the offered load D Burstiness reduction exceeds approximately 70%, the average burst length drops to Intuitively, if a switch decreases the average burst length exactly one cell. This indicates that the output arbiters have become desynchronized and are operating as time-division of traffic that it forwards, then we can expect it to improve multiplexers, serving each input in turn the performance of its downstream neighbor. We can expect any scheduling policy that uses round-robin arbiters to be burst-reducing this is also the case for isLIP isLIP is a deterministic algorithm serving each connection V. ANALYSIS OF ESLIP PERFORMANCE in strict rotation. We therefore expect that bursts of cells at In general, it is difficult to accurately analyze the per different inputs contend ing for the ame output will become formance of a isLIP switch, even for the simplest traffic interleaved and the burstiness will be reduced. This is indeed models. Under uniform load and either very low or very high the case, as is shown in Fig. 12. The graph shows the average offered load, we can readily approximate and understand the burst length at the switch output as a function of offered way in which iSLIP operates. When arrivals are infrequent, bad. Arrivals are on-off processes modulated by a two-state we can assume that the arbiters act independently and that Markov chain with average burst lengths of 16. 32. and 64 arriving cells are successfully scheduled with very low delay At the other extreme, when the switch becomes uniformly Our results indicate that iSLIP reduces the average burst backlogged, we can see that desynchronization will lead the length, and will tend to be more burst-reducing as the offered arbiters to find an efficient time division multiplexing scheme load increases. This is because the probability of switching and operate without contention. But when the trafic is nonuni between multiple connections increases as the utilization in- form, or when the offered load is at neither extreme, the creases. When the offered load is low, arriving bursts do not interaction between the arbiters becomes difficult to describe encounter output contention and the burst of cells is passed The problem lies in the evolution and interdependence of the unmodified. As the load increases. the contention increases and state of each arbiter and their dependence on arriving traffic 7 Note that the convergence is quite fast, and holds approximately even for mall N. For example, 1-I(N-1)/N]N-I equals 0.6073 when 8, A. Comergence to Time-Division Multiplexing and 0.6202 when N=16 and 0. 63 when N is infinite Under Heavy load definitions of burstiness, for example the coefficient of variation [36], burstiness curves andwidth (21. In this section, ne sameourst length [10], or effective Under heavy load, isLIP will behave similarly to an M/D/1 burst length. We define a burst of queue with arrival rates A/N and deterministic service time cells at the output of a switch as the number of consecutive cells that entered cell times. So, under a heavy load of Bernoulli arrivals,the the switch at the same input delay will be approximated by(2)
194 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 7, NO. 2, APRIL 1999 heavy load lies once again in the degree of synchronization of the arbiters. Under low load, arriving cells find the arbiters in random positions and performs in a similar manner to the single iteration version of PIM. The probability that the cell is scheduled to be transmitted immediately is proportional to the probability that no other cell is waiting to be routed to the same output. Ignoring the (small) queueing delay under low offered load, the number of contending cells for each output is approximately which converges with increasing to 7 Hence, for constant small the queueing delay converges to a constant as increases. Under heavy load, the algorithm serves each FIFO once every cycles, and the queues will behave similarly to an M/D/1 queue with arrival rates and deterministic service time cell times. For an M/G/1 queue with random service times arrival rate and service rate the queueing delay is given by (1) So, for the switch under a heavy load of Bernoulli arrivals, the delay will be approximately (2) which is proportional to D. Burstiness Reduction Intuitively, if a switch decreases the average burst length of traffic that it forwards, then we can expect it to improve the performance of its downstream neighbor. We can expect any scheduling policy that uses round-robin arbiters to be burst-reducing8 this is also the case for is a deterministic algorithm serving each connection in strict rotation. We therefore expect that bursts of cells at different inputs contending for the same output will become interleaved and the burstiness will be reduced. This is indeed the case, as is shown in Fig. 12. The graph shows the average burst length at the switch output as a function of offered load. Arrivals are on–off processes modulated by a two-state Markov chain with average burst lengths of 16, 32, and 64 cells. Our results indicate that reduces the average burst length, and will tend to be more burst-reducing as the offered load increases. This is because the probability of switching between multiple connections increases as the utilization increases. When the offered load is low, arriving bursts do not encounter output contention and the burst of cells is passed unmodified. As the load increases, the contention increases and 7Note that the convergence is quite fast, and holds approximately even for small N: For example, 1 [(N 1)=N] N1 equals 0.6073 when N = 8, and 0.6202 when N = 16 and 0:63 when N is infinite. 8There are many definitions of burstiness, for example the coefficient of variation [36], burstiness curves [20], maximum burst length [10], or effective bandwidth [21]. In this section, we use the same measure of burstiness that we use when generating traffic: the average burst length. We define a burst of cells at the output of a switch as the number of consecutive cells that entered the switch at the same input. Fig. 12. Average burst length at switch output as a function of offered load. The arrivals are on–off processes modulated by a two-state DTMC. Results are for a 16 16 switch using the iSLIP scheduling algorithm. bursts are interleaved at the output. In fact, if the offered load exceeds approximately 70%, the average burst length drops to exactly one cell. This indicates that the output arbiters have become desynchronized and are operating as time-division multiplexers, serving each input in turn. V. ANALYSIS OF SLIP PERFORMANCE In general, it is difficult to accurately analyze the performance of a switch, even for the simplest traffic models. Under uniform load and either very low or very high offered load, we can readily approximate and understand the way in which operates. When arrivals are infrequent, we can assume that the arbiters act independently and that arriving cells are successfully scheduled with very low delay. At the other extreme, when the switch becomes uniformly backlogged, we can see that desynchronization will lead the arbiters to find an efficient time division multiplexing scheme and operate without contention. But when the traffic is nonuniform, or when the offered load is at neither extreme, the interaction between the arbiters becomes difficult to describe. The problem lies in the evolution and interdependence of the state of each arbiter and their dependence on arriving traffic. A. Convergence to Time-Division Multiplexing Under Heavy Load Under heavy load, will behave similarly to an M/D/1 queue with arrival rates and deterministic service time cell times. So, under a heavy load of Bernoulli arrivals, the delay will be approximated by (2).
MCKEOWN: THE ISLIP SCHEDULING ALGORITHM FOR INPUT-QUEUED SWITCHES bernoulli_lid_uniform train科 3E∽3 z Offered Load (%) of average latency for the iSLIP algorithm an Offered Load(%)80 M/D/I the iSliP algorithm. ar Fig. 14. Comparison of analytical approximation and simulation results for the average number of sy To see how close isLIP approximates to time-division modulated by a two-state I multiplexing under heavy load, Fig. 13 compares the average cells. The analytical approximation is shown in (3) latency for both iSLIP and an M/D/1 queue (2). Above an offered load of approximately 70%, iSLIP behaves very where similarly to the m/D/I queue, but with a higher latency. This N number of ports is because the service policy is not constant; when a queue A arrival rate averaged over all inputs changes between empty and nonempty, the scheduler must adapt to the new set of queues that require service. Thi daptation takes place over many cell times while the arbiters le have found that this approximation is quite accurate desynchronize again. During this time, the throughput will be over a wide range of uniform workloads. Fig. 14 compares worse than for the MD/I queue and the queue length will the approximation in(3)with simulation results for both i.i.d ncrease. This in turn will lead to an increased latency Bernoulli arrivals and for an on-off arrival process modulated by a two-state Markov-chain B. Desynchronization of Arbiters ye have argued that the performance of isLIP is dictated vl. the iSLIP ALGORITHM WITH MULTIPLE ITERATIONS by the degree of synchronization of the output schedulers. In this section, we present a simple model of synchronization for Until now, we have only considered the operation of SLIP a stationary and sustainable uniform arrival process with a single iteration. We now examine how the algorithm must change when multiple iterations are performed In[24, Appendix 1], we find an approximation for E[S(t)], With more than one iteration, the iterative iSLIP algorithm the expected number of synchronized output schedulers at time t. The approximation is based on two assumptions improves the size of the match; each iteration attempts to add inputs that are unmatched at time t are uniformly dis. connections not made by earlier iterations. Not surprisingly, tributed over all inputs, we find that the performance improves as we increase the number of iterations(up to about log N, for an NXN switch) 2)the number of unmatched inputs at time t has zero Once again, we shall see that desynchronization of the outp ariano arbiters plays an important role in achieving low latency This leads to the approximation When multiple iterations are used, it is necessary to modify ES(切]≈N-N(N-1)4 the isLIP algorithm. The three steps of each iteration operate 22A/AN- in parallel on each output and input and are as follows Step 1: Request. Each unmatched input sends a request (3)every output for which it has a queued cell
MCKEOWN: THE iSLIP SCHEDULING ALGORITHM FOR INPUT-QUEUED SWITCHES 195 Fig. 13. Comparison of average latency for the iSLIP algorithm and an M/D/1 queue. The switch is 16 16 and, for the iSLIP algorithm, arrivals are uniform i.i.d. Bernoulli arrivals. To see how close approximates to time-division multiplexing under heavy load, Fig. 13 compares the average latency for both and an M/D/1 queue (2). Above an offered load of approximately 70%, behaves very similarly to the M/D/1 queue, but with a higher latency. This is because the service policy is not constant; when a queue changes between empty and nonempty, the scheduler must adapt to the new set of queues that require service. This adaptation takes place over many cell times while the arbiters desynchronize again. During this time, the throughput will be worse than for the M/D/1 queue and the queue length will increase. This in turn will lead to an increased latency. B. Desynchronization of Arbiters We have argued that the performance of is dictated by the degree of synchronization of the output schedulers. In this section, we present a simple model of synchronization for a stationary and sustainable uniform arrival process. In [24, Appendix 1], we find an approximation for the expected number of synchronized output schedulers at time The approximation is based on two assumptions: 1) inputs that are unmatched at time are uniformly distributed over all inputs; 2) the number of unmatched inputs at time has zero variance. This leads to the approximation (3) Fig. 14. Comparison of analytical approximation and simulation results for the average number of synchronized output schedulers. Simulation results are for a 16 16 switch with i.i.d. Bernoulli arrivals and an on–off process modulated by a two-state Markov chain with an average burst length of 64 cells. The analytical approximation is shown in (3). where number of ports; arrival rate averaged over all inputs; . We have found that this approximation is quite accurate over a wide range of uniform workloads. Fig. 14 compares the approximation in (3) with simulation results for both i.i.d. Bernoulli arrivals and for an on–off arrival process modulated by a two-state Markov-chain. VI. THE SLIP ALGORITHM WITH MULTIPLE ITERATIONS Until now, we have only considered the operation of with a single iteration. We now examine how the algorithm must change when multiple iterations are performed. With more than one iteration, the iterative algorithm improves the size of the match; each iteration attempts to add connections not made by earlier iterations. Not surprisingly, we find that the performance improves as we increase the number of iterations (up to about for an switch). Once again, we shall see that desynchronization of the output arbiters plays an important role in achieving low latency. When multiple iterations are used, it is necessary to modify the algorithm. The three steps of each iteration operate in parallel on each output and input and are as follows: Step 1: Request. Each unmatched input sends a request to every output for which it has a queued cell.
EEE/ACM TRANSACTIONS ON NETWORKING. VOL 7. NO. 2. APRIL 1999 t to input I with highest priority until it is successful With multiple iterations, the isLIP algorithm has the fol- lowing properties Property 1: Connections matched in the first iteration be- come the lowest priority in the next cell time Property 2: No connection is starved. Because pointers are Cell 1, Iteration 1: g R not updated after the first iteration, an output will continue to grant to the highest priority requesting input until it is successful Cell I Iteration 2. G阳→A t Property 3: For zSLIP with more than one iteration,and nder heavy load, queues with a common output may each have a different throughput. repeats every three cell times Property 4: The algorithm will converge in at most N iterations. Each iteration will schedule zero. one. or more connections. If zero connections are scheduled in an iteration c2. Iteration2G→A then the algorithm has converged; no more connections can be added with more iterations. Therefore, the slowest convergence will occur if exactly one connection is scheduled in each iteration. At most N connections can be scheduled(one every iteration. to every input and one to every output), which means the an offered algorithm will converge in at most N iterations load of I two cell til peats atter Property 5: The algorithm will not necessarily converge to 2 has not a maximum sized match. At best, it will find a maximal match the largest size match without removing connections made in Step 2: Grant. If an unmatched output receives any re- earlier iteration quests, it chooses the one that appears next in a fixed, round- robin schedule starting from the highest priority element. The VIL. SIMULATED PERFORMANCE OF ITERATIVE SLIP output notifies each input whether or not its request was anted. The pointer g: to the highest priority element of A. How Many Iterations? e round-robin schedule is incremented(modulo N) to one location beyond the granted input if and only if the grant is When implementing iSLIP with multiple iterations, we accepted in Step 3 of the first iteration need to decide how many iterations to perform during each cell time. Ideally, from Property 4 above, we would like to perform Step 3: Accept. If an unmatched input receives a grant, N iterations. However, in practice there may be insufficient schedule starting from the highest priority element. The pointer time for N iterations, and so we need to consider the penalty a to the highest priority element of the round-robin schedule is of performing only i iterations, where i< N. In fact, because incremented(modulo n)to one location beyond the accepted output converge in fewer than N iterations. An interesting example of this is shown in Fig. 16. In the first cell time, the algorithm A. Updating Pointers takes N iterations to converge, but thereafter converges in or less iteration each cell time. After N cell times the arbiters Note that pointers gi and ai are only updated for found in the first iteration Connections made in subse matches have become totally desynchronized and the e algorithm will iterations do not cause the pointers to be updated. This is to How many iterations should we use? It clearly does not avoid starvation. To understand how starvation can occur, we always take N. One option is to always run the algorithm to efer to the example of a 3 x3 switch with five active and completion, resulting in a scheduling time that varies from cell heavily loaded connections, shown in Fig. 15. The switch is to cell. In some applications this may be acceptable. In others, scheduled using two iterations of the isLIP algorithm, except such as in an AtM switch. it is desirable to maintain a fixed in this case, the pointers are updated after both iterations. The scheduling time and to try and fit as many iterations into that gure shows the sequence of decisions by the grant and accept time as possible arbiters; for this traffic pattern, they form a repetitive cycle in Under simulation, we have found that for an NX N switch which the highlighted connection from input I to output 2 is it takes about log, N iterations for iSLIP to converge. This never served. Each time the round-robin arbiter at output 2 is similar to the results obtained for PIM in [2] hich the grants to input l, input 1 chooses to accept output I instead Starvation is eliminated if the pointers are not updated after authors prove that the first iteration. In the example, output 2 would continue to E[≤loN+(4/3)
196 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 7, NO. 2, APRIL 1999 Fig. 15. Example of starvation, if pointers are updated after every iteration. The 3 3 switch is heavily loaded, i.e., all active connections have an offered load of 1 cell per cell time. The sequence of grants and accepts repeats after two cell times, even though the (highlighted) connection from input 1 to output 2 has not been made. Hence, this connection will be starved indefinitely. Step 2: Grant. If an unmatched output receives any requests, it chooses the one that appears next in a fixed, roundrobin schedule starting from the highest priority element. The output notifies each input whether or not its request was granted. The pointer to the highest priority element of the round-robin schedule is incremented (modulo to one location beyond the granted input if and only if the grant is accepted in Step 3 of the first iteration. Step 3: Accept. If an unmatched input receives a grant, it accepts the one that appears next in a fixed round-robin schedule starting from the highest priority element. The pointer to the highest priority element of the round-robin schedule is incremented ( to one location beyond the accepted output. A. Updating Pointers Note that pointers and are only updated for matches found in the first iteration. Connections made in subsequent iterations do not cause the pointers to be updated. This is to avoid starvation. To understand how starvation can occur, we refer to the example of a 3 3 switch with five active and heavily loaded connections, shown in Fig. 15. The switch is scheduled using two iterations of the algorithm, except in this case, the pointers are updated after both iterations. The figure shows the sequence of decisions by the grant and accept arbiters; for this traffic pattern, they form a repetitive cycle in which the highlighted connection from input 1 to output 2 is never served. Each time the round-robin arbiter at output 2 grants to input 1, input 1 chooses to accept output 1 instead. Starvation is eliminated if the pointers are not updated after the first iteration. In the example, output 2 would continue to grant to input 1 with highest priority until it is successful. B. Properties With multiple iterations, the algorithm has the following properties: Property 1: Connections matched in the first iteration become the lowest priority in the next cell time. Property 2: No connection is starved. Because pointers are not updated after the first iteration, an output will continue to grant to the highest priority requesting input until it is successful. Property 3: For with more than one iteration, and under heavy load, queues with a common output may each have a different throughput. repeats every three cell times. Property 4: The algorithm will converge in at most iterations. Each iteration will schedule zero, one, or more connections. If zero connections are scheduled in an iteration, then the algorithm has converged; no more connections can be added with more iterations. Therefore, the slowest convergence will occur if exactly one connection is scheduled in each iteration. At most connections can be scheduled (one to every input and one to every output), which means the algorithm will converge in at most iterations. Property 5: The algorithm will not necessarily converge to a maximum sized match. At best, it will find a maximal match: the largest size match without removing connections made in earlier iterations. VII. SIMULATED PERFORMANCE OF ITERATIVE SLIP A. How Many Iterations? When implementing with multiple iterations, we need to decide how many iterations to perform during each cell time. Ideally, from Property 4 above, we would like to perform iterations. However, in practice there may be insufficient time for iterations, and so we need to consider the penalty of performing only iterations, where In fact, because of the desynchronization of the arbiters, will usually converge in fewer than iterations. An interesting example of this is shown in Fig. 16. In the first cell time, the algorithm takes iterations to converge, but thereafter converges in one less iteration each cell time. After cell times, the arbiters have become totally desynchronized and the algorithm will converge in a single iteration. How many iterations should we use? It clearly does not always take One option is to always run the algorithm to completion, resulting in a scheduling time that varies from cell to cell. In some applications this may be acceptable. In others, such as in an ATM switch, it is desirable to maintain a fixed scheduling time and to try and fit as many iterations into that time as possible. Under simulation, we have found that for an switch it takes about iterations for to converge. This is similar to the results obtained for PIM in [2], in which the authors prove that (4)
MCKEOWN: THE ISLIP SCHEDULING ALGORITHM FOR INPUT-QUEUED SWITCHES Cell 1. Iteration 1: Rh…则G→AA Ouput A Cell 2. iteration 1 R G A er g. 17. Performance of iSLIP for 1, 2, and 4 iterations compared with FI R…→G A d output queueing for i.i.d. Bernoulli arrivals with destinations uniformly switch. The graph shows the average delay per cell, measured in cell times between arriving at the input buffers and departing from the switch. avily loaded N XN switch. All inp duration of the example. In the first cel After N cell times. all arbiters are d 二吧 match is found in a single iteration. where I is the number of iterations that pim takes to converge For all the stationary arrival proc es we have tried E[≤lg2 for iSLIP. However, we have not been able to prove that this relation holds in general B. With Benign Bernoulli Arrivals To illustrate the improvement in performance of iSLIP when the number of iterations is increased, Fig. 17 shows the average queueing delay for one, two, and four iterations nder uniform i.1.d. Bernoulli arrivals. We find that multiple iterations of isLIP significantly increase the size of the match and, therefore, reduce the queueing delay. In fact, isLIPcan achieve 100% throughput for one or more iteration with uni form i i.d. Bernoulli arrivals. Intuitively, the size of the match increases with the number of iterations. each new iteration potentially adds connections not made by earlier iterations This is illustrated in Fig. 18, which compares the size of iSLIP Comparison of the match size for iSLIP with the size of a maximum matching with the size of the maximum matching for the same atch for the same set of requests. Results are for a 16 x 16 switch instantaneous queue occupancies. Under low offered load, the IP arbiters move randomly and the ratio of the match size to the maximum match size decreases with increased offered with the number of iterations indicating that the matching gets load. But when the load exceeds approximately 65%, the ratio closer to the maximum-sized match, but only up to a point begins to increase linearly. As expected, the ratio increases For a switch under this traffic load, increasing the number
MCKEOWN: THE iSLIP SCHEDULING ALGORITHM FOR INPUT-QUEUED SWITCHES 197 Fig. 16. Example of the number of iterations required to converge for a heavily loaded N N switch. All input queues remain nonempty for the duration of the example. In the first cell time, the arbiters are all synchronized. During each cell time, one more arbiter is desynchronized from the others. After N cell times, all arbiters are desynchronized and a maximum sized match is found in a single iteration. where is the number of iterations that PIM takes to converge. For all the stationary arrival processes we have tried for However, we have not been able to prove that this relation holds in general. B. With Benign Bernoulli Arrivals To illustrate the improvement in performance of when the number of iterations is increased, Fig. 17 shows the average queueing delay for one, two, and four iterations under uniform i.i.d. Bernoulli arrivals. We find that multiple iterations of significantly increase the size of the match and, therefore, reduce the queueing delay. In fact, can achieve 100% throughput for one or more iteration with uniform i.i.d. Bernoulli arrivals. Intuitively, the size of the match increases with the number of iterations; each new iteration potentially adds connections not made by earlier iterations. This is illustrated in Fig. 18, which compares the size of matching with the size of the maximum matching for the same instantaneous queue occupancies. Under low offered load, the arbiters move randomly and the ratio of the match size to the maximum match size decreases with increased offered load. But when the load exceeds approximately 65%, the ratio begins to increase linearly. As expected, the ratio increases Fig. 17. Performance of iSLIP for 1, 2, and 4 iterations compared with FIFO and output queueing for i.i.d. Bernoulli arrivals with destinations uniformly distributed over all outputs. Results obtained using simulation for a 16 16 switch. The graph shows the average delay per cell, measured in cell times, between arriving at the input buffers and departing from the switch. Fig. 18. Comparison of the match size for iSLIP with the size of a maximum sized match for the same set of requests. Results are for a 16 16 switch and uniform i.i.d. Bernoulli arrivals. with the number of iterations indicating that the matching gets closer to the maximum-sized match, but only up to a point. For a switch under this traffic load, increasing the number