Core-Stateless Fair Queueing: Achieving Approximately Fair Bandwidth Allocations in High Speed Networks Scott shenker Hui Zhang CMU Xerox PARC CMU shenker@parc.xerox.com hehang@cs. cmu. edu Abstract some maintain that fair bandwidth allocation plays a ne essary, not just beneficial, role in congestion control [7, 19] Router mechanisms designed to achieve fair bandwidth al Until now, fair allocations were typically achieved by us- locations, like Fair Qucucing, have many desirable proper- ing per-flow queueing mechanisins- such as Fair Queueing ties for congestion control in the Internet. However, such [7, 18]and its many variants[2, 10, 20-or per-flow dropping mechanisms usually need to maintain state, manage buffers and /or perform packet scheduling on a per flow basis, and These mechanisms are more complex to implement than ti this complexity may prevent them from being cost-effectively ditional FiFo queueing with drop-tail, which is the most plemented and widely deployed. In this paper, we pro widely implemented and deployed mechanism in routers to- pose an architecture that significantly reduces this imple- day. In particular, fair allocation mechanisms inherently mentation complexity yet still achieves approximately fair require the routers to maint. ain state and perform opera- nd of routers- that is, a contiguous region of the net- router, the routers needs to classify the packet into a fo work- and we distinguish between edge routers and core pdate per Hlow state varables, and perform certain opera routers. Edge routers maintain per flow state; they estimate tions based on the per How state. The operations can be as he incoming rate of each fow and insert, a label into each simple as deciding whether to drop or queue the packet(e.g packet header based on this estimate. Core routers main FRED), or as complex as manipulation of priority queues tain no per flow state; they use FiFO packet scheduling aug- (e.g, Fair Queueing). While a number of techniques have nted by a probabilistic dropping algorithm that uses the been proposed to reduce the complexity of the per packet packet labels and an estimate of the aggregate traffic at the operations [1, 20, 21], and commercial implementations are router, We call the scheme Core-Stateless Fair Queueing. availablc in somc intcrmcdiatc class routers, it is still un We present simulations and analysis on the performance clear whether these algorithms can be cost-effectively impl this approach, and discuss an alternate approac mented in high-speed backbone routers because all these al 1 Introduction In this paper we start with the assumption that(1) fair A central tenet of the internet architecture is that conges- allocation mechanisms play an important, perhaps even ne on control is achieved mainly through end-host algorit hms essary, role in congestion control, and(2) the complexity However, starting with Nagle [16], many researchers ob- of existing fair allocation mechanisms is a substantial h erved that such end-to-end congestion control solutions are drance to their adoption. Both of these points are debat- reatly improved when routers have mechanisms that allo able; developments in router technology may make such al cate band width in a fair manner. Fair bandwidth allocation gorithms rather inexpensive to implement, and there may protects well-behaved Flows from ill-behaved ones, and al- be solutions to congestion control that do not require fair lows a diverse set of end-to-end congestion control policies allocation (we discuss this point more fully in Section 4) to co-exist in the network 7 As we discuss in Section 4 By usin two assumptions as our startin are not claiming hat Lhey ure Lrue, but rather alt unly This research was sponsored by DARPA under contract 6600196C8528,E30602972-0287, and DABT6394C0 looking at the implications if indeed they were true. If one ity problem in achieving fair allocation becomes a vitally policies, either expressed or implied, of DARPA, NSF, Inte To this end, we propose and an architecture anld Sun, or the U.S. government. set of algorithms that alloca width in an mately fai hile allowing digital or hard copies of all or port of this work for links to use FiFo queueing and maintain no per-flow state candidate for fairn s prior specifc permission and/ or a fee d,moreover, can be implemented with only tocal infornation 1998AcM158113003-1/80o08..5.00
Core-St at eless Fair Queueing: Achieving Approximately Fair Bandwidth Allocations in High Speed Networks* Ion Stoica Scott Shenker CMlJ Xerox PARC istoicaQcs.cmu.edu shenker@parc.xerox.com Hui Zhang CMlJ hzhang@cs.cmu.edu Abstract Router mechanisms designed to achieve fair bandwidth allocations, like Fair Queueing, have many desirable properties for congestion control in the Internet. However, such mechanisms usually need to maintain state, manage buffers, and/or perform packet scheduling on a per flow basis, and this complexity may prevent them from being cost-effectively implemented and widely deployed. In this paper, we propose an architecture that significantly reduces this implementation complexity yet st,ill achieves approximately fair bandwidth allocations. We apply this approach to an island of routers - that is, a contiguous region of the network and we distinguish between edge routers and core routers. Edge routers maintain per flow state; they estimate the incoming rate of each flow and insert a label into each packet header based on this estimate. Core routers maintain no per flow state; they use FIFO packet scheduling augmented by a probabilistic dropping algorithm that uses the packet labels and an estimate of the aggregate traffic at the router. We call the scheme Core-Stateless Fair Queueing. We present simulations and analysis on the performance of this approach, and discuss an alternate approach. 1 Introduction A central tenet of the Internet architecture is that congestion control is achieved mainly through end-host algorithms. However, starting with Nagle [16], many researchers observed that such end-to-end congestion control solutions are greatly improved when routers have mechanisms that allocate bandwidth in a fair manner, Fair bandwidth allocation protects well-behaved flows from ill-behaved ones, and allows a diverse set of end-to-end congestion control policies Lo co-exist in the network [7]. As we discuss in Section 4, * I’hls research was sponsored by DARPA under contract numbers N66001-96-C-8528, E30602-97-Z-0287, and DABT63-94-C-0073, and by a NSF Carerr Award under grant number NCR-9624979. Additlonal support was provided by Intel Corp., MCI, and Sun Microsysterns. Views and conclusions contained in this document are those of the authors and should no be interpreted as representing the official poliries, either expressed or implied, of DARPA, NSF, Intel, MCI, Sun, or the U.S. government Q ,998 ACM l-58113~003.l/98/0008.. $5.00 sorne maintain that fair bandwidth allocation’ plays a ueressary, not just beneficial, role in congestion control [7, 191. Until now, fair allocations were typically achieved by using per-flow queueing mechanisms - such as Fair Queueing [7, 181 and its many variants [2, 10, 20]- or per-flow dropping mechanisms such as Flow Random Early Drop (FRED) [14]. These mechanisms are more complex to implement than t,raditional FIFO queueing with drop-tail, which is the most widely implemented and deployed mechanism in routers today. In particular, fair allocation mechanisms inherently require the routers to maintain state and perform operations on a per flow basis. For each packet that arrives at the router, the routers needs t,o clnssi~$ the packet into a flow, update per flow state variables, and perform certain operations based on the per flow state. The operations can be as simple as deciding whether to drop or queue the packet (e.g., FRED), or as complex as manipulation of priority queues (e.g., Fair Queueing). While a number of techniques have been proposed to reduce the complexity of the per packet> operations [I, 20, 211, and commercial implementations are available in some intermediate class routers, it is still MIclear whether these algorithms can be cost-effectively implrmented in high-speed backbone routers because all these algorithms still require packet classification and per flow state management. In this paper we start with the assumption that (1) fair allocation mechanisms play an important, perhaps even necessary, role in congestion control, and (2) the complexity of existing fair allocation mechanisms is a substantial hindrance to their adoption. Both of these points are debatable; developments in router technology may make such algorithms rather inexpensive to implement, and there may be solutions to congestion control that do not require fair allocation (we discuss this point more fully in Sectiou 4). By using these two assumptions as our starting points we are not claiming that they are true, but rather are only looking at the implications if indeed they 2uere true. If one starts with these assumptions then overcoming the complexity problem in achieving fair allocation becomes a vitally important problem. To this end, we propose and examine an architecture and a set of algorithms that allocate bandwidth in an approximately fair manner while allowing the routers on high-speed links to use FIFO queueing and maintain no per-flow state. ‘We use the max.mm defimtmn of fairness [la] which, whole not the only possible candidate for fairness, LS certamly a reasonable one and, moreover, can be implemented with only local mformatlon 118
In this approach, we identify an island of routers" and dis- algurithm in which only edge routers Maintain per flow state tinguish between the edge and the core of the island. Edge while core(non-edge) routers do not maintain per flow state routers compute per-flow rate estimates and label the pack ut instead utilize the per -fow information carried via a la ets passing through them by inserting these estimates into bel in each packet's header. This label contains an estimate each packet header. Core routers use FIFO queueing and of the fow's rate; it is initialized by the edge router keep no per-flow state. They employ a probabilistic drop- on per-flow information, and then updated at each ping algorithm that uses the information in the packet la- along the path based only on aggregate information at bels along with the router's own measurement of the aggre. router The bandwidth thin this island of outers are approximately fair. Thus, if this approach were quired by Fair Queueing we use FIFO queueing with prob opted within the high speed interiors of ISP's, and fair al- abilistic dropping on in The probability of dropping a tion mechanisms were adopted for the slower links out- acket as it arrives to ueue is a function of the rate side of these high-speed interiors, then approximately fair estimate carried in the label and of the fair share rate at allocations could be achieved every where. However, thi that router. which is ated based on measurements of approach, like Fair Queneing [7 or RED[9, still provid the benefit if adopted in an incremental fa Thus, our approach avoids both the need to maintain incremental adoption must be done on an island-by-island per-flow state and the need to use complicated packet schedul the state that is carried p no per How state but instead use bit-by-bit or HudA th E basis, not on a router-by-router basis ing and buffering algorithms at core routers. To give a bet ter We call this approach Core-Stateless Fair Queueing(CSFQ) intuition about how this works, we first present the idealized rithm, and then extend the algorithm to a practical packet he details of this approach-such as the kate estimation by-packet version Igorithm and the packet dropping algorithm -in Section 2 Such a scheme cannot hope to achieve the nearly-perfect 2.1 Fluid Model Algorithm levels of fairness obtained by Fair Queueing and other so- phisticated and stateful queueing algorithms. However, or interest is not in perfection, but only in obtaining reasor output link speed C, where the flows are modelled as a con- able approximations to the fair bandwidth allocations. We tinuous stream of bits. We assume each flows arrival rate derive: a worst-Case bound for the performance of this algo- ri(t)is known precisely, Max-min fair bandwidth alloca- rithm in realized setting. This bound is presented tions are characterized by the fact that all Hows that are Section 2 ottlenecked (i.e, have bits dropped) by this router have to the typical functioning of CSFQ. In Section 3 we present the server; let a(t) be the fair share rate at time t. bEom This worst-case analysis does not give an adequate guide the same output rate. We call this rate the fair share rate results from simulation experiments to illustrate the perfo eral, if max-min bandwidth allocations are achieved, each mance of our approach and to compare it, to several other flow i receives service at a rate given by min(r:(t), a(t) schemes: DRR (a variant of Fair Queueing), FRED, RED, Let A(t)denote the total arrival rate: A(t)=2,ri(t).If nd FIFO. We also discuss, therein, the relative mechanistic A(t)>C then the fair share a(t)is the unique solution to plexities of the roaches The first 3 sections of the paper are narrowly on the details of the mechanism and its performance Absolute and relative), with the need for such a me ace a(bist min(r(t),a(t)) d In section 4 we return to the basic If A(t)sC then no bits are dropped and we will, by con- tion of why fair allocations are relevant to congestion con- vention, set a(t)= maxi ri (t) trol. Allocating bandwidth fairly is one way to address what If ri(t)sa(t), i.e., flow i sends we call the unfriendly flow problem; we also discuss an alter- fair share rate, all of its traffic will be forwarded. If r(t)> nate approach to addressing this probl ntification a(t), then a fraction T(t-a(t) of its bits will be dropped, so approach as described in [8. We conclude in Section 5. A longer version of this it will have an output rate of exactly a(t). This suggests a proofs of the theoretical results as well ple prohabilistic forv g algorit hm t hat. achieve pseudocodecanbefoundathttp://www.cs.cmuedu/-ist fair allocation of bandwidth: each incoming bit of flow i is dropped with the probability Ica, ce max(0,-(4) 2 Core-Stateless Fair Queueing (CSFQ (t) In this section, we propose an architecture that approxi- When these dropping probabilities are used, the arrival ates the service provided by an island of Fair Queueing rale of flow i at the next hup is given by min[r(c),a()] mplexity in the core routers. The architecture has two key aspects. First, to avoid mai 2.2 Packet Algorithm taining per Aow state at each router, we use a distributed defined for a bufferless fluid system t meal a contiguous portion of the network, in which the arrival rates are known exactly, Our task now is to extend this approach to the situation in real routers usly these core routers keep some state, but none of it is say“ stateless” we are referring to the where transmission is packetized, there is substantial buffer ing, and the arrival rates are not known
In this approach, we identify an island of routers’ and distinguish between the edge and the core of the island. Edge routers compute per-flow rate estimates and label the packets passing through them by inserting these estimates into each packet header. Core routers use FIFO queueing and keep no per-flow state. They employ a probabilistic dropping algorithm that uses the information in the packet labels along with the router’s own measurement of the aggregate traffic. The bandwidth allocations within this island of routers are approximately fair. Thus, if this approach were adopted within the high speed interiors of ISP’s, and fair allocation mechanisms were adopted for the slower links outside of these high-speed interiors, t,hen approximately fair allocations could be achieved everywhere. However, this approach, like, Fair Queueing [7] or RED [9], still provides benefit if adopted in an incremental fashion, although the incremental adoption must be done on an island-by-island basis, not on a router-by-router basis. We call this approach Core-Stateless Fair Queueing (CSFQ) since the core routers keep no per-flow state but instead use the state that is carried in the packet labels.3 We describe the details of this approach - such as the Rate estimation algorithm and the packet dropping algorithm - in Section 2. Such a scheme cannot hope t,o achieve the nearly-perfect levels of fairness obtained by Fair Queueing and other sophisticated and stateful queueing algorithms. However, our interest is not in perfect,ion, but only in obtaining reasonable approximations to the fair bandwidth allocations. We derivt, a worst-case bound for the performance of this algorithm in an idealized setting. This bound is presented in Section 2. This worst-case analysis does not give an adequate guide to the typical functioning of CSFQ. In Section 3 we present results from simulation experiments to illustrate the performance of our approach and to compare it to several other schemes: DR.R (a variant of Fair Queueing), FRED, RED, and FIFO. We also discuss, therein, the relative mechanistic complexities of these approaches. The first 3 sections of the paper are narrowly focussed on the details of the mechanism and its performance (both absolute and relative), with the need for such a mechanism taken for granted. In Section 4 we return to the basic question of why fair allocations are relevant to congestion control. Allocating bandwidth fairly is one way to address what we call the unfriendly flow problem; we also discuss an alternate approach to addressing this problem, the identification approach as described in [8]. We conclude with a summary in Section 5. A longer version of this paper, containing proofs of the theoretical results as well as more complete pseudocode, can be found at http: //www .cs .cmu. edu/-isto ica/csfq. 2 Core-Stateless Fair Queueing (CSFQ) In this section, we propose an architecture that approximates the service provided by an island of Fair Queueing routers, but has a much lower complexity in the core routers. The architecture has two key aspects. First, to avoid maintaining per flow state at each router, we use a distributed ‘By mland we meal, a contiguous portion of the network, with well-defined intenor and edges. 30bviously these core routers keep some state, but none of it is per-flow state, so when we say “stateless” we are referring to the absence of per-flow state. algorithm in which only edge routers maintain per flow state, while core (non-edge) routers do not maintain per flow state but instead utilize the per-flow information carried via a label in each packet’s header. This label contains an estimate of the flow’s rate; it is initialized by the edge router based on per-flow information, and then updated at each router along the path based only on aggregate information at t,hat, router. Second, to avoid per flow buffering and scheduling, as required by Fair Queueing, we use FIFO queueing with probabilistic dropping on input. The probability of dropping a packet as it arrives to the queue is a function of the rate estimate carried in the label and of the fair share rate at that router, which is estimated based on measurements of the aggregate traffic. Thus, our approach avoids both the need to maintain per-flow state and the need to use complicated packet scheduling and buffering algorithms at core routers. To give a better intuition about how this works, we first present the idealized bit-by-bit or fluid version of the probabilistic dropping algorithm, and then extend the algorithm to a practical packetby-packet version. 2.1 Fluid Model Algorithm We first consider a bufferless fluid model of a router with output link speed C, where the flows are modelled as a continuous stream of bits. We assume each flow’s arrival rate r,(t) is known precisely. Max-min fair bandwidth allocations are characterized by the fact that all flows that are bottlenecked (i.e., have bits dropped) by this router have the same output rate. We call this rate the fair share rate of the server; let a(t) be the fair share rate at time t. In general, if max-min bandwidth allocations are achieved, each flow i receives service at a rate given by min(r,(t),cw(t)). Let A(t) denote the total arrival rate: A(t) = ~~=, rl(t). If A(t) > C then the fair share a(t) is the unique solution to C=k min(r,(t), 4t)), (1) *cl If A(t) 5 C then no bits are dropped and we will, by convention, set cr(t) = max, r,(t). If rl(t) a(t), then a fraction r’cz;ty(t) of its bits will be dropped, so it will have an output rate of exactly a(t). This suggests a very simple probabilistic forwarding algorithm that, achieves fair allocation of bandwidth: each incoming bit of flow i is dropped with the probability max(a,l- -$J) When these dropping probabilities are used, the arrival rate of flow i at the next hop is given by min[r,(t), cult)]. 2.2 Packet Algorithm The above algorithm is defined for a bufferless fluid system in which the arrival rates are known exactly. Our task now is to extend this approach to the situat,ion in real routers where transmission is packetized, there is substantial buffering, and the arrival rates are not known. 119
Flow I Core Router Packet Labeling F(a()=∑min(r(),a(t) (4) Note that F() is a continuous, nondecreasing, concave, and Flaw i packe labeling piecewise-linear function of a. If the link is congested(A(:)> C) we choose a(t)to be the unique solution to F(r)= If the link is not congested (A(t)C at all times dui proof-of-concept of our architecture, we fully expect that the ing an interval of length Kc. Conversely, a link is assumed details of this design will continue to evolve to be uncongested, if A< C at all times during an interval 2.2.1 Computation of Flow Arrival Rate interval in which the link is either congested or uncongested Recall that in our architecture, the rates ri(t) are estimated according to these definitions. If the link is congested then at the edge routers and then these rates are inserted int a is updated based on the equation F (ar)=C. We approxi- reaging to estimate the rate of a fow. Let tk and 4k be has slope F/aold. This yields the arrival time and length of the k packet of flow i. The estimated rate of Hlow i, ri, is updated every time a new aold )/ (3) If the link is not congested, aneto is set to the largest rate of any active fow(i. e, the largest label seen)during the last Ks time units. The value of anew is then used to com where tk=tt-tk-I and k is a constant. We discuss pute dropping probabilities, according to Eq(2). For com the rationale for using the form e-Ti/k for the exponential weight in Section 2.7. In the longer version of this paper We now describe two minor amendments to this algo- 22] we show that, under a wide range of conditions, this rith related to how the buffers are managed. The goal of estimation algorithm converges estimating the fair share a is to match the accepted rate to the link bandwidth. Due to estimation inaccuracies, load 2.2.2 Link Fair Rate Estimation fluctuations between ars updates, and the probabilistic na In this section, we present an estimation algorithm for a(t). exceed the link capacity. While ideally the router's buffers tion 2. 1 where the arrival rates are known exactly, and as- can accor lack of sume the system performs the tIc ping alg rithm according to Eq(2). Then, the rate with which the ffer space. Since drop-tail behavior will defeat the purp n. a rrent estl- ase of adaptive flows sucll as TCP mate of the fair share rate, which we denote by a(t). Letting to limit its effect. To do so, we use a simple heuristic: every F(a(t))denote this acceptance rate, we have
E:dge Router I Estimator Figure 1: The architecture of the output port of an edge rout,er, and a core router, respectively. We still employ a drop-on-input scheme, except that now we drop packets rather than bits. Because the rate estimation (described below) incorporates the packet size, the dropping probability is independent of the packet size and depends only, as above, on the rate r;(t) and fair share rate u(t). We are left with two remaining challenges: estimating the rates r,(t) and t,he fair share o(t). We address these two issues in turn in the next two subsections, and then discuss the rewriting of the labels. Pseudocode reflecting this algorithm is described in Figure 2. We should note, however, that, the main point, of our paper is the overall architecture and that the detailed algorithm presented below represents only an init,ial prototype. While it serves adequately as a proof-of-concept of our architecture, we fully expect that the details of this design will continue to evolve. 2.2.1 Computation of Flow Arrival Rate Recall that in our architecture, the rates rt(t) are estimated at t,he edge routers and then these rates are inserted into the packet labels. At each edge router, we use exponential averaging to estimate the rate of a flow. Let tf and 1: be the arrival time and length of the lath packet of flow i. The estimated rate of flow i, T,, is updated every time a new packet is received: where T,k = tt - tr-’ and K is a constant. We discuss the rationale for using the form e-Tak/Ji for the exponential weight in Section 2.7. In the longer version of this paper [22] we show that, under a wide range of conditions, this estimation algorithm converges. 2.2.2 Link Fair Rate Estimation In this section, we present an estimation algorithm for o(t). To give intuition, consider again the fluid model in Section 2. t where the arrival rates are known exactly, and assumr the system performs the probabilistic dropping algorithm according to Eq. (2). Then, the rate with which the algorithm accepts packets is a function of the current estimate of the fair share rate, which we denote by 6(t). Letting .“‘(&(t)) denote this acceptance rate, we have F(qt)) = 2 min (rl(t), G(t)) r=, Note that F(.) is a continuous, nondecreasing, concave, and piecewise-linear function of cr. If the link is congested (A(t) > C) we choose G(t) to be the unique solution to F(s) = C. If the link is not congested (A(t) < C) we take G(t) to be the largest rate among the flows that traverse the link, i.e., G(t) = maxi<,<,(r,(t)). From Eq (4) note that if we knew -- the arrival rates rl(t) we could then compute o(t) directly. To avoid having to keep such per-flow state, we seek instead to implicitly compute G(t) by using only aggregate measurements of F and A. We use the following heuristic algorithm with three aggregate state variables: G, the estimate for the fair share rate; A, the estimated aggregate arrival rate; F, the estimated rate of the accepted traffic. The last two variables are updated upon the arrival of each packet. For A we use exponential averaging with a parameter e -T/Xi, where T is the inter-arrival time between the current and the previous packet : where Aold is the value of A before the updating. We use an analogous formula to update F. The updating rule for ;i depends on whether the link is congested or not. To filter out the estimation inaccuracies due to exponential smoothing we use a window of size K,. A link is assumed to be congested, if A^ 2 C at all times during an interval of length EC,. Conversely, a link is assumed to be uncongested, if A^ < C at all times during an interval of length I(,. The value% is updated only at the end of an interval in which the link is either congested or uncongested according to these definitions. If the link is congested then G is updated based on the equation F(G) = C. We approximate F(‘) by a linear function that intersects the origin and has slope p/Gold. This yields h c CYnezLi = QoldT F (6) If the link is not congested, (r,,, is set to the largest rate of any active flow (i.e., the largest label seen) during the last Ii’, time units. The value of ;u^,,,,, is then used to compute dropping probabilities, according to Eq. (2). For completeness, we give the pseudocode of the CSFQ algorithm in Figure 2. We now describe two minor amendments to this algorithm related to how the buffers are managed. The goal of estimating the fair share G is to match the accepted rate to the link bandwidth. Due to estimation inaccuracies, load fluctuations between G’s updates, and the probabilistic nature of our algorithm, the accepted rate may occasionally exceed the link capacity. While ideally the router’s buffers can accommodate the extra packets, occasionally the router may be forced to drop the incoming packet, due to lack of buffer space. Since drop-tail behavior will defeat the purpose of our algorithm, and may exhibit undesirable properties in the case of adaptive flows such as TCP [9], it, is important to limit, its effect. To do so, we use a simple heuristic: every 120
inside the island, however, the packet labels are no longer on receiving packet p an accurate estimate of its rate, We cannot. rerun our es e ro classify(p) timation algorithm, because it involves per-flow state, For p Label estimate -rate(ri, P); / use Eq()*/ tunately, as note inl Section 2. 1 the oulguinIg rale: is Inertly he minimum bet ween the incoming rate and the fair rate prob=max(0, 1-a/p label) if rand(0, 1) a, Therefore, we rewrite thethe packet label L as r =estimate_ar (p, 1) Lnet= min( Lold, a), By doing so, the outgoing Row rates will be properly repre- rr =estimatec(p, 0 sented by the packet labels if(prob>0) 2.3 Weighted CSFQ p label a;/* relabel p* The CSFQ algorithm can be extended to support Flows with different weights, Let w; denote the weight of flow i. Re estinate_a(p, dropped) turning to our fluid model, the meaning of these weight estimate rate( A, p);/* est. arrival rate(use Eq. (5))+/ that we say a fair allocation is one in which all bottle- FALSE cked flows have the same value for i. Then, if A(t)>C. estimate- rate(F, p);/* est. accepted traffic rate * the normalized fair rate a(t)is the unique value such that 2i, wr min(a, Ui )=C. The expression for the drop- if(congested== FALSE) congested= TRUE ping probabilities in the weighted case is max(0, 1-a w) start_time=crt_time Tle unly other major change is that the label is now r:/ else instead simply ri. Finally, without going into details we if(crt timc> start-time +Kc) note that the weighted packet-by-packet version is virtual a=a×C/F; identical to the corresponding version of the plain CSF( t tame= crtt algorithm It is inportant wo nule that with weighted CSFQ we can else/*A<C+ only approximate islands in which each flow has the same if(congested = TRUE weight at all routers in an island. That is, our algorithm congested= FALSE cannot accommodate situations where the relative weight: tart_time = crt_time. of flows differ from router to router within an island. How tmpa=0;/+ use to compute nca a+/ ever, even with this limitation, weighted CSFQ may prove else if(crt_time start-time +Kc) a valuable mechanism in implementing differential services, such as the one proposed in [24] tmp-ax =max(tmp_a, p label) a= trrp_a: 2.4 Performance bounds start time crt time: We now present the main theoretical result of the paper tmp_o =0 For generality, this result is given for weighted CSFQ. The return ar proof is given in [22] Our algorithm is built, around several estimation proce- dures, and thus is inherently inexact, One natural concern Figure 2: The pseudocode of CSFQ is whether a How can purposely exploit"these inaccuracies o get. more than its fair share of bandwidth. We answer this question in full generality, but we can analyze a time the buffer overflows, a is decreased by a small fixed per- simplified situation where the normalized fair share rate or entage(taken to be 1% in our simulations). Moreover, is held fixed and there is no buffering, so the drop probabil- avoid overcorrection, we make by e updates a does not decrease by more than 25% that. when a packet arrives a fraction of that packet equal to In addition, since there is little reason to consider a link the fows forwarding probability is transmitted. Note that congested if the buffer is almost empty, we apply the fol during any time interval [t1, t2)a How with weight w is enti- lowing rule. If the link becomes uncongested by the test tled to receive at most wa(t2-t1)service time; we call al Figure 2, then we assume that it remains ingested amount above this the excess service. We can bound this ng as the buffer occupancy is less than some predefined ess service, and the bounds are independent of both the threshold. In this paper we use a threshold that is half of arrival process and the length of the time interval during he total buffer capacity which the flow is active. The bound does depend crucially on the maximal rate R at which a flows packets can arrive 2.2.3 Label Rewriting at a router(limited, for example, by the speed of the flows access link); the smaller this rate R the tighter the bound Our rate estimation al in Section 2.2.1 label packets with their flows rate as they ent land Theorem 1 Consider a link with a constant normalized fa Our packet dropping algorithm described in tc a, and a fow with weight w. Then, the ercess service allows us to limit Rows to their fair share of the bandwidth. received by a flow with weight w, that sends at a rate no After a flow experiences significant losses at a congested link larger than R is bounded above b 121
on receiving packet, p if (edge router) 2 =classify(p); p,label = estimate-rate(r,,p); /* ~dse Ep. (3) */ prob =max(O, 1 - a/p.label); if @rob >unifrand(O, 1)) 0 =estimate-cY (p, 1); drop(p); else 0 =estimate-rw (p, 0); tmqueue(p); if (prob > 0) p.label = cy; /+ relabel p */ estimate-u (p, dropped) estimate-rate(X,p); /* est. arrival rate (use Eq. (5)) */ if (dropped == Fz4-LSE) estimate-rate(F,p); /* est. accepted trafic rate */ if(A^>C) if (congested == FALSE) congested = TRUE; start-time = crt-time; else if (crt-time > start-time + Ii,) 6 = 2 x c/F; start-ti’me = crt-time; else /* 2 C, the normalized fair rate a(t) is the unique value such that CL w, min (0, 2) = C. The expression for the dropping probabilities in the weighted case is max (0, I - (Y?). The only other major change is that the label is now rlj$L, instead simply rz. Finally, without going into details we note that the weighted packet-by-packet version is virtually identical to the corresponding version of the plain CSFQ algorithm. It is import,ant to note that with weighted CSFQ we can only approximate islands in which each How has t,he same weight at all routers in an island. That is, our algorithm cannot accommodate situations where the relative weights of Hows differ from router to router within an island. However, even with this limitation, weighted CSFQ may prove a valuable mechanism in implementing differential services, such as the one proposed in [24] 2.4 Performance Bounds We now present the main theoretical result of the paper. For generality, this result is given for weighted CSFQ. The proof is given in [22]. Our algorithm is built around several estimat,ion procedures, and thus is inherently inexact. One natural concern is whether a flow can purposely “exploit” these inaccuracies to get more than its fair share of bandwidth. We cannot answer this question in full generality, but we can analyze a simplified situation where the normalized fair share rate N is held fixed and there is no buffering, so the drop probabilities are precisely given by Eq. (2). In addition, we assume that when a packet arrives a fraction of that, packet equal to the flow’s forwarding probability is transmitted. Note that during any time interval [tl, t2) a flow with weight w is entitled to receive at most, zua(tz - tl) service time; we call any amount above this the excess service. We can bound this excess service, and the bounds are independent of both the arrival process and the length of the time interval during which the How is active. The bound does depend crucially on the maximal rat,e R at which a Hows packets can arrive at, a router (limited, for example, by the speed of the flow’s access link); the smaller this rate R the tighter the bound. Theorem 1 Consider a link with a constant normalizedfair rute CY, and a flow with weight u). Then, the e.rcess seruzce received by a flow ,witla weight w, that sends at a ratr, no lurger than R is bounded above by 121
2. 7 Miscellaneous Details Tak(1+in=)+Imax (8) Having presented the basic CS algorithm, we now return We have used exponential averaging to estimate the ar- where ra= aw, and lmar represents the marimum length of al rate in Eq.(3). However, instead of using a consta exponential weight we used e-T/k where T is the inter packet arrival time and K is a constant. Our motivation By bounding the excess service, we have shown that in was that e-T/k more closely refects a Auid averaging pro his idealized setting the asymptotic throughput cannot. ex- cess which is independent of the packetizing structure. More system over short time scales; they are limited to their fair specifically, it can be shown that if a constant weight. ls sed the estimated rate will be sensitive to the packet length dis- share over long time scales tribution and there are pathological cases where the esti- 2.5 Implementation Complexity this would allow flows to exploit the estimation process and At core routers, both the time and space conplexity of our btain more than their fair share. In contrast by using a algorithm are constant with respect to the number of com- parameter of e-T/, the estimated rate will asymptotical peting flows and thus we think CSFQ could be implemented converge to the real rate, and this allows us to bound the excess service that can be achieved (as in Theorem 1. We g needs to maintain per flow state. upoassify the packet l on each arrival of each used a similar averaging process in Eq.(5)to estimate the Hlow,(2)update the fair share rate estimation for the cor- The choice of K in the above expression e-T/k presents responding out going link, (3)update the flow rate estima us with several tradeoffs. First, while a smaller K increases tion,and (4)label the packet. All these operations with the system responsiveness to rapid rate fluctuations, a larger the exception of packet classification can be efficiently im K better filters the noise and avoids potential system insta plemented today. ility. Second, k should be large enough guch t hat the esti- Ffficient and general-purpose packet classification algo- mated rate, calculated at the edge of the net work, remains rithins are still under active research, We expect to lever reasonabl r a packet traverses multiple links results. We also note that packet classification his is because the delay- jitter changes the packets'inter is needed for a number of other purpose arrival pattern, which may result in an increased discrep- ontext of Multiprotocol Label Switching ancy between thc estimated ratc (rcccivcd in the packets ing purposes; therefore, the classi labels) and the real rate. To counteract this effect, as a rule fication required for CSFQ may not be an extra cost. In ad of thumb, I should be one order of magnitude larger that dition, if the edge routers are typically not on the high-speed he delay-jitter experienced by a flow over a time interval of backbone links then there is no problem as classification at he same size, K. Third, K should be no larger than the average duration of a flow. Based on this constraints,an appropriate value for K would be between 100 and 500 ms A second issue relates to the requirement of CSFQ for a 2. 6 Architectural Considerations label to be carried in each packet. One possibility is to use he Type Of Service byte in the IP header. For example, by This was intentional, as the CSFQ approach can be applied using a floating point representation with four bits for man tutcs a flow is arbitrary as long as all packets in the io between 1 Kbps and 65 Mbps with an accuracy of 6. 25% follow the same path within the core. In this paper, for co Another possibility is to define an IP option in the case of venience, a flow is implicitly defined as a source-destination IPv4, or a hop-by-hop extension header in the case of IPv6 pair, but one could easily assign fair rates to many other granularities such as source-destination-ports. Moreover 3 Simulation as the rates are re-estimated when entering a new island the unit of How"can vary from island to island as long In this section we evaluate our algorithm by simulation. To Similarly, we have not been precise about the size of these provide some context, we compare CSFQ's performance to CSQ islands. In one extreme, we could take each router four additional algorithms. Two of these, FIFO and RED, as an island and estimate rates at every router; this would represent baseline cases where routers do not attempt to allow us to avoid the use of complicated per-How achcdulin chieve fair bandwidth allocations. The other two alg and dropping algorithms, but would require per-flow classi thms, FRED and DRR, represent different approach fication. Another possibility is that ISP's could extend their island of CSFQ routers to the very edge of their net work having their edge routers at the points where customer's FIFo(First In First Out).Packets are served in a packets enter the ISP's nctwork. Building on the prey first-in first-out order, and the buffers are scenario, multiple ISP's could combine their islands so that using a simple drop-tail strategy; i. e, incoming pack at ISP-ISP boundaries. The key obstacle here is one of trust RED(Random Early Detection in a first-in first-out order, but manage- an drop-tail
2.7 Miscellaneous Details Having presented the basic CSFQ algorithm, we now return to discuss a few aspects in more detail. We have used exponential averaging to estimate the arrival rate in Eq. (3). However, instead of using a constant exponential weight we used e-‘/Ii where T is the interpacket arrival time and li is a constant. Our motivation was that e-Tt1’ more closely reflects a fluid averaging process which is independent of the packetizing structure. More specifically, it can be shown that if a constant weight is used, the estimated rate will be sensitive to the packet length distribution and there are pathological cases where the estimated rate differs from the real arrival rate by a factor; this would allow flows to exploit the estimation process and obtain more than their fair share. In contrast, by using a parameter of e-‘/Ii, the estimated rate will asymptotically converge to the real rate, and this allows us to bound the excess service that can be achieved (as in Theorem 1). We used a similar averaging process in Eq. (5) to estimate the total arrival rate A. The choice of li in the above expression e-‘r”i presents us with several tradeoffs. First, while a smaller Ei increases the system responsiveness to rapid rate fluctuations, a larger I( better filters the noise and avoids potential system instability. Second, I< should be large enough such that the estimated rate, calculated at the edge of the network, remains reasonably accurate after a packet traverses multiple links. This is because the delay-jitter changes the packets’ interarrival pattern, which may result in an increased discrepancy between the estimated rate (received in the packets’ labels) and the real rate. To counteract this effect, as a rule of thumb, I< should be one order of magnitude larger that the delay-jitter experienced by a flow over a time interval of the same size, K. Third, I( should be no larger than the average duration of a flow. Based on this constraints, an appropriate value for li would be between 100 and 500 ms. A second issue relates to the requirement of CSFQ for a label to be carried in each packet. One possibility is to use the Type Of Service byte in the IP header. For example, by using a floating point representation with four bits for mantissa and four bits for exponent we can represents any rate between 1 Kbps and 65 Mbps with an accuracy of 6.25%. Another possibility is to define an IP option in the case of lPv4, or a hop-by-hop extension header in the case of IPv6. whew r a = ,IUJ, and l,,, represents the maximum length of a packet. By bounding the excess service, we have shown that. in this idealized setting the asymptotic throughput cannot exceed the fair share rate. Thus, flows can only exploit the system over short time scales; they are limited to their fair share over long time scales 2.5 Implementation Complexity At core routers, both the time and space complexity of our algorithm are constant with respect to the number of compet,iiig flows. and thus we think CSFQ could be implemented in very high speed core rout,ers. At each edge router CSFQ needs to maintain per flow state. Upon each arrival of each packet, the edge router needs to (1) classify the packet to a flow, (2) update the fair share rate estimation for the corresponding outgoing link, (3) update the flow rate estimat,ion, and (4) label the packet. All these operations with the exception of packet classification can be efficiently implemented today. E:fficient and general-purpose packet, classification algorithins are still under act,ive research. We expect to leverage lhese results. W e a 1 so note that packet classification at irigress nodes is needed for a number of other purposes, such as in the context of Multiprotocol Label Switching (MPLS) [4] or for accounting purposes; therefore, the classification required for CSFQ may not be an extra cost. In addition, if the edge routers are typically not on the high-speed backbone links then there is no problem as classification at rnoderate speeds is quite practical. 2.6 Architectural Considerations We have used the term flow wit,hout defining what we mean. This was int,entional, as the CSFQ approach can be applied to varying degrees of flow granularity; that is, what constitutes a flow is arbitrary as long as all packets in the flow follow the same path within the core. In this paper, for convenience, a flow is implicitly defined as a source-destination pair, but one could easily assign fair rates to many other granularities such as source-destination-ports. Moreover, the unit of Vlow” can vary from island to island as long as the rates are re-estimated when entering a new island. Similarly, we have not been precise about the size of these CSF’Q islands. In one extreme, we could take each router as an island and estimate rates at every router; this would allow us to avoid the use of complicated per-flow scheduling and dropping algorithms, but would require per-flow classification. Another possibility is that ISP’s could extend their island of CSFQ routers to the very edge of their network, having their edge routers at the points where customer’s packets enter the ISP’s network. Building on the previous scenario, multiple ISP’s could combine their islands so that classification and estimation did not have to be performed at ISP-ISI’ boundaries. The key obstacle here is one of trust between ISl’s. 3 Simulations In this section we evaluate our algorithm by simulation. To provide some context, we compare CSFQ’s performance to four additional algorithms. Two of these, FIFO and RED. represent baseline cases where routers do not attempt to achieve fair bandwidth allocations. The other two algorithms, FRED and DRR, represent different approaches to achieving fairness. l FIFO (First In First Out) - Packets are served in a first-in first-out order, and the buffers are managed using a simple drop-tail strategy; i.e., incoming packets are dropped when the buffer is full. l RED (Random Early Detection) - Packets are served in a first-in first-out order, but the buffer management is significantly more sophisticated than drop-tail. RED [9] starts to probabilistically drop packets long 122
06 小,律个 Figure 3: Simulation results for a 10 Mbps link shared by N flows. (a) The average thre ver 10 sec when n= 3: and all flows are UDPs. The arrival rate for How i is(i+ 1) times larger than its fair sha Hows are indexed frorn o (b) The throughputs of one UDP flow (indexed 0)sending at 10 Mbps, and of 31 TCP Ho Mbps link. g early congestion all use first-in-first-out scheduling. CSFQ edge routers have can then gracefully back complexity comparable to FRED, and CsFQ core routers off befo ffer overflows. RED maintains two have complexity comparable to RED When We have examined the behavior of CSFQ under a buffer occupancy is smaller than the first threshold, no ety of conditions. We use an assortment of traffic source packet is dropped, and when the exponentially aver aged buffer occupancy is larger than the second thresh- but also some on-off sources)and topologies. Due to space old all packets are dropped. When the exponentiall limitations, we only report on a small sampling of the sim raged buffer occupancy is between the two ulations we have run. All simulations were performed olds, the packet dropping probability increases ns-2 [17], which provide accurate packet-level implementa th buffer tion for various net work protocols, such as TCP and RLM ecelve driven Layered Multicast), and various buffer FRED(Flow Random Early Drop)-This algorithm management and scheduling algorithms, such as RED and width allocation(14]. To achieve fairness, FRED main- and FRED, were part of the standard ns-2 distribution 9 Unless otherwise specified, we use the following parame ping decisions are based on this flow statc. Speci KB. In the RED and FRED cases the first threshold is seL ally, FRED preferentially drops a packet of a flow 16 KB. while the second one is set to 32 KB. The averaging that has either(1) had many packets dropped in the constants K(used in estimating the Row rate), Ka(used in past, or(2)a queue larger than the average queue size estimating the fair rate), and Ke(used in making the deci- Fred has two variants, which we will call FRED-1 ion of whether a link is congested or not )are all set to 100 and FRED-2. The main difference between the two ns unless specified otherw The s that FRED-2 guarantees to each flow a minin we follow in this paper is to choose these constants to be ber of buff As a general rule, FRED-2 forms better than FRED-I only when the number of (i.e, 64KB/10Mbps = 51.2 ms). Finally, in all topologies Hows is large. In the following data, when we do not the first router on the path of each flow is always assumed distinguish between the two, we are quoting the result to be an edge router; all other routers are assumed without from the version of FRED which performed better exception to be core routers DRR Deficit Round Robin)-This algorithm [20]rep- We simulated the other four algorithms to give us bench resents an efficient implementation of the well-known narks against which to assess these results. We use DRR as ighted fair queueing (WFQ) discipline. The buffer our model of fairness and use the baseline cases, FIFO and management scheme assumes that when the buffer is This source. referred to as UDP in the remainder of the paper. full the packet from the longest queue is dropped. DRR has fixed size packets and the packet interarrival times are uniformly is the only one of the four to use a sophisticated per- distributed between [0. 5x avg, 1 G x avg), where avg is the avcrage flow queueing algorithm, and thus achieves the highest 5A fuller set of tests, and the scripts used to run them, is available dcgrcc of fairn These fc epresent four different levels of omplexity DRR and FRED have to classify incoming flows whereas FIFO and reD do not. DRR in addition has to as long as the aggregate arrival rate is less than 10 times the link the rest 123
(a) VJ) Figure 3: Simulat,ion results for a IO Mbps link shared by N flows. (a) The average throughput over 10 set when N = 32, and all flows arr IrDPs. The: arrival rate for flow i is (i + 1) times larger than its fair share. The flows are indexed from 0. (b) The throughputs of ant: UDP flow (indexed 0) sending at 10 Mbps, and of 31 TCP fiows sharing a 10 Mbps link. before the buff’er is full, providing early congestion indication to flows which can then gracefully backoff before the buffer overflows. RED maintains two buffer thresholds. When the exponentially averaged buffer occupancy is smaller than the first threshold, no packet is dropped, and when the exponentially averaged buffer occupancy is larger than the second threshold all packets are dropped. When the exponentially averaged buffer occupancy is between the two thresholds, the packet dropping probability increases linearly with buffer occupancy. l FRED (Flow Random Early Drop) - This algorithm extends RED to provide some degree of fair bandwidth allocation [14]. To achieve fairness, FRED maintains state for all flows that have at least one packet in the buffer. Unlike RED where the dropping decision is based only on the buffer state, in FRED dropping decisions are based on this flow stat,e. Specifically, FRED preferentially drops a packet of a flow that has either (1) had many packets dropped in the past, or (2) a queue larger than the average queue size. FRED has two variants, which we will call FRED-l and FRED-2. The main difference between the two is that FRED-2 guarantees to each flow a minimum number of buffers. As a general rule, FRED-2 performs better than FRED-1 only when the number of flows is large. In the following data, when we do not distinguish between the two, we are quoting the results from the version of FRED which performed better. . DRR (Deficit Round Robin) - This algorithm [20] represents an efficient implementation of the well-known weighted fair queueing (WFQ) discipline. The buffer management scheme assumes that when the buffer is full the packet from the longest queue is dropped. DRR is the only one of the four to use a sophisticated perflow queueing algorithm, and thus achieves the highest degree of fairness. These for~r algorit,hms represent, four different levels of c.omplexity. DRR and FRED h ave to classify incoming flows, whertlas FIFO and RED do not,. DRR in addition has to implement its packet scheduling algorithm, whereas the rest all use first-in-first-out scheduling. CSFQ edge routers have complexity comparable to FRED, and CSFQ core routers have complexity comparable to RED. We have examined the behavior of CSFQ under a variety of conditions. We use an assortment of traffic sources (mainly TCP sources and constant bit rate UDP sources,4 but also some on-off sources) and topologies. Due to space limitations, we only report on a small sampling of the simulations we have rum5 All simulations were performed in ns-2 [17], which provide accurate packet-level implementation for various network protocols, such as TCP and RLM [15] (Receiver-driven Layered Multicast), and various buffer management and scheduling algorithms, such as RED and DRR. All algorithms used in the simulation, except CSFQ and FRED, were part of the standard ns-2 distribution. Unless otherwise specified, we use the following parameters for the simulations in this section. Each output link has a capacity of 10 Mbps, a latency of 1 ms, and a buffer of 64 KB. In the RED and FRED cases the first threshold is set to 16 KB, while the second one is set to 32 KB. The averaging constants I( (used in estimating the flow rate), It’, (used in estimating the fair rate), and Ii’, (used in making the decision of whether a link is congested or not) are all set to 100 ms unless specified otherwise. The general rule of thumb we follow in this paper is to choose these constants to be roughly two times larger than the maximum queueing delay (i.e., 64KB/lOMbps = 51.2 ms).6 Finally, in all topologies the first router on the path of each flow is always assumed to be an edge router; all other routers are assumed without exception to be core routers. We simulated the other four algorithms to give us benchmarks against which to assess these results. We use DRR as our model of fairness and use the baseline cases, FIFO and ‘This source, referred to as UDP in the remainder of the paper, has fixed size packets and the packet interarrival times are umformly distributed between [0.5 x avg, 1.5 x avg), where aug is the average mterarrival time. 5A fuller set of tests, and the scripts used to run them, IS avaIlable at http:lluuu.cs.cmu.edu/-i=t~lca/csfq 61t can be shown that by using this rule an Idle lmk that becomes suddenly congested by a set of Identical UDP sources ~111 not experxnce buffer overflow before the algorithm detects the congestlon, as long as the aggregate arrival rate IS less than 10 times the link capacity (see [22]) 123
UDP.KI·LDPK10 CP/ O TCPUDP 0 Sources UDP.IU UD UDPKI UDP. KID Figure 5 for analyzing the effects of multiple con- gested e throughput of a flow. Each link has DPs). All links have 10 Mbps capaci- 2 Mbps, which l rates of all UDPs, excepting UDP-0,are ties. The leads to all links bet ween routers being con Figure 4: The normalized bandwidth of a TCP How that ested competes with N-1 UDP flows sending at twice their al- located rates. as a function of N ill-behaved flows. We perform 31 simulations, each for different value of N, N =2.. 32. In each simulation we RED, as representing the(unfair) status quo. The goal of ake one TCP Aow and UDP sends these experiments is to determine where CSFQ sits between at twice its fair share rate of 10 Mbps. Figure 4 plots the RED mark, being somewhat more complex than CSFQ but not 10 ratio betwcen the average throughput of thc TCP Aow over seconds and the fair share bandwidth it should receive In general, we find that CSFQ achieves a reasonable de- The function of the number of fows in the system N here are three points of interest. First, DRR performs very gree of fairness, significantly closer to DRR than to FI well when there are less than 22 fows, but its performance or RED. CSFQ's performance is typically comparable to decreases afterwards. This is because the TCP flow's buffer FRED's, although there are several situations where CSFQ share is less than three buffers, which significantly affect gnificantly outperforms FRED. There are a large number its throughput. Second, CSFQ performs better than DRR of experiments and each experiment involves rather complex when the number of flows is large. This is because CSFQ dynamics. Due to space limitations, in the sections that fol- able to cope better with the TCP burstiness by allowing the important points and omit TCP flow to have several packets buffered for short time intervals. Finally, across the entire range, CSFQ provides similar or better performance as compared to FreD 3.1 A Single Congested Link We first, consider a single 10 Mbps congested link shared by 3.2 Multiple Congested Links N flows. The propagation delay along the link is 1 ms. We We now analyze how the throughput of a well-behaved flow performed three related experiment In the first experiment, we have 32 UDP flows is affected when the How traverses more than one congested link. We performed two experiments based on the topology I sends 0.625 Mbps, and sa&y 1 times more than its fair shown in Figure 5. All UDPs, except UDP-0, send at hare of 0.3125 Mbps. Thus flow 0 sends 0. 3125 Mbps, flow Mbps. Since each link in the system has 10 Mbps capacit on, Figure 3(a) shows the av this will result in all links between routers being congested age throughput of each flow over a 10 sec interval; FIFO In the first experiment, we have a UDP denoted RED, and FrED-1 fail to ensure fairness, with each flow get UDP-O)sending at its fair ting a share proportional to its incoming rate, while drr ure 6(a)shows the fraction traffic that is for- a fair bandwidth dis warded versus the number links. CSFQ and tion. CSFQ and FRED-2 achieve a less precise degree of FRED perform reasonably not quite as well fairness; for CSFQ the throughputs of all flows are between as Di R 11% and +5% of the ideal value. In the second experiment we replace UDP-0 with a TCP In the second experiment we consider the impact of an flow. Similarly, Figure 6(b)plots the normalized TCP through ill-behaved UDP now on a set of TCP nows. More precisely, put against the number of congested links. Again, DRR and the traffic of fow 0 comes from a ud source that sends at 10 Mbps. while all the other flows (from 1 to 31)are TCPs. significantly worse though still much better than RED and Figure 3(b) shows the throughput of each flow averaged over FIFO. The reason is that while DRR and CSFQ try to allo- a 10 sec interval. The only two algorithms that can most cate bandwidth fairly among competing flows during conges effectively contain the UDP flow are DRR and CSFQ. Un- tion, FRED tries to allocate buffers fairly. Flows with dif- der FrEd the UDP flow gets almost 1. 8 Mbps -close to ferent end-to-end congestion control algorithms will achieve 0: 396 Mbps and 0. 361 Mbps under wrie the udP only gets different. t.hronghpnts even if routers try to fairly allocate buffer tively. As expected FIFO and ReD perform poorly, with tie UDP flow getting over 8 Mbps in both cas In the final experiment, we measure how well the al gorithms can protect a single tcp fo 124
Figure 4: The normalized bandwidth of a TCP flow that competes with N - 1 UDP flows sending at t,wice their allocated rates, as a function of N. RED, as representing the (unfair) status quo. The goal of these experiments is to determine where CSFQ sits between these two ext,remes. FRED is a more ambiguous benchmark, being somewhat more complex than CSFQ but not as complex as DRR. In general, we find that CSFQ achieves a reasonable degree of fairness, significantly closer to DRR t,han to FIFO or RED. CSFQ’s performance is typically comparable to FRED’s, although there are several situations where CSFQ significantly outperforms FRED. There are a large number of experiments and each experiment involves rather complex dynamics. Due to space limitations, in the sections that follow we will merely highlight a few important points and omit detailed explanations of the dynamics. 3.1 A Single Congested Link We first consider a single 10 Mbps congested link shared by N flows. The propagation delay along the link is 1 ms. We performed three related experiments. In the first, experiment, we have 32 UDP flows, indexed from 0, where flow i sends i + 1 times more than its fair share of 0.31.25 Mbps. Thus flow 0 sends 0.3125 Mbps, flow 1 sends 0.625 Mbps, and so on. Figure 3(a) shows the average throughput of each flow over a 10 set interval; FIFO, RED, and FRED-1 fail to ensure fairness, with each flow gett,ing a share proportional to its incoming rate, while DRR is extremely effective in achieving a fair bandwidth distribution. CSFQ and FRED-2 achieve a less precise degree of fairness; for CSFQ the throughputs of all flows are between -11% and +5% of the ideal value. In the second experiment we consider the impact of an ill-behaved IJDP flow on a set of TCP flows. More precisely, the traffic of flow 0 comes from a UDP source that sends at 10 Mbps, while all the other flows (from 1 to 31) are TCPs. Figure 3(b) shows the throughput of each flow averaged over a 10 set interval. The only two algorithms that can most effectively contain the UDP flow are DRR and CSFQ. Under FRED the UDP flow gets almost 1.8 Mbps - close to six times more than its fair share - while the LJDP only gets 0.396 Mbps and 0.361 Mbps under DRR and CSFQ, respectively. As expected FIFO and RED perform poorly, with the UDP flow getting over 8 Mbps in both cases. In the final experiment, we measure how well the algorit hms can protect a single TCP flow against multiple Figure 5: Topology for analyzing the effects of multiple congested links on the throughput of a flow. Each link has ten cross flows (all UDPs). All links have 10 Mbps capacities. The sending rates of all UDPs, excepting UDP-0, are 2 Mbps, which leads to all links between routers being congested. ill-behaved flows. We perform 31 simulations, each for a different value of N, N = 2.. .32. In each simulation we take one TCP flow and N - 1 UDP flows; each UDP sends at twice its fair share rate of $-bps. Figure 4 plots the ratio between the average throughput of the TCP flow over 10 seconds and the fair share bandwidth it should receive as a function of the total number of flows in the system N. There are three points of interest. First, DRR performs very well when there are less than 22 flows, but its performances decreases afterwards. This is because the TCP flow’s buffer share is less than three buffers, which significantly affects its throughput. Second, CSFQ performs better than DRR when the number of flows is large. This is because CSFQ is able to cope better with the TCP burstiness by allowing the TCP flow to have several packets buffered for short time intervals. Finally, across the entire range, CSFQ provides similar or better performance as compared to FRED. 3.2 Multiple Congested Links We now analyze how the throughput of a well-behaved flow is affected when the flow traverses more than one congested link. We performed two experiments based on the topology shown in Figure 5. All UDPs, except UDP-0, send at 2 Mbps. Since each link in the system has 10 Mbps capacity, this will result in all links between routers being congested. In the first experiment, we have a UDP flow (denoted UDP-0) sending at its fair share rate of 0.909 Mbps. Figure 6(a) shows the fraction of UDP-O’s traffic that is forwarded versus the number of congested links. CSFQ and FRED perform reasonably well, although not quite as well as DRR. In the second experiment we replace UDP-0 with a TCP flow. Similarly, Figure 6(b) plots the normalized TCP throughput against the number of congested links. Again, DRR and CSFQ prove to be effective. In comparison, FRED performs significantly worse though still much better than RED and FIFO. The reason is that while DRR and CSFQ try to allocate bandwidth fairly among competing flows during congestion, FRED tries to allocate buffers fairly. Flows with different end-to-end congestion control algorithms will achieve different throughputs even if routers try to fairly allocate buffer. 124
Figure 6:(a) The normalized throughput of UDP-0 as a function of the number of congested links. (b) The same plot when UDP-0 is replaced by a TCP flow FRED RED 53221436 5452 FIFO Table 1: Statistics for an ON-OFF flow with 19 competing Table 3: The mean throughputs (in packets) and st andard TCPs Alows(all numbers are in packets) deviations for 19 TCPs in the presence of a UDP How along a link with propagation delay of 100 ms, The UDP sends at the link capacity of 10 Mbps. gorithm mean time std dev is 64 KB, we set constants K, Ka, and Ke to be 250 ms i. e, about two times larger than the maximum queue de- I RED 5921274 lay. An interesting point to notice is that, unlike DRR and FIFO 8401695 CSFQ, Fred does not provide fair bandwidth allocation in this scenario. Again, as discussed in Section 3. 2, this is Table 2: The mcan transfcr times (in ma)and the corre due to the fact that RLM and TCP use different, end-to-end sponding standard deviations for 60 short TCPs in the pres- congestion control algorith ce of a UDP flow that sends at the link capacit Mbp 3. 4 Different Traffic Models So far we have only considered UDP, TCP and layered mi ticast, traffic sources. We now look at two additional source 3.3 Coexistence of Different Adaptation Schemes models with greater degrees of burstiness. We again con- In this experiment we investigate the extent to which CSFQ gle 10 Mbps congested link. In the first exper iment, this link is shared by one ON-OFF source and 19 Receiver-driven Layered Multicast(RLM)(15] is an adaptive TCPs.. The ON and OFF periods of the ON-OFF source into a number of layers (each to its own multicast group)and of 100 ms and 1900 ms respectively. During the ON perla the receiver joins or leaves the groups associated with th the ON-OFF sourcc sends at 10 Mbps. Notc that thc ON layers based on how many packet drops it is experiencing the der as the averaging intervals K e consider a 4 Mbps link traversed by one TCP and three and Kc which are all 100 ms, so this experiment is designed RLM flows. Each source uses a seven layer encoding, where to test to what extent CSFQ act over short timescales RLM case this will correspond to each receiver subs.E layer i sends 2+4 Kbps; each layer is modeled by a UDP The ON-OFF source sent 6758 packets over the course of affic source. The fair share of each How is 1Mbps the experiment. Table 1 shows the number of packets from the ON-OFF source dropped at the congested link. The to the first five layers? DRR results show what happens when the ON-OFF source The receiving rates averaged over I second interval for restricted to its fair share at all times. FRED and CSFQ each algorithm are plotted in Figure 7. Since in this experi- also are able to of fairnes ment the link bandwidth is 4 Mbps and the router buffer size ulates Web traffi 60 TCP transfers -arrival times are More precisely, we have >s,2+4 Kbps = 0.992 Mbps distributed with the mean of 0.05 ms, and the
Figure 6: (a) The normalized throughput of TJDP-0 as a function of the number of congested links. (b) The same plot when UDP-0 is replaced by a TCP flow. Algorithm delivered dropped DRR I 601 1 6157 t CSFQ: I 1680 I 5078 1 Table 1: Statistics for an ON-OFF flow with 19 competing TCPs flows (all numbers are in packets). FIFO 840 1 1695 Table 2: The mean transfer times (in ms) and the corresponding standard deviations for 60 short TCPs in the presence of a UDP flow that sends at the link capacity, i.e., 10 Mbps. 3.3 Coexistence of Different Adaptation Schemes In this experiment we investigate the extent to which CSFQ can deal with flows that employ different adaptation schemes. Receiver-driven Layered Multicast (RLM) [15] is an adaptive scheme in which the source sends the information encoded into a number of layers (each to its own multicast group) and the receiver joins or leaves the groups associated with the layers based on how many packet drops it is experiencing We consider a 4 Mbps link traversed by one TCP and three RLM flows. Each source uses a seven layer encoding, where layer i sends 21t4 Kbps; each layer is modeled by a UDP traffic source. The fair share of each flow is 1Mbps. In the RLM case this will correspond to each receiver subscribing to the first five layers’. The receiving rates averaged over 1 second interval for each algorithm are plotted in Figure 7. Since in this experiment the link bandwidth is 4 Mbps and the router buffer size ‘More precisely, we have c:=, 2’t4 Kbps = 0.992 Mbps. Algorithm 1 mean std. dev DRR I 6080 I 64 CSFQ 5761 220 FRED 4974 190 RED 628 80 FIFO 378 69 Table 3: The mean throughputs (in packets) and standard deviations for 19 TCPs in the presence of a UDP flow along a link with propagation delay of 100 ms. The UDP sends at the link capacity of 10 Mbps. is 64 KB, we set constants I(, li,, and h’, to be 250 ms, i.e., about two times larger than the maximum queue delay. An interesting point to notice is that, unlike DRR and CSFQ, FRED does not provide fair bandwidth allocation in this scenario. Again, as discussed in Section 3.2, this is due to the fact that RLM and TCP use different end-to-end congestion control algorithms. 3.4 Different Traffic Models So far we have only considered UDP, TCP and layered multicast traffic sources. We now look at two additional source models with greater degrees of burstiness. We again consider a single 10 Mbps congested link. In the first experiment, this link is shared by one ON-OFF source and 19 TCPs. The ON and OFF periods of the ON-OFF source are both drawn from exponential distributions with means of 100 ms and 1900 ms respectively. During the ON period the ON-OFF source sends at 10 Mbps. Note that the ONtime is on the same order as the averaging intervals K, K,, and I(, which are all 100 ms, so this experiment is designed to test to what extent CSFQ can react over short timescales. The ON-OFF source sent 6758 packets over the course of the experiment. Table 1 shows the number of packets from the ON-OFF source dropped at the congested link. The DRR results show what happens when the ON-OFF source is restricted to its fair share at all times. FRED and CSFQ also are able to achieve a high degree of fairness. Our next experiment simulates Web traffic. There are 60 TCP transfers whose inter-arrival times are exponentially distributed with the mean of 0.05 ms, and the length of each 125
(a) DRR ErGun (b) CSFQ (c)FRED A超 钟增 uii ir (a) reD (e)FIFO Figure 7: The throughput of three RLM Aows and one TCP flow along a 4 Mbps link transfer is drawn from a Pareto distribution with 20 packets walues are consistent with those pre: amer of age number of packets of a TCP flow during a 100 seconds the [5]. In addition, there is a single 10 Mbps UDP flow n interval. Both CSFQ and FRED perform reasonably well Table 2 presents tlie mean trausfer time and the 3.6 Packet Relabeling ponding standard deviations. Here, CSFQ perfo than FRED ly because it has a larger average Recall that when the dropping probability of a packet. size, but still almost one order of magnitude better thar non-zero we relabel it with the fair rate a so that the label IFO and reD of the packet will refect the new rate of the flow. To test how well this works ogy 3.5 Large Latency Figure 8, where each link is 10 Mbps. Note that as long as their full f All of our experiments so far have had small link delays( and 2 are less on link 2(3.33 Mbps)thaI ms). In this experiment we again consider a single 10 Mbps on link 1 (5 Mbps), re will be dropping on both links congested link, but now with a propagation delay of 100 ns his will test the function to make sure that the The load is comprised of one UDP ow that sends at the coming rates are ely reflected on the second link link speed and 19 TCP flows. Due to the large propagation We perform two e s(only looking at CSFQ's delay, in this experiment we set the buffer size to be 256 KB, formance). In the first, there are three UDPs sending data
(a) DRR transfer is drawn from a Pareto distribution with a mean of 20 packets (1 packet = 1 KB) and a shaping parameter of 1.06. These values are consistent with those presented in t,he [5]. In addition, there is a single 10 Mbps UDP flow. ‘Table 2 presents the mean transfer time and the corresponding standard deviations. Here, CSFQ performs worse than FRED, mainly because it has a larger average queue size, but still almost one order of magnitude better than FIFO and RED. 3.5 Large Latency All of our experiments so far have had small link delays (1 ms). In this experiment we again consider a single 10 Mbps congested link, but now with a propagation delay of 100 ms. ‘lhc, load is comprised of one UDP flow that sends at the link speed and 19 TCP fiows. Due to the large propagation delay, in this experiment we set the buffer size Lo be 256 KB, Figure 7: The throughput of three RLM flows and one TCP flow along a 4 Mbps link . and Ii’, K,, and I(, to be 400 ms. Table 3 shows the average number of packets of a TCP flow during a 100 seconds interval. Both CSFQ and FRED perform reasonably well. 3.6 Packet Relabeling Recall that when the dropping probability of a packet is non-zero we relabel it with the fair rate cy so that the label of the packet will reflect the new rate of the flow. To test how well this works in practice, we consider the topology in Figure 8, where each link is 10 Mbps. Note that as long as all three flows attempt to use their full fair share, the fair shares of flows 1 and 2 are less on link 2 (3.33 Mbps) than on link 1 (5 Mbps), so there will be dropping on both links. This will test the relabelling function to make sure that the incoming rates are accurately reflected on the second link. We perform two experiments (only looking at CSFQ’s performance). In the first, there are three UDPs sending data 126
Flow 2 [8, and have benefited greatly from conversations with Steve Deering and Sally Floyd. We should note that the matters addressed in this section are rather controversial and thi overview unavoidably reflects our prejudices. This section however, is merely intended to provide some perspective or Flow t2 (Link 2) vation for this work, and thig overview should not undercut the technical aspects of the CSFQ pro- 10 Mbps posal that are the main focus of the previous sections 4. 1 The Unfriendly Flow Problem Data networks such as the Internet because of their rel on statistical multiplexing, must provide some mechan Figure 8: Simulation scenario for the packet relabeling ex- control congestion. The current Internet, which has periment. Each link has 10 Mbps capacity, and a propaga- FIFo queueing and drop-tail mechanisms in its routers, re- tion delay of 1 ms lics on cnd-to-cnd congestion control in which hosts curtail their transmission rates when they detect that the network is congested. The most widely utilized form of end-to-end Traffic Flow 1 Flow 2 congestion control is that embodied in TCP [1l], which has been tremendously successful in preventing congestion col The efficacy of this approach depends on two fundamen Table 4: The throughputs resulting Irom CSFQ averaged over 10 seconds for the three flows in figure 8 along link 2. in that they implement congestion control algorithms,and (2) these algorithms are homogeneous- or roughly equiv alent hat they pruduce simuilar bandwidth allocation at 10 Mbps each. Table 4 shows the average throughputs if used in similar circumstances. In particular, assumption over 10 sec of the three UDP fows. As expected these rar (2)requires, in the language of [8], that all flows are TCP friendly place the three [DPs by three TCPs. Again, despite the The assumption of universal cooperation can be violated TCP burstiness which may negatively affect the rate esti- in three general ways. First, some applications are unre mation and relabeling accuracy, each TCP gets close to its trol algorithms at all. Most of the early multimedia and multicast applications, like vat, nu, vic, ub and RealAudio 3.7 Discussion of Simulation Results fall into this category. Second, some applications use col gestion control algorithms that, while responsive, are not We have tested CSFQ under a wide range of conditions TCP-friendly. RLM is such an algorithm. Third, som conditions purposely designed to stress its ability to achieve users will cheat and use a non-TCP congestion control al- fair allocations, These t d the others we have run gorithm to get more bandwidth. An example of this would gest that CSFQ achieves a reasonable approximation of G R but cannot show here because of space limitations be using a modified form of TCP with, for instance, a larger initial window and window opening constants dwidth allocations in most conditions. Certainly CSFQ Each of these forms of noncooperation can have a rd to the status quo (FIFo or nificant ncgative impact on the performance obtaine ) Moreover, in all situations CSFQ is roughly compa cooperating flows. At present, we do not yet know able with Fred. and in widespread noncooperation will be, and thus cannot assess fairer allocations, Recall that FRED the level of harm it will cause. However low classification while CSfQ does not so we are achievi solid evidence that noncooperation will not be a problem these levels of fairness in a more scalable manner, However t seems unsound to base the Internets congestion control there is clearly room for improvement in CSFQ, as there are paradigm on the assumption of universal cooperation.We benchmark, DRR. We do not yet know if these are due to tion that one needs to deal with the problem of unfriendly our particular choices for the estimation algorithms, or are fows inherent properties of the CSFQ architecture Actually,the term TCP-friendly in [8] means that"their arrival 4 Why Are Fair Allocations Important? ty that should be mere precisely cal In the Introduction we stated that one of the underlying as was beneficial, and perhaps even crucial, for congestion con- TCP- equivalent but are TCP-friendly -i.e, they get much less than trol. In this section we motivate the role of fair allocations in their fare share will be widely deploy on control by discussing the problem of unfriendly 3 showed RLM we end this section with a discussion of the role of punish- flow starts after all the RLM flows then it receives less that hair lows, and then presenting two approaches to this problem ment. In what follows we borrow heavily from 7,3, and first pointed out to us by Steve McCanne [15] 127
Sources Flow 2 I (IO Mbps) Flow I2 (IO Mbps) (Lmk 2) Gateway IO Mbps k-0 Sink Flow 3 (10 Mbps) 0’ Figure 8: Simulation scenario for the packet relabeling experiment. Each link has 10 Mbps capacity, and a propagation delay of 1 ms. Td?iC Flow 1 Flow 2 Flow 3 UDP 3.36 3.32 3.28 TCP 3.43 3.13 3.43 Table 4: The throughputs resulting from CSFQ averaged over 10 seconds for the three flows in Figure 8 along link 2. at 10 Mbps each. Table 4 shows the average throughputs over 10 set of the three UDP flows. As expected these rates are closed to 3.33 Mbps. In the second experiment, we replace the three UDPs by three TCPs. Again, despite the TCP burstiness which may negatively affect the rate estimation and relabeling accuracy, each TCP gets close to its fair share. 3.7 Discussion of Simulation Results We have tested CSFQ under a wide range of conditions, conditions purposely designed to stress its ability to achieve fair allocations. These tests, and the others we have run but cannot show here because of space limitrations, suggest that CSFQ achieves a reasonable approximation of fair bandwidth allocations in most conditions. Certainly CSFQ is far superior in this regard to the status quo (FIFO or RED). Moreover, in all situations CSFQ is roughly comparable with FRED, and in some cases it achieves significantly fairer allocations. Recall that FRED requires per-packet flow classification while CSFQ does not, so we are achieving these levels of fairness in a more scalable manner. However, there is clearly room for improvement in CSFQ, as there are cases where its performance is significantly below that of its benchmark, DRR. We do not yet know if these are due to our particular choices for the estimation algorithms, or are inherent properties of the CSFQ architecture. 4 Why Are Fair Allocations Important? In the Introduction we stated that one of the underlying assumptions of this work is that fairly allocating bandwidth was beneficial, and perhaps even crucial, for congestion control. In this section we motivate the role of fair allocations in congestion control by discussing the problem of unfriendly flows, and then presenting two approaches to this problem; we end this section with a discussion of the role of punishment. In what follows we borrow heavily from [7], [3], and [8], and have benefited greatly from conversations with Steve Deering and Sally Floyd. We should note that, the matters addressed in this section are rather controversial and this overview unavoidably reflects our prejudices. This section. however, is merely intended to provide some perspective on our motivation for this work, and any biases in this overview should not undercut the technical aspects of the CSFQ proposal that are the main focus of the previous sections. 4.1 The Unfriendly Flow Problem Data networks such as the Internet, because of their reliance on statistical multiplexing, must provide some mechanism to control congestion. The current, Internet, which has mostly FIFO queueing and drop-tail mechanisms in its routers, I‘(-‘- lies on end-to-end congestion control in which hosts curt,ail their transmission rates when they detect that. the network is congested. The most widely utilized form of end-to-end congestion control is that embodied in TCP [ll], which has been tremendously successful in preventing congestion collapse. The efficacy of this approach depends on two fundamental assumptions: (1) all (or almost all) flows are cooperative in that they implement congestion control algorithms, and (2) these algorithms are homogeneous -~ or roughly equivalent - in that they produce similar bandwidth allocations if used in similar circumstances. In particular, assumption (2) requires, in the language of [8], that all flows arc TCf’- friendly.8 The assumption of universal cooperation can be violated in three general ways. First, some applications are urhresponsa’ve in that they don’t implement any congestion control algorithms at all. Most of the early multimedia and multicast applications, like vat, nv, vie, wb and RealAudio fall into this category. Second, some applications use congestion control algorithms that, while responsive, are not TCP-friendly. RLM is such an algorithm.g Third, some users will cheat and use a non-TCP congestion control algorithm to get more bandwidth. An example of this would be using a modified form of TCP with, for instance, a larger initial window and window opening constants. Each of these forms of noncooperation can have a significant negative impact on the performance obtained by cooperating flows. At present, we do not, yet know how widespread noncooperation will be, and thus cannot assess the level of harm it will cause. However, in lieu of more solid evidence that noncooperation will not be a problem, it seems unsound to base the Internet’s congestion control paradigm on the assumption of universal cooperation. We therefore started this paper with the fundament,al assumption that one needs to deal with the problem of unfriendly flows. ‘Actually, the term TCP-friendly in [8] means that “thew arrival rate does not exceed that of any TCP connection in the same cucumsta”ces.” Here we use it to mean that the arrival rates are roughly comparable, a property that should be more precisely called TCP-eqozdent. We blur the dlstmctlon betweeu ‘I’CP-friendly and TCP-equivalent to avoid an overly unwieldy set of terms III this short overview. However, we think the distinction may be rendered moot since it is unlikely that congestion control algorithms that arc not TCP-equivalent but are TCP-friendly - z.e., they get much less than their fare share - will be widely deployed. ‘Although our data in Sectmn 3.3 showed RLM recavlng less than its fair share, when we change the simulation scenario so that the TCP flow starts after all the RLM flows then It receives less than half of its fair share This hysteresis in the RLM versus TCP behavior was first pointed out to us by Steve McCanne [15]. 127