Congestion Avoidance and Control Van jacobson Lawrence Berkeley Laboratory Michael J Karels+ University of California at Berkeley November 1988 Introduction Computer networks have experienced an explosive growth over the past few years and with that growth have come severe congestion problems. For example, it is now common to see internet gateways drop 10% of the incoming packets because of local buffer overflows Our investigation of some of these problems has shown that much of the cause lies in transport protocol implementations(not in the protocols themselves ): The obvious' way to implement a window-based transport protocol can result in exactly the wrong behavior in response to network congestion. We give examples of wrong behavior and describe some simple algorithms that can be used to make right things happen. The algorithms are rooted in the idea of achieving network stability by forcing the transport connection to obey a packet conservation principle. We show how the algorithms derive from this principle and what effect they have on traffic over congested networks In October of 86. the internet had the first of what became a series of ' congestion col- lapses. During this period, the data throughput from LBL to UC Berkeley(sites separated by 400 yards and two IMP hops)dropped from 32 Kbps to 40 bps. We were fascinated by this sudden factor-of-thousand drop in bandwidth and embarked on an investigation of why things had gotten so bad. In particular, we wondered if the 4.3BSD(Berkeley UNIX) TCP was mis-behaving or if it could be tuned to work better under abysmal network conditions The answer to both of these questions was yes Note: This is a very slightly revised version of a paper originally presented at SIGCOMM88 [12 eY RThis work was supported in part by the U.S. Department of Energy under Contract Number DE-ACO3- vish to reference this work, please cite [12] 76sF00098 *This work was supported by the U.S. Department of Commerce, National Bureau of Standards, under Grant Number 60NANB8D0830
Congestion Avoidance and Control∗ Van Jacobson† Lawrence Berkeley Laboratory Michael J. Karels‡ University of California at Berkeley November, 1988 Introduction Computer networks have experienced an explosive growth over the past few years and with that growth have come severe congestion problems. For example, it is now common to see internet gateways drop 10% of the incoming packets because of local buffer overflows. Our investigation of some of these problems has shown that much of the cause lies in transport protocol implementations (not in the protocols themselves): The ‘obvious’ ways to implement a window-based transport protocol can result in exactly the wrong behavior in response to network congestion. We give examples of ‘wrong’ behavior and describe some simple algorithms that can be used to make right things happen. The algorithms are rooted in the idea of achieving network stability by forcing the transport connection to obey a ‘packet conservation’ principle. We show how the algorithms derive from this principle and what effect they have on traffic over congested networks. In October of ’86, the Internet had the first of what became a series of ‘congestion collapses’. During this period, the data throughput from LBL to UC Berkeley (sites separated by 400 yards and two IMP hops) dropped from 32 Kbps to 40 bps. We were fascinated by this sudden factor-of-thousand drop in bandwidth and embarked on an investigation of why things had gotten so bad. In particular, we wondered if the 4.3BSD (Berkeley UNIX) TCP was mis-behaving or if it could be tuned to work better under abysmal network conditions. The answer to both of these questions was “yes”. ∗Note: This is a very slightly revised version of a paper originally presented at SIGCOMM ’88 [12]. If you wish to reference this work, please cite [12]. †This work was supported in part by the U.S. Department of Energy under Contract Number DE-AC03- 76SF00098. ‡This work was supported by the U.S. Department of Commerce, National Bureau of Standards, under Grant Number 60NANB8D0830. 1
I GETTING TO EQUILIBRIUM: SLOW-START Since that time, we have put seven new algorithms into the 4BSD TCP round-trip-time variance estimation (ii) exponential retransmit timer backoff (iii) slow-start (iv) more aggressive receiver ack policy (v) dynamic window sizing on congestion (vi Karn's clamped retransmit backoff (vii) fast retransmit Our measurements and the reports of beta testers suggest that the final product is fairly good at dealing with congested conditions on the Internet tr This paper is a brief description of (i)-(v)and the rationale behind them. (vi) is an algo- thm recently developed by Phil Karn of Bell Communications Research, described in [16] (vii) is described in a soon-to-be-published RFC(ARPANET Request for Comments) Algorithms()-(v) spring from one observation: The fow on a TCP connection(or ISO TP-4 or Xerox NS SPP connection) should obey a conservation of packets' principle And, if this principle were obeyed, congestion collapse would become the exception rather than the rule. Thus congestion control involves finding places that violate conservation and fixing them By 'conservation of packets' we mean that for a connection in equilibrium,1.e,run- ning stably with a full window of data in transit, the packet fow is what a physicist would call"conservative: A new packet isn t put into the network until an old packet leaves. The physics of flow predicts that systems with this property should be robust in the face of congestion. Observation of the Internet suggests that it was not particularly robust. Why the di There are only three ways for packet conservation to fail 1. The connection doesnt get to equilibrium,or 2. A sender injects a new packet before an old packet has exited, or 3. The equilibrium can t be reached because of resource limits along the path In the following sections we treat each of these in turn 1 Getting to Equilibrium: Slow-start Failure (1)has to be from a connection that is either starting or restarting after a packet loss. Another way to look at the conservation property is to say that the sender uses acks as a clock to strobe new packets into the network. Since the receiver can generate acks ne 'A conservative flow means that for any given time the integral of the packet density around the sender- receiver-sender loop is a constant. Since packets have to'diffuse'around this loop, the integral is suficiently ontinuous to be a Lyapunov function for the system. A constant function trivially meets the conditions for Lyapunov stability so the system is stable and any superposition of such systems is stable. (See [3], chap. 11
1 GETTING TO EQUILIBRIUM: SLOW-START 2 Since that time, we have put seven new algorithms into the 4BSD TCP: (i) round-trip-time variance estimation (ii) exponential retransmit timer backoff (iii) slow-start (iv) more aggressive receiver ack policy (v) dynamic window sizing on congestion (vi) Karn’s clamped retransmit backoff (vii) fast retransmit Our measurements and the reports of beta testers suggest that the final product is fairly good at dealing with congested conditions on the Internet. This paper is a brief description of (i) – (v) and the rationale behind them. (vi) is an algorithm recently developed by Phil Karn of Bell Communications Research, described in [16]. (vii) is described in a soon-to-be-published RFC (ARPANET “Request for Comments”). Algorithms (i) – (v) spring from one observation: The flow on a TCP connection (or ISO TP-4 or Xerox NS SPP connection) should obey a ‘conservation of packets’ principle. And, if this principle were obeyed, congestion collapse would become the exception rather than the rule. Thus congestion control involves finding places that violate conservation and fixing them. By ‘conservation of packets’ we mean that for a connection ‘in equilibrium’, i.e., running stably with a full window of data in transit, the packet flow is what a physicist would call ‘conservative’: A new packet isn’t put into the network until an old packet leaves. The physics of flow predicts that systems with this property should be robust in the face of congestion.1 Observation of the Internet suggests that it was not particularly robust. Why the discrepancy? There are only three ways for packet conservation to fail: 1. The connection doesn’t get to equilibrium, or 2. A sender injects a new packet before an old packet has exited, or 3. The equilibrium can’t be reached because of resource limits along the path. In the following sections, we treat each of these in turn. 1 Getting to Equilibrium: Slow-start Failure (1) has to be from a connection that is either starting or restarting after a packet loss. Another way to look at the conservation property is to say that the sender uses acks as a ‘clock’ to strobe new packets into the network. Since the receiver can generate acks no 1A conservative flow means that for any given time, the integral of the packet density around the sender– receiver–sender loop is a constant. Since packets have to ‘diffuse’ around this loop, the integral is sufficiently continuous to be a Lyapunov function for the system. A constant function trivially meets the conditions for Lyapunov stability so the system is stable and any superposition of such systems is stable. (See [3], chap. 11– 12 or [21], chap. 9 for excellent introductions to system stability theory.)
I GETTING TO EQUILIBRIUM: SLOW-START Figure 1: Window Flow Control Self-clocking nder Receiver H Ab- This is a schematic representation of a sender and receiver on high bandwidth networks connected by a slower, long-haul net. The sender is just starting and has shipped a win- dow's worth of packets, back-to-back. The ack for the first of those packets is about to arrive back at the sender(the vertical line at the mouth of the lower left funnel) The vertical dimension is bandwidth, the horizontal dimension is time. Each of the shaded boxes is a packet. Bandwidth x Time= Bits so the area of each box is the packet size. The number of bits doesn't change as a packet goes through the network so a packet squeezed into the smaller long-haul bandwidth must spread out in time. The time Pb represents the minimum packet spacing on the slowest link in the path(the bottleneck). As the packets leave the bottleneck for the destination net, nothing changes the inter-packet interval so on iver's net packet spacing Pr= Pb. If the receiver is the same for all packets, the spacing between acks on the receiver's net A,=Pr= Pb. If the time slot Pb was big enough for a packet, it's big enough for an ack so the ack spacing is preserved long the return path. Thus the ack spacing on the sender's net As= Pb So, if packets after the first burst are sent only in response to an ack, the sender's packet spacing will exactly match the packet time on the slowest link in the path faster than data packets can get through the network, the protocol is self clocking(fig. 1) Self clocking systems automatically adjust to bandwidth and delay variations and have a wide dynamic range(important considering that TCP spans a range from 800 Mbps Cray channels to 1200 bps packet radio links). But the same thing that makes a self-clocked system stable when it's running makes it hard to start-to get data fowing there must be acks to clock out packets but to get acks there must be data flowing To start the 'clock,, we developed a slow-start algorithm to gradually increase the amount of data in-transit. Although we flatter ourselves that the design of this algorithm is rather subtle, the implementation is trivial -one new state variable and three lines of code in the sender. 2Slow-start is quite similar to the CUTE algorithm described in[14]. We didn t know about CUTE at the time we were developing slow-start but we should have--CUTE preceded our work by several months When describing our algorithm at the Feb, 1987, Internet Engineering Task Force (IETF)meeting, we called it sofl-start, a reference to an electronics engineers technique to limit in-rush current. The name slow-start was coined by John Nagle in a message to the IETF mailing list in March, 87. This name was clearly superior to ours and we promptly adopted it
1 GETTING TO EQUILIBRIUM: SLOW-START 3 Figure 1: Window Flow Control ‘Self-clocking’ Pr As Ar Pb Sender Receiver Ab This is a schematic representation of a sender and receiver on high bandwidth networks connected by a slower, long-haul net. The sender is just starting and has shipped a window’s worth of packets, back-to-back. The ack for the first of those packets is about to arrive back at the sender (the vertical line at the mouth of the lower left funnel). The vertical dimension is bandwidth, the horizontal dimension is time. Each of the shaded boxes is a packet. Bandwidth × Time = Bits so the area of each box is the packet size. The number of bits doesn’t change as a packet goes through the network so a packet squeezed into the smaller long-haul bandwidth must spread out in time. The time Pb represents the minimum packet spacing on the slowest link in the path (the bottleneck). As the packets leave the bottleneck for the destination net, nothing changes the inter-packet interval so on the receiver’s net packet spacing Pr = Pb. If the receiver processing time is the same for all packets, the spacing between acks on the receiver’s net Ar = Pr = Pb. If the time slot Pb was big enough for a packet, it’s big enough for an ack so the ack spacing is preserved along the return path. Thus the ack spacing on the sender’s net As = Pb. So, if packets after the first burst are sent only in response to an ack, the sender’s packet spacing will exactly match the packet time on the slowest link in the path. faster than data packets can get through the network, the protocol is ‘self clocking’ (fig. 1). Self clocking systems automatically adjust to bandwidth and delay variations and have a wide dynamic range (important considering that TCP spans a range from 800 Mbps Cray channels to 1200 bps packet radio links). But the same thing that makes a self-clocked system stable when it’s running makes it hard to start — to get data flowing there must be acks to clock out packets but to get acks there must be data flowing. To start the ‘clock’, we developed a slow-start algorithm to gradually increase the amount of data in-transit.2 Although we flatter ourselves that the design of this algorithm is rather subtle, the implementation is trivial — one new state variable and three lines of code in the sender: 2Slow-start is quite similar to the CUTE algorithm described in [14]. We didn’t know about CUTE at the time we were developing slow-start but we should have—CUTE preceded our work by several months. When describing our algorithm at the Feb., 1987, Internet Engineering Task Force (IETF) meeting, we called it soft-start, a reference to an electronics engineer’s technique to limit in-rush current. The name slow-start was coined by John Nagle in a message to the IETF mailing list in March, ’87. This name was clearly superior to ours and we promptly adopted it
I GETTING TO EQUILIBRIUM: SLOW-START Figure 2: The Chronology of a Slow-start One Round Trip Time One Packet Time The horizontal direction is time. The continuous time line has been chopped into one- round-trip-time pieces stacked vertically with increasing time going down the page. The grey, numbered boxes are packets. The white numbered boxes are the corresponding acks As each ack arrives, two packets are generated: one for the ack(the ack says a packet has left the system so a new packet is added to take its place)and one because an ack opens the congestion window by one packet. It may be clear from the figure why an add-one- packet-to-window policy opens the window exponentially in time If the local net is much faster than the long haul net, the ack's two packets arrive at the bottleneck at essentially the same time. These two packets are shown stacked on top of one nother (indicating that one of them would have to occupy space in the gateway's outbound queue). Thus the short-term queue demand on the gateway is increasing exponentially and opening a window of size w packets will require w/2 packets of buffer capacity at the bottleneck Add a congestion window, cwnd, to the per-connection state When starting or restarting after a loss, set cwnd to one packet On each ack for new data, increase cwnd by one packet When sending, send the minimum of the receivers advertised window and cwnd Actually, the slow-start window increase isnt that slow: it takes time Rlog w where R is the round-trip-time and w is the window size in packets(fig. 2). This means the window opens quickly enough to have a negligible effect on performance, even on links with large bandwidth-delay product. And the algorithm guarantees that a connection will source data at a rate at most twice the maximum possible on the path. Without slow-start, by contrast, when 10 Mbps Ethernet hosts talk over the 56 Kbps Arpanet via IP gateways, the
1 GETTING TO EQUILIBRIUM: SLOW-START 4 Figure 2: The Chronology of a Slow-start 1 2 3 1 One Round Trip Time 0R 1R 2 4 5 3 6 7 2R 4 8 9 5 10 11 6 12 13 7 14 15 3R One Packet Time The horizontal direction is time. The continuous time line has been chopped into oneround-trip-time pieces stacked vertically with increasing time going down the page. The grey, numbered boxes are packets. The white numbered boxes are the corresponding acks. As each ack arrives, two packets are generated: one for the ack (the ack says a packet has left the system so a new packet is added to take its place) and one because an ack opens the congestion window by one packet. It may be clear from the figure why an add-onepacket-to-window policy opens the window exponentially in time. If the local net is much faster than the long haul net, the ack’s two packets arrive at the bottleneck at essentially the same time. These two packets are shown stacked on top of one another (indicating that one of them would have to occupy space in the gateway’s outbound queue). Thus the short-term queue demand on the gateway is increasing exponentially and opening a window of size W packets will require W/2 packets of buffer capacity at the bottleneck. • Add a congestion window, cwnd, to the per-connection state. • When starting or restarting after a loss, set cwnd to one packet. • On each ack for new data, increase cwnd by one packet. • When sending, send the minimum of the receiver’s advertised window and cwnd. Actually, the slow-start window increase isn’t that slow: it takes time Rlog2W where R is the round-trip-time and W is the window size in packets (fig. 2). This means the window opens quickly enough to have a negligible effect on performance, even on links with a large bandwidth–delay product. And the algorithm guarantees that a connection will source data at a rate at most twice the maximum possible on the path. Without slow-start, by contrast, when 10 Mbps Ethernet hosts talk over the 56 Kbps Arpanet via IP gateways, the
2 CONSERVATION AT EQUILIBRIUM: ROUND-TRIP TIMING Figure 3: Startup behavior of TCP without Slow-start 8 Send Time(sec) Trace data of the start of a TCP conversation between two Sun 3/50s running Sun os 3.5 (the 4.3BSD TCP). The two Suns were on different Ethernets connected by IP gateways driving a 230. 4 Kbps point-to-point link(essentially the setup shown in fig. 7). The win- dow size for the connection was 16KB (32 512-byte packets)and there were 30 packets of buffer available at the bottleneck gateway. The actual path contains six store-and-forward hops so the pipe plus gateway queue has enough capacity for a full window but the gateway alone does not Each dot is a 512 data-byte packet. The x-axis is the time the packet was sent. The y axis is the sequence number in the packet header. Thus a vertical array of dots indicate back-to-back packets and two dots with the same y but different x indicate a retransmit Desirable behavior on this graph would be a relatively smooth line of dots extending diagonally from the lower left to the upper right. The slope of this line would equal the available bandwidth. Nothing in this trace resembles desirable behavior The dashed line shows the 20 KBps bandwidth available for this connection. Only 35% of this bandwidth was used; the rest was wasted on retransmits. Almost everything is retransmitted at least once and data from 54 to 58 KB is sent five times first-hop gateway sees a burst of eight packets delivered at 200 times the path bandwidth This burst of packets often puts the connection into a persistent failure mode of continuous retransmissions(figures 3 and 4) 2 Conservation at equilibrium: round-trip timing Once data is flowing reliably, problems(2)and (3)should be addressed. Assuming that the protocol implementation is correct, (2)must represent a failure of sender's retransmit timer. A good round trip time estimator, the core of the retransmit timer, is the single most
2 CONSERVATION AT EQUILIBRIUM: ROUND-TRIP TIMING 5 Figure 3: Startup behavior of TCP without Slow-start • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • •• • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • Send Time (sec) Packet Sequence Number (KB) 0 2 4 6 8 10 0 10 20 30 40 50 60 70 Trace data of the start of a TCP conversation between two Sun 3/50s running Sun OS 3.5 (the 4.3BSD TCP). The two Suns were on different Ethernets connected by IP gateways driving a 230.4 Kbps point-to-point link (essentially the setup shown in fig. 7). The window size for the connection was 16KB (32 512-byte packets) and there were 30 packets of buffer available at the bottleneck gateway. The actual path contains six store-and-forward hops so the pipe plus gateway queue has enough capacity for a full window but the gateway queue alone does not. Each dot is a 512 data-byte packet. The x-axis is the time the packet was sent. The yaxis is the sequence number in the packet header. Thus a vertical array of dots indicate back-to-back packets and two dots with the same y but different x indicate a retransmit. ‘Desirable’ behavior on this graph would be a relatively smooth line of dots extending diagonally from the lower left to the upper right. The slope of this line would equal the available bandwidth. Nothing in this trace resembles desirable behavior. The dashed line shows the 20 KBps bandwidth available for this connection. Only 35% of this bandwidth was used; the rest was wasted on retransmits. Almost everything is retransmitted at least once and data from 54 to 58 KB is sent five times. first-hop gateway sees a burst of eight packets delivered at 200 times the path bandwidth. This burst of packets often puts the connection into a persistent failure mode of continuous retransmissions (figures 3 and 4). 2 Conservation at equilibrium: round-trip timing Once data is flowing reliably, problems (2) and (3) should be addressed. Assuming that the protocol implementation is correct, (2) must represent a failure of sender’s retransmit timer. A good round trip time estimator, the core of the retransmit timer, is the single most
2 CONSERVATION AT EQUILIBRIUM: ROUND-TRIP TIMING 6 Figure 4: Startup behavior of TCP with Slow-start 88 Send Time(sec) Same conditions as the previous figure(same time of day, same Suns, same network path, same buffer and window sizes), except the machines were running the 4.3*TCP with slow- start. No bandwidth is wasted on retransmits but two seconds is spent on the slow-start so the effective bandwidth of this part of the trace is 16 KBps- two times better than figure 3. (This is slightly misleading: Unlike the previous figure, the slope of the trace is 20 KBps and the effect of the 2 second offset decreases as the trace lengthens. E. g, if this trace had run a minute, the effective bandwidth would have been 19 KBps. The effective bandwidth without slow-start stays at 7 KBps no matter how long the trace. important feature of any protocol implementation that expects to survive heavy load. And it is frequently botched([26]and [13] describe typical problems) One mistake is not estimating the variation, OR, of the round trip time, R. From queuing theory we know that R and the variation in R increase quickly with load. If the load is p the ratio of average arrival rate to average departure rate), R and or scale like(1-p) To make this concrete, if the network is running at 75% of capacity, as the arpanet was in last Aprils collapse, one should expect round-trip-time to vary by a factor of sixteen(-2o to+2o) The TCP protocol specification[2 suggests estimating mean round trip time via the lov ass filter R←R+(1-∞)M where R is the average RTT estimate, M is a round trip time measurement from the most recently acked data packet, and a is a filter gain constant with a suggested value of 0.9 Once the R estimate is updated, the retransmit timeout interval, rto, for the next packet sent
2 CONSERVATION AT EQUILIBRIUM: ROUND-TRIP TIMING 6 Figure 4: Startup behavior of TCP with Slow-start • •• ••• ••••• •••••••• ••••••••••• ••••••••••••••••• •••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••• •••• •••••••••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••• ••••••••••••••••••••••••••••••• •••••••••• •••••••••••••••••••••• Send Time (sec) Packet Sequence Number (KB) 0 2 4 6 8 10 0 20 40 60 80 100 120 140 160 Same conditions as the previous figure (same time of day, same Suns, same network path, same buffer and window sizes), except the machines were running the 4.3+TCP with slowstart. No bandwidth is wasted on retransmits but two seconds is spent on the slow-start so the effective bandwidth of this part of the trace is 16 KBps — two times better than figure 3. (This is slightly misleading: Unlike the previous figure, the slope of the trace is 20 KBps and the effect of the 2 second offset decreases as the trace lengthens. E.g., if this trace had run a minute, the effective bandwidth would have been 19 KBps. The effective bandwidth without slow-start stays at 7 KBps no matter how long the trace.) important feature of any protocol implementation that expects to survive heavy load. And it is frequently botched ([26] and [13] describe typical problems). One mistake is not estimating the variation, σR, of the round trip time, R. From queuing theory we know that R and the variation in R increase quickly with load. If the load is ρ (the ratio of average arrival rate to average departure rate), R and σR scale like (1− ρ)−1. To make this concrete, if the network is running at 75% of capacity, as the Arpanet was in last April’s collapse, one should expect round-trip-time to vary by a factor of sixteen (−2σ to +2σ). The TCP protocol specification[2] suggests estimating mean round trip time via the lowpass filter R ← αR+ (1−α)M where R is the average RTT estimate, M is a round trip time measurement from the most recently acked data packet, and α is a filter gain constant with a suggested value of 0.9. Once the R estimate is updated, the retransmit timeout interval, rto, for the next packet sent is set to βR
2 CONSERVATION AT EQUILIBRIUM: ROUND-TRIP TIMING Figure 5: Performance of an rFC793 retransmit timer 9 Trace data showing per-packet round trip time on a well-behaved Arpanet connection. The X-axis is the packet number (packets were numbered sequentially, starting with one)and the y-axis is the elapsed time from the send of the packet to the sender's receipt of its ack During this portion of the trace, no packets were dropped or retransmitted The packets are indicated by a dot. a dashed line connects them to make the sequence eas- ier to follow. The solid line shows the behavior of a retransmit timer computed according to the rules of rFC793 The parameter B accounts for RTT variation(see [5], section 5). The suggested B=2 can adapt to loads of at most 30%. Above this point, a connection will respond to load increases by retransmitting packets that have only been delayed in transit. This forces the network to do useless work, wasting bandwidth on duplicates of packets that will eventually be delivered. at a time when it's known to be having trouble with useful work. I.e.. this is the network equivalent of pouring gasoline on a fire We developed a cheap method for estimating variation(see appendix A) and the re- ulting retransmit timer essentially eliminates spurious retransmissions. A pleasant side effect of estimating B rather than using a fixed value is that low load as well as high load performance improves, particularly over high delay paths such as satellite links( figures 5 nd 6) Another timer mistake is in the backoff after a retransmit: If a packet has to be retrans- mitted more than once, how should the retransmits be spaced? For a transport endpoint embedded in a network of unknown topology and with an unknown, unknowable and con- stantly changing population of competing conversations, only one scheme has any hope of working--exponential backoff-but a proof of this is beyond the scope of this paper. 3We are far from the first to recognize that transport needs to estimate both mean and variation. See, for example, [6]. But we do think our estimator is simpler than most. 4See [8]. Several authors have shown that backoffs'slower'than exponential are stable given finite popula- tions and knowledge of the global traffic. However, [17] shows that nothing slower than exponential behavior will work in the general case. To feed your intuition, consider that an IP gateway has essentially the same behavior as the 'ether in an ALOHA net or Ethernet. Justifying exponential retransmit backoff is the same as
2 CONSERVATION AT EQUILIBRIUM: ROUND-TRIP TIMING 7 Figure 5: Performance of an RFC793 retransmit timer • •••• • • •• •• •• • • •• • • • • • ••••• • • • • •••• •• • • • •••••• • • ••• • • • • ••• • • • • • • • • • ••• • • • • • • •• • •• • •• • • • • ••• •• • • • • • • •• Packet RTT (sec.) 0 10 20 30 40 50 60 70 80 90 100 110 0 2 4 6 8 10 12 Trace data showing per-packet round trip time on a well-behaved Arpanet connection. The x-axis is the packet number (packets were numbered sequentially, starting with one) and the y-axis is the elapsed time from the send of the packet to the sender’s receipt of its ack. During this portion of the trace, no packets were dropped or retransmitted. The packets are indicated by a dot. A dashed line connects them to make the sequence easier to follow. The solid line shows the behavior of a retransmit timer computed according to the rules of RFC793. The parameter β accounts for RTT variation (see [5], section 5). The suggested β = 2 can adapt to loads of at most 30%. Above this point, a connection will respond to load increases by retransmitting packets that have only been delayed in transit. This forces the network to do useless work, wasting bandwidth on duplicates of packets that will eventually be delivered, at a time when it’s known to be having trouble with useful work. I.e., this is the network equivalent of pouring gasoline on a fire. We developed a cheap method for estimating variation (see appendix A)3 and the resulting retransmit timer essentially eliminates spurious retransmissions. A pleasant side effect of estimating β rather than using a fixed value is that low load as well as high load performance improves, particularly over high delay paths such as satellite links (figures 5 and 6). Another timer mistake is in the backoff after a retransmit: If a packet has to be retransmitted more than once, how should the retransmits be spaced? For a transport endpoint embedded in a network of unknown topology and with an unknown, unknowable and constantly changing population of competing conversations, only one scheme has any hope of working—exponential backoff—but a proof of this is beyond the scope of this paper.4 3We are far from the first to recognize that transport needs to estimate both mean and variation. See, for example, [6]. But we do think our estimator is simpler than most. 4See [8]. Several authors have shown that backoffs ‘slower’ than exponential are stable given finite populations and knowledge of the global traffic. However, [17] shows that nothing slower than exponential behavior will work in the general case. To feed your intuition, consider that an IP gateway has essentially the same behavior as the ‘ether’ in an ALOHA net or Ethernet. Justifying exponential retransmit backoff is the same as
3 ADAPTING TO THE PATH: CONGESTION AVOIDANCE Figure 6: Performance of a mean+Variance retransmit timer !¥ Same data as above but the solid line shows a retransmit timer computed according to the To finesse a proof, note that a network is, to a very good approximation, a linear system That is, it is composed of elements that behave like linear operators-integrators, delays gain stages, etc. Linear system theory says that if a system is stable, the stability is expo- nential. This suggests that an unstable system(a network subject to random load shocks and prone to congestive collapse) can be stabilized by adding some exponential damping (exponential timer backoff)to its primary excitation(senders, traffic sources 3 Adapting to the path: congestion avoidance If the timers are in good shape, it is possible to state with some confidence that a timeout in- dicates a lost packet and not a broken timer. At this point, something can be done about Packets get lost for two reasons: they are damaged in transit, or the network is congested and somewhere on the path there was insufficient buffer capacity. On most network paths, loss due to damage is rare(<1%)so it is probable that a packet loss is due to congestion in the network 6 showing that no collision backoff slower than an exponential will guarantee stability on an Ethernet. Unfortu- nately, with an infinite user population even exponential backoff won't guarantee stability(although it'almost does-see [1]). Fortunately, we don' t(yet) have to deal with an infinite user population S The phrase congestion collapse(describing a positive feedback instability due to poor retransmit timers)is again the coinage of John Nagle, this time from [23] b Because a packet loss empties the window, the throughput of any window flow control protocol is quite sensitive to damage loss. For an RFC793 standard tCP running with window w(where w is at most the bandwidth-delay product), a loss probability of p degrades throughput by a factor of (1+2pmw-.E.g, a 1% age loss rate on an Arpanet path( 8 packet window) degrades TCP throughput by 14%. The congestion control scheme we propose is insensitive to damage loss until the loss rate is on the order of the window equilibration length( the number of packets it takes the window to regain its original size after a loss). If the pre-loss size is w, equilibration takes roughly w/3 packets so, for the Arpanet, the loss sensitivity
3 ADAPTING TO THE PATH: CONGESTION AVOIDANCE 8 Figure 6: Performance of a Mean+Variance retransmit timer • •••• • • •• •• •• • • •• • • • • • ••••• • • • • •••• •• • • • •••••• • • ••• • • • • ••• • • • • • • • • • ••• • • • • • • •• • •• • •• • • • • ••• •• • • • • • • •• Packet RTT (sec.) 0 10 20 30 40 50 60 70 80 90 100 110 0 2 4 6 8 10 12 Same data as above but the solid line shows a retransmit timer computed according to the algorithm in appendix A. To finesse a proof, note that a network is, to a very good approximation, a linear system. That is, it is composed of elements that behave like linear operators — integrators, delays, gain stages, etc. Linear system theory says that if a system is stable, the stability is exponential. This suggests that an unstable system (a network subject to random load shocks and prone to congestive collapse5) can be stabilized by adding some exponential damping (exponential timer backoff) to its primary excitation (senders, traffic sources). 3 Adapting to the path: congestion avoidance If the timers are in good shape, it is possible to state with some confidence that a timeout indicates a lost packet and not a broken timer. At this point, something can be done about (3). Packets get lost for two reasons: they are damaged in transit, or the network is congested and somewhere on the path there was insufficient buffer capacity. On most network paths, loss due to damage is rare (1%) so it is probable that a packet loss is due to congestion in the network.6 showing that no collision backoff slower than an exponential will guarantee stability on an Ethernet. Unfortunately, with an infinite user population even exponential backoff won’t guarantee stability (although it ‘almost’ does—see [1]). Fortunately, we don’t (yet) have to deal with an infinite user population. 5The phrase congestion collapse (describing a positive feedback instability due to poor retransmit timers) is again the coinage of John Nagle, this time from [23]. 6Because a packet loss empties the window, the throughput of any window flow control protocol is quite sensitive to damage loss. For an RFC793 standard TCP running with window w (where w is at most the bandwidth-delay product), a loss probability of p degrades throughput by a factor of (1+2pw)−1. E.g., a 1% damage loss rate on an Arpanet path (8 packet window) degrades TCP throughput by 14%. The congestion control scheme we propose is insensitive to damage loss until the loss rate is on the order of the window equilibration length (the number of packets it takes the window to regain its original size after a loss). If the pre-loss size is w, equilibration takes roughly w2/3 packets so, for the Arpanet, the loss sensitivity
3 ADAPTING TO THE PATH: CONGESTION AVOIDANCE A congestion avoidance strategy, such as the one proposed in [15], will have two components: The network must be able to signal the transport endpoints that congestion is occurring (or about to occur). And the endpoints must have a policy that decreases utilization if this signal is received and increases utilization if the signal isnt received If packet loss is(almost)al ways due to congestion and if a timeout is(almost)always due to a lost packet, we have a good candidate for the ' network is congested signal. Partic ularly since this signal is delivered automatically by all existing networks, without special modification(as opposed to [15] which requires a new bit in the packet headers and a mod- ification to all existing gateways to set this bit) The other part of a congestion avoidance strategy, the endnode action, is almost identical in the DEC/ISO scheme and our TCP and follows directly from a first-order time-series model of the network: Say network load is measured by average queue length over fixed ntervals of some appropriate length(something near the round trip time). If Li is the load at interval i, an uncongested network can be modeled by saying Li changes slowly compared to the sampling time. I.e (N constant). If the network is subject to congestion, this zeroth order model breaks down The average queue length becomes the sum of two terms, the n above that accounts for the average arrival rate of new traffic and intrinsic delay, and a new term that accounts for the fraction of traffic left over from the last time interval and the effect of this left-over traffic (e.g, induced retransmits) These are the first two terms in a Taylor series expansion of L((). There is reason to believe one might eventually need a three term, second order model, but not until the Internet has When the network is congested, y must be large and the queue lengths will start in creasing exponentially. The system will stabilize only if the traffic sources throttle back at least as quickly as the queues are growing. Since a source controls load in a window-based protocol by adjusting the size of the window, w, we end up with the sender policy On congestion W=dW-1(d<1) I.e., a multiplicative decrease of the window size(which becomes an exponential decrease over time if the congestion persists) threshold is about 5%. At this high loss rate, the empty window effect described above has already degraded throughput by 44% and the additional degradation from the congestion avoidance window shrinking is the least We are concerned that the congestion control noise sensitivity is quadratic in w but it will take at least another generation of network evolution to reach window sizes where this will be significant. If experience shows this sensitivity to be a liability, a trivial modification to the algorithm makes it linear in w. An in-progress paper explores this subject in detail 7 This is not an accident: We copied Jain s scheme after hearing his presentation at [10]and realizing that the scheme was. in a sense. universal &See any good control theory text for the relationship between a system model and admissible controls for that system. A nice introduction appears in [21], chap. 8 I.e., the system behaves like Li N yLi-I, a difference equation with the solution Ln=r"Lo
3 ADAPTING TO THE PATH: CONGESTION AVOIDANCE 9 A ‘congestion avoidance’ strategy, such as the one proposed in [15], will have two components: The network must be able to signal the transport endpoints that congestion is occurring (or about to occur). And the endpoints must have a policy that decreases utilization if this signal is received and increases utilization if the signal isn’t received. If packet loss is (almost) always due to congestion and if a timeout is (almost) always due to a lost packet, we have a good candidate for the ‘network is congested’ signal. Particularly since this signal is delivered automatically by all existing networks, without special modification (as opposed to [15] which requires a new bit in the packet headers and a modification to all existing gateways to set this bit). The other part of a congestion avoidance strategy, the endnode action, is almost identical in the DEC/ISO scheme and our TCP7 and follows directly from a first-order time-series model of the network:8 Say network load is measured by average queue length over fixed intervals of some appropriate length (something near the round trip time). If Li is the load at interval i, an uncongested network can be modeled by saying Li changes slowly compared to the sampling time. I.e., Li = N (N constant). If the network is subject to congestion, this zeroth order model breaks down. The average queue length becomes the sum of two terms, the N above that accounts for the average arrival rate of new traffic and intrinsic delay, and a new term that accounts for the fraction of traffic left over from the last time interval and the effect of this left-over traffic (e.g., induced retransmits): Li = N +γLi−1 (These are the first two terms in a Taylor series expansion of L(t). There is reason to believe one might eventually need a three term, second order model, but not until the Internet has grown substantially.) When the network is congested, γ must be large and the queue lengths will start increasing exponentially.9 The system will stabilize only if the traffic sources throttle back at least as quickly as the queues are growing. Since a source controls load in a window-based protocol by adjusting the size of the window, W, we end up with the sender policy On congestion: Wi = dWi−1 (d 1
3 ADAPTING TO THE PATH: CONGESTION AVOIDANCE 10 If there's no congestion, y must be near zero and the load approximately constant. The network announces, via a dropped packet, when demand is excessive but says nothing if connection is using less than its fair share(since the network is stateless, it cannot know this). Thus a connection has to increase its bandwidth utilization to find out the current limit. E.g., you could have been sharing the path with someone else and converged to a window that gives you each half the available bandwidth. If she shuts down, 50% of the bandwidth will be wasted unless your window size is increased. What should the increase The first thought is to use a symmetric, multiplicative increase, possibly with a longer time constant, W= bW-1,1<b<1/d. This is a mistake. The result will oscillate wildly nd, on the average, deliver poor throughput. The analytic reason for this has to do with that fact that it is easy to drive the net into saturation but hard for the net to recover(what [18, chap. 2.1, calls the rush-hour efect). Thus overestimating the available bandwidth is costly. But an exponential, almost regardless of its time constant, increases so quickly that overestimates are inevitable Without justification, we'll state that the best increase policy is to make small, On Wi=Wi-I+u(u<Wmax where Wmar is the pipesize(the delay-bandwidth product of the path minus protocol ove head-i.e, the largest sensible window for the unloaded path). This is the additive increase multiplicative decrease policy suggested in [15] and the policy we've implemented in TCP The only difference between the two implementations is the choice of constants for d and We used 0.5 and I for reasons partially explained in appendix D. a more complete analysis is in yet another in-progress paper The preceding has probably made the congestion control algorithm sound hairy but it's not. Like slow-start it's three lines of code On any timeout, set cwnd to half the current window size(this is the multiplicative On each ack for new data, increase cwnd by 1/cwnd(this is the additive increase) 10In fig. 1, note that the"pipesize'is 16 packets, 8 in each path, but the sender is using a window of 22 packets. The six excess packets will form a queue at the entry to the bottleneck and that queue cannot shrink, even though the sender carefully clocks out packets at the bottleneck link rate. This stable queue is another, unfortunate, aspect of conservation: The queue would shrink only if the gateway could move packets into the skinny pipe faster than the sender dumped packets into the fat pipe. But the system tunes itself so each time the gateway pulls a packet off the front of its queue, the sender lays a new packet on the end a gateway needs excess output capacity (ie, p< 1) to dissipate a queue and the clearing time will scale like(1-p)-([18], chap. 2 is an excellent discussion of this). Since at equilibrium our transport connection wants to run the bottleneck link at 100%(p= 1), we have to be sure that during the non-equilibrium window ustment, our control policy allows the gateway enough free bandwidth to dissipate queues that inevitably orm due to path testing and traffic fluctuations. by an argument similar to the one used to show exponential timer backoff is necessary, it's possible to show that an exponential (multiplicative) window increase policy will be'faster'than the dissipation time for some traffic mix and, thus, leads to an unbounded growth of the I See [4]for a complete analysis of these increase and decrease policies. Also see [ 8] and [9] for a control- theoretic analysis of a similar class of control policies. by a small amount on each ack rather than by a large amount at the end of the interval. (Assuming, of course
3 ADAPTING TO THE PATH: CONGESTION AVOIDANCE 10 If there’s no congestion, γ must be near zero and the load approximately constant. The network announces, via a dropped packet, when demand is excessive but says nothing if a connection is using less than its fair share (since the network is stateless, it cannot know this). Thus a connection has to increase its bandwidth utilization to find out the current limit. E.g., you could have been sharing the path with someone else and converged to a window that gives you each half the available bandwidth. If she shuts down, 50% of the bandwidth will be wasted unless your window size is increased. What should the increase policy be? The first thought is to use a symmetric, multiplicative increase, possibly with a longer time constant, Wi = bWi−1, 1 < b ≤ 1/d. This is a mistake. The result will oscillate wildly and, on the average, deliver poor throughput. The analytic reason for this has to do with that fact that it is easy to drive the net into saturation but hard for the net to recover (what [18], chap. 2.1, calls the rush-hour effect).10 Thus overestimating the available bandwidth is costly. But an exponential, almost regardless of its time constant, increases so quickly that overestimates are inevitable. Without justification, we’ll state that the best increase policy is to make small, constant changes to the window size:11 On no congestion: Wi = Wi−1 +u (u Wmax) where Wmax is the pipesize (the delay-bandwidth product of the path minus protocol overhead — i.e., the largest sensible window for the unloaded path). This is the additive increase / multiplicative decrease policy suggested in [15] and the policy we’ve implemented in TCP. The only difference between the two implementations is the choice of constants for d and u. We used 0.5 and 1 for reasons partially explained in appendix D. A more complete analysis is in yet another in-progress paper. The preceding has probably made the congestion control algorithm sound hairy but it’s not. Like slow-start, it’s three lines of code: • On any timeout, set cwnd to half the current window size (this is the multiplicative decrease). • On each ack for new data, increase cwnd by 1/cwnd (this is the additive increase).12 10In fig. 1, note that the ‘pipesize’ is 16 packets, 8 in each path, but the sender is using a window of 22 packets. The six excess packets will form a queue at the entry to the bottleneck and that queue cannot shrink, even though the sender carefully clocks out packets at the bottleneck link rate. This stable queue is another, unfortunate, aspect of conservation: The queue would shrink only if the gateway could move packets into the skinny pipe faster than the sender dumped packets into the fat pipe. But the system tunes itself so each time the gateway pulls a packet off the front of its queue, the sender lays a new packet on the end. A gateway needs excess output capacity (i.e., ρ < 1) to dissipate a queue and the clearing time will scale like (1−ρ)−2 ([18], chap. 2 is an excellent discussion of this). Since at equilibrium our transport connection ‘wants’ to run the bottleneck link at 100% (ρ = 1), we have to be sure that during the non-equilibrium window adjustment, our control policy allows the gateway enough free bandwidth to dissipate queues that inevitably form due to path testing and traffic fluctuations. By an argument similar to the one used to show exponential timer backoff is necessary, it’s possible to show that an exponential (multiplicative) window increase policy will be ‘faster’ than the dissipation time for some traffic mix and, thus, leads to an unbounded growth of the bottleneck queue. 11See [4] for a complete analysis of these increase and decrease policies. Also see [8] and [9] for a controltheoretic analysis of a similar class of control policies. 12This increment rule may be less than obvious. We want to increase the window by at most one packet over a time interval of length R (the round trip time). To make the algorithm ‘self-clocked’, it’s better to increment by a small amount on each ack rather than by a large amount at the end of the interval. (Assuming, of course