Resilient overlay Networks David Andersen Hari Balakrishnan, Frans Kaashoek and robert morris MIT Laboratory for Computer Science @nms. Ics. mit. edI http://nms.Ics.mitedu/ron/ Abstract and its constituent networks, usually operated by some network ser- a Resilient Overlay Network(RON) is an architecture that allow vice provider. The information shared with other providers and istributed Internet applications to detect and recover from path AS's is heavily filtered and summarized using the border Gat outages and periods of degraded performance within several sec- Protocol (BGP-4)running at the border routers between ASs[21] onds, improving over today's wide-area routing protocols that take which allows the Internet to scale to millions of networks at least several minutes to recover. A RON is an application-layer This wide-area routing scalability comes at the cost of re- overlay on top of the existing Internet routing substrate, The RON duced fault-tolerance of end-to-end communication between Inter- nodes monitor the functioning and quality of the Internet paths net hosts. This cost arises because BGP hides many topological among themselves, and use this information to decide whether to details in the interests of scalability and policy enforcement, has route packets directly over the Internet or by way of other Ron little information about traffic conditions, and damps routing up- problems arise to prevent large-scale oscil- Results from two sets of measurements of a working ron de- lations. As a result, BGP's fault recovery mechanisms sometimes ployed at sites scattered across the Internet demonstrate the benefits take many minutes before routes converge to a consistent form [121, f our architecture. For instance, over a 64-hour sampling period March 2001 across a twelve-node RoN, there were 32 significant ruptions in communication lasting tens of minutes or more B3, 18, outages, each lasting over thirty minutes, over the 132 measured 19]. The result is that today' s Internet is vulnerable to router and paths. RON,s routing mechanism was able to detect, recover, and link faults, configuration errors, and malice-hardly a week goes route around all of them, in less than twenty seconds on average, by without some serious problem affecting the connectivity pro- showing that its methods for fault detection and recovery work well ded by one or more Internet Service Providers(IsPs)[15] at discovering alternate paths in the Internet. Furthermore, RON Resilient Overlay Networks (RONs) are a remedy for some of was able to improve the loss rate, latency, or throughput perceived these problems. Distributed applications layer a"resilient overlay by data transfers; for example, about 5% of the transfers doubled etwork"over the underlying Internet routing substrate. The nodes their TCP throughput and 5% of our transfers saw their loss prob- comprising a ron reside in a variety of routing domains, and co- ability reduced by o.05. We found that forwarding packets via at operate with each other to forward data on behalf of any pair of and improve performance in most cases. These improvements, par- administrated and configured, and routing domains rarely share in- Gs ularly in the area of fault detection and recovery, demonstrate the terior links, they generally fail independently of each other. A nefits of moving some of the control over routing into the hands a result, if the underlying topology has physical path redundancy RON can often find paths between its nodes, even when wide-area outing Internet protocols like BGP-4 cannot. 1. ntroduction The main goal of RoN is to enable a group of nodes to commu- nicate with each other in the face of problems with the underlying The Internet is organized as independently u- Internet paths connecting them. ron detects problems by aggres- ms(Ass)that peer together. In th cture, sively probing and monitoring the paths connecting its nodes. If detailed routing information is maintained only with the underlying Internet path is the best one, that path is used and no other ron node is involved in the forwarding path. If the Internet Defense Advanced research path is not the best one, the RoN will forward the packet by way of around most failures by using only one intermediate hop RON nodes exchange information about the quality of the paths among themselves via a routing protocol and build forwarding ta- bles based on a variety of path metrics, including latency, packet loss rate, and available throughput. Each ROn node obtains the assive observations of data transfers. In mentation, each ROn is explicitly designed to be limited in size- between two and fifty nodes-to facilitate aggressive path main- enance via probing without excessive bandwidth overhead. This
Resilient Overlay Networks David Andersen, Hari Balakrishnan, Frans Kaashoek, and Robert Morris MIT Laboratory for Computer Science ron@nms.lcs.mit.edu http://nms.lcs.mit.edu/ron/ Abstract A Resilient Overlay Network (RON) is an architecture that allows distributed Internet applications to detect and recover from path outages and periods of degraded performance within several seconds, improving over today’s wide-area routing protocols that take at least several minutes to recover. A RON is an application-layer overlay on top of the existing Internet routing substrate. The RON nodes monitor the functioning and quality of the Internet paths among themselves, and use this information to decide whether to route packets directly over the Internet or by way of other RON nodes, optimizing application-specific routing metrics. Results from two sets of measurements of a working RON deployed at sites scattered across the Internet demonstrate the benefits of our architecture. For instance, over a 64-hour sampling period in March 2001 across a twelve-node RON, there were 32 significant outages, each lasting over thirty minutes, over the 132 measured paths. RON’s routing mechanism was able to detect, recover, and route around all of them, in less than twenty seconds on average, showing that its methods for fault detection and recovery work well at discovering alternate paths in the Internet. Furthermore, RON was able to improve the loss rate, latency, or throughput perceived by data transfers; for example, about 5% of the transfers doubled their TCP throughput and 5% of our transfers saw their loss probability reduced by 0.05. We found that forwarding packets via at most one intermediate RON node is sufficient to overcome faults and improve performance in most cases. These improvements, particularly in the area of fault detection and recovery, demonstrate the benefits of moving some of the control over routing into the hands of end-systems. 1. Introduction The Internet is organized as independently operating autonomous systems (AS’s) that peer together. In this architecture, detailed routing information is maintained only within a single AS This research was sponsored by the Defense Advanced Research Projects Agency (DARPA) and the Space and Naval Warfare Systems Center, San Diego, under contract N66001-00-1-8933. and its constituent networks, usually operated by some network service provider. The information shared with other providers and AS’s is heavily filtered and summarized using the Border Gateway Protocol (BGP-4) running at the border routers between AS’s [21], which allows the Internet to scale to millions of networks. This wide-area routing scalability comes at the cost of reduced fault-tolerance of end-to-end communication between Internet hosts. This cost arises because BGP hides many topological details in the interests of scalability and policy enforcement, has little information about traffic conditions, and damps routing updates when potential problems arise to prevent large-scale oscillations. As a result, BGP’s fault recovery mechanisms sometimes take many minutes before routes converge to a consistent form [12], and there are times when path outages even lead to significant disruptions in communication lasting tens of minutes or more [3, 18, 19]. The result is that today’s Internet is vulnerable to router and link faults, configuration errors, and malice—hardly a week goes by without some serious problem affecting the connectivity provided by one or more Internet Service Providers (ISPs) [15]. Resilient Overlay Networks (RONs) are a remedy for some of these problems. Distributed applications layer a “resilient overlay network” over the underlying Internet routing substrate. The nodes comprising a RON reside in a variety of routing domains, and cooperate with each other to forward data on behalf of any pair of communicating nodes in the RON. Because AS’s are independently administrated and configured, and routing domains rarely share interior links, they generally fail independently of each other. As a result, if the underlying topology has physical path redundancy, RON can often find paths between its nodes, even when wide-area routing Internet protocols like BGP-4 cannot. The main goal of RON is to enable a group of nodes to communicate with each other in the face of problems with the underlying Internet paths connecting them. RON detects problems by aggressively probing and monitoring the paths connecting its nodes. If the underlying Internet path is the best one, that path is used and no other RON node is involved in the forwarding path. If the Internet path is not the best one, the RON will forward the packet by way of other RON nodes. In practice, we have found that RON can route around most failures by using only one intermediate hop. RON nodes exchange information about the quality of the paths among themselves via a routing protocol and build forwarding tables based on a variety of path metrics, including latency, packet loss rate, and available throughput. Each RON node obtains the path metrics using a combination of active probing experiments and passive observations of on-going data transfers. In our implementation, each RON is explicitly designed to be limited in size— between two and fifty nodes—to facilitate aggressive path maintenance via probing without excessive bandwidth overhead. This
and do not depend on the existence of non-commercial or private networks(such as the Internet2 backbone that interconnects many educational institutions); our ability to determine this was enabled Cae by RON's policy routing feature that allows the expression and im- selected for packets We also found that roN successfully routed around performance NC-Cable failures: in RONI, the loss probability improved by at least 0.05 in 5% of the samples, end-to-end communication latency reduced by 40ms in 11%of the samples, and TCP throughput doubled in 5%of all samples. In addition, we found cases when RON's loss latency, and throughput-optimizing path selection mechanisms all chose different paths between the same two nodes, suggesting that application-specific path selection techniques are likely to be use- Figure 1: The current sixteen-node ron deploy ment. Five sites in practice. A noteworthy finding from the experiments and are at universities in the USA, two are European universitie analysis is that in most cases, forwarding packets via at most one (not shown), three are"broadband"home Internet hosts con- intermediate RON node is sufficient both for recovering from fail- nected by Cable or DSL, one is located at a US ISP, and five are ures and for improving communication latency at corporations in the USA 2. Related work To our knowledge, ron is the first wide-area network overlay allows RON to recover from problems in the underlying Internet in system that can detect and recover from path outages and periods of several seconds rather than several minutes degraded performance within several seconds. Ron builds on pre- The second goal of Ron is to integrate routing and path selec- vious studies that quantify end-to-end network reliability and per- on with distributed applications more tightly than is traditionally formance, on IP-based routing techniques for fault-tolerance, and done. This integration includes the ability to consult application on overlay-based techniques to enhance performance pecific metrics in selecting paths, and the ability to incorporate application-specific notions of what network conditions constitute 2.1 Internet Performance studies fault. " As a result, RONs can be used in a variety of ways. A mul Labovitz et al.[12]use a combination of measurement and anal timedia conferencing program may link directly against the RON ysis to show that inter-domain routers in the Internet may take tens ing an overlay between all particip of minutes to reach a consistent view of the network topology after the conference, and using loss rates, delay jitter, or application- a fault, primarily because of routing table oscillations during BGP's observed throughput as metrics on which to choose paths. An ad- rather complicated path selection process. They find that during ministrator may wish to use a RON-based router application to this period of"delayed convergence, end-to-end communication form an overlay network between multiple LANs as an" Overlay is adversely affected. In fact, outages on the order of minutes cause VPN. This idea can be extended further to develop an"Overlay active TCP connections (i.e, connections in the ESTABlISHeD ISP,formed by linking(via RON) points of presence in different state with outstanding data)to terminate when TCP does not re- aditional ISPs after buying bandwidth from them. Using RON's ceive an acknowledgment for its outstanding data. They also find routing machinery, an Overlay IsP can provide more resilient and that, while part of the convergence delays can be fixed with changes failure-resistant Internet to its customers to the deployed BGP implementations, long delays and temporary The third goal of ron is to provide a framework for the imple- oscillations are a fundamental consequence of the BGP path vector mentation of expressive routing policies, which govern the choice routing protocol of paths in the network. For example, Ron facilitates classifying Paxson's probe experiments show that routing pathologies pre- ackets into categories that could implement notions of acceptable vent selected Internet hosts from communicating up to 3.3% of the use, or enforce forwarding rate controls time averaged over a long time period, and that this percentage has This paper describes the design and implementation of ron, not improved with time [18]. Labovitz et al. find, by examining and presents several experiments that evaluate whether RON is a routing table logs at Internet backbones, that 10% of all considered good idea. To conduct this evaluation and demonstrate the ben- routes were available less than 95% of the time. and that less than efits of RoN, we have deployed a working sixteen-node RoN 35% of all routes were available more than 99.99% of the time [13] ites sprinkled across the Internet(see Figure 1). The ron client Furthermore, they find that about 40% of all path outages take more we experiment with is a resilient IP forwarder, which allows us to than 30 minutes to repair and are heavy-tailed in their duration. ompare connections between pairs of nodes running over a ron More recently, Chandra et al. find using active probing that 5% against running straight over the Internet. of all detected failures last more than 10, 000 seconds(2 hours, 45 We have collected a few weeks worth of experimental results of minutes), and that failure durations are heavy-tailed and can last sis of two separate datasets: RONI with twelve nodes measured in findings do not augur well for mission-critical services that require March 2001 and RO 2 with sixteen nodes measured in May 2001. a higher degree of end-to-end communication availability In both datasets we found that ron was able to route around be- The Detour measurement study made the observation, using Pax- ween 60% and 100% of all significant outages. Our implementa- son's and their own data collected at various times between 1995 tion takes 18 seconds, on average, to detect and route around a path and 1999, that path selection in the wide-area Internet is sub- failure and is able to do so in the face of an active denial-of-service optimal from the standpoint of end-to-end latency, packet loss rate, attack on a path. We also found that these benefits of quick fault de- and TCP throughput [23]. This study showed the potential long ection and successful recovery are realized on the public Internet term benefits of"detouring"packets via a third node by comparing
CCI Aros Utah CMU To vu.nl Lulea.se MIT MA−Cable Cisco Cornell NYU NC−Cable OR−DSL CA−T1 PDI Mazu Figure 1: The current sixteen-node RON deployment. Five sites are at universities in the USA, two are European universities (not shown), three are “broadband” home Internet hosts connected by Cable or DSL, one is located at a US ISP, and five are at corporations in the USA. allows RON to recover from problems in the underlying Internet in several seconds rather than several minutes. The second goal of RON is to integrate routing and path selection with distributed applications more tightly than is traditionally done. This integration includes the ability to consult applicationspecific metrics in selecting paths, and the ability to incorporate application-specific notions of what network conditions constitute a “fault.” As a result, RONs can be used in a variety of ways. A multimedia conferencing program may link directly against the RON library, transparently forming an overlay between all participants in the conference, and using loss rates, delay jitter, or applicationobserved throughput as metrics on which to choose paths. An administrator may wish to use a RON-based router application to form an overlay network between multiple LANs as an “Overlay VPN.” This idea can be extended further to develop an “Overlay ISP,” formed by linking (via RON) points of presence in different traditional ISPs after buying bandwidth from them. Using RON’s routing machinery, an Overlay ISP can provide more resilient and failure-resistant Internet service to its customers. The third goal of RON is to provide a framework for the implementation of expressive routing policies, which govern the choice of paths in the network. For example, RON facilitates classifying packets into categories that could implement notions of acceptable use, or enforce forwarding rate controls. This paper describes the design and implementation of RON, and presents several experiments that evaluate whether RON is a good idea. To conduct this evaluation and demonstrate the benefits of RON, we have deployed a working sixteen-node RON at sites sprinkled across the Internet (see Figure 1). The RON client we experiment with is a resilient IP forwarder, which allows us to compare connections between pairs of nodes running over a RON against running straight over the Internet. We have collected a few weeks’ worth of experimental results of path outages and performance failures and present a detailed analysis of two separate datasets: RON1 with twelve nodes measured in March 2001 and RON2 with sixteen nodes measured in May 2001. In both datasets, we found that RON was able to route around between 60% and 100% of all significant outages. Our implementation takes 18 seconds, on average, to detect and route around a path failure and is able to do so in the face of an active denial-of-service attack on a path. We also found that these benefits of quick fault detection and successful recovery are realized on the public Internet and do not depend on the existence of non-commercial or private networks (such as the Internet2 backbone that interconnects many educational institutions); our ability to determine this was enabled by RON’s policy routing feature that allows the expression and implementation of sophisticated policies that determine how paths are selected for packets. We also found that RON successfully routed around performance failures: in RON1, the loss probability improved by at least 0.05 in 5% of the samples, end-to-end communication latency reduced by 40ms in 11% of the samples, and TCP throughput doubled in 5% of all samples. In addition, we found cases when RON’s loss, latency, and throughput-optimizing path selection mechanisms all chose different paths between the same two nodes, suggesting that application-specific path selection techniques are likely to be useful in practice. A noteworthy finding from the experiments and analysis is that in most cases, forwarding packets via at most one intermediate RON node is sufficient both for recovering from failures and for improving communication latency. 2. Related Work To our knowledge, RON is the first wide-area network overlay system that can detect and recover from path outages and periods of degraded performance within several seconds. RON builds on previous studies that quantify end-to-end network reliability and performance, on IP-based routing techniques for fault-tolerance, and on overlay-based techniques to enhance performance. 2.1 Internet Performance Studies Labovitz et al. [12] use a combination of measurement and analysis to show that inter-domain routers in the Internet may take tens of minutes to reach a consistent view of the network topology after a fault, primarily because of routing table oscillations during BGP’s rather complicated path selection process. They find that during this period of “delayed convergence,” end-to-end communication is adversely affected. In fact, outages on the order of minutes cause active TCP connections (i.e., connections in the ESTABLISHED state with outstanding data) to terminate when TCP does not receive an acknowledgment for its outstanding data. They also find that, while part of the convergence delays can be fixed with changes to the deployed BGP implementations, long delays and temporary oscillations are a fundamental consequence of the BGP path vector routing protocol. Paxson’s probe experiments show that routing pathologies prevent selected Internet hosts from communicating up to 3.3% of the time averaged over a long time period, and that this percentage has not improved with time [18]. Labovitz et al. find, by examining routing table logs at Internet backbones, that 10% of all considered routes were available less than 95% of the time, and that less than 35% of all routes were available more than 99.99% of the time [13]. Furthermore, they find that about 40% of all path outages take more than 30 minutes to repair and are heavy-tailed in their duration. More recently, Chandra et al. find using active probing that 5% of all detected failures last more than 10,000 seconds (2 hours, 45 minutes), and that failure durations are heavy-tailed and can last for as long as 100,000 seconds before being repaired [3]. These findings do not augur well for mission-critical services that require a higher degree of end-to-end communication availability. The Detour measurement study made the observation, using Paxson’s and their own data collected at various times between 1995 and 1999, that path selection in the wide-area Internet is suboptimal from the standpoint of end-to-end latency, packet loss rate, and TCP throughput [23]. This study showed the potential longterm benefits of “detouring” packets via a third node by comparing
the long-term average properties of detoured paths against Internet- phasis on high performance packet classification and routing. It uses IP-in-IP encapsulation to send packets along alternate paths While ron shares with detour the idea of routing via other 2.2 Network-layer Techniques nodes, our work differs from Detour in three significant ways. First, Much work has been done on performance-based and fault- RON seeks to prevent disruptions in end-to-end communication in tolerant routing within a single routing domain, but practical mecl the face of failures. RoN takes advantage of underlying Internet anisms for wide-area Internet recovery from outages or badly per- path redundancy on time-scales of a few seconds, reacting respon- forming paths are lacking sively to path outages and performance failures. Second,RON As though today's wide-area BGP-4 routing is based largely on designed as an application-controlled routing overlay; because each hop-counts, early ARPANET routing was more dynamic, re RON is more closely tied to the application using it, RON more ding to the current delay and utilization of the network. by readily integrates application-specific path metrics and path selec- 1989, the ARPANET evolved to using a delay- and congestion- tion policies. Third, we present and analyze experimental results based distributed shortest path routing algorithm [11]. However, from a real-world deployment of a ron to demonstrate fast re- the diversity and size of today's decentralized Internet necessitated covery from failure and improved latency and loss-rates even over the deployment of protocols that perform more aggregation and short time -scales fewer updates. As a result, unlike some interior routing protocols An alternative design to ron would be to use a generic overlay within ASS, BGP-4 routing between As's optimizes for scalable infrastructure like the X-Bone and port a standard network routing operation over all el protocol (like OSPF or RIP) with low timer values. However, this By treating vast collections of subnetworks as a single entity for by itself will not improve the resilience of Internet communications global routing purposes, BGP-4 is able to summarize and aggregate for two reasons. First, a reliable and low-overhead outage detection enormous amounts of routing information into a format that scales module is required, to distinguish between packet losses caused by to hundreds of millions of hosts. To prevent costly route oscilla congestion or error-prone links from legitimate problems with a tions, BGP-4 explicitly damps changes in routes. Unfortunately, path. Second, generic network-level routing protocols do not utilize while aggregation and damping provide good scalability, they in application-specific definitions of faults terfere with rapid detection and recovery when faults occur. RON Various Content Delivery Networks(CDNs) use overlay tech- handles this by leaving scalable operation to the underlying Inter- niques and caching to improve the performance of content del het substrate moving fault detection and recovery to a higher layer for specific applications such as Http and streaming video The overlay that is capable of faster response because it does not have functionality provided by RON may ease future CDn development to worry about scalabil by providing some routing components required by these services An oft-cited"solution" to achieving fault-tolerant network con- advertising a customer network through multiple ISPs. The idea 3. Design Goals is that an outage in one ISP would leave the customer connected The design of RoN seeks to meet three main design goals: (i) via the other. However, this solution does not generally achieve failure detection and recovery in less than 20 seconds, (ii) tighter fault detection and recovery within several seconds because of the integration of routing and path selection with the application, and degree of aggregation used to achieve wide-area routing scalabil (ii) expressive policy routing cept routing announcements for fewer than 8192 contiguous ad- 3.1 Fast Failure Detection and recovery dresses(a"/19 netblock). Small companies, regardless of their Todays wide-area Internet routing system based on BGP-4 does fault-tolerance needs, do not often require such a large address not handle failures well. From a network perspective, we define block, and cannot effectively multi-home. One alternative may be two kinds of failures. Link failures occur when a router or a link rovider-based addressing, where an organization gets addresses rom multiple providers, but this requires handling two distinct sets problem, or link disconnection. Path failures occur for a variety of of addresses on its hosts. It is unclear how on-going connections reasons, including denial-of-service attacks or other bursts of traffic on one address set can seamlessly switch on a failure in this model that cause a high degree of packet loss or high, variable latencies 2.3 Overlay-based Techniques Applications perceive all failures in one of two ways: outages or performance failures. Link failures and extreme path failures cause Overlay networks are an old idea; in fact, the Internet itself was outages, when the average packet loss rate over a sustained period developed as an overlay on the telephone network. Several Inter- net overlays have been designed in the past for various purpose tocols including TCP to degrade by several orders of magnitude including providing OSI network-layer connectivity [10), easing Performance failures are less extreme, for example, throughput, la IP multicast deployment using the MBone [6, and providing IPv6 tency, or loss-rates might degrade by a factor of two or three onnectivity using the 6-Bone [9). The X-Bone is a recent infras- BGP-4 takes a long time on the order of several minutes, to con- tructure project designed to speed the deployment of IP-based over- verge to a new valid route after a link failure causes an outage [ 12] lay networks [26]. It provides management functions and mecha- In contrast, RON's goal is to detect and recover from outages and nisms to insert packets into the overlay, but does not yet support performance failures within several seconds. Compounding this fault-tolerant operation or application-controlled path selection. Few overlay networks have been designed for efficient fault de such as packet foods and persistent congestion on links or paths tection and recovery, although some have been designed for better that greatly degrade end-to-end performance. As long as a link is end-to-end performance. The Detour framework [5, 22] was mo- deemed"live"(i.e, the BGP session is still alive), BGP's AS-path architecture designed to support altermate-hop routing, with an em- for an application usikk may nop ackets down the faulty path tivated by the potential long-term performance benefits of indirect based routing will continue to rout routing [23]. It is an in-kernel packet encapsulation and routing fortunately, such a path ovide adequate performance
the long-term average properties of detoured paths against Internetchosen paths. 2.2 Network-layer Techniques Much work has been done on performance-based and faulttolerant routing within a single routing domain, but practical mechanisms for wide-area Internet recovery from outages or badly performing paths are lacking. Although today’s wide-area BGP-4 routing is based largely on AS hop-counts, early ARPANET routing was more dynamic, responding to the current delay and utilization of the network. By 1989, the ARPANET evolved to using a delay- and congestionbased distributed shortest path routing algorithm [11]. However, the diversity and size of today’s decentralized Internet necessitated the deployment of protocols that perform more aggregation and fewer updates. As a result, unlike some interior routing protocols within AS’s, BGP-4 routing between AS’s optimizes for scalable operation over all else. By treating vast collections of subnetworks as a single entity for global routing purposes, BGP-4 is able to summarize and aggregate enormous amounts of routing information into a format that scales to hundreds of millions of hosts. To prevent costly route oscillations, BGP-4 explicitly damps changes in routes. Unfortunately, while aggregation and damping provide good scalability, they interfere with rapid detection and recovery when faults occur. RON handles this by leaving scalable operation to the underlying Internet substrate, moving fault detection and recovery to a higher layer overlay that is capable of faster response because it does not have to worry about scalability. An oft-cited “solution” to achieving fault-tolerant network connectivity for a small- or medium-sized customer is to multi-home, advertising a customer network through multiple ISPs. The idea is that an outage in one ISP would leave the customer connected via the other. However, this solution does not generally achieve fault detection and recovery within several seconds because of the degree of aggregation used to achieve wide-area routing scalability. To limit the size of their routing tables, many ISPs will not accept routing announcements for fewer than 8192 contiguous addresses (a “/19” netblock). Small companies, regardless of their fault-tolerance needs, do not often require such a large address block, and cannot effectively multi-home. One alternative may be “provider-based addressing,” where an organization gets addresses from multiple providers, but this requires handling two distinct sets of addresses on its hosts. It is unclear how on-going connections on one address set can seamlessly switch on a failure in this model. 2.3 Overlay-based Techniques Overlay networks are an old idea; in fact, the Internet itself was developed as an overlay on the telephone network. Several Internet overlays have been designed in the past for various purposes, including providing OSI network-layer connectivity [10], easing IP multicast deployment using the MBone [6], and providing IPv6 connectivity using the 6-Bone [9]. The X-Bone is a recent infrastructure project designed to speed the deployment of IP-based overlay networks [26]. It provides management functions and mechanisms to insert packets into the overlay, but does not yet support fault-tolerant operation or application-controlled path selection. Few overlay networks have been designed for efficient fault detection and recovery, although some have been designed for better end-to-end performance. The Detour framework [5, 22] was motivated by the potential long-term performance benefits of indirect routing [23]. It is an in-kernel packet encapsulation and routing architecture designed to support alternate-hop routing, with an emphasis on high performance packet classification and routing. It uses IP-in-IP encapsulation to send packets along alternate paths. While RON shares with Detour the idea of routing via other nodes, our work differs from Detour in three significant ways. First, RON seeks to prevent disruptions in end-to-end communication in the face of failures. RON takes advantage of underlying Internet path redundancy on time-scales of a few seconds, reacting responsively to path outages and performance failures. Second, RON is designed as an application-controlled routing overlay; because each RON is more closely tied to the application using it, RON more readily integrates application-specific path metrics and path selection policies. Third, we present and analyze experimental results from a real-world deployment of a RON to demonstrate fast recovery from failure and improved latency and loss-rates even over short time-scales. An alternative design to RON would be to use a generic overlay infrastructure like the X-Bone and port a standard network routing protocol (like OSPF or RIP) with low timer values. However, this by itself will not improve the resilience of Internet communications for two reasons. First, a reliable and low-overhead outage detection module is required, to distinguish between packet losses caused by congestion or error-prone links from legitimate problems with a path. Second, generic network-level routing protocols do not utilize application-specific definitions of faults. Various Content Delivery Networks (CDNs) use overlay techniques and caching to improve the performance of content delivery for specific applications such as HTTP and streaming video. The functionality provided by RON may ease future CDN development by providing some routing components required by these services. 3. Design Goals The design of RON seeks to meet three main design goals: (i) failure detection and recovery in less than 20 seconds; (ii) tighter integration of routing and path selection with the application; and (iii) expressive policy routing. 3.1 Fast Failure Detection and Recovery Today’s wide-area Internet routing system based on BGP-4 does not handle failures well. From a network perspective, we define two kinds of failures. Link failures occur when a router or a link connecting two routers fails because of a software error, hardware problem, or link disconnection. Path failures occur for a variety of reasons, including denial-of-service attacks or other bursts of traffic that cause a high degree of packet loss or high, variable latencies. Applications perceive all failures in one of two ways: outages or performance failures. Link failures and extreme path failures cause outages, when the average packet loss rate over a sustained period of several minutes is high (about 30% or higher), causing most protocols including TCP to degrade by several orders of magnitude. Performance failures are less extreme; for example, throughput, latency, or loss-rates might degrade by a factor of two or three. BGP-4 takes a long time, on the order of several minutes, to converge to a new valid route after a link failure causes an outage [12]. In contrast, RON’s goal is to detect and recover from outages and performance failures within several seconds. Compounding this problem, IP-layer protocols like BGP-4 cannot detect problems such as packet floods and persistent congestion on links or paths that greatly degrade end-to-end performance. As long as a link is deemed “live” (i.e., the BGP session is still alive), BGP’s AS-pathbased routing will continue to route packets down the faulty path; unfortunately, such a path may not provide adequate performance for an application using it
vBNS /Internet 2 c N Utah 5Mbps/60ms Conduits QWest UUNE aT&T Figure 3: The RON system architecture. Data enters the rON from rOn clients via a conduit at an entry node. At each node, 6Mbps he RoNforwarder consults with its router to determine the best Mbps, 3ms path for the packet, and sends it to the next node. Path selec- Cable modem plifying the forwarding path at other nodes. When the packe reaches the ron exit node, the forwarder there hands it to the appropriate output conduit, which passes the data to the client Figure 2: Internet interconnections are often complex. The dot. To choose paths, RON nodes monitor the quality of their vir ted links are private and are not announced globally tual links using active probing and passive observation. RON nodes use a link-state routing protocol to disseminate the topol- ogy and virtual-link quality of the overlay network. 3.2 Tighter Integration with applications Failures and faults are application-specific notions: network con- 4. design ditions that are fatal for one application may be acceptable for other, more adaptive one. For instance, a UDP-based Internet audio The conceptual design of roN, shown in Figure 3, is quite sim- application not using good packet-level error correction may not RON nodes, deployed at various locations on the Internet, work at all at loss rates larger than 10%. At this loss rate. a bulk form an application-layer overlay to cooperatively route packets transfer application using TCP will continue to work because of for each other. Each RON node monitors the quality of the Internet TCP's adaptation mechanisms, albeit at lower performance. How paths between it and the other nodes, and uses this information to ever, at loss rates of 30% or more, TCP becomes essentially un- two nodes is called a virtual link. To discover the topology of the applications to independently define and react to failures overlay network and obtain information about all virtual links in In addition, applications may prioritize some metrics over oth to exchange information about a variety of quality metrics. Most ers(e.g, latency over throughput, or low loss over latency) in their of RoN's design supports routing through multiple intermediate paths. A routing system may not be able to optimize all of these nodes, but our results (Section 6) show that using at most one inter- metrics simultaneously; for example, a path with a one-second la ency may appear to be the best throughput path, but this degree of our design focus on finding better paths via a single intermediate RoN node of latency may be unacceptable to an interactive application Cur- rently, RON's goal is to allow applications to influence the choice 4.1 Software Architecture of paths using a single metric. We plan to explore multi-criteria path selection in the future Each program that communicates with the ron software on a node is a RON client. The overlay network is defined by a sin- 3.3 Expressive Policy routing Despite the need for policy routing and enforcement of accept or application. This group of clients can use service-specific rout- able use and other policies, todays approaches are primitive and ing metrics when deciding how to forward packets in the group cumbersome. For instance, BGP-4 is incapable of expressing fine Our design accommodates a variety of Ron clients, ranging from a generic IP packet forwarder that improves the reliability of IP grained policies aimed at users or hosts. This lack of precision packet delivery, to a multi-party conferencing application that in not only reduces the set of paths available in the case of a failure, but also inhibits innovation in the use of carefully targeted policies orporates application-specific metrics in its route selection such as end-to-end per-user rate controls or enforcement of accept a ron client interacts with ron across an APi called a conduit ble use policies(AUPs) based on packet classification. Because which the client uses to send and receive packets. On the data path, RONs will typically run on relatively powerful end-points, we be the first node that receives a packet(via the conduit) classifies it eve they are well-suited to providing fine-grained policy routing to determine the type of path it should use(e.g, low-latency, high- throughput, etc. ) This node is called the entry node: it determines Figure 2 shows the AS-level network connectivity between four a path from its topology table, encapsulates the packet into a RON of our RON hosts the full graph for(only) 12 hosts traverses 36 header, tags it with some information that simplifies forwarding different autonomous systems. The figure gives a hint of the con siderable underlying path redundancy available in the Internet-the by downstream RON nodes, and forwards it on. Each subsequent eason RON worksand shows situations where BGP's blunt po warding hop based on the destination address and the tag. The final ron node that deliver icy expression inhibits fail-over. For example, if the Aros-UUNET the packet to the roN application is called the exit nod connection failed. users at Aros would be unable to reach mIt even if they were authorized to use Utah's network resources to get there The conduits access ron via two functions This is because it impossible to announce a bgp route only to pa a_ron)allows a node to forward ticular users, so the Utah-MIT link is kept completely private a packet to a destination ron node either along the ron or
155Mbps / 60ms BBN Qwest UUNET AT&T MediaOne 6Mbps 130 Mbps Private Peering 45Mbps 5ms 1Mbps, 3ms Cable Modem Private Peering 3Mbps 6ms ArosNet Utah 155 MIT vBNS / Internet 2 Figure 2: Internet interconnections are often complex. The dotted links are private and are not announced globally. 3.2 Tighter Integration with Applications Failures and faults are application-specific notions: network conditions that are fatal for one application may be acceptable for another, more adaptive one. For instance, a UDP-based Internet audio application not using good packet-level error correction may not work at all at loss rates larger than 10%. At this loss rate, a bulk transfer application using TCP will continue to work because of TCP’s adaptation mechanisms, albeit at lower performance. However, at loss rates of 30% or more, TCP becomes essentially unusable because it times out for most packets [16]. RON allows applications to independently define and react to failures. In addition, applications may prioritize some metrics over others (e.g., latency over throughput, or low loss over latency) in their path selection. They may also construct their own metrics to select paths. A routing system may not be able to optimize all of these metrics simultaneously; for example, a path with a one-second latency may appear to be the best throughput path, but this degree of latency may be unacceptable to an interactive application. Currently, RON’s goal is to allow applications to influence the choice of paths using a single metric. We plan to explore multi-criteria path selection in the future. 3.3 Expressive Policy Routing Despite the need for policy routing and enforcement of acceptable use and other policies, today’s approaches are primitive and cumbersome. For instance, BGP-4 is incapable of expressing finegrained policies aimed at users or hosts. This lack of precision not only reduces the set of paths available in the case of a failure, but also inhibits innovation in the use of carefully targeted policies, such as end-to-end per-user rate controls or enforcement of acceptable use policies (AUPs) based on packet classification. Because RONs will typically run on relatively powerful end-points, we believe they are well-suited to providing fine-grained policy routing. Figure 2 shows the AS-level network connectivity between four of our RON hosts; the full graph for (only) 12 hosts traverses 36 different autonomous systems. The figure gives a hint of the considerable underlying path redundancy available in the Internet—the reason RON works—and shows situations where BGP’s blunt policy expression inhibits fail-over. For example, if the Aros-UUNET connection failed, users at Aros would be unable to reach MIT even if they were authorized to use Utah’s network resources to get there. This is because it impossible to announce a BGP route only to particular users, so the Utah-MIT link is kept completely private. External Probes Data Node 2 Node 3 Performance Database Node 1 Probes Forwarder Router Probes Forwarder Router Probes Forwarder Router Conduits Conduits Conduits Figure 3: The RON system architecture. Data enters the RON from RON clients via a conduit at an entry node. At each node, the RON forwarder consults with its router to determine the best path for the packet, and sends it to the next node. Path selection is done at the entry node, which also tags the packet, simplifying the forwarding path at other nodes. When the packet reaches the RON exit node, the forwarder there hands it to the appropriate output conduit, which passes the data to the client. To choose paths, RON nodes monitor the quality of their virtual links using active probing and passive observation. RON nodes use a link-state routing protocol to disseminate the topology and virtual-link quality of the overlay network. 4. Design The conceptual design of RON, shown in Figure 3, is quite simple. RON nodes, deployed at various locations on the Internet, form an application-layer overlay to cooperatively route packets for each other. Each RON node monitors the quality of the Internet paths between it and the other nodes, and uses this information to intelligently select paths for packets. Each Internet path between two nodes is called a virtual link. To discover the topology of the overlay network and obtain information about all virtual links in the topology, every RON node participates in a routing protocol to exchange information about a variety of quality metrics. Most of RON’s design supports routing through multiple intermediate nodes, but our results (Section 6) show that using at most one intermediate RON node is sufficient most of the time. Therefore, parts of our design focus on finding better paths via a single intermediate RON node. 4.1 Software Architecture Each program that communicates with the RON software on a node is a RON client. The overlay network is defined by a single group of clients that collaborate to provide a distributed service or application. This group of clients can use service-specific routing metrics when deciding how to forward packets in the group. Our design accommodates a variety of RON clients, ranging from a generic IP packet forwarder that improves the reliability of IP packet delivery, to a multi-party conferencing application that incorporates application-specific metrics in its route selection. A RON client interacts with RON across an API called a conduit, which the client uses to send and receive packets. On the data path, the first node that receives a packet (via the conduit) classifies it to determine the type of path it should use (e.g., low-latency, highthroughput, etc.). This node is called the entry node: it determines a path from its topology table, encapsulates the packet into a RON header, tags it with some information that simplifies forwarding by downstream RON nodes, and forwards it on. Each subsequent RON node simply determines the next forwarding hop based on the destination address and the tag. The final RON node that delivers the packet to the RON application is called the exit node. The conduits access RON via two functions: 1. send(pkt, dst, via ron) allows a node to forward a packet to a destination RON node either along the RON or
sing the direct Internet path. RON's delivery, like UDP, is incomplete information about any other hen all paths in best-effort and unreliable the ron from the other ron nodes to ilable 2. recv(pkt, via.ron)is a callback function that is 4.2.2 Path Evaluation and Selection called when a packet arrives for the client program. This The ron routers need an algorithm to determine if a path is still callback is invoked after the RON conduit matches the type alive and a set of algorithms with which to evaluate potential paths of the packet in the ron header to the set of types pre The responsibility of these metric evaluators is to provide a number registered by the client when it joins the roN. The ron quantifying how good "a path is according to that metric.These numbers are relative, and are only compared to other numbers from the same evaluator. The two important aspects of path evaluation The basic RON functionality is provided by the forwarder are the mechanism by which the data for two links are combined object, which implements the above functions. It also provides a timer registration and callback mechanism to perform periodic op- into a single path, and the formula used to evaluate erations, and a similar service for network socket data availability Every ron router implements outage detection Each client must instantiate a forwarder and hand to it two mod to determine if the virtual link between it and anoth ules: a RON router and a RON membership manager. The ron working. It uses an active probing mechanism for this. On de- router implements a routing protocol. The RON membership man- tecting the loss of a probe, the normal low-frequency probing is ager implements a protocol to maintain the list of members of a placed by a sequence of consecutive probes, sent in relatively quick RON. By default, ron provides a few different RoN router and succession spaced by PROBE- TIMEOUT seconds. If OUTA GE- THRESH membership manager modules for clients to use probes in a row elicit no response, then the path is considered RON routers and membership managers exchange packets using ead.” If even one of t RON as their forwarding service, rather than over direct IP paths. higher-frequency probes are canceled. Paths experiencing outages sages to be forwarded even when some underlying lPpans lay es- are rated on their packet loss rate history a path having an out- This feature of our system is beneficial because it allows these m age will always lose to a path not experiencing an outage. The OUTAGE-THRESH and the frequency of probing(PROBE-INTERVAL 4.2 Routing and path selection permit a trade-off between outage detection time and the bandwidth Routing is the process of building up the forwarding tables that consumed by the(low-frequency)probing process(Section 6.2 in- are used to choose paths for packets. In RON, the entry node vestigates this) has more control over subsequent path selection than in traditional By default, every RON router implements three different routing datagram networks. This node tags the packet's RoN header with metrics: the latency-minimizer, the loss-minimizer, and the TCP throughput-optimizer. The latency-minimizer forwarding table is an identifier that identifies the flow to which the packet belongs, computed by computing an exponential weighted moving average subsequent routers attempt to keep a flow id on the same path it first used, barring significant link changes. Tagging, like the IPv6 (EWMA)of round-trip latency samples with parameter a For any flow ID, helps support multi-hop routing by speeding up the for- link l, its latency estimate lat i is updated as arding path at intermediate nodes. It also helps tie a packet flow ple (1) to a chosen path, making performance more predictable, and pro- vides a basis for future support of multi-path routing in RON. B We use c =0.9 which means that 10% of the current late tagging at the entry node, the application is given maximum control estimate is based on the most recent sample. This number is similar over what the network considers a"flow. to the values suggested for TCPs round-trip time estimator [20] The small size of a ron relative to the internet allows it to mai For a ron path, the overall latency is the sum of the individual tain information about multiple alternate routes and to select the virtual link latencies: lat path that best suits the ron client according to a client-specified To estimate loss rates, ron uses the average of the last k= 100 routing metric. By default, it maintains information about three probe samples as the current average. Like Floyd et al.[7 ,we pecific metrics for each virtual link: (i)latency, (ii) packet loss found this to be a better estimator than EWMA, etains som ite, and(iii)throughput, as might be obtained by a bulk-transfer memory of samples obtained in the distant pas TCP connection between the end-points of the virtual link. Ron possible to further improve our estimator by clients can override these defaults with their own metrics, and the some of the h samples [7] RON library constructs the appropriate forwarding table to pick Loss metrics are multiplicative on a path: if we good paths. The router builds up forwarding tables for each com- losses are independent, the probability of success on the entire bination of policy routing and chosen routing metric is roughly equal to the probability of surviving all hops individu- 4.2 Link-State dissemination ron does not attempt to find optimal throughput paths, but The default roN router uses a link-state routing protocol to dis- strives to avoid paths of low throughput when good alternatives are seminate topology information between routers, which in turn is available. Given the time-varying and somewhat unpredictable na used to build the forwarding tables. Each node in an N-node ron ture of available bandwidth on Internet paths [2, 19], we believe this has N-1 virtual links. Each node's router periodically requests is an appropriate goal. From the standpoint of improving the reli- summary information of the different performance metrics to the ability of path selection in the face of performance failures, avoid- N-l other nodes from its local performance database and di ing bad paths is more important than optimizing to eliminate small seminates its view to the others throughput differences between paths. While a characterization of This information is sent via the ron forwarding mesh itself, to the utility received by programs at different available bandwidths ensure that routing information is propagated in the event of path may help determine a good path selection threshold, we believe that outages and heavy loss periods. Thus, the ron routing protocol more than a 50% bandwidth reduction is likely to reduce the util- is itself a Ron client, with a well-defined ron packet type. This ity of many programs. This threshold also falls outside the typical leads to an attractive property: The only time a ron router has variation observed on a given path over time-scales of tens of
using the direct Internet path. RON’s delivery, like UDP, is best-effort and unreliable. 2. recv(pkt, via ron) is a callback function that is called when a packet arrives for the client program. This callback is invoked after the RON conduit matches the type of the packet in the RON header to the set of types preregistered by the client when it joins the RON. The RON packet type is a demultiplexing field for incoming packets. The basic RON functionality is provided by the forwarder object, which implements the above functions. It also provides a timer registration and callback mechanism to perform periodic operations, and a similar service for network socket data availability. Each client must instantiate a forwarder and hand to it two modules: a RON router and a RON membership manager. The RON router implements a routing protocol. The RON membership manager implements a protocol to maintain the list of members of a RON. By default, RON provides a few different RON router and membership manager modules for clients to use. RON routers and membership managers exchange packets using RON as their forwarding service, rather than over direct IP paths. This feature of our system is beneficial because it allows these messages to be forwarded even when some underlying IP paths fail. 4.2 Routing and Path Selection Routing is the process of building up the forwarding tables that are used to choose paths for packets. In RON, the entry node has more control over subsequent path selection than in traditional datagram networks. This node tags the packet’s RON header with an identifier that identifies the flow to which the packet belongs; subsequent routers attempt to keep a flow ID on the same path it first used, barring significant link changes. Tagging, like the IPv6 flow ID, helps support multi-hop routing by speeding up the forwarding path at intermediate nodes. It also helps tie a packet flow to a chosen path, making performance more predictable, and provides a basis for future support of multi-path routing in RON. By tagging at the entry node, the application is given maximum control over what the network considers a “flow.” The small size of a RON relative to the Internet allows it to maintain information about multiple alternate routes and to select the path that best suits the RON client according to a client-specified routing metric. By default, it maintains information about three specific metrics for each virtual link: (i) latency, (ii) packet loss rate, and (iii) throughput, as might be obtained by a bulk-transfer TCP connection between the end-points of the virtual link. RON clients can override these defaults with their own metrics, and the RON library constructs the appropriate forwarding table to pick good paths. The router builds up forwarding tables for each combination of policy routing and chosen routing metric. 4.2.1 Link-State Dissemination The default RON router uses a link-state routing protocol to disseminate topology information between routers, which in turn is used to build the forwarding tables. Each node in an N-node RON has N 1 virtual links. Each node’s router periodically requests summary information of the different performance metrics to the N 1 other nodes from its local performance database and disseminates its view to the others. This information is sent via the RON forwarding mesh itself, to ensure that routing information is propagated in the event of path outages and heavy loss periods. Thus, the RON routing protocol is itself a RON client, with a well-defined RON packet type. This leads to an attractive property: The only time a RON router has incomplete information about any other one is when all paths in the RON from the other RON nodes to it are unavailable. 4.2.2 Path Evaluation and Selection The RON routers need an algorithm to determine if a path is still alive, and a set of algorithms with which to evaluate potential paths. The responsibility of these metric evaluators is to provide a number quantifying how “good” a path is according to that metric. These numbers are relative, and are only compared to other numbers from the same evaluator. The two important aspects of path evaluation are the mechanism by which the data for two links are combined into a single path, and the formula used to evaluate the path. Every RON router implements outage detection, which it uses to determine if the virtual link between it and another node is still working. It uses an active probing mechanism for this. On detecting the loss of a probe, the normal low-frequency probing is replaced by a sequence of consecutive probes, sent in relatively quick succession spaced by PROBE TIMEOUT seconds. If OUTAGE THRESH probes in a row elicit no response, then the path is considered “dead.” If even one of them gets a response, then the subsequent higher-frequency probes are canceled. Paths experiencing outages are rated on their packet loss rate history; a path having an outage will always lose to a path not experiencing an outage. The OUTAGE THRESH and the frequency of probing (PROBE INTERVAL) permit a trade-off between outage detection time and the bandwidth consumed by the (low-frequency) probing process (Section 6.2 investigates this). By default, every RON router implements three different routing metrics: the latency-minimizer, the loss-minimizer, and the TCP throughput-optimizer. The latency-minimizer forwarding table is computed by computing an exponential weighted moving average (EWMA) of round-trip latency samples with parameter . For any link l, its latency estimate latl is updated as: latl latl + (1 ) new samplel (1) We use = 0:9, which means that 10% of the current latency estimate is based on the most recent sample. This number is similar to the values suggested for TCP’s round-trip time estimator [20]. For a RON path, the overall latency is the sum of the individual virtual link latencies: latpath = P l2path latl . To estimate loss rates, RON uses the average of the last k = 100 probe samples as the current average. Like Floyd et al. [7], we found this to be a better estimator than EWMA, which retains some memory of samples obtained in the distant past as well. It might be possible to further improve our estimator by unequally weighting some of the k samples [7]. Loss metrics are multiplicative on a path: if we assume that losses are independent, the probability of success on the entire path is roughly equal to the probability of surviving all hops individually: lossratepath = 1 l2path(1 lossratel). RON does not attempt to find optimal throughput paths, but strives to avoid paths of low throughput when good alternatives are available. Given the time-varying and somewhat unpredictable nature of available bandwidth on Internet paths [2, 19], we believe this is an appropriate goal. From the standpoint of improving the reliability of path selection in the face of performance failures, avoiding bad paths is more important than optimizing to eliminate small throughput differences between paths. While a characterization of the utility received by programs at different available bandwidths may help determine a good path selection threshold, we believe that more than a 50% bandwidth reduction is likely to reduce the utility of many programs. This threshold also falls outside the typical variation observed on a given path over time-scales of tens of min-
utes[2]. We therefore concentrate on avoiding throughput faults of this order of magnitude like congestion control, o the throughput optimizer focuses on this latency KSamples Throughput-intensive applications typically use TCP or TCP Client type of traffic. The performance of a bulk TCP transfer is a func- on of the round-trip latency and the packet loss rate it observes Throughput optimization combines the latency and loss metrics us- Client ing a simplified version of the TCP throughput equation [16], which ovides an upper-bound on TCP throughput. Our granularity of loss rate detection is 1%, and the throughput equation is more sen- pdb request ( pab update() sitive at lower loss rates. We set a minimum packet loss rate of 2% pdb callback() to prevent an infinite bandwidth and to prevent large route oscilla tions from single packet losses. The formula used is: Figure 4: The roN performance database. To make good routing decisions RoN needs to have detailed per- where p is ne-way end-to-end packet loss probability and rt formance information. It is impractical to send large performance is the end-to-end round-trip time estimated from the hop-by-hop histories to all participants in the ron. A reliable performance samples as described above repository must handle participants that crash, reboot, or rejoin the RON. Measurement data is often noisy, and different clients may this scoring difficult to optimize with standard shortest-path algo- have difterent ways in which they will use that data, for instance, rithms, for instance, the TCP throughput between nodes A and C an outage detector may want to know if any packets were success- via a node b is not usually the smaller of the TCP throughputs be- ly sent in the last 15 seconds, but a throughput improver may tween A and B and B and C. While more complicated search algo- be interested in a longer-term packet loss average. Therefore, the rithms can be employed to handle this, RoN routing currently takes system needs a flexible summarisation mechanisn advantage of the resulting simplicity when only single-intermediate To support these requirements, each ron node or local group paths are considered to obtain throughput-optimized paths. of nodes uses a separate performance database to store samples Our probe samples give us the nwo-way packet loss probability, (Figure 4). The database is a generalization of SPAND [24] from which we estimate the one-way loss probability by(unreal supporting different data types and summarization mechanisms tically, for some paths) assuming symmetric loss rates and solving Sources of information and clients that use it agree on a class the resulting quadratic equation. Assuming that losses are indeper name for the data and the meaning of the values inserted int dent on the two paths A,B and B+ A, then the maximum the database. The probes insert data into the database by calling absolute error in the one-way loss rate estimate occurs when all pdb_update(class, src, dst, type, value). The of the loss is in only one direction, resulting in an error equal src and dst fields contain the IP addresses of the sampled data -. assuming a minimum loss rate of 2% ensures that when choosing between two high-quality links, our loss rate esti 4.3 Policy routing mate is within 50% of the true value. At large loss rates, highly RON allows users or administrators to define the types of trafic asymmetric loss patterns cause this method to disregard a poten- allowed on particular network links. In traditional Internet pol tially good path because the reverse direction of that path has a routing, "type"is typically defined only by the packets source and high loss rate. However, the benefits of avoiding the high loss rate destination addresses [4: ron generalizes this notion to include path seem to outweigh the cost of missing one good link, but we in- other information about the packet. ron separates policy routing tend to remedy this problem soon by explicitly estimating one-way into two components: classification and routing table formation loss rates When a packet enters the ron, it is classified and given a policy This method of throughput comparison using Equation 2 is not tag, this tag is used to perform lookups in the appropriate set of without faults. For instance, a slow but relatively unused bottle- routing tables at each RON router. A separate set of routing tables neck link on a path would almost never observe packet losses, and is constructed for each policy by re-running the routing computa- none of our probes would be lost. The equation we use would pre- tion, removing the links disallowed by the corresponding policy dict an unreasonably high bandwidth estimate. One way of tac The routing computation computes the shortest path from the RON ling this would be to send pairs of probe packets to estimate node to all other nodes, for any given metric. This construction upper-bound on link bandwidth, while another might be to use non- applies to multi-hop or single-hop indirection; because a standard nvasive probes to predict throughput [8 shortest-paths algorithm may not work for all metrics(specifically, Oscillating rapidly between multiple paths is harmful to applica- TCP throughput), our implementation, described in Section 5.2, is tions sensitive to packet reordering or delay jitter. While each of specific to single-hop indirection. The result of running the table the evaluation metrics applies some smoothing, this is not enough construction is the multi-level routing tables shown in Figure 5 to avoid"fapping" between two nearly equal routes: RON routers The construction of these routing tables and the production of therefore employ hysteresis. Based on an analysis of 5000 snap- packet tags is facilitated by the policy classifier component. The shots from a rON node's link-state table, we chose to apply a sim- policy classifier produces the permits function that determines if ple 5% hysteresis bonus to the"last good" route for the three met- a given policy is allowed to use a particular virtual link. It also pro- ics. This simple method appears to provide a reasonable trade-off vides a conduit-specific data classifier module that helps the ron between responsiveness to changes in the underlying paths and un- node decide which policy is appropriate for the incoming packet. necessary route flapping. For example, if packets from a commercial site are not allowed to go across the Internet2 educational backbone, and we received a 4.2.3 Performance database acket from such a site at an MIt ron node. the data classifier
utes [2]. We therefore concentrate on avoiding throughput faults of this order of magnitude. Throughput-intensive applications typically use TCP or TCPlike congestion control, so the throughput optimizer focuses on this type of traffic. The performance of a bulk TCP transfer is a function of the round-trip latency and the packet loss rate it observes. Throughput optimization combines the latency and loss metrics using a simplified version of the TCP throughput equation [16], which provides an upper-bound on TCP throughput. Our granularity of loss rate detection is 1%, and the throughput equation is more sensitive at lower loss rates. We set a minimum packet loss rate of 2% to prevent an infinite bandwidth and to prevent large route oscillations from single packet losses. The formula used is: score = p 1:5 rtt p p (2) where p is the one-way end-to-end packet loss probability and rtt is the end-to-end round-trip time estimated from the hop-by-hop samples as described above. The non-linear combination of loss and round-trip time makes this scoring difficult to optimize with standard shortest-path algorithms; for instance, the TCP throughput between nodes A and C via a node B is not usually the smaller of the TCP throughputs between A and B and B and C. While more complicated search algorithms can be employed to handle this, RON routing currently takes advantage of the resulting simplicity when only single-intermediate paths are considered to obtain throughput-optimized paths. Our probe samples give us the two-way packet loss probability, from which we estimate the one-way loss probability by (unrealistically, for some paths) assuming symmetric loss rates and solving the resulting quadratic equation. Assuming that losses are independent on the two paths A ! B and B ! A, then the maximum absolute error in the one-way loss rate estimate occurs when all of the loss is in only one direction, resulting in an error equal to losstwoway 2 . Assuming a minimum loss rate of 2% ensures that, when choosing between two high-quality links, our loss rate estimate is within 50% of the true value. At large loss rates, highly asymmetric loss patterns cause this method to disregard a potentially good path because the reverse direction of that path has a high loss rate. However, the benefits of avoiding the high loss rate path seem to outweigh the cost of missing one good link, but we intend to remedy this problem soon by explicitly estimating one-way loss rates. This method of throughput comparison using Equation 2 is not without faults. For instance, a slow but relatively unused bottleneck link on a path would almost never observe packet losses, and none of our probes would be lost. The equation we use would predict an unreasonably high bandwidth estimate. One way of tackling this would be to send pairs of probe packets to estimate an upper-bound on link bandwidth,while another might be to use noninvasive probes to predict throughput [8]. Oscillating rapidly between multiple paths is harmful to applications sensitive to packet reordering or delay jitter. While each of the evaluation metrics applies some smoothing, this is not enough to avoid “flapping” between two nearly equal routes: RON routers therefore employ hysteresis. Based on an analysis of 5000 snapshots from a RON node’s link-state table, we chose to apply a simple 5% hysteresis bonus to the “last good” route for the three metrics. This simple method appears to provide a reasonable trade-off between responsiveness to changes in the underlying paths and unnecessary route flapping. 4.2.3 Performance Database pdb_callback() Metrics App Metrics Throughput Loss Latency Performance Database Samples request callback Client Client Probers pdb_update() pdb_request() System Figure 4: The RON performance database. To make good routing decisions RON needs to have detailed performance information. It is impractical to send large performance histories to all participants in the RON. A reliable performance repository must handle participants that crash, reboot, or rejoin the RON. Measurement data is often noisy, and different clients may have different ways in which they will use that data; for instance, an outage detector may want to know if any packets were successfully sent in the last 15 seconds, but a throughput improver may be interested in a longer-term packet loss average. Therefore, the system needs a flexible summarization mechanism. To support these requirements, each RON node or local group of nodes uses a separate performance database to store samples (Figure 4). The database is a generalization of SPAND [24], supporting different data types and summarization mechanisms. Sources of information and clients that use it agree on a class name for the data and the meaning of the values inserted into the database. The probes insert data into the database by calling pdb update(class, src, dst, type, value). The src and dst fields contain the IP addresses of the sampled data. 4.3 Policy Routing RON allows users or administrators to define the types of traffic allowed on particular network links. In traditional Internet policy routing, “type” is typically defined only by the packet’s source and destination addresses [4]; RON generalizes this notion to include other information about the packet. RON separates policy routing into two components: classification and routing table formation. When a packet enters the RON, it is classified and given a policy tag; this tag is used to perform lookups in the appropriate set of routing tables at each RON router. A separate set of routing tables is constructed for each policy by re-running the routing computation, removing the links disallowed by the corresponding policy. The routing computation computes the shortest path from the RON node to all other nodes, for any given metric. This construction applies to multi-hop or single-hop indirection; because a standard shortest-paths algorithm may not work for all metrics (specifically, TCP throughput), our implementation, described in Section 5.2, is specific to single-hop indirection. The result of running the table construction is the multi-level routing tables shown in Figure 5. The construction of these routing tables and the production of packet tags is facilitated by the policy classifier component. The policy classifier produces the permits function that determines if a given policy is allowed to use a particular virtual link. It also provides a conduit-specific data classifier module that helps the RON node decide which policy is appropriate for the incoming packet. For example, if packets from a commercial site are not allowed to go across the Internet2 educational backbone, and we received a packet from such a site at an MIT RON node, the data classifier
a policy tag that is interpreted by the forwarders to decide which network routing policies apply to the packet. If the packet is des- g Lookup tined for the local node, the forwarder uses the packet type field to ing Pref Dem demultiplex the packet to the ron client. If the packet's flow Id has a valid flow cache entry, the forwarder short-cuts the routing process with this entry. Otherwise, the rout ing table lookup occurs in three stages. The first stage examines the policy tag, and locates the proper routing preference table. There is one routing preference table for each known policy tag. Policy ad- Router herence supersedes all other routing preferences. Next, the lookup procedure examines the routing preference flags to find a compat ible route selection metric for the packet. The flags are examined Figure 5: The forwarding control and data paths. The for from the right, and the first fag understood by the router directs warder passes the packet header to the routing table which the packet to the next table. All ROn routers understand the ba first checkes if a valid flow cache entry exists. If not, it per sic system metrics of latency, loss, and throughput; this provides forms three levels of lookups. The first level is based on the a way for a shared ron environment to support users who may olicy type, the second is based on the routing preference, and also have client-defined metrics, and still provide them with good the third is a hash lookup of next hops indexed by destination. default metrics. A lookup in the routing preference table leads to a hash table of next-hops based upon the destination RON node The entry in the next-hop table is returned to the forwarder, which RON Source Address places the packet on the network destined for the next hop RON Destinatio Source po 4.5 Bootstrap and membership Management In addition to allowing clients to define their own membershi mechanisms, Ron provides two system membership managers:a Packet Type on ple static membership mechanism that loads its peers from a and a dynamic announcement-based, soft-state membership protocol. To bootstrap the dynamic membership protocol, a new Figure 6: The ron packet header. The routing flags and Hlow node needs to know the identity of at least one peer in the ron ID are set by the input conduit. The packet type is a demul- The new node uses this neighbor to broadcast its existence using a iplexing key indicating the appropriate conduit or protocol at fiooder, which is a special ron client that implements a general he receiver purpose, resilient flooding mechanism using a RON forwarder The main challenge in the dynamic membership protocol is to avoid confusing a path outage to a node from its having left the module would determine which policy to use and set the corre- RON. Each node builds up and periodically(every five minutes on sponding tag on the packet. Subsequent ron nodes examine the average in our implementation)fioods to all other nodes its list of policy tag instead of reclassifying the packet. We have designed two policy mechanisms: exclusive cliques and minutes, it assumes that the peer is no longer participating in the eneral policies. In an exclusive clique, only data originating from RoN. Observe that this mechanism allows two nodes in the same and destined to other members of the clique may traverse inter- RON to have a non-functioning direct Internet path for long periods clique links. This mimics, for instance, the"educational only" pol of time: as long as there is some path in the ron between the two on the internet2 backbone this classifier takes only a nodes, neither will think that the other has left list of networks and subnet masks to match agains The overhead of broadcasting is minuscule compared to the traf- Our general policy component is more powerful; it accepts a fic caused by active probing and routing updates, esp BPF-like packet matcher [14], and a list of links that are denied the limited size of a ron. The resulting robustness is high: each by this policy. It returns the first policy that matches a packet's node receives up to N- 1 copies of the peer list sent from a given other node. This redundancy causes a node to be deleted from an- address policy composition, although users may create composed other node's view of the ron only if the former node is genuinely versions of their own policies using the general policy component. partitioned for over an hour from every other node in the RON. A In Section 7, we discuss possible abuse of AUPs and network tran- RON node will re-bootstrap from a well-known peer after a long 4.4 Data Forwarding 5. Implementation The forwarder at each ron node examines every incoming The Ron system is implemented at user-level as a fiexible acket to determine if it is destined for a local client or a remote of C++ libraries to which user-level ron clients link. Each clier estination. If it requires further delivery, the forwarder passes the can pick and choose the components that best suits its needs. To ROn packet header to the routing table, as shown in Figure 5 provide a specific example of a client using the ROn library, The ron packet header is shown in Figure 6. It is inspired by the describe the implementation of resilient IP forwarder:. This for design of IPv6[17]. Because ron needs to support more than sim warder improves IP packet delivery without any modification to ply IP-in-IP encapsulation, RON uses its own header. ron does the transport protocols and applications running at end-nodes. The not fragment packets, but does inform applications if they exceed architecture of the resilient IP forwarder is shown in Figure 7 the Maximum Transmission Unit (MTU). As in IPv6, applications RON uses UDP to forward data, since TCP's reliable byte-stream must perform end-to-end path mTU discovery. ron also provides is mismatched to many rOn clients, and IP-based encapsulation
Hop Dst Next Policy Demux Routing Lookup Routing Pref Demux Router Forwarder from net to next hop via net Figure 5: The forwarding control and data paths. The forwarder passes the packet header to the routing table, which first checkes if a valid flow cache entry exists. If not, it performs three levels of lookups. The first level is based on the policy type, the second is based on the routing preference, and the third is a hash lookup of next hops indexed by destination. Flow ID Version Hop Limit Routing Flags RON Source Address RON Destination Address Source port Dest port Packet Type Policy Tag Figure 6: The RON packet header. The routing flags and flow ID are set by the input conduit. The packet type is a demultiplexing key indicating the appropriate conduit or protocol at the receiver. module would determine which policy to use and set the corresponding tag on the packet. Subsequent RON nodes examine the policy tag instead of reclassifying the packet. We have designed two policy mechanisms: exclusive cliques and general policies. In an exclusive clique, only data originating from and destined to other members of the clique may traverse interclique links. This mimics, for instance, the “educational only” policy on the Internet2 backbone. This policy classifier takes only a list of networks and subnet masks to match against. Our general policy component is more powerful; it accepts a BPF-like packet matcher [14], and a list of links that are denied by this policy. It returns the first policy that matches a packet’s fields to the stored BPF-based information. We do not currently address policy composition, although users may create composed versions of their own policies using the general policy component. In Section 7, we discuss possible abuse of AUPs and network transit policies. 4.4 Data Forwarding The forwarder at each RON node examines every incoming packet to determine if it is destined for a local client or a remote destination. If it requires further delivery, the forwarder passes the RON packet header to the routing table, as shown in Figure 5. The RON packet header is shown in Figure 6. It is inspired by the design of IPv6 [17]. Because RON needs to support more than simply IP-in-IP encapsulation, RON uses its own header. RON does not fragment packets, but does inform applications if they exceed the Maximum Transmission Unit (MTU). As in IPv6, applications must perform end-to-end path MTU discovery. RON also provides a policy tag that is interpreted by the forwarders to decide which network routing policies apply to the packet. If the packet is destined for the local node, the forwarder uses the packet type field to demultiplex the packet to the RON client. If the packet’s flow ID has a valid flow cache entry, the forwarder short-cuts the routing process with this entry. Otherwise, the routing table lookup occurs in three stages. The first stage examines the policy tag, and locates the proper routing preference table. There is one routing preference table for each known policy tag. Policy adherence supersedes all other routing preferences. Next, the lookup procedure examines the routing preference flags to find a compatible route selection metric for the packet. The flags are examined from the right, and the first flag understood by the router directs the packet to the next table. All RON routers understand the basic system metrics of latency, loss, and throughput; this provides a way for a shared RON environment to support users who may also have client-defined metrics, and still provide them with good default metrics. A lookup in the routing preference table leads to a hash table of next-hops based upon the destination RON node. The entry in the next-hop table is returned to the forwarder, which places the packet on the network destined for the next hop. 4.5 Bootstrap and Membership Management In addition to allowing clients to define their own membership mechanisms, RON provides two system membership managers: a simple static membership mechanism that loads its peers from a file, and a dynamic announcement-based, soft-state membership protocol. To bootstrap the dynamic membership protocol, a new node needs to know the identity of at least one peer in the RON. The new node uses this neighbor to broadcast its existence using a flooder, which is a special RON client that implements a generalpurpose, resilient flooding mechanism using a RON forwarder. The main challenge in the dynamic membership protocol is to avoid confusing a path outage to a node from its having left the RON. Each node builds up and periodically (every five minutes on average in our implementation) floods to all other nodes its list of peer RON nodes. If a node has not heard about a peer in sixty minutes, it assumes that the peer is no longer participating in the RON. Observe that this mechanism allows two nodes in the same RON to have a non-functioning direct Internet path for long periods of time: as long as there is some path in the RON between the two nodes, neither will think that the other has left. The overhead of broadcasting is minuscule compared to the traf- fic caused by active probing and routing updates, especially given the limited size of a RON. The resulting robustness is high: each node receives up to N 1 copies of the peer list sent from a given other node. This redundancy causes a node to be deleted from another node’s view of the RON only if the former node is genuinely partitioned for over an hour from every other node in the RON. A RON node will re-bootstrap from a well-known peer after a long partition. 5. Implementation The RON system is implemented at user-level as a flexible set of C++ libraries to which user-level RON clients link. Each client can pick and choose the components that best suits its needs. To provide a specific example of a client using the RON library, we describe the implementation of resilient IP forwarder. This forwarder improves IP packet delivery without any modification to the transport protocols and applications running at end-nodes. The architecture of the resilient IP forwarder is shown in Figure 7. RON uses UDP to forward data, since TCP’s reliable byte-stream is mismatched to many RON clients, and IP-based encapsulation
5.2 Routers Local Site RON routers implement the router virtual interface, which UDP)(ivert Socket)(Raw socket has only a single function call, lookup(pkt *mypkt). The RON library provides a trivial static router, and a dynamic router Emit that routes based upon different metric optimizations. The dynamic CMP router is extensible by linking it with a different set of metric de- iptions. Metric descriptions provide an ev score of a link. and a list of metrics that the routin table needs to generate and propagate. The implementation'srout- le creation ific to which also is local? Type Ites the need for the flow cache the multi-level routing table at the bottom of Figure 7 Route ookup AAKEROUTING TABLE(POLICIES, METRICS, PEERS) foreach p in policies foreach M in MEtRics Routing Pref Demux foreach Dest in peers foreach Hop in PEERS Dst N AND P permits( Hop, Dest)f default c=M eval(me, Hop, Dest); if sc> best-score i best _score= sc Hop, Figure 7: The resilient IP forwarder. Packets enter the sys tem through FreeBSD's divert sockets. They are handed to the table[PI[m[Dest]=nertJiop RON forwarder, which looks up the next hop. When the packet We have implemented the clique classifier discussed in local raw socket, where it re-enters IP processing tion 4.3, and are implementing a general policy classifier provide classifiers for the resilient IP forwarder 5.3 Monitoring virtual links would restrict its application-specific fo lities. The on core services run without speci Node a Node B privileges. The conduit at the entry no gatewav the ron, and classifies every packet. This classification is client- ID 5: time 33 pecific; it labels the packet with information that decides what ID5 metric is later used to route the packet. As an example sponse 2 a ron Ip forwarder conduit might label dns and Http traffic ID 5: time 39- as"latency-sensitive"and FTP traffic as"throughput-intensive, th rtts which would cause the downstream ron forwarders to use the Record"success" with rtt6 appropriate routing metric for each packet type. Non-entry RON nodes route the packet based only on the attached label and desti- nation of the packet. Figure 8: The active probers probing mechanism. with three This design ensures that all client- and application-specific rout- packets, both participants get an RTT sample without requir ing functions occur only at the entry and exit conduits, and that for- ng synchronized clocks. warding inside the ron is independent of client-specific logic. In addition to reducing the per-packet overhead at intermediate nodes. Each ron node in an N-node ron monitors its n-1 packet. This means, for example, that a ron node implementing prober component maintains a copy of a peers table w. it also localizes the client-specific computation required on each irtual links using randomized periodic probes IP forwarder may also participate in an overlay with a confer- encing application, and help improve the reliability of data delivery prober sends a small UDP probe packet to the remote peer. Each e conrerence lement the application-specific methods that label packets to tell the IP for- robe packet has a random 64-bit ID. The process used by the warder which routing metrics to use for conference packets rober is shown in Figure 8. When a node receives an initial probe request from a peer, it sends response I to that peer, and resets its probe timer for that peer. When the originating node sees response 5.1 The lP Forwarder 1, it sends response 2 back to the peer, so that both sides get reach- We implemented the resilient IP forwarder using FreeBSD's di- ability and rTT information from 3 packets vert sockets to automatically send Ip traffic over the ron, and emit The probing protocol is implemented as a ron client, which it at the other end. The resilient IP forwarder provides classifica- communicates with performance database(implemented as a stand- tion, encapsulation, and decapsulation of IP packets through a spe- alone application running on the Berkeley DB3 backend)using cial conduit called the ip_conduit(top of Figure 7)
Divert Socket Routing Pref Demux Latency Loss B/W Hop Dst Next UDP Type Demux Yes Local Data Protocols and Conduits No send() recv() Policy Demux vBNS Router default Route Lookup flags policy Yes Conduit Must_Fragment? Classify Encapsulate is_local? Forwarder Raw Socket ICMP Emit Outbound FreeBSD Networking IP To RON nodes To/From Local Site Figure 7: The resilient IP forwarder. Packets enter the system through FreeBSD’s divert sockets. They are handed to the RON forwarder, which looks up the next hop. When the packet arrives at its destination, the conduit writes the packet out on a local raw socket, where it re-enters IP processing. would restrict its application-specific forwarding capabilities. The RON core services run without special kernel support or elevated privileges. The conduit at the entry node is the client’s gateway to the RON, and classifies every packet. This classification is clientspecific; it labels the packet with information that decides what routing metric is later used to route the packet. As an example, a RON IP forwarder conduit might label DNS and HTTP traffic as “latency-sensitive” and FTP traffic as “throughput-intensive,” which would cause the downstream RON forwarders to use the appropriate routing metric for each packet type. Non-entry RON nodes route the packet based only on the attached label and destination of the packet. This design ensures that all client- and application-specific routing functions occur only at the entry and exit conduits, and that forwarding inside the RON is independent of client-specific logic. In addition to reducing the per-packet overhead at intermediate nodes, it also localizes the client-specific computation required on each packet. This means, for example, that a RON node implementing an IP forwarder may also participate in an overlay with a conferencing application, and help improve the reliability of data delivery for the conference. The conference node conduits implement the application-specific methods that label packets to tell the IP forwarder which routing metrics to use for conference packets. 5.1 The IP Forwarder We implemented the resilient IP forwarder using FreeBSD’s divert sockets to automatically send IP traffic over the RON, and emit it at the other end. The resilient IP forwarder provides classification, encapsulation, and decapsulation of IP packets through a special conduit called the ip conduit (top of Figure 7). 5.2 Routers RON routers implement the router virtual interface, which has only a single function call, lookup(pkt *mypkt). The RON library provides a trivial static router, and a dynamic router that routes based upon different metric optimizations. The dynamic router is extensible by linking it with a different set of metric descriptions. Metric descriptions provide an evaluation function that returns the “score” of a link, and a list of metrics that the routing table needs to generate and propagate. The implementation’s routing table creation is specific to single-hop indirection, which also eliminates the need for the flow cache. The following algorithm fills in the multi-level routing table at the bottom of Figure 7: MAKEROUTINGTABLE(POLICIES, METRICS, PEERS) foreach P in POLICIES foreach M in METRICS foreach Dest in PEERS foreach Hop in PEERS if P .permits(me; Hop) AND P .permits(Hop; Dest) f sc = M.eval(me; Hop; Dest); if sc > best score f best score = sc; next hop = Hop; g g table[P ][M][Dest] = next hop; We have implemented the clique classifier discussed in Section 4.3, and are implementing a general policy classifier. Both provide classifiers for the resilient IP forwarder. 5.3 Monitoring Virtual Links Record "success" with RTT 6 Node A Node B Initial Ping Response 1 Response 2 ID 5: time 10 ID 5: time 15 ID 5: time 33 ID 5: time 39 Record "success" with RTT 5 Figure 8: The active prober’s probing mechanism. With three packets, both participants get an RTT sample without requiring synchronized clocks. Each RON node in an N-node RON monitors its N 1 virtual links using randomized periodic probes. The active prober component maintains a copy of a peers table with a next probe time field per peer. When this field expires, the prober sends a small UDP probe packet to the remote peer. Each probe packet has a random 64-bit ID. The process used by the prober is shown in Figure 8. When a node receives an initial probe request from a peer, it sends response 1 to that peer, and resets its probe timer for that peer. When the originating node sees response 1, it sends response 2 back to the peer, so that both sides get reachability and RTT information from 3 packets. The probing protocol is implemented as a RON client, which communicates with performance database (implemented as a standalone application running on the Berkeley DB3 backend) using a simple UDP-based protocol
oN was able to successfu and recover from 100%(in RON1) and 60% )of all complete 6.2 Aros ISP in Salt Lake City, UT usages and all periods of high loss rates of Cci.Com in sAlt Lake City Ut 30% or more Cisco-ma.cominWalthamMa RON takes 18 seconds, on average, to route around a CMU Pittsburgh, PA failure and can do so in the face of a flooding attack. 6.2 ★ Corne11 Ithaca,NY Lulea Lulea University, Sweden Ron successfully routed around bad throughput fail- MA-Cable MediaOne Cable in Cambridge MA ures, doubling TCP throughput in 5% of all samples MIT Cambridge MA In 5%of the samples, RON reduced the loss probability 6.3 Ca-tI.cominFosterCityCa by 0.05 or more NYu New York. NY Utah Salt Lake City, UT Single-hop route indirection captured the majority of U-NI Vrije Univ, Amsterdam, Netherlands benefits in our ron deployment, for both outage recov-64 Additional hosts used in dataset Ron? ery and latency optimization. R-DSl DSL in Corvallis. OR NC-Cable MediaOne Cable in Durham. N Table 1: Major results from measurements of the ron testbed. PDI com in Palo Alto. CA Mazu com in Boston, MA 6. Evaluation The goal of the ron system is to overcome path outages and per- Table 2: The hosts in our sixteen-node roN deployment, which formance failures, without introducing excessive overhead or new detail to determine the effectiveness of ron in terisks indicate U.s. Universities on the internet2 failure modes. In this section, we present an evaluation of how well RON meets these goals. We evaluate the performance of the he two European universities, Lulea and vU-NL as non-Internet2 resilient IP forwarder, which uses ron for outage detection and loss, latency, and throughput optimization. Our evaluation has three main parts to it. First, we study commercial links, and never from the more reliable Intemet links RON's ability to detect outages and recover quickly from them The raw measurement data used in this paper consists of probe Next, we investigate performance failures and RON's ability to im- packets, throughput samples, and traceroute results. To probe. prove the loss rate, latency, and throughput of badly performing each RON node independently repeated the following steps paths. Finally, we investigate two important aspects of RoN'srout- 1. Pick a random node, j ing, showing the effectiveness of its one-intermediate-hop strateg 2. Pick a probe-type from one of direct, latency, loss) using compared to more general alternatives and the stability of roN- round-robin selection. Send a probe to node j generated routes. Table I summarizes our key findings 3. Delay for a random time interval between I and 2 seconds This characterizes all N(N-1) paths For RONi, we analyze 6.1 Methodology 10.9 million packet departure and arrival times, collected over 64 Most of our results come from experiments with a wide-area hours between 21 March 2001 and 23 March 2001. This produces ron deployed at several Internet sites. There are N(N-1)differ- 2.6 million individual RTT, one-way loss, and jitter data samples, ent paths between the hosts in an N-site ron deployment. Table 2 for which we calculated time-averaged samples averaged over a shows the location of sites for our experiments. We analyze two 30-minute duration. We also took 8, 855 throughput samples from distinct datasets-RON with N= 12 nodes and 132 distinct 1 MByte bulk transfers(or 30 seconds, whichever came sooner), paths, and RON2 with N= 16 nodes and 240 distinct paths. In recording the time at which each power of two's worth of data was RONI, traceroute data shows that 36 different AS's were tra- sent and the duration of the transfer. RON2 data was collected ersed. with at least 74 distinct inter-AS links: for RO N2, 50 AS's over 85 hours from 7 May 2001 and 1 1 May 2001, and consists of and 118 inter-AS links were traversed. The O(N )scaling of path 13.8 million packet arrival and departure times. Space constraints diversity suggests that even small numbers of confederating hosts allow us to present only the path outage analysis of RoN2, but the expose a large number of Internet paths to examination [18, 19]. performance failure results from RON2 are similar to RONI [1] We do not claim that our experiments and results are typical or rep Most of the ron hosts were Intel Celeron/733-based machines resentative of anything other than our deployment, but present and running FreeBSD, with 256MB RAM and 9GB of disk space. The nalyze them to demonstrate the kinds of gains one might realize processing capability of a host was never a bottleneck in any with rons our wide-area experiments. Our measurement data is available at Several of our host sites are internet2-connected educational http://nms.lcs.mit.edu/ron/. ites, and we found that this high-speed experimental network presents opportunities for path improvement not available in the 6.2 Overcoming path Outages public Internet. To demonstrate that our policy routing module Precisely measuring a path outage is harder than one might think. works, and to make our measurements closer to what Internet hosts One possibility is to define an outage as the length of time during n general would observe, all the measurements reported herein which no packets get through the path. As a yardstick, we could were taken with a policy that prohibited sending traffic to or from use ranges of time in which TCP implementations will time out commercial sites over the Internet2. Therefore, for instance, a and shut down a connection; these vary from about 120 seconds packet could travel from Utah to Cornell to NYU. but not from I Some hosts came and went during the study, so the number of Aros to Utah to Nyu. this is consistent with the auP of Internet2 that precludes commercial traffic. As a result, in our measurements, f alp the averages is not always as large as one might expect all hosts were continually present. More details are available in all path improvements imolving a commercial site came from only the documentation accompanying the data
RON was able to successfully detect and recover from 100% (in RON1 ) and 60% (in RON2) of all complete outages and all periods of sustained high loss rates of 30% or more. 6.2 RON takes 18 seconds, on average, to route around a failure and can do so in the face of a flooding attack. 6.2 RON successfully routed around bad throughput failures, doubling TCP throughput in 5% of all samples. 6.3 In 5% of the samples, RON reduced the loss probability by 0.05 or more. 6.3 Single-hop route indirection captured the majority of benefits in our RON deployment, for both outage recovery and latency optimization. 6.4 Table 1: Major results from measurements of the RON testbed. 6. Evaluation The goal of the RON system is to overcome path outages and performance failures, without introducing excessive overhead or new failure modes. In this section, we present an evaluation of how well RON meets these goals. We evaluate the performance of the resilient IP forwarder, which uses RON for outage detection and loss, latency, and throughput optimization. Our evaluation has three main parts to it. First, we study RON’s ability to detect outages and recover quickly from them. Next, we investigate performance failures and RON’s ability to improve the loss rate, latency, and throughput of badly performing paths. Finally, we investigate two important aspects of RON’s routing, showing the effectiveness of its one-intermediate-hop strategy compared to more general alternatives and the stability of RONgenerated routes. Table 1 summarizes our key findings. 6.1 Methodology Most of our results come from experiments with a wide-area RON deployed at several Internet sites. There are N(N 1) different paths between the hosts in an N-site RON deployment. Table 2 shows the location of sites for our experiments. We analyze two distinct datasets—RON1 with N = 12 nodes and 132 distinct paths, and RON2 with N = 16 nodes and 240 distinct paths. In RON1 , traceroute data shows that 36 different AS’s were traversed, with at least 74 distinct inter-AS links; for RON2, 50 AS’s and 118 inter-AS links were traversed. The O(N2 ) scaling of path diversity suggests that even small numbers of confederating hosts expose a large number of Internet paths to examination [18, 19]. We do not claim that our experiments and results are typical or representative of anything other than our deployment, but present and analyze them to demonstrate the kinds of gains one might realize with RONs. Several of our host sites are Internet2-connected educational sites, and we found that this high-speed experimental network presents opportunities for path improvement not available in the public Internet. To demonstrate that our policy routing module works, and to make our measurements closer to what Internet hosts in general would observe, all the measurements reported herein were taken with a policy that prohibited sending traffic to or from commercial sites over the Internet2. Therefore, for instance, a packet could travel from Utah to Cornell to NYU, but not from Aros to Utah to NYU. This is consistent with the AUP of Internet2 that precludes commercial traffic. As a result, in our measurements, all path improvements involving a commercial site came from only Name Description Aros ISP in Salt Lake City, UT CCI .com in Salt Lake City, UT Cisco-MA .com in Waltham, MA * CMU Pittsburgh, PA * Cornell Ithaca, NY Lulea Lulea University, Sweden MA-Cable MediaOne Cable in Cambridge, MA * MIT Cambridge, MA CA-T1 .com in Foster City, CA * NYU New York, NY * Utah Salt Lake City, UT VU-NL Vrije Univ., Amsterdam, Netherlands Additional hosts used in dataset RON2 OR-DSL DSL in Corvallis, OR NC-Cable MediaOne Cable in Durham, NC PDI .com in Palo Alto, CA Mazu .com in Boston, MA Table 2: The hosts in our sixteen-node RON deployment, which we study in detail to determine the effectiveness of RON in practice. Asterisks indicate U.S. Universities on the Internet2 backbone. The two European universities, Lulea and VU-NL are classified as non-Internet2. commercial links, and never from the more reliable Internet2 links. The raw measurement data used in this paper consists of probe packets, throughput samples, and traceroute results. To probe, each RON node independently repeated the following steps: 1. Pick a random node, j. 2. Pick a probe-type from one of fdirect; latency; lossg using round-robin selection. Send a probe to node j. 3. Delay for a random time interval between 1 and 2 seconds. This characterizes all N(N 1) paths. For RON1, we analyze 10.9 million packet departure and arrival times, collected over 64 hours between 21 March 2001 and 23 March 2001. This produces 2.6 million individual RTT, one-way loss, and jitter data samples, for which we calculated time-averaged samples averaged over a 30-minute duration.1 We also took 8,855 throughput samples from 1 MByte bulk transfers (or 30 seconds, whichever came sooner), recording the time at which each power of two’s worth of data was sent and the duration of the transfer. RON2 data was collected over 85 hours from 7 May 2001 and 11 May 2001, and consists of 13.8 million packet arrival and departure times. Space constraints allow us to present only the path outage analysis of RON2, but the performance failure results from RON2 are similar to RON1 [1]. Most of the RON hosts were Intel Celeron/733-based machines running FreeBSD, with 256MB RAM and 9GB of disk space. The processing capability of a host was never a bottleneck in any of our wide-area experiments. Our measurement data is available at http://nms.lcs.mit.edu/ron/. 6.2 Overcoming Path Outages Precisely measuring a path outage is harder than one might think. One possibility is to define an outage as the length of time during which no packets get through the path. As a yardstick, we could use ranges of time in which TCP implementations will time out and shut down a connection; these vary from about 120 seconds 1 Some hosts came and went during the study, so the number of samples in the averages is not always as large as one might expect if all hosts were continually present. More details are available in the documentation accompanying the data
(e.g, some Solaris versions) to 511 seconds(e.g, most BSDs). (e.g, multiple 30-minute averaged samples with a 100% loss rate his gives us a metric that characterizes when batch applications may have been the result of the same problem) or higher(e.g, a will fail; from our own experience, we believe that most users run- time-averaged loss rate of 50% could have resulted from multiple ng interactive applications have a far lower threshold for declar- link outages of a few minutes each) ing their Internet connectivity"dead. We emphasize that Internet2 paths were never used to improve a The problem with this small time-scale definition is that robustly non-Internet2 connections performance. In fact, the vast measuring it and differentiating between a high packet loss rate and of the problems corrected by ron involved only commerci true disconnections is hard. However, since distributed applications net connections, as is shown in brackets] by the number of c suffer in either case, we assert that it is operationally useful to de- when we remove all Internet2 paths from consideration fine a path outage as the length of time over which the packet loss- rate is larger than some threshold. We define outage Loss Rate Ron Win No Change RON Loss the observed packet loss rate averaged over an interval is larger 526[51758[5147145] than on the path, and O otherwise. For values of T on the order of 4[] M minutes, a measured value of han 30% degrades TCP performance by orders of magnitude, by forcing TCP into fre 23[23] quent timeout-based retransmissions [16] 20[20 15[15 000000 14[14] 10 overlapping points 90% 12[2]0 00000000 100%10[0]0 30% RON loss Table 3: Outage data for RONI. A"RON win"at means that the loss rate of the direct Internet path was y%and the ron loss rate was Numbers in brackets sHow the contribution to the total oltage number after eliminating all he(typically more reliable) Internet2 paths, which reflects the ublic internet better les The numbers and percentage of outages for RON2(Table 4) 30-minute avg RON loss rate are noticeably higher than in RON ing the variability of path reliability in the Internet today. RON2 had 34,000 30-minute samples, about 2.5X more samples than RON igure 9: Packet loss rate averaged over 30-min m出x Loss Rate n Win No Chan for direct Internet paths vs. ron paths for the RO There are 32 points above the=0.3 horizontal l 112 points above =0.5, including overlapping points. RONS loss oj ting router avoided these failures and never 40% 7 experienced a 30-minute loss-rate larger than 30% 106 7 How often is outage(T,)=1, and how often is RON able to route around these situationis? Figure 9 shows a scatterplot of the provement in loss-rate, averaged over T= 1800 s achieved by 62748 0 on for the RONI the 32 points above the =0.3 line parallel to the horizontal axis, which signifies a conditon bad enough to kill most applications There are no points to the right of the 0.3 line parallel to the Table 4: Outage data for RON vertical axis. The scatterplot conceals most of the data in the lower left corner, we will revisit this data in the form of a CdF (Figure 1 in the next section These results show that ron offers substantial improvements during a large fraction of outages, but is not infallible in The precise number of times out age(T, ) was equal to I for the best path at lower outage rates when is between 0 T:= 1800 s is shown in Table 3. These statstics are obtained by calculating 13, 650 30-minute loss-rate averages of a 5l-hour subset However,in RONI, RON,'s outage deteetion and path of the RoNi packet trace, involving 132 different communication machinery was able to successfully route around all the outag uations! This is especially revealing because it suggests that al paths. We count a"RON Win"if the time-averaged loss rate on the outages in RONI were not on"edge"links connecting the site the Internet was> and the loss rate with RON was< %, "No to the Internet, but elsewhere where path diversity allowed RON Change"and"RONoss"are analogously defined. We find that the to provide connectivity. In RON2, about 60% of the serious out number of complete communication outages was 10 in this dataset, age situations were overcome; the remaining 40% were almost all which means that there were 10 instances when the sampled paths due to individual sites being unreachable from any other site in the he num bers in this table are no RON er of link or routing failures observed in the Internet across the sam pled paths, the number of such failures could have been lower 'The one situation where Ron made things worse at a 100%loss
(e.g., some Solaris versions) to 511 seconds (e.g., most BSDs). This gives us a metric that characterizes when batch applications will fail; from our own experience, we believe that most users running interactive applications have a far lower threshold for declaring their Internet connectivity “dead.” The problem with this small time-scale definition is that robustly measuring it and differentiating between a high packet loss rate and true disconnections is hard. However, since distributed applications suffer in either case, we assert that it is operationally useful to de- fine a path outage as the length of time over which the packet lossrate is larger than some threshold. We define outage(; p)=1 if the observed packet loss rate averaged over an interval is larger than p on the path, and 0 otherwise. For values of on the order of several minutes, a measured value of p larger than 30% degrades TCP performance by orders of magnitude, by forcing TCP into frequent timeout-based retransmissions [16]. 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 30−minute avg Internet loss rate 30−minute avg RON loss rate samples 10 overlapping points at (0,1) 30% internet loss line 30% RON loss line x=y Figure 9: Packet loss rate averaged over 30-minute intervals for direct Internet paths vs. RON paths for the RON1 dataset. There are 32 points above the p = 0:3 horizontal line, and 20 points above p = 0:5, including overlapping points. In contrast, RON’s loss optimizing router avoided these failures and never experienced a 30-minute loss-rate larger than 30%. How often is outage(; p)=1, and how often is RON able to route around these situations? Figure 9 shows a scatterplot of the improvement in loss-rate, averaged over = 1800 s achieved by RON for the RON1 measurements. To identify outages consider the 32 points above the p = 0:3 line parallel to the horizontal axis, which signifies a condition bad enough to kill most applications. There are no points to the right of the p = 0:3 line parallel to the vertical axis. The scatterplot conceals most of the data in the lowerleft corner; we will revisit this data in the form of a CDF (Figure 11) in the next section. The precise number of times outage(; p) was equal to 1 for = 1800 s is shown in Table 3. These statistics are obtained by calculating 13,650 30-minute loss-rate averages of a 51-hour subset of the RON1 packet trace, involving 132 different communication paths. We count a “RON Win” if the time-averaged loss rate on the Internet was p% and the loss rate with RON was < p%; “No Change” and “RON Loss” are analogously defined. We find that the number of complete communication outages was 10 in this dataset, which means that there were 10 instances when the sampled paths had a 100% loss rate. The numbers in this table are not the number of link or routing failures observed in the Internet across the sampled paths; the number of such failures could have been lower (e.g., multiple 30-minute averaged samples with a 100% loss rate may have been the result of the same problem) or higher (e.g., a time-averaged loss rate of 50% could have resulted from multiple link outages of a few minutes each). We emphasize that Internet2 paths were never used to improve a non-Internet2 connection’s performance. In fact, the vast majority of the problems corrected by RON involved only commercial Internet connections, as is shown [in brackets] by the number of outages when we remove all Internet2 paths from consideration. Loss Rate RON Win No Change RON Loss 10% 526 [517] 58 [51] 47 [45] 20% 142 [140] 4 [3] 15 [15] 30% 32 [32] 0 0 40% 23 [23] 0 0 50% 20 [20] 0 0 60% 19 [19] 0 0 70% 15 [15] 0 0 80% 14 [14] 0 0 90% 12 [12] 0 0 100% 10 [10] 0 0 Table 3: Outage data for RON1. A “RON win” at p% means that the loss rate of the direct Internet path was p% and the RON loss rate was < p%. Numbers in brackets show the contribution to the total outage number after eliminating all the (typically more reliable) Internet2 paths, which reflects the public Internet better. The numbers and percentage of outages for RON2 (Table 4) are noticeably higher than in RON1, showing the variability of path reliability in the Internet today. RON2 had 34,000 30-minute samples, about 2.5X more samples than RON1 . Loss Rate RON Win No Change Ron Loss 10% 557 165 113 20% 168 112 33 30% 131 84 18 40% 110 75 7 50% 106 69 7 60% 100 62 5 70% 93 57 1 80% 87 54 0 90% 85 48 2 100% 67 45 1 Table 4: Outage data for RON2. These results show that RON offers substantial improvements during a large fraction of outages, but is not infallible in picking the best path at lower outage rates when p is between 0 and 20%. However, in RON1, RON’s outage detection and path selection machinery was able to successfully route around all the outage situations! This is especially revealing because it suggests that all the outages in RON1 were not on “edge” links connecting the site to the Internet, but elsewhere where path diversity allowed RON to provide connectivity. In RON2, about 60% of the serious outage situations were overcome; the remaining 40% were almost all due to individual sites being unreachable from any other site in the RON.2 2 The one situation where RON made things worse at a 100% loss