Automated Worm Fingerprinting Sumeet Singh, Cristian Estan, George Varghese and Stefan Savage Department of Computer Science and Engineering University of california, San Diego Abstract Doom can spread through multiple vectors and have Network worms are a clear and growing threat to the se- added backdoors, mail-relays and denial-of-service at curity of todays Internet-connected hosts and networks tacks to their payloads The combination of the internets unrestricted connec- Unfortunately, our current ability to defend against tivity and widespread software homogeneity allows net- these outbreaks is extremely poor and their propagation. In fact, modern worms can spread so In fact, the basic approach of detection, characterization, quickly, and so widely, that no human-mediated reaction and containment has not changed significantly over the can hope to contain an outbreak. last five years. Typically, new worms are detected in an G In this paper, we propose an automated approach ad hoc fashion by a combination of intrusion detection quickly detecting previously unknown worms and systems and administrator legwork. Then, after isolat- viruses based on two key behavioral characteristics- ing an instance of the worm, skilled security profession- a common exploit sequence together with a range of als manually characterize a worm signature and finally, unique sources generating infections and destinations be- this signature is used to contain subsequent infections ng targeted. More importantly, our approach- called via updates to anti-virus software and network filtering content sifting"-automatically generates precise sig- products. While this approach is qualitatively sound, it is natures that can then be used to filter or moderate the quantitatively insufficient. Manual signature extraction spread of the worm elsewhere in the network is an expensive, slow, manual procedure that can take Using a combination of existing and novel algorithms hours or even days to complete. It requires isolating a we have developed a scalable content sifting implemen- new worm, decompiling it, looking for invariant code tation with low memory and CPU requirements. Over quences and testing the signature for uniqueness. How months of active use at UCSD, our Earlybird prototype ever, recent simulations by Moore et al. suggest that an system has automatically detected and generated signa- effective worm containment can require a reaction time tures for all pathogens known to be active on our network of well under sixty seconds[23]. More concretely,con- as well as for several new worms and viruses which were sider that in the time it took to read this section, the slam- unknown at the time our system identified them. Our mer worm had contacted well over a billion distinct In- initial experience suggests that, for a wide range of net- ternet hosts work pathogens, it may be practical to construct fully This paper investigates the challenges in addressin automated defenses -even against so-called"zero-day this problem and describes a prototype system, called Earlybird, that can automatically detect and contain new worms on the network using precise signatures. Our ap. 1 Introduction proach, which we call content sifting, is based on two In the last three years, large-scale Internet worm out- observations: first, that some portion of the content in ex breaks have profoundly demonstrated the threat posed isting worms is invariant-typically the code exploiting a by self-propagating programs. The combination of latent host vulnerability-and second, that the spreading despread software homogeneity and the Internets un- dynamics of a worm is atypical of Internet application restricted communication model creates an ideal climate Simply stated, it is rare to observe the same string re- for infectious pathogens. Worse, each new epidemic has curring within packets sent from many sources to many demonstrated increased speed, virulence or sophistica- destinations. By sifting through network traffic for con- tion over its predecessors. While the Code Red worm tent strings that are both frequently repeated and widely took over fourteen hours to infect its vulnerable pop dispersed, we can automatically identify new worms and ulation in 2001, the Slammer worm, released some 18 their precise signatures months later, did the same in under 10 minutes [22, 21] In our prototype system, we have developed appl The Code red worm is thought to have infected roughly mate versions of this algorithm that are amenable to high 360,000 hosts, while, by some estimates, the Nimda speed implementation. In live experiments on a portion worm compromised over two million [8]. While early of the UCSD campus network, we have deployed Early worms typically spread by a single mechanism and bird and demonstrated that it can automatically extract little else, modern variants such as SoBig. F and My- the signature for all known active worms(e.g Codered
Automated Worm Fingerprinting Sumeet Singh, Cristian Estan, George Varghese and Stefan Savage Department of Computer Science and Engineering University of California, San Diego Abstract Network worms are a clear and growing threat to the security of today’s Internet-connected hosts and networks. The combination of the Internet’s unrestricted connectivity and widespread software homogeneity allows network pathogens to exploit tremendous parallelism in their propagation. In fact, modern worms can spread so quickly, and so widely, that no human-mediated reaction can hope to contain an outbreak. In this paper, we propose an automated approach for quickly detecting previously unknown worms and viruses based on two key behavioral characteristics – a common exploit sequence together with a range of unique sources generating infections and destinations being targeted. More importantly, our approach – called “content sifting” – automatically generates precise signatures that can then be used to filter or moderate the spread of the worm elsewhere in the network. Using a combination of existing and novel algorithms we have developed a scalable content sifting implementation with low memory and CPU requirements. Over months of active use at UCSD, our Earlybird prototype system has automatically detected and generated signatures for all pathogens known to be active on our network as well as for several new worms and viruses which were unknown at the time our system identified them. Our initial experience suggests that, for a wide range of network pathogens, it may be practical to construct fully automated defenses – even against so-called “zero-day” epidemics. 1 Introduction In the last three years, large-scale Internet worm outbreaks have profoundly demonstrated the threat posed by self-propagating programs. The combination of widespread software homogeneity and the Internet’s unrestricted communication model creates an ideal climate for infectious pathogens. Worse, each new epidemic has demonstrated increased speed, virulence or sophistication over its predecessors. While the Code Red worm took over fourteen hours to infect its vulnerable population in 2001, the Slammer worm, released some 18 months later, did the same in under 10 minutes [22, 21]. The Code Red worm is thought to have infected roughly 360,000 hosts, while, by some estimates, the Nimda worm compromised over two million [8]. While early worms typically spread by a single mechanism and did little else, modern variants such as SoBig.F and MyDoom can spread through multiple vectors and have added backdoors, mail-relays and denial-of-service attacks to their payloads. Unfortunately, our current ability to defend against these outbreaks is extremely poor and has not advanced significantly since the Code Red episode in mid-2001. In fact, the basic approach of detection, characterization, and containment has not changed significantly over the last five years. Typically, new worms are detected in an ad hoc fashion by a combination of intrusion detection systems and administrator legwork. Then, after isolating an instance of the worm, skilled security professionals manually characterize a worm signature and finally, this signature is used to contain subsequent infections via updates to anti-virus software and network filtering products. While this approach is qualitatively sound, it is quantitatively insufficient. Manual signature extraction is an expensive, slow, manual procedure that can take hours or even days to complete. It requires isolating a new worm, decompiling it, looking for invariant code sequences and testing the signature for uniqueness. However, recent simulations by Moore et al. suggest that an effective worm containment can require a reaction time of well under sixty seconds [23]. More concretely, consider that in the time it took to read this section, the Slammer worm had contacted well over a billion distinct Internet hosts. This paper investigates the challenges in addressing this problem and describes a prototype system, called Earlybird, that can automatically detect and contain new worms on the network using precise signatures. Our approach, which we call content sifting, is based on two observations: first, that some portion of the content in existing worms is invariant – typically the code exploiting a latent host vulnerability – and second, that the spreading dynamics of a worm is atypical of Internet applications. Simply stated, it is rare to observe the same string recurring within packets sent from many sources to many destinations. By sifting through network traffic for content strings that are both frequently repeated and widely dispersed, we can automatically identify new worms and their precise signatures. In our prototype system, we have developed approximate versions of this algorithm that are amenable to highspeed implementation. In live experiments on a portion of the UCSD campus network, we have deployed Earlybird and demonstrated that it can automatically extract the signature for all known active worms (e.g. CodeRed
Slammer). Moreover, during our experiments Earlybird age worm outbreaks. Staniford et al.s landmark paper detected and extracted a signature for the Blaster, My- anticipated the development of far faster worms and ex- Doom and Kibuv B worms- significantly before they trapolated their growth analytically [42]-foreshadow- had been publicly disclosed and hours or days before any ing the release of the Slammer worm in 2002. Moore public detection signatures were distributed. Finally, in et al. subsequently analyzed the Slammer outbreak and our testing over a period of eight months, we have ex- estimated that almost all of the Internet address space perienced relatively few false positives and exceptions was scanned by the worm in under 10 minutes-limited are typically due to structural properties of a few popu- only by bandwidth constraints at the infected sites [21] lar protocols(SPAM via SMTP and NetBlOS) that recur This experience also motivates the need for fast and au- consistently and can be procedurally"white-listed tomated reaction times. Finally, based on these results, The remainder of this paper is structured as follows. Moore et al. analyzed the engineering requirements for In Section 2 we survey the field of worm research that reactive defenses-exploring the tradeoffs between reac. re have built upon and describe how it motivates our tion time, deployment and the granularity of containment work. Section 3 describes how we define worm behav- mechanisms(signature based vs. IP address based)[23] ior. Section 4 outlines a naive approach to detecting such Two of their key findings motivate our work. behaviors, followed by a concrete description of practi- First, they demonstrated that signature-based methods cal content sifting algorithms in Section 5. Section 6 can be an order of magnitude more effective than simply describes the implementation of the Earlybird prototype quarantining infected hosts piecemeal. The rough intu and an analysis of our live experiments using it. We de- ition for this is simple: if a worm can compromise a new scribe limitations and extensions in Section 7. Finally, host with an average latency of a seconds, then an ad- in Section 8 we summarize our findings and conclude dress based quarantine can must react more quickly than I seconds to prevent the worm from spreading. By con- 2 Background and related Work trast, a signature based system can, in principle, halt all Worms are simply small programs. They spread by ex- subsequent spreading once a signature is identified. The ploiting a latent software vulnerability in some popular second important result was their derivation, via simu network service-such as email. Web or terminal access tion, of"benchmarks"for how quickly such signatures 9 Seizing control of program execution and then sending must be generated to offer effective containment. Slow copy of themselves to other susceptible hosts spreading worms, such as CodeRe ffectively While the potential threat posed by network worms contained if signatures are generated within 60 minutes, has a long past -originating with fictional accounts while containing high-speed worms, such as Slammer in gerrold's"When harlie was One" and Brunner's may require signature generation in well under 5 minutes "Shockwave Rider"-it is only recently that this threat -perhaps as little as 60 seconds. Our principal contribu- has enjoyed significant research attention. Fred Co- tion is demonstrating practical mechanisms for achieving hen first lay the theoretical foundations for understand- this requirement. ing computer viruses in 198414, 5], and the Internet In the remainder of this section we examine existing worm of 1988 demonstrated that self-replication via a techniques for detecting worm outbreaks, characteriz- network could dramatically amplify the virulence of such ing worms and proposed countermeasures for mitigating pathogens [33, 39]. However, the analysis and under- worm spread standing of network worms did not advance substantially until the Code Red outbreak of 2001. In this section. we 2.1 Worm Detection attempt to summarize the contemporary research litera- Three current classes of methods are used for detecting ure-especially in its relation to our own work. detection, honeypots, and behavioral The first research papers in the"modern worm era" techniques at end hosts. We consider each of these in focused on characterizations and analyses of particu lar worm outbreaks. For example, Moore et al. pub Worms spread by selecting susceptible target hosts, in- lished one of the first empirical analyses of the CodeRed fecting them over the network, and then repeating this worms growth, based on unsolicited scans passively ob- process in a distributed recursive fashion. Many existing served on an unused network [22]. Further, the authors worms, excepting email viruses, will select targets using estimated the operational"repair"rate by actively prob- a random process. For instance, CodeRed selected target ng a subsample of the 360,000 infected sites over time. IP addresses uniformly from the entire address space. As They found that, despite unprecedented media coverage, a result, a worm may will be highly unusual in the num- the repair rate during the initial outbreak averaged under ber, frequency and distribution of addresses that it scans 2 percent per day. This reinforces our belief that fully This can be leveraged to detect worms in several ways ated intervention is necessary to effectively To monitor random scanning worms from a global p
Slammer). Moreover, during our experiments Earlybird detected and extracted a signature for the Blaster, MyDoom and Kibuv.B worms – significantly before they had been publicly disclosed and hours or days before any public detection signatures were distributed. Finally, in our testing over a period of eight months, we have experienced relatively few false positives and exceptions are typically due to structural properties of a few popular protocols (SPAM via SMTP and NetBIOS) that recur consistently and can be procedurally “white-listed”. The remainder of this paper is structured as follows. In Section 2 we survey the field of worm research that we have built upon and describe how it motivates our work. Section 3 describes how we define worm behavior. Section 4 outlines a naive approach to detecting such behaviors, followed by a concrete description of practical content sifting algorithms in Section 5. Section 6 describes the implementation of the Earlybird prototype and an analysis of our live experiments using it. We describe limitations and extensions in Section 7. Finally, in Section 8 we summarize our findings and conclude. 2 Background and Related Work Worms are simply small programs. They spread by exploiting a latent software vulnerability in some popular network service – such as email, Web or terminal access – seizing control of program execution and then sending a copy of themselves to other susceptible hosts. While the potential threat posed by network worms has a long past – originating with fictional accounts in Gerrold’s “When Harlie was One” and Brunner’s “Shockwave Rider” – it is only recently that this threat has enjoyed significant research attention. Fred Cohen first lay the theoretical foundations for understanding computer viruses in 1984 [4, 5], and the Internet worm of 1988 demonstrated that self-replication via a network could dramatically amplify the virulence of such pathogens [33, 39]. However, the analysis and understanding of network worms did not advance substantially until the CodeRed outbreak of 2001. In this section, we attempt to summarize the contemporary research literature – especially in its relation to our own work. The first research papers in the “modern worm era” focused on characterizations and analyses of particular worm outbreaks. For example, Moore et al. published one of the first empirical analyses of the CodeRed worm’s growth, based on unsolicited scans passively observed on an unused network [22]. Further, the authors estimated the operational “repair” rate by actively probing a subsample of the 360,000 infected sites over time. They found that, despite unprecedented media coverage, the repair rate during the initial outbreak averaged under 2 percent per day. This reinforces our belief that fully automated intervention is necessary to effectively manage worm outbreaks. Staniford et al.’s landmark paper anticipated the development of far faster worms and extrapolated their growth analytically [42] – foreshadowing the release of the Slammer worm in 2002. Moore et al. subsequently analyzed the Slammer outbreak and estimated that almost all of the Internet address space was scanned by the worm in under 10 minutes – limited only by bandwidth constraints at the infected sites [21]. This experience also motivates the need for fast and automated reaction times. Finally, based on these results, Moore et al. analyzed the engineering requirements for reactive defenses – exploring the tradeoffs between reaction time, deployment and the granularity of containment mechanisms (signature based vs. IP address based) [23]. Two of their key findings motivate our work. First, they demonstrated that signature-based methods can be an order of magnitude more effective than simply quarantining infected hosts piecemeal. The rough intuition for this is simple: if a worm can compromise a new host with an average latency of x seconds, then an address based quarantine can must react more quickly than x seconds to prevent the worm from spreading. By contrast, a signature based system can, in principle, halt all subsequent spreading once a signature is identified. The second important result was their derivation, via simulation, of “benchmarks” for how quickly such signatures must be generated to offer effective containment. Slowspreading worms, such as CodeRed can be effectively contained if signatures are generated within 60 minutes, while containing high-speed worms, such as Slammer, may require signature generation in well under 5 minutes – perhaps as little as 60 seconds. Our principal contribution is demonstrating practical mechanisms for achieving this requirement. In the remainder of this section we examine existing techniques for detecting worm outbreaks, characterizing worms and proposed countermeasures for mitigating worm spread. 2.1 Worm Detection Three current classes of methods are used for detecting new worms: scan detection, honeypots, and behavioral techniques at end hosts. We consider each of these in turn. Worms spread by selecting susceptible target hosts, infecting them over the network, and then repeating this process in a distributed recursive fashion. Many existing worms, excepting email viruses, will select targets using a random process. For instance, CodeRed selected target IP addresses uniformly from the entire address space. As a result, a worm may will be highly unusual in the number, frequency and distribution of addresses that it scans. This can be leveraged to detect worms in several ways. To monitor random scanning worms from a global per-
spective, one approach is to use network telescopes to send a packet from the same buffer containing a re- passive network monitors that observe large ranges of ceived packet is often indicative of suspicious activity unused, yet routable, address space [25, 22, 26]. Under While behavioral techniques are able to leverage large the assumption that worms will select target victims at amounts of detailed context about application and sys random, a new worm will scan a given network telescope tem behavior, they can be expensive to manage and de with a probability directly proportional to the worms ploy ubiquitously. Moreover, end-host systems can, by scan rate and the network telescope's"size; that is, the definition, only detect an attack against a single host and number of IP addresses monitored. Consequently, large not infer the presence of a large-scale outbreak. Clearly, network telescopes will be able to detect fast spreading from a management, cost and reuse standpoint, it is ideal worms of this type fairly quickly. At the enterprise level, to detect and block new attacks in the network. That Staniford provides a comprehensive analysis of the fac- said, end-host approaches offer a level of sensitivity that tors impacting the ability of a network monitor to suc- is difficult to match in the network and can be a useful cessfully detect and quarantine infected hosts in an on- complement- particularly for detecting potential slow line fashion [41] or stealthy worms that do not leave a significant imprint However, there are two key limitations to the scan de- on the network tection approach. First, it is not well suited to worms which spread in a non-random fashion, such as e-mail 2.2 Characterization viruses or worms spread via instant messenger or peer- Characterization is the process of analyzing and identify- to-peer applications. Such worms generate a target list ing a new worm or exploit, so that targeted defenses may from address books or buddy lists at the victim and there- be deployed fore spread topologically-according to the implicit rela- One approach is to create a priori vulnerability sig tionship graph between individuals. Consequently, they natures that match known exploitable vulnerabilities in do not exhibit anomalous scanning patterns and will not deployed software [44, 45]. For example, a vulnerability be detected as a consequence. The second drawback is signature for the Slammer worm might match all UDP infected sites, not a signature identifying their behavior. searching for such traffic. either in the network or on Consequently, defenses based on scan detection must be the host, a new worm exploiting the same vulnerability an order of magnitude faster than those based on signa- will be revealed. This is very similar to traditional in- ture extraction[23 trusion detection systems (IDS), such as Snort [1] and a different approach to worm detection is demon- Bro [29], which compare traffic content to databases of strated by Honeypots. First introduced to the commu- strings used in known attacks. This general approach has nity via Cliff Stoll,s book, The Cuckoo's Egg, and Bill the advantage that it can deployed before the outbreak of Cheswick's paper"An Evening with Berferd, honeypots a new worm and therefore can offer an added measure are simply monitored idle hosts with untreated vulner- of defense. However, this sort of proactive characteriza abilities. Any outside interaction with the host is, by tion can only be applied to vulnerabilities that are already definition, unsolicited and any malicious actions can be well-known and well-characterized manually. Further, observed directly. Consequently, any unsolicited out- the tradeoff between vulnerability signature specificity, bound traffic generated by a honeypot represents unde- complexity and false positives remains an open question niable evidence of an intrusion and possibly a worm in- Wang et als Shield, is by far the best-known vulnera- fection. Moreover, since the honeypot host is directly bility blocking system and it focuses on an end-host im- controlled, malicious code can be differentiated from the plementation precisely to better manage some of these default configuration. In this manner, the "body"of a tradeoffs [44]. We do not consider this approach further worm can be isolated and then analyzed to extract a sig- in this paper, but we believe it can be a valuable com- nature. This approach is commonly used for acquiring plement to the automated signature extraction alternative worm instances for manual analysis [18]. There are two we explore principal drawbacks to honeypots: they require a signif The earliest automation for signature extraction is due cant amount of slow manual analysis and they depend to Kephart and Arnold [15]. Their system, used commer- the honeypot being quickly infected by a new worm cially by IBM, allows viruses to infect known"decoy Finally, a technique that has found increasing traction programs in a controlled environment, extracts the in- in the commercial world(e. g. via recently acquired star- fected (i.e, modified)regions of the decoys and then uses tups, Okena and Entracept) is host-based behavioral de- a variety of heuristics to identify invariant code strings tection. Such systems dynamically analyze the patterns across infection instances. Among this set of candidates of system calls for anomalous activity [31, 28, 3]indicat- an"optimal"signature is determined by estimating the ing code injection or propagation. For example, attempts false positive probability against a measured corpus of
spective, one approach is to use network telescopes – passive network monitors that observe large ranges of unused, yet routable, address space [25, 22, 26]. Under the assumption that worms will select target victims at random, a new worm will scan a given network telescope with a probability directly proportional to the worm’s scan rate and the network telescope’s “size”; that is, the number of IP addresses monitored. Consequently, large network telescopes will be able to detect fast spreading worms of this type fairly quickly. At the enterprise level, Staniford provides a comprehensive analysis of the factors impacting the ability of a network monitor to successfully detect and quarantine infected hosts in an online fashion [41]. However, there are two key limitations to the scan detection approach. First, it is not well suited to worms which spread in a non-random fashion, such as e-mail viruses or worms spread via instant messenger or peerto-peer applications. Such worms generate a target list from address books or buddy lists at the victim and therefore spread topologically – according to the implicit relationship graph between individuals. Consequently, they do not exhibit anomalous scanning patterns and will not be detected as a consequence. The second drawback is that scan detection can only provide the IP address of infected sites, not a signature identifying their behavior. Consequently, defenses based on scan detection must be an order of magnitude faster than those based on signature extraction [23]. A different approach to worm detection is demonstrated by Honeypots. First introduced to the community via Cliff Stoll’s book, “The Cuckoo’s Egg”, and Bill Cheswick’s paper “An Evening with Berferd”, honeypots are simply monitored idle hosts with untreated vulnerabilities. Any outside interaction with the host is, by definition, unsolicited and any malicious actions can be observed directly. Consequently, any unsolicited outbound traffic generated by a honeypot represents undeniable evidence of an intrusion and possibly a worm infection. Moreover, since the honeypot host is directly controlled, malicious code can be differentiated from the default configuration. In this manner, the “body” of a worm can be isolated and then analyzed to extract a signature. This approach is commonly used for acquiring worm instances for manual analysis [18]. There are two principal drawbacks to honeypots: they require a signifi- cant amount of slow manual analysis and they depend on the honeypot being quickly infected by a new worm. Finally, a technique that has found increasing traction in the commercial world (e.g. via recently acquired startups, Okena and Entracept) is host-based behavioral detection. Such systems dynamically analyze the patterns of system calls for anomalous activity [31, 28, 3] indicating code injection or propagation. For example, attempts to send a packet from the same buffer containing a received packet is often indicative of suspicious activity. While behavioral techniques are able to leverage large amounts of detailed context about application and system behavior, they can be expensive to manage and deploy ubiquitously. Moreover, end-host systems can, by definition, only detect an attack against a single host and not infer the presence of a large-scale outbreak. Clearly, from a management, cost and reuse standpoint, it is ideal to detect and block new attacks in the network. That said, end-host approaches offer a level of sensitivity that is difficult to match in the network and can be a useful complement – particularly for detecting potential slow or stealthy worms that do not leave a significant imprint on the network. 2.2 Characterization Characterization is the process of analyzing and identifying a new worm or exploit, so that targeted defenses may be deployed. One approach is to create a priori vulnerability signatures that match known exploitable vulnerabilities in deployed software [44, 45]. For example, a vulnerability signature for the Slammer worm might match all UDP traffic on port 1434 that is longer than 100 bytes. By searching for such traffic, either in the network or on the host, a new worm exploiting the same vulnerability will be revealed. This is very similar to traditional intrusion detection systems (IDS), such as Snort [1] and Bro [29], which compare traffic content to databases of strings used in known attacks. This general approach has the advantage that it can deployed before the outbreak of a new worm and therefore can offer an added measure of defense. However, this sort of proactive characterization can only be applied to vulnerabilitiesthat are already well-known and well-characterized manually. Further, the tradeoff between vulnerability signature specificity, complexity and false positives remains an open question. Wang et al’s Shield, is by far the best-known vulnerability blocking system and it focuses on an end-host implementation precisely to better manage some of these tradeoffs [44]. We do not consider this approach further in this paper, but we believe it can be a valuable complement to the automated signature extraction alternative we explore. The earliest automation for signature extraction is due to Kephart and Arnold [15]. Their system, used commercially by IBM, allows viruses to infect known “decoy” programs in a controlled environment, extracts the infected (i.e., modified) regions of the decoys and then uses a variety of heuristics to identify invariant code strings across infection instances. Among this set of candidates an “optimal” signature is determined by estimating the false positive probability against a measured corpus of
n-grams found in normal computer programs. This ap- lapping fixed-length content strings over each byte off proach is extremely powerful, but assumes the presence set) although it is not currently clear what the impact of of a known instance of a virus and a controlled environ ment to monitor The former limitation is partially addressed by the 2.3 Containment Honeycomb system of Kreibich and Crowcroft [17. Containment refers to the mechanism used to slow or stop the spread of an active worm. There are three Honeycomb is a host-based intrusion detection system containment mechanisms in use today: host quarantine, that automatically generates signatures by looking for longest common subsequences among sets of strings string-matching and connection throttling. Host quaran found in message exchanges. This basic procedure is tine is simply the act of preventing an infected host from similar to our own, but there are also important structural communicating with other hosts-typically implemented via Ip-level access control lists on routers or firewalls the most important of which is scale. Honeycomb is de- String-matching containment - typified by signature- signed for a host-based context with orders of magr tude less processing required. To put this in context, our matches network trattic against particular strings, or sig Earlybird system currently processes more traffic in one natures, of known worms and can then drop associated second than the prototype Honeycomb observed in 24 packets. To enable high-bandwidth deployments, sev- hours. However, one clear advantage offered by the host eral hardware vendors are now producing high-speed context is its natural imperviousness to network evasion String matching and regular expression checking chip techniques [30]. We discuss this issue further in Se for worm and virus filtering. lockwood et al. describe tion 7 application [ 19]. Finally, a different strategy, proposed Finally,over the last two years of Earlybird's devel- by Twycross and Williamson([43], is to proactively limit opment([34, 35, 37], the clearest parallels can be drawn the rate of all outgoing connections made by a machine to Kim and Karp's contemporaneously-developed Auto- and thereby slow-but not stop-the spread of any worm graph system [16]. Like Earlybird, Autograph also uses Their approach was proposed in a host context, but there network-level data to infer worm signatures and both systems employ Rabin fingerprints to index counters of is no reason such connection throttling cannot be applied content substrings and use white-lists to set aside wel at the network level as well known false positives. However, there are several im- In this paper, we assume the availability of portant differences as well. First, Autograph relies on matching containment (perhaps in concert wit tling) and our Earlybird prototype generates sig a prefiltering step that identifies flows with suspicious for a Snort in-line intrusion detection system -blocking scanning activity (particularly the number of unsuccess. all packets containing discovered worm signatures prevalence. By contrast, Earlybird measures the preva- 3 Defining Worm behavior lence of all content entering the network and only then Network worms, due to their distinct purpose, tend to be considers the addressing activity. This difference means have quite differently from the popular client-server and that Autograph cannot detect large classes of worms that peer-to-peer applications deployed on today's networks Earlybird can-including almost all e-mail borne worms, In this section we explore these key behaviors in more such as MyDoom, UDP-based worms such as Slammer, detail and how they can be exploited to detect and char- poofed source worms, or worms carried via IM or p2P acterize network worms clients. Second, Autograph has extensive support for distributed deployments-involving active cooperation 3.1 Content invariance between multiple sensors. By contrast, Earlybird has In all existing worms of which we are aware, some or focused almost entirely on the algorithmics required all of the worm program is invariant across every copy support a robust and scalable wire-speed implementation Typically, the entire worm program is identical across in a single sensor and only supports distribution through every host it infects. However, some worms make use a centralized aggregator. Third, Earlybird is an on-line of limited polymorphism- by encrypting each worm in- system that has been in near-production use for eight stance independently and/or randomizing filler text. In months and handles 200 megabits of live traffic, these cases, much of the worm body is variable, but key while, as described, Autograph is an off-line system that portions are still invariant(e. g, the decryption routine) has only been evaluated using traces. Finally, there are For the purposes of this paper, we assume that a worm many differences in the details of the algorithms used has some amount of invariant content or has relatively (e.g. Autograph breaks content into non-overlapping few variants. We discuss violations of this assumption in variable-length chunks while Earlybird Section 7
n-grams found in normal computer programs. This approach is extremely powerful, but assumes the presence of a known instance of a virus and a controlled environment to monitor. The former limitation is partially addressed by the Honeycomb system of Kreibich and Crowcroft [17]. Honeycomb is a host-based intrusion detection system that automatically generates signatures by looking for longest common subsequences among sets of strings found in message exchanges. This basic procedure is similar to our own, but there are also important structural and algorithmic differences between our two approaches, the most important of which is scale. Honeycomb is designed for a host-based context with orders of magnitude less processing required. To put this in context, our Earlybird system currently processes more traffic in one second than the prototype Honeycomb observed in 24 hours. However, one clear advantage offered by the host context is its natural imperviousness to network evasion techniques [30]. We discuss this issue further in Section 7. Finally, over the last two years of Earlybird’s development [34, 35, 37], the clearest parallels can be drawn to Kim and Karp’s contemporaneously-developed Autograph system [16]. Like Earlybird, Autograph also uses network-level data to infer worm signatures and both systems employ Rabin fingerprints to index counters of content substrings and use white-lists to set aside wellknown false positives. However, there are several important differences as well. First, Autograph relies on a prefiltering step that identifies flows with suspicious scanning activity (particularly the number of unsuccessful TCP connection attempts) before calculating content prevalence. By contrast, Earlybird measures the prevalence of all content entering the network and only then considers the addressing activity. This difference means that Autograph cannot detect large classes of worms that Earlybird can – including almost all e-mail borne worms, such as MyDoom, UDP-based worms such as Slammer, spoofed source worms, or worms carried via IM or P2P clients. Second, Autograph has extensive support for distributed deployments – involving active cooperation between multiple sensors. By contrast, Earlybird has focused almost entirely on the algorithmics required to support a robust and scalable wire-speed implementation in a single sensor and only supports distribution through a centralized aggregator. Third, Earlybird is an on-line system that has been in near-production use for eight months and handles over 200 megabits of live traffic, while, as described, Autograph is an off-line system that has only been evaluated using traces. Finally, there are many differences in the details of the algorithms used (e.g. Autograph breaks content into non-overlapping variable-length chunks while Earlybird manages overlapping fixed-length content strings over each byte offset) although it is not currently clear what the impact of these differences is. 2.3 Containment Containment refers to the mechanism used to slow or stop the spread of an active worm. There are three containment mechanisms in use today: host quarantine, string-matching and connection throttling. Host quarantine is simply the act of preventing an infected host from communicating with other hosts – typically implemented via IP-level access control lists on routers or firewalls. String-matching containment – typified by signaturebased network intrusion prevention syst ems (NIPS) – matches network traffic against particular strings, or signatures, of known worms and can then drop associated packets. To enable high-bandwidth deployments, several hardware vendors are now producing high-speed string matching and regular expression checking chips for worm and virus filtering. Lockwood et al. describe an FPGA-based research prototype programmed for this application [19]. Finally, a different strategy, proposed by Twycross and Williamson [43], is to proactively limit the rate of all outgoing connections made by a machine and thereby slow – but not stop – the spread of any worm. Their approach was proposed in a host context, but there is no reason such connection throttling cannot be applied at the network level as well. In this paper, we assume the availability of stringmatching containment (perhaps in concert with throttling) and our Earlybird prototype generates signatures for a Snort in-line intrusion detection system – blocking all packets containing discovered worm signatures. 3 Defining Worm Behavior Network worms, due to their distinct purpose, tend to behave quite differently from the popular client-server and peer-to-peer applications deployed on today’s networks. In this section we explore these key behaviors in more detail and how they can be exploited to detect and characterize network worms. 3.1 Content invariance In all existing worms of which we are aware, some or all of the worm program is invariant across every copy. Typically, the entire worm program is identical across every host it infects. However, some worms make use of limited polymorphism – by encrypting each worm instance independently and/or randomizing filler text. In these cases, much of the worm body is variable, but key portions are still invariant (e.g., the decryption routine). For the purposes of this paper, we assume that a worm has some amount of invariant content or has relatively few variants. We discuss violations of this assumption in Section 7
3.2 Content prevalence ProcessTraffic(payload, srclP, dstlP) ant portion of a worms content will appear frequ versely, content which is not prevalent will not represent ests)>Dst DispTh) if (payload in kno the invariant portion of a worm and therefore is not a 9 e urn useful candidate for constructing signatures Insert(payload, known Signatures) NewSignature Alarm(payload) 3.3 Address dispersion For the same reasons, the number of distinct hosts in- Figure 1: The idealized content sifting algorithm detects all fected by a worm will grow over time. Consequently, packet contents that are seen often enough and are comi packets containing a live worm will tend to reflect a vari- enough sources and going to enough destinations.The ty of different source and destination addresses. More- table is used are both parameters of the algorithm over, during a major outbreak, the number of such ad- dresses can grow extremely quickly. Finally, it is reason able to expect that the distribution of these addresses will be far can have significant clustering [9]. In this paper we only ke advantage of the first of these thre Stage 3 re believe there is potential value in considering all of them Figure 2: Multi-stage is hashed us- 4 Finding worm signatures ing hash function hl into Stage 2 table From these assumptions, we can conclude that network if all the hashed counters are ab this traffic will contain common substrings and will be an approximation error decreaseve shown that the number directed between a variety of different sources and desti- nations. While it is not yet clear that this characterization or not widely dispersed is sifted out, leaving only the is exclusively caused by worms, for now we will assume that identifying this traffic pattern is sufficient for detect worm-like content. However, while content sifting can correctly identify worm signatures, the basic algorithm ing worms. We examine the issue of false positives later we have described is far too inefficient to be practical. In in the paper. In principle, detecting this traffic pattern the next section we describe algorithms for approximat- is relatively straightforward. Figure 1 shows an ideal- ized algorithm that achieves this goal. For each network ing correctness in exchange for efficiency and practical- packet,the content is extracted and all substrings pro.ity cessed. Each substring is indexed into a prevalence table 5 Practical Content Sifting that increments a count field for a given substring each For automated signature extraction to scale to high-speed time it is found. In effect, this table implements a his- links, the algorithms must have small processing require togram of all observed substrings. To maintain a count of ments(ideally well-suited to parallelization), and small unique source and destination addresses, each table entry memory requirements. Finally, to allow arbitrary de also maintains two lists, containing IP addresses, that are ployment strategies, the algorithm should not depend searched and potentially updated each time a substring on having a symmetric vantage point in the network. count is incremented. Sorting this table on the substring To satisfy these requirements, we now describe scalable of the address li and accurate algorithms for estimating content preva of likely worm traffic. Better still, the table entries meet nce and address dispersion, and techniques for man- ing this worm behavior criteria are exactly those contain- aging CPU overload through smooth tradeoffs between ing the invariant substrings of the worm. It is these sub- detection time and overhead. For simplicity, in this sec strings that can be used as signatures to filter the worm tion we describe our algorithms at packet(and not flow out of legitimate network traf granularity We call this approach content sifting because it effec tively implements a high-pass filter on the contents of 5.1 Estimating content prevalence etwork traffic Network content which is not prevalent Identifying common content involves finding the packet As described earlier, the presence of"dark"IP addresses can also payloads that appear at least z times among the N pack- provide qualitatively strong evidence of worm-like behavior ets sent during a given interval. However, a table indexed
3.2 Content prevalence Since worms are designed foremost to spread, the invariant portion of a worm’s content will appear frequently on the network as it spreads or attempts to spread. Conversely, content which is not prevalent will not represent the invariant portion of a worm and therefore is not a useful candidate for constructing signatures. 3.3 Address dispersion For the same reasons, the number of distinct hosts infected by a worm will grow over time. Consequently, packets containing a live worm will tend to reflect a variety of different source and destination addresses. Moreover, during a major outbreak, the number of such addresses can grow extremely quickly. Finally, it is reasonable to expect that the distribution of these addresses will be far more uniform than typical network traffic which can have significant clustering [9].1 In this paper we only take advantage of the first of these three observations, but we believe there is potential value in considering all of them. 4 Finding worm signatures From these assumptions, we can conclude that network worms must generate significant traffic to spread and that this traffic will contain common substrings and will be directed between a variety of different sources and destinations. While it is not yet clear that this characterization is exclusively caused by worms, for now we will assume that identifying this traffic pattern is sufficient for detecting worms. We examine the issue of false positives later in the paper. In principle, detecting this traffic pattern is relatively straightforward. Figure 1 shows an idealized algorithm that achieves this goal. For each network packet, the content is extracted and all substrings processed. Each substring is indexed into a prevalence table that increments a count field for a given substring each time it is found. In effect, this table implements a histogram of all observed substrings. To maintain a count of unique source and destination addresses, each table entry also maintains two lists, containing IP addresses, that are searched and potentially updated each time a substring count is incremented. Sorting this table on the substring count and the size of the address lists will produce the set of likely worm traffic. Better still, the table entries meeting this worm behavior criteria are exactly those containing the invariant substrings of the worm. It is these substrings that can be used as signatures to filter the worm out of legitimate network traffic. We call this approach content sifting because it effectively implements a high-pass filter on the contents of network traffic. Network content which is not prevalent 1As described earlier, the presence of “dark” IP addresses can also provide qualitatively strong evidence of worm-like behavior. ProcessTraffic(payload,srcIP,dstIP) 1 prevalence[payload]++ 2 Insert(srcIP,dispersion[payload].sources) 3 Insert(dstIP,dispersion[payload].dests) 4 if (prevalence[payload]>PrevalenceTh 5 and size(dispersion[payload].sources)>SrcDispTh 6 and size(dispersion[payload].dests)>DstDispTh) 7 if (payload in knownSignatures) 8 return 9 endif 10 Insert(payload,knownSignatures) 11 NewSignatureAlarm(payload) 12 endif Figure 1: The idealized content sifting algorithm detects all packet contents that are seen often enough and are coming from enough sources and going to enough destinations. The value of the detection thresholds and the time window over which each table is used are both parameters of the algorithm. ✁✁✁✁✁ ✂✁✂✂✁✂✂✁✂✂✁✂✂✁✂✄✁✄✁✄ ✄✁✄✁✄ ✄✁✄✁✄ ✄✁✄✁✄ ✄✁✄✁✄ ☎✁☎☎✁☎☎✁☎☎✁☎☎✁☎ ✆✁✆✁✆ ✆✁✆✁✆ ✆✁✆✁✆ ✆✁✆✁✆ ✆✁✆✁✆ ✝✁✝✝✁✝✝✁✝✝✁✝✝✁✝ ✞✁✞✞✁✞✞✁✞✞✁✞✞✁✞ ✟✁✟✟✁✟✟✁✟✟✁✟✟✁✟ ✠✁✠✠✁✠✠✁✠✠✁✠✠✁✠ ✡✁✡✡✁✡✡✁✡✡✁✡✡✁✡ ☛✁☛☛✁☛☛✁☛☛✁☛☛✁☛ ☞✁☞☞✁☞☞✁☞☞✁☞☞✁☞ ✌✁✌✁✌ ✌✁✌✁✌ ✌✁✌✁✌ ✌✁✌✁✌ ✌✁✌✁✌ ✍✁✍✁✍ ✍✁✍✁✍ ✍✁✍✁✍ ✍✁✍✁✍ ✍✁✍✁✍ ✎✁✎✁✎✁✎✁✎✁✎ ✏✁✏✁✏✁✏✁✏✁✏✑✁✑✁✑✁✑✁✑✁✑✁✑✁✑✁✑ ✒✁✒✁✒✁✒✁✒✁✒✁✒✁✒ ✒✁✒✁✒✁✒✁✒✁✒✁✒✁✒ ✒✁✒✁✒✁✒✁✒✁✒✁✒✁✒ ✒✁✒✁✒✁✒✁✒✁✒✁✒✁✒ ✒✁✒✁✒✁✒✁✒✁✒✁✒✁✒ ✒✁✒✁✒✁✒✁✒✁✒✁✒✁✒ ✒✁✒✁✒✁✒✁✒✁✒✁✒✁✒ ✒✁✒✁✒✁✒✁✒✁✒✁✒✁✒ ✒✁✒✁✒✁✒✁✒✁✒✁✒✁✒ ✒✁✒✁✒✁✒✁✒✁✒✁✒✁✒ ✓✁✓✁✓✁✓✁✓✁✓✁✓✁✓✁✓✁✓✁✓✁✓✁✓✁✓ ✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔ ✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔ ✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔ ✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔ ✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔ ✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔ ✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔ ✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔ ✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔ ✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔ ✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔✁✔ ✕✁✕✁✕✁✕✁✕✁✕✁✕✁✕✁✕✁✕✁✕ ✖✁✖✁✖✁✖✁✖✁✖✁✖✁✖✁✖✁✖✁✖ ✗✁✗✁✗✁✗✁✗✁✗✁✗✁✗✁✗✁✗✁✗ ✘✁✘✁✘✁✘✁✘✁✘✁✘✁✘✁✘✁✘✁✘ ✘✁✘✁✘✁✘✁✘✁✘✁✘✁✘✁✘✁✘✁✘ ✘✁✘✁✘✁✘✁✘✁✘✁✘✁✘✁✘✁✘✁✘ ✘✁✘✁✘✁✘✁✘✁✘✁✘✁✘✁✘✁✘✁✘ ✘✁✘✁✘✁✘✁✘✁✘✁✘✁✘✁✘✁✘✁✘ ✘✁✘✁✘✁✘✁✘✁✘✁✘✁✘✁✘✁✘✁✘ ✘✁✘✁✘✁✘✁✘✁✘✁✘✁✘✁✘✁✘✁✘ ✘✁✘✁✘✁✘✁✘✁✘✁✘✁✘✁✘✁✘✁✘ ✘✁✘✁✘✁✘✁✘✁✘✁✘✁✘✁✘✁✘✁✘ ✙✁✙✁✙✁✙✁✙ ✚✁✚✁✚✁✚✁✚ ✚✁✚✁✚✁✚✁✚ ✚✁✚✁✚✁✚✁✚ ✚✁✚✁✚✁✚✁✚ ✚✁✚✁✚✁✚✁✚ ✚✁✚✁✚✁✚✁✚ ✚✁✚✁✚✁✚✁✚ ✚✁✚✁✚✁✚✁✚ ✚✁✚✁✚✁✚✁✚ h2(F) ✛✁✛✁✛✁✛✁✛ ✜✁✜✁✜✁✜✁✜ h1(F) h3(F) Stage 3 Stage 2 Stage 1 payload Packet All Large? Figure 2: Multi-stage Filters. A piece of content is hashed using hash function h1 into a Stage 1 table, h2 into a Stage 2 table, etc and each table entry contains a counter that is incremented. If all the hashed counters are above the prevalence threshold, then the content string is saved for address dispersion measurements. In previous work we have shown that the probability of an approximation error decreases exponentially with the number of stages and consequently is extremely small in practice [10]. or not widely dispersed is sifted out, leaving only the worm-like content. However, while content sifting can correctly identify worm signatures, the basic algorithm we have described is far too inefficient to be practical. In the next section we describe algorithms for approximating correctness in exchange for efficiency and practicality. 5 Practical Content Sifting For automated signature extraction to scale to high-speed links, the algorithms must have small processing requirements (ideally well-suited to parallelization), and small memory requirements. Finally, to allow arbitrary deployment strategies, the algorithm should not depend on having a symmetric vantage point in the network. To satisfy these requirements, we now describe scalable and accurate algorithms for estimating content prevalence and address dispersion, and techniques for managing CPU overload through smooth tradeoffs between detection time and overhead. For simplicity, in this section we describe our algorithms at packet (and not flow) granularity. 5.1 Estimating content prevalence Identifying common content involves finding the packet payloads that appear at least x times among the N packets sent during a given interval. However, a table indexed
by payload can quickly consume huge amounts of mem- packet granularity. While this is sufficient for detect ory. For example, on a fully loaded 1 Gbps link, this ing most existing worms, it is easy to envision worms naive approach could generate a 1 GByte table in less for which the invariant content is a string smaller than a than 10 seconds. Memory consumption can be reduced single packet or for which the invariant content occurs considerably by indexing the table using a fixed size hash at different offsets in each instance. However, detecting of the packet payload instead of the full payload. After common strings of at least a minimum length is compu- a certain hash value has repeated r-1 times, the next tationally complex. Instead we address the related -yet packet with this hash is reported. In the absence of colli- far easier-problem of detecting repeating strings with a sions, the associated content will have appeared exactly small fixed length B. As with full packet contents, stor- ar times. By selecting a hash function with suitably large ing individual substrings can require exorbitant memory range(e.g, 32 or 64 bits) the collision probability can be and computational resources. Instead, we use a method minimized. Assuming a 16 byte hash table entry and an similar to the one proposed by Manber for finding simi- flows, frequently called"heavy hitters"[13, 10). By fingerprint, no matter where they are in a packe eb E average packet size of 500 bytes, this algorithm would lar files in a large file system [20]. We compute a variant take over 4 minutes to generate the same 1 GByte table. of Rabin fingerprints for all possible substrings of a Memory efficiency can be improved further by observ- tain length [32]. As these fingerprints are polynomials ng that identifying prevalent content is isomorphic to they can be computed incrementally while retaining the the well-studied problem of identifying high-bandwidth property that two equal substrings will generate modifying the definition of"" to reflect content fin- However, each packet with a payload of s bytes has gerprints instead of the (srcip, dstip, srcport, dstport, s-6+1 strings of length B, so the memory references protocol)tuple used for flow analysis, heavy-hitter ap- used per packet is still substantially greater than that con- proximation algorithms can be used to find prevalent sumed by a single per-packet hash. In Section 5.3, we content using comparatively small amounts of memory. describe a technique called value sampling to consider Our prototype uses multi-stage filters with conserva- ably reduce memory references tive update to dramatically reduce the memory footpr of the problem(see Figure 2 for a general description and 5.2 Estimating address dispersion [13, 10]for a thorough analysis). While simple, we be- While content prevalence is the key metric for identify lieve this notion of using a content signature as a"flow ing potential worm signatures, address dispersion is crit- identifier"on which to maintain counters is a powerful ical for avoiding false positives among this set.With- out this additional test a system could not distinguish be- An important modification is to append the destina- tween a worm and a piece of content that frequently od tion port and protocol to the content before hashing. curs between two computers-for example a mail client Since worms typically target a particular service(they sending the same user name repeatedly as it checks for are designed to exploit a vulnerability in that service) new mail on the mail server regularly this will not impact the ability to track worm traffic, but To quantify address dispersion one must count the dis can effectively exclude large amounts of prevalent con- tinct source Ip addresses and destination ip addresses as tent not generated by worms (i.e, potential false pos tives). For example, if two users on the same network generated by a worm(note that this is different from the both download the Yahoo home page they will receive previous content prevalence problem which only requires many packets with identical payloads. However, traffic estimating the repetitions of each distinct string). While sent from the Web server will be directed to a so-called one could simply count source-destination address pairs ephemeral"port selected by each client. Since these counting the source and destination addresses indepen- ports are selected independently, adding them to the hash dently allows finer distinctions to be made. For example input will generally differentiate these different clients mail messages sent to a popular mailing list are associ- even when the content being carried is identical ated with many source-destination address pairs, but with So far, we have only discussed content at the whole only two sources-the mail server of the original sender 2We are not the first to use hashing techniques to analyze the content of the message and the mail server running the list makeup of network traffic. Snogeglon t aross a network (3.ints for ing a simple list or hash table, more efficient solutions While it is possible to count IP addresses exactly us- ompressing content sent over a network [40. 271 are needed if there are many pieces of content suspected 3Note that it is possible for this assumption to be violated under usual circumstances. In particular, the witty worm exploited of being generated by worms. Our solution is to trade ort to exploit off some precision in these counters for dramatic reduc ts vulnerability -the destination port was random [6]. Catching this tions in memory requirements. Our first approach was orm required us to maintain an additional matching table in which the source port is appended to the hash output instead to appropriate the direct bitmap data structure originally
by payload can quickly consume huge amounts of memory. For example, on a fully loaded 1 Gbps link, this naive approach could generate a 1 GByte table in less than 10 seconds. Memory consumption can be reduced considerably by indexing the table using a fixed size hash of the packet payload instead of the full payload. After a certain hash value has repeated x − 1 times, the next packet with this hash is reported. In the absence of collisions, the associated content will have appeared exactly x times. By selecting a hash function with suitably large range (e.g., 32 or 64 bits) the collision probability can be minimized. Assuming a 16 byte hash table entry and an average packet size of 500 bytes, this algorithm would take over 4 minutes to generate the same 1 GByte table. Memory efficiency can be improved further by observing that identifying prevalent content is isomorphic to the well-studied problem of identifying high-bandwidth flows, frequently called “heavy hitters” [13, 10]. By modifying the definition of “flow” to reflect content fingerprints instead of the (srcip, dstip, srcport, dstport, protocol) tuple used for flow analysis, heavy-hitter approximation algorithms can be used to find prevalent content using comparatively small amounts of memory. Our prototype uses multi-stage filters with conservative update to dramatically reduce the memory footprint of the problem (see Figure 2 for a general description and [13, 10] for a thorough analysis). While simple, we believe this notion of using a content signature as a “flow identifier” on which to maintain counters is a powerful technique.2 An important modification is to append the destination port and protocol to the content before hashing. Since worms typically target a particular service (they are designed to exploit a vulnerability in that service) this will not impact the ability to track worm traffic, but can effectively exclude large amounts of prevalent content not generated by worms (i.e., potential false positives).3 For example, if two users on the same network both download the Yahoo home page they will receive many packets with identical payloads. However, traffic sent from the Web server will be directed to a so-called “ephemeral” port selected by each client. Since these ports are selected independently, adding them to the hash input will generally differentiate these different clients even when the content being carried is identical. So far, we have only discussed content at the whole 2We are not the first to use hashing techniques to analyze the content makeup of network traffic. Snoeren et al. and Duffield et al. both use hashing to match packet observations across a network [38, 7], and both Spring et al. and Muthitacharoen et al. use Rabin fingerprints for compressing content sent over a network [40, 27]. 3Note that it is possible for this assumption to be violated under unusual circumstances. In particular, the Witty worm exploited promiscuous network devices and only required a fixed source port to exploit its vulnerability – the destination port was random [6]. Catching this worm required us to maintain an additional matching table in which the source port is appended to the hash output instead. packet granularity. While this is sufficient for detecting most existing worms, it is easy to envision worms for which the invariant content is a string smaller than a single packet or for which the invariant content occurs at different offsets in each instance. However, detecting common strings of at least a minimum length is computationally complex. Instead we address the related – yet far easier – problem of detecting repeating strings with a small fixed length β. As with full packet contents, storing individual substrings can require exorbitant memory and computational resources. Instead, we use a method similar to the one proposed by Manber for finding similar files in a large file system [20]. We compute a variant of Rabin fingerprints for all possible substrings of a certain length [32]. As these fingerprints are polynomials they can be computed incrementally while retaining the property that two equal substrings will generate the same fingerprint, no matter where they are in a packet. However, each packet with a payload of s bytes has s − β + 1 strings of length β, so the memory references used per packet is still substantially greater than that consumed by a single per-packet hash. In Section 5.3, we describe a technique called value sampling to considerably reduce memory references. 5.2 Estimating address dispersion While content prevalence is the key metric for identifying potential worm signatures, address dispersion is critical for avoiding false positives among this set. Without this additional test a system could not distinguish between a worm and a piece of content that frequently occurs between two computers – for example a mail client sending the same user name repeatedly as it checks for new mail on the mail server regularly. To quantify address dispersion one must count the distinct source IP addresses and destination IP addresses associated with each piece of content suspected of being generated by a worm (note that this is different from the previous content prevalence problem which only requires estimating the repetitions of each distinct string). While one could simply count source-destination address pairs, counting the source and destination addresses independently allows finer distinctions to be made. For example, mail messages sent to a popular mailing list are associated with many source-destination address pairs, but with only two sources – the mail server of the original sender of the message and the mail server running the list. While it is possible to count IP addresses exactly using a simple list or hash table, more efficient solutions are needed if there are many pieces of content suspected of being generated by worms. Our solution is to trade off some precision in these counters for dramatic reductions in memory requirements. Our first approach was to appropriate the direct bitmap data structure originally
content source is hashed to a bitmap, the corresponding i code s Hashe developed for approximate flow counting 146, 11]. Each bit is set, and an alarm is raised when the number of bits tcode= FirstBits(code <<(level+1) set exceeds a threshold. For example, if the dispersion 5 4 if (level base and level basetnumbmps) threshold t is 30. the source address is hashed into a if eveb tod b ase and CeoventBitsset(bitmaps(on = mar) NextConfigurationo of bits set crosses 20( the value 20 is calculated analyti-9endif cally to account for hash collisions). This approach has Compute Estimate(bitmaps, base) minimal memory requirements, but in exchange it loses 2 for i=o to numbmps-I the ability to estimate the actual values of each counter umIPs=numIPs+b In(b/Count Bits NotSet(bitmaps[iD)) important for measuring the rate of infection or priori tizing alerts. While other techniques such as probabilis (2base-1)/(2mumbmps-1), bIn(b/(b- maz)) tic counting [12] and multiresolution bitmaps [11]can 6 return numIPs. 204s/(1-2-mumomps)+correction provide accurate counts they require significantly more Figure 3: A scaled bitmap uses numbmps bitmaps of size b 512 bits to count to 1 million ash space. When t Instead, we have invented a counting algorithm that set reaches max), we advance to the next configuration by "recy leverages the fact that address dispersion continuously clungth et trsapc e ades es weomltupe an estimate of the during an outbreak Using this observation devise a new, compact data structure, called a scaled of the fraction of the hash covered by t bitmap, that accurately estimates address dispersion u were active in earlier configurations, while the current bitmaps ng five times less memory than existing algorithms were not in use at their present levels The scaled bitmap achieves this reduction by subsam- pling the range of the hash space. For example, to count the one covering the largest portion of the hash space, up to 64 sources using 32 bits, one might hash sources is the most important in computing the estimate, but the into a space from 0 to 63 yet only set bits for values other bitmaps provide"memory" for counts that are still that hash between 0 and 31-thus ignoring half of the small and serve to minimize the previously mentioned sources. At the end of a fixed measurement interval. this biases. Consequently, not much correction is needed subsampling is adjusted by scaling the resulting count to when these bitmaps become the most important. For- estimate the true count(a factor of two in the previous ex- mally, we can prove that the maximum ratio between the count by simply increasing this scaling factor whenever is 2/(2numomps-1)[3 e number of active addresses bias of the algorithm and ul the bitmap is filled. For example the next configuration Overall, this new technique allows us to count sources of the bitmap might map one quarter of the hash spac nd destinations quite accurately using only 3 bitmaps to a 32 bit bitmap and scale the resulting count by four. with roughly 5 times less memory than previously known This allows the storage of the bitmap to remain constant techniques [12, Il]. This is critical for practical scaling across an enormous range of counts. because it reduces the systems sensitivity to the effec- However, once the bitmap is scaled to a new configu- tiveness of the low-pass filter provided by the content ration, the addresses that were active throughout the pre- prevalence test vious configuration are lost and adjusting for this bias directly can lead to double counting. To minimize these 5.3 CPU scaling errors, the final scaled bitmap algorithm, shown in Fig- Using multistage filters to detect content prevalence and ure 3, uses multiple bitmaps(numbmps =3 in this ex- scaled bitmaps to estimate address dispersion decreases ample)each mapped to progressively smaller and smaller memory usage and limits the amount of processing portions of the hash space. To calculate the count, the However, each payload string still requires significant estimated number of sources hashing to each bitmap are processing. In our prototype implementation(detailed added, and then this sum is divided by the fraction of in Section 6), the CPU can easily manage processing the hash space covered by all the bitmaps. When the each packet payload as a single string, but when apply bitmap covering the largest portion of the hash space ing Rabin fingerprints, the processing of every substring ly bits set to be accurate, it is advanced to of length B can figuration by recycling it: the bitmap is re- load. For example, a packet with 1, 000 bytes of pay set and then mapped to the next slice of the hash space load and B= 40, requires processing 960 Rabin fin- (Figure 4). Consequently, each bitmap covers half the gerprints. While computing the Rabin fingerprints them hash space covered by its predecessor. The first bitmap, selves incurs overhead, it is the three order of magnitude
developed for approximate flow counting [46, 11]. Each content source is hashed to a bitmap, the corresponding bit is set, and an alarm is raised when the number of bits set exceeds a threshold. For example, if the dispersion threshold T is 30, the source address is hashed into a bitmap of 32 bits and an alarm is raised if the number of bits set crosses 20 (the value 20 is calculated analytically to account for hash collisions). This approach has minimal memory requirements, but in exchange it loses the ability to estimate the actual values of each counter – important for measuring the rate of infection or prioritizing alerts. While other techniques such as probabilistic counting [12] and multiresolution bitmaps [11] can provide accurate counts they require significantly more memory. For example a multiresolution bitmap requires 512 bits to count to 1 million. Instead, we have invented a counting algorithm that leverages the fact that address dispersion continuously increases during an outbreak. Using this observation we devise a new, compact data structure, called a scaled bitmap, that accurately estimates address dispersion using five times less memory than existing algorithms. The scaled bitmap achieves this reduction by subsampling the range of the hash space. For example, to count up to 64 sources using 32 bits, one might hash sources into a space from 0 to 63 yet only set bits for values that hash between 0 and 31 – thus ignoring half of the sources. At the end of a fixed measurement interval, this subsampling is adjusted by scaling the resulting count to estimate the true count (a factor of two in the previous example). Generalizing, we track a continuously increasing count by simply increasing this scaling factor whenever the bitmap is filled. For example the next configuration of the bitmap might map one quarter of the hash space to a 32 bit bitmap and scale the resulting count by four. This allows the storage of the bitmap to remain constant across an enormous range of counts. However, once the bitmap is scaled to a new configuration, the addresses that were active throughout the previous configuration are lost and adjusting for this bias directly can lead to double counting. To minimize these errors, the final scaled bitmap algorithm, shown in Figure 3, uses multiple bitmaps (numbmps = 3 in this example) each mapped to progressively smaller and smaller portions of the hash space. To calculate the count, the estimated number of sources hashing to each bitmap are added, and then this sum is divided by the fraction of the hash space covered by all the bitmaps. When the bitmap covering the largest portion of the hash space has too many bits set to be accurate, it is advanced to the next configuration by recycling it: the bitmap is reset and then mapped to the next slice of the hash space (Figure 4). Consequently, each bitmap covers half the hash space covered by its predecessor. The first bitmap, UpdateBitmap(IP) 1 code = Hash(IP) 2 level = CountLeadingZeroes(code) 3 bitcode = FirstBits(code << (level+1)) 4 if (level ≥ base and level < base+numbmps) 5 SetBit(bitcode,bitmaps[level-base]) 6 if (level == base and CountBitsSet(bitmaps[0]) == max) 7 NextConfiguration() 8 endif 9 endif ComputeEstimate(bitmaps,base) 1 numIPs=0 2 for i= 0 to numbmps-1 3 numIPs=numIPs+b ln(b/CountBitsNotSet(bitmaps[i])) 4 endfor 5 correction= 2(2base − 1)/(2numbmps − 1) · b ln(b/(b − max)) 6 return numIPs ·2 base/(1 − 2 −numbmps)+correction Figure 3: A scaled bitmap uses numbmps bitmaps of size b bits each. The bitmaps cover progressively smaller portions of the hash space. When the bitmap covering the largest portion of the hash space gets too full to be accurate (the number of bits set reaches max), we advance to the next configuration by “recycling” the bitmap (see Figure 4). To compute an estimate of the number of distinct IP addresses, we multiply a estimate of the number of addresses that mapped to the bitmaps by the inverse of the fraction of the hash space covered by the bitmaps. A correction is added to the result to account for the IP addresses that were active in earlier configurations, while the current bitmaps were not in use at their present levels. the one covering the largest portion of the hash space, is the most important in computing the estimate, but the other bitmaps provide “memory” for counts that are still small and serve to minimize the previously mentioned biases. Consequently, not much correction is needed when these bitmaps become the most important. Formally, we can prove that the maximum ratio between the bias of the algorithm and the number of active addresses is 2/(2numbmps − 1)[36]. Overall, this new technique allows us to count sources and destinations quite accurately using only 3 bitmaps with roughly 5 times less memory than previously known techniques [12, 11]. This is critical for practical scaling because it reduces the system’s sensitivity to the effectiveness of the low-pass filter provided by the content prevalence test. 5.3 CPU scaling Using multistage filters to detect content prevalence and scaled bitmaps to estimate address dispersion decreases memory usage and limits the amount of processing. However, each payload string still requires significant processing. In our prototype implementation (detailed in Section 6), the CPU can easily manage processing each packet payload as a single string, but when applying Rabin fingerprints, the processing of every substring of length β can overload the CPU during high traffic load. For example, a packet with 1, 000 bytes of payload and β = 40, requires processing 960 Rabin fingerprints. While computing the Rabin fingerprints themselves incurs overhead, it is the three order of magnitude
After recycling base=1 wt> Prevalence Threshol -Hash space Figure 4: When the bitmap covering the largest portion of the Figure 5: Content Sifting Algorithm as used in Early Bird ash space fills up, it is recycled. The bitmap is cleared most before recycling. Recycling covered by the bit 100 bytes is 55%, but for a worm with a signature of 20 bytes it increases to 92%o, and for 400 bytes to 99.64% The sampling value f represents a tradeoff between increase in the number of content sifting operations that processing and the probability of missing a worm; pro exceeds the capacity of our current CPU. While a faster cessing decreases linearly with f and the length of the in- CPU might solve this problem for a given traffic profile, variant content required increases linearly with f. Note the possibility of traffic surges and denial-of-service at hat all current worms have had invariant content of at acks on a sensor produce the same problem again. We least 400 bytes, for which the probability of false nega believe that a security device should not fail in these cir- tives is at most 0.36%. Our user-space software imple cumstances but instead smoothly scale back functionality mentation requires f= 1/64 to keep up with roughly to match capacity -still performing the same functions 200Mbps of traffic on a Gigabit Ethernet interface. Fi but perhaps with reduced fidelity or responsiveness nally, since the parameters of the Rabin fingerprint algo- The obvious approach to address this problem is via rithm p and M are not known, the worm writer cannot dynamic sampling. However, randomly sampling which determine which strings will not be sampled in advance substrings to process could cause us to miss a large frac tion of the occurrences of each substring and thus delay 5.4 Putting it together the generation of a worm's signature. Instead, we use Figure 5 depicts the content sifting algorithm imple sampling [20] and select only mented in the Early Bird prototype. As each packet ar- hich the fingerprint matches a certain pattern(e. g. the rives, its content(or substrings of its content) is hashed last 6 bits of the fingerprint are 0). Consequently, the al- d appended with the protocol identifier and destination gorithm will systematically ignore some substrings, but ort to produce a content hash code. In our implemen- track all occurrences of others. However, if a worm con- tation, we use a 32-bit Cyclic Redundancy Check(CRC) tains even a single tracked substring, it will be detected as a packet hash and 40-byte Rabin fingerprints for sub- as promptly as without the sampling. For example, if f string hashes. Each Rabin fingerprint is subsampled with is the fraction of the tracked substrings(e.g. f= 1/64 f= 1/64. The resulting hash codes are used to index if we track the substrings whose Rabin fingerprint ends the address dispersion table. If an entry already exists on 6 Os), then the probability of detecting a worm with a (the content has been determined to be prevalent) then signature of length z is Track(x)=1-e-/(x-9+1) the address dispersion table entries for source and desti- Since Rabin fingerprint are randomly distributed nation IP addresses(implemented as scaled bitmaps)are themselves, the probability of tracking a worm substring updated. If the source and destination counts exceed the of length B is f. Thus, the probability of missing the dispersion threshold, then the content string is reported. worm is pmiss(B)=l-f. The probability of not track If the content hash is not found in the dispersion ta- ing the worm is the probability of not tracking any ble, it is indexed into the content prevalence table. In Its substrings. If the worm signature has length I, it has our implementation, we use four independent hash func -B+ l substrings of length B. Assuming that no sub- tions of the content hash to create 4 indexes into four string of length B repeats in the signature, the probability counter arrays. Using the conservative update optimiza of not tracking the worm is Pmiss(r)=(1-f)x-p+l a tion, only the smallest among the four counters is incre e-f(x-p+1). For example with f=1/64 and B= 40, mented [10]. If all four counters are greater than the ne probability of tracking a worm with a signature of prevalence threshold, then a new entry is made in the ad
Hash space Hash space Before recycling base=0 After recycling base=1 Figure 4: When the bitmap covering the largest portion of the hash space fills up, it is recycled. The bitmap is cleared and it is mapped to the largest uncovered portion of the hash space which is half the size of the portion covered by the bitmap rightmost before recycling. Recycling increments the variable base (see Figure 3) by one. increase in the number of content sifting operations that exceeds the capacity of our current CPU. While a faster CPU might solve this problem for a given traffic profile, the possibility of traffic surges and denial-of-service attacks on a sensor produce the same problem again. We believe that a security device should not fail in these circumstances but instead smoothly scale back functionality to match capacity – still performing the same functions but perhaps with reduced fidelity or responsiveness. The obvious approach to address this problem is via dynamic sampling. However, randomly sampling which substrings to process could cause us to miss a large fraction of the occurrences of each substring and thus delay the generation of a worm’s signature. Instead, we use value sampling [20] and select only those substrings for which the fingerprint matches a certain pattern (e.g. the last 6 bits of the fingerprint are 0). Consequently, the algorithm will systematically ignore some substrings, but track all occurrences of others. However, if a worm contains even a single tracked substring, it will be detected as promptly as without the sampling. For example, if f is the fraction of the tracked substrings (e.g. f = 1/64 if we track the substrings whose Rabin fingerprint ends on 6 0s), then the probability of detecting a worm with a signature of length x is ptrack(x) = 1 − e −f(x−β+1) . Since Rabin fingerprint are randomly distributed themselves, the probability of tracking a worm substring of length β is f. Thus, the probability of missing the worm is pmiss(β) = 1 − f. The probability of not tracking the worm is the probability of not tracking any of its substrings. If the worm signature has length x, it has x − β + 1 substrings of length β. Assuming that no substring of length β repeats in the signature, the probability of not tracking the worm is pmiss(x) = (1−f) x−β+1 ≈ e −f(x−β+1) . For example with f = 1/64 and β = 40, the probability of tracking a worm with a signature of KEY COUNT Header Payload KEY DEST IP COUNT AD entry exists? Create AD entry Update counter else Counters > Dispersion Threshold? report KEY as suspected worm Key update counters Address Dispersion Table SRC IP COUNT Count > Prevalence Threshold? Content Prevalence Table protocol/dstPort +Payload Figure 5: Content Sifting Algorithm as used in EarlyBird. 100 bytes is 55%, but for a worm with a signature of 200 bytes it increases to 92%, and for 400 bytes to 99.64%. The sampling value f represents a tradeoff between processing and the probability of missing a worm; processing decreases linearly with f and the length of the invariant content required increases linearly with f. Note that all current worms have had invariant content of at least 400 bytes, for which the probability of false negatives is at most 0.36%. Our user-space software implementation requires f = 1/64 to keep up with roughly 200Mbps of traffic on a Gigabit Ethernet interface. Finally, since the parameters of the Rabin fingerprint algorithm p and M are not known, the worm writer cannot determine which strings will not be sampled in advance. 5.4 Putting it together Figure 5 depicts the content sifting algorithm implemented in the EarlyBird prototype. As each packet arrives, its content (or substrings of its content) is hashed and appended with the protocol identifier and destination port to produce a content hash code. In our implementation, we use a 32-bit Cyclic Redundancy Check (CRC) as a packet hash and 40-byte Rabin fingerprints for substring hashes. Each Rabin fingerprint is subsampled with f = 1/64. The resulting hash codes are used to index the address dispersion table. If an entry already exists (the content has been determined to be prevalent) then the address dispersion table entries for source and destination IP addresses (implemented as scaled bitmaps) are updated. If the source and destination counts exceed the dispersion threshold, then the content string is reported. If the content hash is not found in the dispersion table, it is indexed into the content prevalence table. In our implementation, we use four independent hash functions of the content hash to create 4 indexes into four counter arrays. Using the conservative update optimization, only the smallest among the four counters is incremented [10]. If all four counters are greater than the prevalence threshold, then a new entry is made in the ad-
dress dispersion table- with high probability, the content has appeared frequently enough to be a candidate worm signature. Pseudocode for the main loop of the Early Bird system is shown in Figure 5 baM wal s 444. 12345 ketTle) Gensnr-1 Parks nd (cnomalousADEntry(currentADEntry. packet) t DispTh)) tHash)> PravalenceTh) srclP, dstlP, packet Time) 14 endwhile Figure 7: A screenshot of the main screen of the Early Bire interface. Each zone is labeled by a prefix and shows th n every packet. When rent anoma ammeters he prevalence which can be changed by the user. More detailed screens show contains the source and detailed counts for each anomaly, as shown for Sasser in Figure required for the scaled 12. coordinates real-time updates from the sensors, coalesces The content prevalence table sees the most activity related signatures, activates any network-level or host- in the system and serves as a high-pass filter for fre- level blocking services and is responsible for administra quent content. The multi-stage filter data structure is tive reporting and control. Our implementation is written cleared on a regular interval (60 seconds in our imple- in C and the aggregator also uses the MySql database to mentation).By contrast, the address prevalence table has log all events, the popular rrd-tools library for graphical typically fewer values-only those strings exceeding the reporting, and PHP scripting for administrative control prevalence threshold-and can be garbage collected over A screenshot of the main screen of the early Bird user longer time scales(even hours) interface showing zones and a summary of the current Each of these mechanisms can be implemented at high system activity is shown in Figure 7 speeds in either software or hardware, with relatively Finally, in order to automatically block outbreaks, the modest memory requirements as we quantify in the next Early Bird system automatically generates and deploys section. Moreover, our approach makes no assumptions precise content-based signatures formatted for the Snort about the point of deployment, whether at the endpoint, inline intrusion prevention system [1]. A sample such edge, or core. However the optimal parameters settings signature for Kibvu. B is shown below may depend on the point of deployments. In Section 6 we empirically explore the parameter settings used by (msp: 2112067784 Fra May14 03:51200 2004 Early Bird pr content:"1909090904d3fe37790 909090ff63649090909090";) 6 Experience Based on the content sifting algorithm just described, we 6.2 Implementation and environment have built a prototype system which has been in use on The current prototype Earlybird sensor executes on a the UCSD campus for over eight months. In this sec- 1.6Ghz AMD Opteron 242 1U server configured with tion, we describe our overall system design, the imple- a standard Linux 2.6 kernel. The server is equipped mentation and experimental environment, our initial ex- with two Broadcom Gigabit copper network interfaces periments exploring the parameter space of the content for data capture. The Early Bird sensor itself is a single- sifting algorithm, our evaluation of false positives and threaded application which executes at user-level and false negatives, and our preliminary results in finding live captures packets using the popular libpcap library. The worms at our site ystem is roughly 5000 lines of code(not including ex ternal libraries) with the bulk of the code dedicated to 6.1 System design elf-monitoring for the purpose of this paper. The scal- The Early Bird system consists of two major components: able implementation itself is a much smaller fraction of Sensors and an Aggregator. Each sensor sifts through this code base. In its present untuned form, Early Bird traffic on configurable address space "zones"of responsi- sifts though over ITB of traffic per day and is able to bility and reports anomalous signatures. The aggregator keep up with over 200Mbps of continuous traffic whe
dress dispersion table – with high probability, the content has appeared frequently enough to be a candidate worm signature. Pseudocode for the main loop of the EarlyBird system is shown in Figure 5. ProcessPacket() 1 InitializeIncrementalHash(payload,payloadLength,dstPort) 2 while (currentHash=GetNextHash()) 3 if (currentADEntry=ADEntryMap.Find(currentHash)) 4 UpdateADEntry(currentADEntry,srcIP,dstIP,packetTime) 5 if ( (currentADEntry.srcCount > SrcDispTh) and (currentADEntry.dstCount > DstDispTh) ) 6 ReportAnomalousADEntry(currentADEntry,packet) 7 endif 8 else 9 if ( MsfIncrement(currentHash) > PravalenceTh) 10 newADEntry=InitializeADEntry(srcIP,dstIP,packetTime) 11 ADEntryMap.Insert(currentHash,newADEntry) 12 endif 13 endif 14 endwhile Figure 6: The EarlyBird loop performed on every packet.When the prevalence threshold is exceeded, dispersion counting is done by creating an ADentry. ADentry contains the source and destination bitmaps and the scale factors required for the scaled bitmap implementation. The content prevalence table sees the most activity in the system and serves as a high-pass filter for frequent content. The multi-stage filter data structure is cleared on a regular interval (60 seconds in our implementation). By contrast, the address prevalence table has typically fewer values – only those strings exceeding the prevalence threshold – and can be garbage collected over longer time scales (even hours). Each of these mechanisms can be implemented at high speeds in either software or hardware, with relatively modest memory requirements as we quantify in the next section. Moreover, our approach makes no assumptions about the point of deployment, whether at the endpoint, edge, or core. However the optimal parameters settings may depend on the point of deployments. In Section 6 we empirically explore the parameter settings used by our EarlyBird prototype. 6 Experience Based on the content sifting algorithm just described, we have built a prototype system which has been in use on the UCSD campus for over eight months. In this section, we describe our overall system design, the implementation and experimental environment, our initial experiments exploring the parameter space of the content sifting algorithm, our evaluation of false positives and false negatives, and our preliminary results in finding live worms at our site. 6.1 System design The EarlyBird system consists of two major components: Sensors and an Aggregator. Each sensor sifts through traffic on configurable addressspace “zones” of responsibility and reports anomalous signatures. The aggregator Figure 7: A screenshot of the main screen of the EarlyBird user interface. Each zone is labeled by a prefix and shows the current anomalies (worms), and prevalence/dispersion parameters which can be changed by the user. More detailed screens show detailed counts for each anomaly, as shown for Sasser in Figure 12. coordinatesreal-time updatesfrom the sensors, coalesces related signatures, activates any network-level or hostlevel blocking services and is responsible for administrative reporting and control. Our implementation is written in C and the aggregator also uses the MySql database to log all events, the popular rrd-tools library for graphical reporting, and PHP scripting for administrative control. A screenshot of the main screen of the EarlyBird user interface showing zones and a summary of the current system activity is shown in Figure 7. Finally, in order to automatically block outbreaks, the EarlyBird system automatically generates and deploys precise content-based signatures formatted for the Snortinline intrusion prevention system [1]. A sample such signature for Kibvu.B is shown below. drop tcp $HOME_NET any -> $EXTERNAL_NET 5000 (msg:"2712067784 Fri May 14 03:51:00 2004"; rev:1; content:"|90 90 90 90 4d 3f e3 77 90 90 90 90 ff 63 64 90 90 90 90 90|";) 6.2 Implementation and environment The current prototype Earlybird sensor executes on a 1.6Ghz AMD Opteron 242 1U server configured with a standard Linux 2.6 kernel. The server is equipped with two Broadcom Gigabit copper network interfaces for data capture. The EarlyBird sensor itself is a singlethreaded application which executes at user-level and captures packets using the popular libpcap library. The system is roughly 5000 lines of code (not including external libraries) with the bulk of the code dedicated to self-monitoring for the purpose of this paper. The scalable implementation itself is a much smaller fraction of this code base. In its present untuned form, EarlyBird sifts though over 1TB of traffic per day and is able to keep up with over 200Mbps of continuous traffic when
using Rabin fingerprints with a value sampling probabil y of 1/ 64(and at even higher speeds using whole packet CRCS). The experiments in the remainder of this paper ar Raben40 60sec based on data collected from a Cisco Catalyst router con- figured to mirror all in-bound and out -bound traffic to ird currently makes no distinction be tween incoming and outgoing packets). The router man ages traffic to and from roughly 5000 hosts, primarily clients. as well as all traffic to and from a few dedicated campus servers for DNS, SMTP/POP/IMAP, NFS, etc. The measured links experience a sustained traffic rate of roughly 100Mbps, with bursts of up to 500Mbps. emulative distribution function of content 6.3 Parameter tuning ash functions. This cdf is arameters used by found in each measuremen our a Note that the y axis is art tent prevalence threshold(currently 3), the address dis persion threshold(currently 30 sources and 30 destina tions), and the time to garbage collect address dispersion table entries(currently several hours). We now describe the rationale behind these initial choices S29 AND 0>9 Content prevalence threshold: Figure 8 shows the Sx30ANDD>30 distribution of signature repetitions on a trace for differ ent hash functions. For example, using a 60 second mea- i surement interval and a whole packet CRC, over 97 per- s cent of all signatures repeat two or fewer times and 94.5 percent are only observed once. Using a finer grained- content hash or a longer measurement interval increases these numbers even further. However, to a first approxi- mation, all reasonable values of these parameters reveal that very few signatures ever repeat more than 3 times. Time(minutes) Recall that the principal benefit provided by the con- Figure 9: Number of distinct signatures detected over time for tent prevalence table is to remove from consideration the different address dispersion thresholds enormous number of substrings which appear rarely and therefore are not possible worm signature candidates. We worms and the others are benign but consistently reoc have repeated these experiments on several datasets at curring strings that are post-filtered by a whitelist differing times and observed the same pattern. Conse Note that there is an inherent tradeoff between the quently, for the remainder of this paper we use a preva- speed of detecting a new worm and the likelihood of a lence threshold of 3 false positive. By using a lower dispersion threshold one Address dispersion threshold: Once a signature has can respond to a worm more quickly, but it is increas- passed the prevalence threshold it is still unlikely that it ingly likely that many such signatures will be benign represents a worm. Figure 9 shows the number of distinct For example, we find that the Slammer worm signature signatures found. as a function of time for different ad- is detected within one second with an address dispersion dress dispersion thresholds. For example, after 10 min- threshold of 2, yet takes up to 5 seconds to discover us- utes there are over 1000 signatures(note the log scale) ing the hreshold of 30. At the same with a low dispersion threshold of 2-meaning that the time there are two orders of magnitude more signatures same string has been observed in packets with two dif- that will be reported with the lowest dispersion threshold ferent source ip addresses and two different destination most of which will likely be false positives IP addresses. However, as the dispersion threshold is in- Garbage collection: The final key parameter of our algorithm is the elapse cally. By the time a threshold of 30 is reached, there are dress dispersion table is garbage collected. The impact only 5 or 6 prevalent strings meeting the dispersion crite- of this setting is shown in Figure 10. When the timeout ria and the increase in this number is very slow over time. is set to 100 seconds, then almost 60 percent of all sig In this particular trace, two of these strings represent live natures are garbage collected before a sub
using Rabin fingerprints with a value sampling probability of 1/64 (and at even higherspeeds using whole packet CRCs). The experiments in the remainder of this paper are based on data collected from a Cisco Catalyst router con- figured to mirror all in-bound and out-bound traffic to our sensor (Earlybird currently makes no distinction between incoming and outgoing packets). The router manages traffic to and from roughly 5000 hosts, primarily clients, as well as all traffic to and from a few dedicated campus servers for DNS, SMTP/POP/IMAP, NFS, etc. The measured links experience a sustained traffic rate of roughly 100Mbps, with bursts of up to 500Mbps. 6.3 Parameter tuning The key parameters used by our algorithm are the content prevalence threshold (currently 3), the address dispersion threshold (currently 30 sources and 30 destinations), and the time to garbage collect address dispersion table entries (currently several hours). We now describe the rationale behind these initial choices. Content prevalence threshold: Figure 8 shows the distribution of signature repetitions on a trace for different hash functions. For example, using a 60 second measurement interval and a whole packet CRC, over 97 percent of all signatures repeat two or fewer times and 94.5 percent are only observed once. Using a finer grainedcontent hash or a longer measurement interval increases these numbers even further. However, to a first approximation, all reasonable values of these parameters reveal that very few signatures ever repeat more than 3 times. Recall that the principal benefit provided by the content prevalence table is to remove from consideration the enormous number of substrings which appear rarely and therefore are not possible worm signature candidates. We have repeated these experiments on several datasets at differing times and observed the same pattern. Consequently, for the remainder of this paper we use a prevalence threshold of 3. Address dispersion threshold: Once a signature has passed the prevalence threshold it is still unlikely that it represents a worm. Figure 9 shows the number of distinct signatures found, as a function of time, for different address dispersion thresholds. For example, after 10 minutes there are over 1000 signatures (note the log scale) with a low dispersion threshold of 2 – meaning that the same string has been observed in packets with two different source IP addresses and two different destination IP addresses. However, as the dispersion threshold is increased, the number of such strings decreases dramatically. By the time a threshold of 30 is reached, there are only 5 or 6 prevalentstrings meeting the dispersion criteria and the increase in this number is very slow over time. In this particular trace, two of these strings represent live 0.94 0.95 0.96 0.97 0.98 0.99 1 1 10 100 1000 10000 100000 1e+06 Percentage of signatures Repetitions CRC 60sec CRC 300sec Rabin40 60sec Figure 8: Cumulative distribution function of content signatures for different hash functions. This CDF is computed from the set of repetitions found in each measurement interval over a period of 10 minutes. Note that the y axis is artificially truncated to show more detail. 1 10 100 1000 10000 0 10 20 30 40 50 60 Number of distinct signatures Time (minutes) S>1 AND D>1 S>4 AND D>4 S>9 AND D>9 S>30 AND D>30 Figure 9: Number of distinct signatures detected over time for different address dispersion thresholds. worms and the others are benign but consistently reoccurring strings that are post-filtered by a whitelist. Note that there is an inherent tradeoff between the speed of detecting a new worm and the likelihood of a false positive. By using a lower dispersion threshold one can respond to a worm more quickly, but it is increasingly likely that many such signatures will be benign. For example, we find that the Slammer worm signature is detected within one second with an address dispersion threshold of 2, yet takes up to 5 seconds to discover using the more conservative threshold of 30. At the same time there are two orders of magnitude more signatures that will be reported with the lowest dispersion threshold – most of which will likely be false positives. Garbage collection: The final key parameter of our algorithm is the elapsed time before an entry in the address dispersion table is garbage collected. The impact of this setting is shown in Figure 10. When the timeout is set to 100 seconds, then almost 60 percent of all signatures are garbage collected before a subsequent update