Querying and Mining Data Streams you only Get One Look A Tutorial Minos Garofalakis Johannes Gehrke° Rajeev rastogi ○Be| Laboratories cOrnell University Garofalakis, Gehrke, Rastogi, VLDB02 #1
Garofalakis, Gehrke, Rastogi, VLDB’02 # 1 Querying and Mining Data Streams: You Only Get One Look A Tutorial Minos Garofalakis Johannes Gehrke Rajeev Rastogi Bell Laboratories Cornell University
Ou uTIne Introduction Motivation Stream computation model applications Basic stream synopses computation Samples, Equi-depth histograms, Wavelets Mining data streams Decision trees, clustering, association rules Sketch-based computation techniques Self-joins, Joins, Wavelets, V-optimal histograms Advanced techniques Sliding windows, Distinct values, Hot lists Future directions Conclusions Garofalakis, Gehrke, Rastogi, VLDB02 #2
Garofalakis, Gehrke, Rastogi, VLDB’02 # 2 Outline • Introduction & Motivation – Stream computation model, Applications • Basic stream synopses computation – Samples, Equi-depth histograms, Wavelets • Mining data streams – Decision trees, clustering, association rules • Sketch-based computation techniques – Self-joins, Joins, Wavelets, V-optimal histograms • Advanced techniques – Sliding windows, Distinct values, Hot lists • Future directions & Conclusions
Processing data Streams: MotivatioAont sdmelege O CRe A growing number of applications generate streams of data Performance measurements in network monitoring and traffic management Call detail records in telecommunications Transactions in retail chains, A TM operations in banks Log records generated by Web Servers Sensor network data Application characteristics Massive volumes of data( several terabytes Records arrive at a rapid rate Goa: Mine patterns, process queries and compute statistics on data streams in real-time Garofalakis, Gehrke, Rastogi, VLDB02 #3
Garofalakis, Gehrke, Rastogi, VLDB’02 # 3 Processing Data Streams: Motivation • A growing number of applications generate streams of data – Performance measurements in network monitoring and traffic management – Call detail records in telecommunications – Transactions in retail chains, ATM operations in banks – Log records generated by Web Servers – Sensor network data • Application characteristics – Massive volumes of data (several terabytes) – Records arrive at a rapid rate • Goal: Mine patterns, process queries and compute statistics on data streams in real-time
Data Streams: Computation Model A data stream is a(massive) sequence of elements:el,-.,en Synopsis in Memory Data streams Stream Processing (Apro×mate) Engine Answer Stream processing requirements Single pass: Each record is examined at most once Bounded storage: Limited Memory(M)for storing synopsis Real-time: Per record processing time(to maintain synopsis)must be low Garofalakis, Gehrke, Rastogi, VLDB'02 #4
Garofalakis, Gehrke, Rastogi, VLDB’02 # 4 Data Streams: Computation Model • A data stream is a (massive) sequence of elements: • Stream processing requirements – Single pass: Each record is examined at most once – Bounded storage: Limited Memory (M) for storing synopsis – Real-time: Per record processing time (to maintain synopsis) must be low Stream Processing Engine (Approximate) Answer Synopsis in Memory Data Streams e en ,..., 1
Network Management Application uren gdm: O owL Network Management involves monitoring and configuring network hardware and software to ensure smooth operation Monitor link bandwidth usage, estimate traffic demands Quickly detect faults, congestion and isolate root cause oad balancing, improve utilization of network resources Network Operations Measurements Center Alarms Network Garofalakis, Gehrke, Rastogi, VLDB02 #5
Garofalakis, Gehrke, Rastogi, VLDB’02 # 5 Network Management Application • Network Management involves monitoring and configuring network hardware and software to ensure smooth operation – Monitor link bandwidth usage, estimate traffic demands – Quickly detect faults, congestion and isolate root cause – Load balancing, improve utilization of network resources Network Operations Center Network Measurements Alarms
IP Network Measurement Data O Po IP session data (collected using Cisco NetFlow Source Destination Duration Bytes Protocol 10.1.0.2 16.2.3.7 12 20K http 18.6.7.1 124.0.3 24K http 13.94.3 116.8.2 15 20K 152.2.9 17.1.2.1 9 40K http 12.4.38 14.8.7.4 26 58K http 10.5.1.3 13.0.0.1 27 100K ftp 11.1.0.6 10.3.4.5 32 300K ftp 19.7.1.2 16.5.5.8 18 80K ft a t&t collects 100 GBs of net flow data each day! Garofalakis, Gehrke, Rastogi, VLDB02 #6
Garofalakis, Gehrke, Rastogi, VLDB’02 # 6 IP Network Measurement Data • IP session data (collected using Cisco NetFlow) • AT&T collects 100 GBs of NetFlow data each day! Source Destination Duration Bytes Protocol 10.1.0.2 16.2.3.7 12 20K http 18.6.7.1 12.4.0.3 16 24K http 13.9.4.3 11.6.8.2 15 20K http 15.2.2.9 17.1.2.1 19 40K http 12.4.3.8 14.8.7.4 26 58K http 10.5.1.3 13.0.0.1 27 100K ftp 11.1.0.6 10.3.4.5 32 300K ftp 19.7.1.2 16.5.5.8 18 80K ftp
Network Data Processing Traffic estimation How many bytes were sent between a pair of Ip addresses? What fraction network ip addresses are active? List the top 100 iP addresses in terms of traffic Traffic analysis What is the average duration of an IP session? What is the median of the number of bytes in each IP session? raud detection List all sessions that transmitted more than 1000 bytes Identify all sessions whose duration was more than twice the normal Security/Denial of Service List all IP addresses that have witnessed a sudden spike in traffic Identify ip addresses involved in more than 1000 sessions Garofalakis, Gehrke, Rastogi, VLDB'02 #7
Garofalakis, Gehrke, Rastogi, VLDB’02 # 7 Network Data Processing • Traffic estimation – How many bytes were sent between a pair of IP addresses? – What fraction network IP addresses are active? – List the top 100 IP addresses in terms of traffic • Traffic analysis – What is the average duration of an IP session? – What is the median of the number of bytes in each IP session? • Fraud detection – List all sessions that transmitted more than 1000 bytes – Identify all sessions whose duration was more than twice the normal • Security/Denial of Service – List all IP addresses that have witnessed a sudden spike in traffic – Identify IP addresses involved in more than 1000 sessions
Data Stream Processing Algorithms Generally, algorithms compute approximate answers Difficult to compute answers accurately with limited memory Approximate answers-Deterministic bounds Algorithms only compute an approximate answer, but bounds on error Approximate answers-Probabilistic bounds Algorithms compute an approximate answer with high probability With probability at least 1-8, the computed answer is within a factor f of the actual answer Single-pass algorithms for processing streams also pplicable to(massive) terabyte databases Garofalakis, Gehrke, Rastogi, VLDB02 #8
Garofalakis, Gehrke, Rastogi, VLDB’02 # 8 Data Stream Processing Algorithms • Generally, algorithms compute approximate answers – Difficult to compute answers accurately with limited memory • Approximate answers - Deterministic bounds – Algorithms only compute an approximate answer, but bounds on error • Approximate answers - Probabilistic bounds – Algorithms compute an approximate answer with high probability • With probability at least , the computed answer is within a factor of the actual answer • Single-pass algorithms for processing streams also applicable to (massive) terabyte databases! 1−
Ou uTIne Introduction Motivation Basic stream synopses computation Samples: Answering queries using samples, Reservoir sampling Histograms: Equi-depth histograms, On-line quantile computation Wavelets: Haar-wavelet histogram construction maintenance Mining data streams Sketch-based computation techniques Advanced techniques Future directions conclusions Garofalakis, Gehrke, Rastogi, VLDB02 #9
Garofalakis, Gehrke, Rastogi, VLDB’02 # 9 Outline • Introduction & Motivation • Basic stream synopses computation – Samples: Answering queries using samples, Reservoir sampling – Histograms: Equi-depth histograms, On-line quantile computation – Wavelets: Haar-wavelet histogram construction & maintenance • Mining data streams • Sketch-based computation techniques • Advanced techniques • Future directions & Conclusions
Sampling: Basics Idea: A small random sample s of the data often well represents all the data For a fast approx answer, apply"modified"query to s Example: select ggq from r where R e is odd Data stream: 9 3 5 2 7 165849 Sample s: 9 5 1 8 If agg is avg, return average of odd elements in s answer: 5 If agg is count, return average over all elements e in S of n if e is odd answer:12*3/4=9 o if e is even Unbiased: For expressions involving count, sum, avg: the estimator is unbiased, i. e, the expected value of the answer is the actual answer Garofalakis, Gehrke, Rastogi, VLDB02 #10
Garofalakis, Gehrke, Rastogi, VLDB’02 # 10 Sampling: Basics • Idea: A small random sample S of the data often wellrepresents all the data – For a fast approx answer, apply “modified” query to S – Example: select agg from R where R.e is odd (n=12) – If agg is avg, return average of odd elements in S – If agg is count, return average over all elements e in S of • n if e is odd • 0 if e is even Unbiased: For expressions involving count, sum, avg: the estimator is unbiased, i.e., the expected value of the answer is the actual answer Data stream: 9 3 5 2 7 1 6 5 8 4 9 1 Sample S: 9 5 1 8 answer: 5 answer: 12*3/4 =9