Chapter 5: Mining Frequent Patterns Association and correlations Basic concepts and a road map Efficient and scalable frequent itemset mining methods Mining various kinds of association rules From association mining to correlation analysIs Constraint-based association mining Summary February 4, 2021 Data Mining: Concepts and Techniques
February 4, 2021 Data Mining: Concepts and Techniques 3 Chapter 5: Mining Frequent Patterns, Association and Correlations ◼ Basic concepts and a road map ◼ Efficient and scalable frequent itemset mining methods ◼ Mining various kinds of association rules ◼ From association mining to correlation analysis ◼ Constraint-based association mining ◼ Summary
What Is frequent pattern analysis? Frequent pattern: a pattern(a set of items, subsequences, substructures etc. that occurs frequently in a data set First proposed by agrawal, imielinski, and Swami [ais3] in the context of frequent itemsets and association rule mining Motivation Finding inherent regularities in data What products were often purchased together?-Beer and diapers? What are the subsequent purchases after buying a pc? What kinds of DNa are sensitive to this new drug Can we automatically classify web documents? pplications Basket data analysis cross-marketing catalog design, sale campaign analysis, Web log(click stream)analysis and dNa sequence analysis February 4, 2021 Data Mining: Concepts and Techniques
February 4, 2021 Data Mining: Concepts and Techniques 4 What Is Frequent Pattern Analysis? ◼ Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set ◼ First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining ◼ Motivation: Finding inherent regularities in data ◼ What products were often purchased together?— Beer and diapers?! ◼ What are the subsequent purchases after buying a PC? ◼ What kinds of DNA are sensitive to this new drug? ◼ Can we automatically classify web documents? ◼ Applications ◼ Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis
Why Is Freq. Pattern Mining Important? Discloses an intrinsic and important property of data sets Forms the foundation for many essential data mining tasks Association, correlation, and causality analysis Sequential, structural(e.g,, sub-graph) patterns Pattern analysis in spatiotemporal, multimedia time series and stream data Classification associative classification Cluster analysis: frequent pattern-based clustering Data warehousing iceberg cube and cube -gradient Semantic data compression fascicles Broad applications February 4, 2021 Data Mining: Concepts and Techniques 5
February 4, 2021 Data Mining: Concepts and Techniques 5 Why Is Freq. Pattern Mining Important? ◼ Discloses an intrinsic and important property of data sets ◼ Forms the foundation for many essential data mining tasks ◼ Association, correlation, and causality analysis ◼ Sequential, structural (e.g., sub-graph) patterns ◼ Pattern analysis in spatiotemporal, multimedia, timeseries, and stream data ◼ Classification: associative classification ◼ Cluster analysis: frequent pattern-based clustering ◼ Data warehousing: iceberg cube and cube-gradient ◼ Semantic data compression: fascicles ◼ Broad applications
Basic Concepts: Frequent Patterns and Association rules Transaction-id Items boug Itemset X={X1…,} A.B. D Find all the rules x>with minimum 20 A.C. D support and confidence 30 A,DE pport s, probability that a 40 B,EF transaction contains xu y 50 B, C,,EF confidence, c conditional Customer Customer probability that a transaction DUVS DO buys diaper having X also contains r Let supmin=50%, COl 50% Freg. Pat :A: 3, B: 3, D 4 E: 3, AD: 3y Association rules. Customer A→D(60%,100% buys beer D→A(60%,75%) February 4, 2021 Data Mining: Concepts and Techniques
February 4, 2021 Data Mining: Concepts and Techniques 6 Basic Concepts: Frequent Patterns and Association Rules ◼ Itemset X = {x1 , …, xk} ◼ Find all the rules X → Y with minimum support and confidence ◼ support, s, probability that a transaction contains X Y ◼ confidence, c, conditional probability that a transaction having X also contains Y Let supmin = 50%, confmin = 50% Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3} Association rules: A → D (60%, 100%) D → A (60%, 75%) Customer buys diaper Customer buys both Customer buys beer Transaction-id Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F
Closed patterns and max-Patterns A long pattern contains a combinatorial number of sub- patterns e.g,{a…,a1o0 contains(10)+(10)+…+ (100)=20-1=1.27*1030 sub-patterns Solution: Mine closed patterns and max-patterns instead An itemset X is closed if X is freguent and there exists no super-patternY x, with the same support as X (proposed by pasquier, et al. ICDT99) An itemset X is a max-pattern if X is frequent and there exists no frequent super-pattern y X(proposed by Bayardo SigMOd98) Closed pattern is a lossless compression of freq. patterns Reducing the of patterns and rules February 4, 2021 Data Mining: Concepts and Techniques 7
February 4, 2021 Data Mining: Concepts and Techniques 7 Closed Patterns and Max-Patterns ◼ A long pattern contains a combinatorial number of subpatterns, e.g., {a1 , …, a100} contains (100 1 ) + (100 2 ) + … + (1 1 0 0 0 0 ) = 2 100 – 1 = 1.27*1030 sub-patterns! ◼ Solution: Mine closed patterns and max-patterns instead ◼ An itemset X is closed if X is frequent and there exists no super-pattern Y כ X, with the same support as X (proposed by Pasquier, et al. @ ICDT’99) ◼ An itemset X is a max-pattern if X is frequent and there exists no frequent super-pattern Y כ X (proposed by Bayardo @ SIGMOD’98) ◼ Closed pattern is a lossless compression of freq. patterns ◼ Reducing the # of patterns and rules
Closed patterns and max-Patterns ■ Exercise.DB={ 50 a Min_sup=1 What is the set of closed itemset? ■:1 :2 What is the set of max-pattern <a1n…ta100 What is the set of all patterns? February 4, 2021 Data Mining: Concepts and Techniques 8
February 4, 2021 Data Mining: Concepts and Techniques 8 Closed Patterns and Max-Patterns ◼ Exercise. DB = {, } ◼ Min_sup = 1. ◼ What is the set of closed itemset? ◼ : 1 ◼ : 2 ◼ What is the set of max-pattern? ◼ : 1 ◼ What is the set of all patterns? ◼ !!
Chapter 5: Mining Frequent Patterns Association and correlations Basic concepts and a road map Efficient and scalable frequent itemset mining methods Mining various kinds of association rules From association mining to correlation analysIs Constraint-based association mining Summary February 4, 2021 Data Mining: Concepts and Techniques
February 4, 2021 Data Mining: Concepts and Techniques 9 Chapter 5: Mining Frequent Patterns, Association and Correlations ◼ Basic concepts and a road map ◼ Efficient and scalable frequent itemset mining methods ◼ Mining various kinds of association rules ◼ From association mining to correlation analysis ◼ Constraint-based association mining ◼ Summary
Scalable methods for Mining frequent patterns The downward closure property of frequent patterns a Any subset of a frequent itemset must be frequent If ibeer, diaper, nuts is frequent, so is ibeer diaper i. e, every transaction having beer, diaper, nuts also contains (beer, diaper Scalable mining methods: Three major approaches Apriori (agrawal srikant@VLDB94 Freq. pattern growth(ePgrowth-Han, Pei yin @SIGMOD00) Vertical data format approach(Charm-Zaki Hsiao @SDM02) February 4, 2021 Data Mining: Concepts and Techniques 10
February 4, 2021 Data Mining: Concepts and Techniques 10 Scalable Methods for Mining Frequent Patterns ◼ The downward closure property of frequent patterns ◼ Any subset of a frequent itemset must be frequent ◼ If {beer, diaper, nuts} is frequent, so is {beer, diaper} ◼ i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper} ◼ Scalable mining methods: Three major approaches ◼ Apriori (Agrawal & Srikant@VLDB’94) ◼ Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00) ◼ Vertical data format approach (Charm—Zaki & Hsiao @SDM’02)
Apriori: a candidate Generation-and-Test approach Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated /tested (Agrawal srikant @VLDB 94, Mannila, et al.@ KDD 94) Method Initially, scan DB once to get frequent 1-itemset Generate length( k+1)candidate itemsets from length k frequent itemsets Test the candidates against dB Terminate when no frequent or candidate set can be generated February 4, 2021 Data Mining: Concepts and Techniques 11
February 4, 2021 Data Mining: Concepts and Techniques 11 Apriori: A Candidate Generation-and-Test Approach ◼ Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94) ◼ Method: ◼ Initially, scan DB once to get frequent 1-itemset ◼ Generate length (k+1) candidate itemsets from length k frequent itemsets ◼ Test the candidates against DB ◼ Terminate when no frequent or candidate set can be generated
The apriori algorithm-An Example Supin =2 ItemsetIsup Database TDB Itemset I sup IAN 2 Tid Items B}3 IAS 4C}3 {B} A, C D B, C. E can IC 20 2333 30A, B, C, E 但}3 3 40 B, E 2L Itemset sup 2 Itemset L,「 ItemsetIsup {A,B} {A,C} 匚AC}2 2nd scan A, B {A,C} tB, C 2 B,C}2 {A,E} 1B, Ey 3 {B, {B,C} C, El 2 {B,E} C. Itemset 3rd scan Itemset I sup AB, CE B, C,E] 2 February 4, 2021 Data Mining: Concepts and Techniques 12
February 4, 2021 Data Mining: Concepts and Techniques 12 The Apriori Algorithm—An Example Database TDB 1 st scan C1 L1 L2 C2 C2 2 nd scan C3 3 L3 rd scan Tid Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E Itemset sup {A} 2 {B} 3 {C} 3 {D} 1 {E} 3 Itemset sup {A} 2 {B} 3 {C} 3 {E} 3 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} Itemset sup {A, B} 1 {A, C} 2 {A, E} 1 {B, C} 2 {B, E} 3 {C, E} 2 Itemset sup {A, C} 2 {B, C} 2 {B, E} 3 {C, E} 2 Itemset {B, C, E} Itemset sup {B, C, E} 2 Supmin = 2