COMP 578 Data Warehousing data mining Ch 2 Discovering Association Rules Keith C.C. Chan Department of computing The Hong Kong Polytechnic University
Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University Ch 2 Discovering Association Rules COMP 578 Data Warehousing & Data Mining
The Ar Mining Problem Given a database of transactions Each transaction being a list of items E.g. purchased by a customer in a visit Find all rules that correlate the presence of one set of items with that of another set of items E. g, 30%of people who buys diapers also uys beer 2
2 The AR Mining Problem ◼ Given a database of transactions. ◼ Each transaction being a list of items. ◼ E.g. purchased by a customer in a visit. ◼ Find all rules that correlate the presence of one set of items with that of another set of items ◼ E.g., 30% of people who buys diapers also buys beer
Motivation applications a If we can find such associations, we will be able to answer 222→beer (What should the company do to boost beer sales?) Diapers→??2 (What other products should the store stocks up?) Attached mailing in direct marketing 3
3 Motivation & Applications (1) ◼ If we can find such associations, we will be able to answer: ◼ ??? beer (What should the company do to boost beer sales?) ◼ Diapers ??? (What other products should the store stocks up?) ◼ Attached mailing in direct marketing
Motivation applications(2) Originally for marketing to understand purchasing trends What products or services customers tend to purchase at the same time or later on? Use market basket analysis to plan Coupon and discounting Do not offer simultaneous discounts on beer and diapers if they tend to be bought together Discount one to pull in sales of the other Product placement a Place products that have a strong purchasing relationship close together Place such products far apart to increase traffic past other Items
4 ◼ Originally for marketing to understand purchasing trends. ◼ What products or services customers tend to purchase at the same time, or later on? ◼ Use market basket analysis to plan: ◼ Coupon and discounting: ◼ Do not offer simultaneous discounts on beer and diapers if they tend to be bought together. ◼ Discount one to pull in sales of the other. ◼ Product placement. ◼ Place products that have a strong purchasing relationship close together. ◼ Place such products far apart to increase traffic past other items. Motivation & Applications (2)
Measure of Interestingness a For a data mining algorithm to mine for interesting association rules, users have to define a measure of"interestingness a Two popular interestingness measures have been ropose Support and Confidence Lift Ratio(Interest) MineSet from SGI use the terms predictability and prevalence instead of support and confidence
5 Measure of Interestingness ◼ For a data mining algorithm to mine for interesting association rules, users have to define a measure of “interestingness”. ◼ Two popular interestingness measures have been proposed: ◼ Support and Confidence ◼ Lift Ratio (Interest) ◼ MineSet from SGI use the terms predictability and prevalence instead of support and confidence
The Support and Confidence Given rule x&y=>Z Support,S=P(x∪YuZ) where AU B indicates that a transaction contains both X and y (union of item sets X and Y) of tuples containing both a &b/ total of tuples Confidence, C=P(ZXUY) P(Z XU Y) is a conditional probability that a transaction having iXUY also contains of tuples containing both X&y&z /# of tuples containing X&y
6 Given rule X & Y => Z ◼ Support, S = P(X Y Z) where A B indicates that a transaction contains both X and Y (union of item sets X and Y) [# of tuples containing both A & B / total # of tuples] ◼ Confidence, C = P(Z | X Y ) P(Z | X Y ) is a conditional probability that a transaction having {XY} also contains Z [# of tuples containing both X&Y&Z / # of tuples containing X&Y] The Support and Confidence
The Support and Confidence Customer Customer buys both Let minimum support 50%, and buys diaper minimum confidence 50%. find out the s and c of 1.A→C 2.C→A Customer buys beer Transaction ID Items Bought 2000 A, B C Answer. 1000 A C A→C(50%,666% 4000 A D 5000 B, E, F C→A(50%,100%) 7
7 The Support and Confidence Transaction ID Items Bought 2000 A,B,C 1000 A,C 4000 A,D 5000 B,E,F Let minimum support 50%, and minimum confidence 50%, find out the S and C of : 1. A C 2. C A Customer buys diaper Customer buys both Customer buys beer Answer: A C (50%, 66.6%) C A (50%, 100%)
How Good is a Predictive model? Response curves How does the response rate of a targeted selection compare to a random selection? 100% Optimal Selection Response Targeted Selection Rate Random Selection Most likely to respond Least likely
8 How Good is a Predictive Model? Response curves - How does the response rate of a targeted selection compare to a random selection?
What is A Lift Ratio? (1) ■ Consider the rule: When people buy diapers they also buy beer 50 percent of the time a It states an explicit percentage (50% of the time) Consider this other rule People who purchase a vcr are three times more likely to also purchase a camcorder The rule used the comparative phrase three times more likely
9 What is A Lift Ratio? (1) ◼ Consider the rule: ◼ When people buy diapers they also buy beer 50 percent of the time. ◼ It states an explicit percentage (50% of the time). ◼ Consider this other rule: ◼ People who purchase a VCR are three times more likely to also purchase a camcorder. ◼ The rule used the comparative phrase “three times more likely”?
What is a Lift ratio?(2) a The probability is compared to the baseline likelihood The baseline likelihood is the probability of the event occurring independently E. g, if people normally buy beer 5% of the time then the first rule could have said 10 times more likely.” The ratio in this kind of comparison is called lift a key goal of an association rule mining exercise is to find rules that have the desired lift 10
10 ◼ The probability is compared to the baseline likelihood. ◼ The baseline likelihood is the probability of the event occurring independently. ◼ E.g., if people normally buy beer 5% of the time, then the first rule could have said “10 times more likely.” ◼ The ratio in this kind of comparison is called lift. ◼ A key goal of an association rule mining exercise is to find rules that have the desired lift. What is A Lift Ratio? (2)