COMP 578 Discovering Classification Rules Keith c.c. chan Department of Computing The Hong Kong Polytechnic University
COMP 578 Discovering Classification Rules Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University
An Example Classification Problem Patient records Recovered Symptoms Treatment Not recovered A B?
2 An Example Classification Problem Patient Records Symptoms & Treatment Recovered Not Recovered A? B?
Classification in Relational DB Patient Symptom Treatment/Recovered Mike Headache TypeAYes Mary Fever typeANo Bill Cough Type B2 No Fever Type C1Yes Dave Doug h Type C1Yes Anne Headache Type B2 Yes Will John, having a headache Class Label and treated with Type CI recover?
3 Classification in Relational DB Patient Symptom TreatmentRecovered Mike Headache Type A Yes Mary Fever Type A No Bill Cough Type B2 No Jim Fever Type C1 Yes Dave Cough Type C1 Yes Anne Headache Type B2 Yes Class Label Will John, having a headache and treated with Type C1, recover?
Discovering of Classification Rules Minins Classification Training Rules Data NAME Symptom Treat. Recover? Mike Headache Type AYes Mary Fever Type ANo Classification Cough Type B2No Rules Jim Fever ype C1 Yes Dave Cough Typ ype C1 YesIF Symptom=Headache Anne Headache Type B2 Yes AND Treatment=Cl Then Recover Yes Based on the classification rule discovered. John will recover
4 Discovering of Classification Rules Training Data NAME Symptom Treat. Recover? Mike Headache Type A Yes Mary Fever Type A No Bill Cough Type B2 No Jim Fever Type C1 Yes Dave Cough Type C1 Yes Anne Headache Type B2 Yes Mining Classification Rules IF Symptom = Headache AND Treatment = C1 THEN Recover = Yes Classification Rules Based on the classification rule discovered, John will recover!!!
The classification problem a Given a database consisting of n records Each record characterized by m attributes Each record pre-classified into p different classes Find A set of classification rules(that constitutes a classification model) that characterizes the different classes so that records not originally in the database can be accurately classified I. e predicting"class labels
5 The Classification Problem Given: – A database consisting of n records. – Each record characterized by m attributes. – Each record pre-classified into p different classes. Find: – A set of classification rules (that constitutes a classification model) that characterizes the different classes – so that records not originally in the database can be accurately classified. – I.e “predicting” class labels
Typical Applications Credit approval Classes can be high risk Low risk? 罐 Target marketing What are the classes? Medical diagnosis Classes can be customers with different diseases w Treatment effectiveness analysIs Classes can be patience with different degrees of recovery
6 Typical Applications Credit approval. – Classes can be High Risk, Low Risk? Target marketing. – What are the classes? Medical diagnosis – Classes can be customers with different diseases. Treatment effectiveness analysis. – Classes can be patience with different degrees of recovery
Techniques for Discoveirng of Classification Rules s The k-Nearest Neighbor Algorithm s The linear discriminant function s The Bayesian Approach s The decision tree approach The Neural Network approach s The genetic algorithm approach
7 Techniques for Discoveirng of Classification Rules The k-Nearest Neighbor Algorithm. The Linear Discriminant Function. The Bayesian Approach. The Decision Tree approach. The Neural Network approach. The Genetic Algorithm approach
Example Using The K-NN Algorithm Salary Age Insurance 15K 28 Bt 31K 39 Buy 41K 53 Buy 10K 45 Buy 14K 55 Bu 25K 27 Not buy 42K 32 Not Buy 18K 38 Not bur 33K 44 Not Buy John earns 24K per month and is 42 years old Will he buy insurance?
8 Example Using The k-NN Algorithm Salary Age Insurance 15K 28 Buy 31K 39 Buy 41K 53 Buy 10K 45 Buy 14K 55 Buy 25K 27 Not Buy 42K 32 Not Buy 18K 38 Not Buy 33K 44 Not Buy John earns 24K per month and is 42 years old. Will he buy insurance?
The k-Nearest Neighbor Algorithm All data records correspond to points in the n Dimensional space Nearest neighbor defined in terms of Euclidean distance s k-nn returns the most common class label among k training examples nearest to xq
9 The k-Nearest Neighbor Algorithm All data records correspond to points in the nDimensional space. Nearest neighbor defined in terms of Euclidean distance. k-NN returns the most common class label among k training examples nearest to xq. . _ + _ xq + _ _ + _ _ +
The K-NN Algorithm(2) N k-nn can be for continuous-valued labels Calculate the mean values of the k nearest neighbors w Distance-weighted nearest neighbor algorithm Weight the contribution of each of the k neighbors according to their distance to the query point x a Advantage X Robust to noisy data by averaging k-nearest neighbors 罐 Disadvantage Distance between neighbors could be dominated by irrelevant attributes 10
10 The k-NN Algorithm (2) k-NN can be for continuous-valued labels. – Calculate the mean values of the k nearest neighbors Distance-weighted nearest neighbor algorithm – Weight the contribution of each of the k neighbors according to their distance to the query point xq Advantage: – Robust to noisy data by averaging k-nearest neighbors Disadvantage: – Distance between neighbors could be dominated by irrelevant attributes. w d x q x i 1 2 ( , )