Big data Analysis and mining Decision Tree Qinpei zhao赵钦佩 qinpeizhao@tongji.edu.cn 2015 Fall 2021/2/9
2021/2/9 1 Big Data Analysis and Mining Qinpei Zhao 赵钦佩 qinpeizhao@tongji.edu.cn 2015 Fall Decision Tree
Illustrating Classification Task Tid Attrib Attrib2 Attrib3 Class Learning algorithm Small Medium120 Induction Yes Medium Yes 220K No Learn 8 85K Model No Medium No Small 90K Yes Training set Model Apply Tid Attrib Attrib2 Attrib3 Class Model 12 Yes Medium 110K Deduction 14 No 15 67K est set
Illustrating Classification Task Apply Model Induction Deduction Learn Model Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes 10 Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Test Set Learning algorithm Training Set
Classification: Definition a Given a collection of records(training set e Each record contains a set of attributes one of the attributes is the class find a mode for class attribute as a function of the values of other attributes a Goal: previously unseen records should be assigned a class as accurately as possible atest set is used to determine the accuracy of the model Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it
Classification: Definition ◼ Given a collection of records (training set ) ◆ Each record contains a set of attributes, one of the attributes is the class. ◼ Find a model for class attribute as a function of the values of other attributes. ◼ Goal: previously unseen records should be assigned a class as accurately as possible. ◆ A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it
Examples of Classification Task Predicting tumor cells as benign or malignant Classifying credit card transactions as legitimate or fraudulent Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil Categorizing news stories as finance, weather, entertainment, sports, etc
Examples of Classification Task ◼ Predicting tumor cells as benign or malignant ◼ Classifying credit card transactions as legitimate or fraudulent ◼ Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil ◼ Categorizing news stories as finance, weather, entertainment, sports, etc
What is a Decision Tree? u An inductive learning task o Use particular facts to make more generalized conclusions aA predictive model based on a branching series of Boolean tests o These smaller boolean tests are less complex than a one-stage classifier a Let's look at a sample decision tree
◼ An inductive learning task ◆ Use particular facts to make more generalized conclusions ◼ A predictive model based on a branching series of Boolean tests ◆ These smaller Boolean tests are less complex than a one-stage classifier ◼ Let’s look at a sample decision tree… What is a Decision Tree?
Example Tax cheating id Refund Marital Taxable Splitting Attributes Status Income Single 125K No 2No Married 100K No Refund No Yes No Single 70K 4 Yes Married 120K No NO Mast 5No Divorced 95K Yes Single, DiVorced Married nO Married 60K No 7 Yes Divorced220K No TaxIn NO 8No Single 85K Yes 80K nO Married 75K No NO YES 10No Single 90K Training data Model: decision tree
Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Refund MarSt TaxInc NO YES NO NO Yes No Single, Divorced Married 80K Splitting Attributes Training Data Model: Decision Tree Example – Tax cheating
Example-Tax cheating MarT Single Married id Refund marital Taxable ivorced Status Income Cheat NO Refund Yes Single 125K Yes 2No Married 100K No 3No Single70K No NO TaxIng 4 Yes Married 120K No 80K 5No Divorced95K Yes NO YES nO Married 60K 7 Yes Divorced220K No Single 85K 9No Married 75K No There could be more than one tree that 10No Single 90K Yes fits the same data!
Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 MarSt Refund TaxInc NO YES NO NO Yes No Married Single, Divorced 80K There could be more than one tree that fits the same data! Example – Tax cheating
Decision Tree Classification Task Tree Tid Attrib 1 Attrib Attrib3 Class Induction algorithm Induction Learn Model Medium 75K Yes Training Set Mode Apply Decision Model ree Tid Attrib1 Attrib2 Attrib3 Class Deduction 14No Small 67K Test Set
Decision Tree Classification Task Apply Model Induction Deduction Learn Model Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes 10 Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Test Set Tree Induction algorithm Training Set Decision Tree
Apply Model to Test Data Test Data Start from the root of tree Refund marita Taxable Status Income Cheat No Married 80K Refund Yes No NO Mast Single, DWorced Married TaxIne NO <80K NO YES
Apply Model to Test Data Refund MarSt TaxInc NO YES NO NO Yes No Single, Divorced Married 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data Start from the root of tree
Apply Model to Test Data Test Data Refund marita Taxable Status Income Cheat No Married 80K Refund Yes No NO Mast Single, DWorced Married TaxIne NO <80K NO YES
Apply Model to Test Data Refund MarSt TaxInc NO YES NO NO Yes No Single, Divorced Married 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data