Chapter 7. Classification: Basic Concepts Classification: Basic Concepts Decision tree induction Bayes Classification Methods Rule-Based classification Model evaluation and selection ■ Summary
1 Chapter 7. Classification: Basic Concepts ◼ Classification: Basic Concepts ◼ Decision Tree Induction ◼ Bayes Classification Methods ◼ Rule-Based Classification ◼ Model Evaluation and Selection ◼ Summary
Classification vs Prediction Classification predicts categorical class labels(discrete or nominal classifies data(constructs a model) based on the training set and the values class labels)in a classifying attribute and uses it in classifying new data Prediction models continuous-valued functions, i. e, predicts unknown or missing values
2 ◼ Classification ◼ predicts categorical class labels (discrete or nominal) ◼ classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data ◼ Prediction ◼ models continuous-valued functions, i.e., predicts unknown or missing values Classification vs. Prediction
Classifications Definition Given a collection of records (training set Each record contains a set of attributes one of the attributes is the class find a model for class attribute as a function of the values of other attributes Goal previously unseen records should be assigned a class as accurately as possible a test set is used to determine the accuracy of the model Usually the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it
3 Classification: Definition ◼ Given a collection of records (training set ) ◼ Each record contains a set of attributes, one of the attributes is the class. ◼ Find a model for class attribute as a function of the values of other attributes. ◼ Goal: previously unseen records should be assigned a class as accurately as possible. ◼ A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it
Supervised vs Unsupervised Learning Supervised learning (classification) Supervision The training data (observations measurements etc are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning(clustering) The class labels of training data is unknown Given a set of measurements observations etc with the aim of establishing the existence of classes or clusters in e data 4
4 Supervised vs. Unsupervised Learning ◼ Supervised learning (classification) ◼ Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations ◼ New data is classified based on the training set ◼ Unsupervised learning (clustering) ◼ The class labels of training data is unknown ◼ Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
Prediction problems: Classification vs. Numeric Prediction Classification predicts categorical class labels(discrete or nominal) classifies data(constructs a model) based on the training set and the values (class labels)in a classifying attribute and uses it in classifying new data ■ Numeric Prediction models continuous-valued functions, i.e. predicts unknown or missing values Typical applications Credit/loan approval Medical diagnosis: if a tumor is cancerous or benign a fraud detection: if a transaction is fraudulent u Web page categorization: which category it is 5
5 ◼ Classification ◼ predicts categorical class labels (discrete or nominal) ◼ classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data ◼ Numeric Prediction ◼ models continuous-valued functions, i.e., predicts unknown or missing values ◼ Typical applications ◼ Credit/loan approval: ◼ Medical diagnosis: if a tumor is cancerous or benign ◼ Fraud detection: if a transaction is fraudulent ◼ Web page categorization: which category it is Prediction Problems: Classification vs. Numeric Prediction
Classification→ATw。- Step Process Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute the set of tuples used for model construction is training set the model is represented as classification rules decision trees or mathematical formulae Model usage: for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set otherwise overfitting If the accuracy is acceptable, use the model to classify new data Note: If the test set is used to select models, it is called validation (test) set
6 Classification—A Two-Step Process ◼ Model construction: describing a set of predetermined classes ◼ Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute ◼ The set of tuples used for model construction is training set ◼ The model is represented as classification rules, decision trees, or mathematical formulae ◼ Model usage: for classifying future or unknown objects ◼ Estimate accuracy of the model ◼ The known label of test sample is compared with the classified result from the model ◼ Accuracy rate is the percentage of test set samples that are correctly classified by the model ◼ Test set is independent of training set (otherwise overfitting) ◼ If the accuracy is acceptable, use the model to classify new data ◼ Note: If the test set is used to select models, it is called validation (test) set
ILlustrating Classification Task Tid Attrib1 Attrib2 Attrib3 Class Learning algorithm tedium 100K No Smal 4 Yes Induction 7 Yes 220K Learn Smal Model Medium 10 No Small 90K Yes Training set Model Appl Tid Attrib Attrib2 Attrib3 Class Model Small 55K 12 Yes Medium 110K Deduction 14|No 15 No 7K ? Test Set 7
7 Illustrating Classification Task Apply Model Induction Deduction Learn Model Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes 10 Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Test Set Learning algorithm Training Set
Process (O): Model Construction Classification algorithms Training Data NAMERANK YEARS TENURED Classifier Mike Assistant Prof no (Model) Mary Assistant Prof es Bⅲ Professor 372763 yes Associate prof yes Dave Assistant Prof IF rank="professor no Anne Associate Prof OR years >6 no ThEN tenured =yes 8
8 Process (1): Model Construction Training Data NAME RANK YEARS TENURED Mike Assistant Prof 3 n o Mary Assistant Prof 7 yes Bill Professor 2 yes Jim Associate Prof 7 yes Dave Assistant Prof 6 n o Anne Associate Prof 3 n o Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Classifier (Model)
Process(2): Using the Model in Prediction Classifier Testing Data Unseen Data (Jeff, Professor, 4) NAME RANK YEARS TENURED Tom Assistant Prof no Tenured? Merlis a Associate prof no George rofessor 2757 yes Yes Joseph Assistant Prof yes
9 Process (2): Using the Model in Prediction Classifier Testing Data NAME RANK YEARS TENURED Tom Assistant Prof 2 n o Merlisa Associate Prof 7 n o George Professor 5 yes Joseph Assistant Prof 7 yes Unseen Data (Jeff, Professor, 4) Tenured?
Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and conquer manner At start all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in advance) Examples are partitioned recursively based on selected attributes a Test attributes are selected on the basis of a heuristic or statistical measure(e.g information gain Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning majority voting is employed for classifying the leaf There are no samples left
10 Algorithm for Decision Tree Induction ◼ Basic algorithm (a greedy algorithm) ◼ Tree is constructed in a top-down recursive divide-andconquer manner ◼ At start, all the training examples are at the root ◼ Attributes are categorical (if continuous-valued, they are discretized in advance) ◼ Examples are partitioned recursively based on selected attributes ◼ Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) ◼ Conditions for stopping partitioning ◼ All samples for a given node belong to the same class ◼ There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf ◼ There are no samples left