Statistical Classification Minsoo Kim Pomona College Advisor:Jo Hardin April 2,2010
Statistical Classification Minsoo Kim Pomona College Advisor: Jo Hardin April 2, 2010
2
2
Contents 1 Introduction 5 2 Basic Discriminants 2.1 Linear Discriminant Analysis for Two Populations 2.1.1 Classification of Normal Populations when1=∑2=∑...·........ 9 2.2 Fisher's Discriminant Analysis·.....·...·.·.················ 12 2.3 Further Concerns....··· 14 3 Support Vector Machines 17 3.1 Lagrange Multipliers。·.······························· 17 3.2 Hard Margin Support Vector Machines.····.·.·.····.··. 19 3.3 L1 Soft Margin Support Vector Machines..:··.···..·········.·· 25 3.4 Kernel Methods.·············· 28 3.5 Further Concerns .. 33 4 Applications in R 35 4.1 LDA in R...·..·. 。。 36
Contents 1 Introduction 5 2 Basic Discriminants 7 2.1 Linear Discriminant Analysis for Two Populations . . . . . . . . . . . . . . . . . . . 7 2.1.1 Classification of Normal Populations when Σ1 = Σ2 = Σ . . . . . . . . . . . . 9 2.2 Fisher’s Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Further Concerns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3 Support Vector Machines 17 3.1 Lagrange Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 Hard Margin Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3 L1 Soft Margin Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . 25 3.4 Kernel Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.5 Further Concerns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4 Applications in R 35 4.1 LDA in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3
4 CONTENTS 4.2 Support Vector Machines in R:····,······················· 38 5 Conclusion 43
4 CONTENTS 4.2 Support Vector Machines in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5 Conclusion 43
Chapter 1 Introduction Consider the following problem:suppose you are a doctor whose patient has just been hospitalized with a heart attack.Will he or she have another heart attack within the next six months?Or this one:suppose you are a bank,and given a person who wants to take out a small business loan,will she default on the loan?Or how about this one:how can your email server tell which emails are spam and which ones are actual mail? Intuitively,there is a sense that all of the above situations can be resolved by examining em- pirical data and figuring out what factors are important in each.For example,in the heart attack case,one might want to look through hospitalization records of patients who have had two heart attacks in a six month period and see if your patient resembles them in age,demographics,diet and exercise habits,and other clinical measurements.Or for loans,one might evaluate future borrowers on their credit score,income,and number of credit cards they have,and make a decision based on previous borrowers with similar profiles. The above situations are examples of classification problems.Classification is a statistical method used to build predicative models to separate and classify new data points.Let us examine the spam filtering example in more detail.The populations we want to distinguish between are junk emails,those emails from advertisers or hackers that fill up your inbox and cause everyone a headache,and real emails.Next,suppose you have a collection of emails already,say N of them, some of them junk and some of them real,and from these emails we want to build ways to filter out spam.We refer to this collection of emails as the training set,which is,in general,a set of data that is already classified that we use to build our classification model.We refer to the training data as=[x1;x2,...,xN. Now,from our training data,we need to determine which variables,called features,we want to measure to assess whether future emails are spam or not.Features can be continuous or discrete: an example of a discrete feature is whether the sender of an email comes from a .edu address or not,and an example of a continuous feature is the percentage of an email that a certain word or character(like“free'”,“your”,or“l")takes up.If m features are measured,each email contains a m x 1 row vector xi,for i{1,...,N}of data,and we refer to the space Rm as the feature space. The training data is a N x m matrix,where entry rij represents the jth feature of the ith email. Finally,using our training data X,classification will make a decision function,D(x),a function
Chapter 1 Introduction Consider the following problem: suppose you are a doctor whose patient has just been hospitalized with a heart attack. Will he or she have another heart attack within the next six months? Or this one: suppose you are a bank, and given a person who wants to take out a small business loan, will she default on the loan? Or how about this one: how can your email server tell which emails are spam and which ones are actual mail? Intuitively, there is a sense that all of the above situations can be resolved by examining empirical data and figuring out what factors are important in each. For example, in the heart attack case, one might want to look through hospitalization records of patients who have had two heart attacks in a six month period and see if your patient resembles them in age, demographics, diet and exercise habits, and other clinical measurements. Or for loans, one might evaluate future borrowers on their credit score, income, and number of credit cards they have, and make a decision based on previous borrowers with similar profiles. The above situations are examples of classification problems. Classification is a statistical method used to build predicative models to separate and classify new data points. Let us examine the spam filtering example in more detail. The populations we want to distinguish between are junk emails, those emails from advertisers or hackers that fill up your inbox and cause everyone a headache, and real emails. Next, suppose you have a collection of emails already, say N of them, some of them junk and some of them real, and from these emails we want to build ways to filter out spam. We refer to this collection of emails as the training set, which is, in general, a set of data that is already classified that we use to build our classification model. We refer to the training data as X = {x1, x2, . . . , xN }. Now, from our training data, we need to determine which variables, called features, we want to measure to assess whether future emails are spam or not. Features can be continuous or discrete: an example of a discrete feature is whether the sender of an email comes from a .edu address or not, and an example of a continuous feature is the percentage of an email that a certain word or character (like “free”, “your”, or “!”) takes up. If m features are measured, each email contains a m ×1 row vector xi , for i ∈ {1, . . . , N} of data, and we refer to the space R m as the feature space. The training data is a N × m matrix, where entry xij represents the j th feature of the i th email. Finally, using our training data X, classification will make a decision function, D(x), a function 5
6 CHAPTER 1.INTRODUCTION that takes a new data point and produces a prediction of its population The three examples above are just a small sample of the uses of classification.Given these preliminaries,this paper will examine various methods of classification,starting with linear discrim- inant analysis and Fisher's discriminant analysis,to Support Vector Machines(SVMs).Our work will conclude with an analysis of the linear classifiers and SVMs applied to an actual data set
6 CHAPTER 1. INTRODUCTION that takes a new data point and produces a prediction of its population The three examples above are just a small sample of the uses of classification. Given these preliminaries, this paper will examine various methods of classification, starting with linear discriminant analysis and Fisher’s discriminant analysis, to Support Vector Machines (SVMs). Our work will conclude with an analysis of the linear classifiers and SVMs applied to an actual data set
Chapter 2 Basic Discriminants The first classification methods we examine are linear discriminant;particularly,Linear Discriminant Analysis and Fisher's Discriminant Analysis.They are similar in that they both produce linear decision functions that in fact are nearly identical,but the two methods have different assumptions and different approaches.In this chapter the two methods are compared and contrasted. 2.1 Linear Discriminant Analysis for Two Populations Given a pair of known populations,m1 and m2,assume that mI has a probability density function (pdf)fi(x),and similarly,T2 has a pdf of f2(x),where fi(x)f2(x).Then intuitively,a decision function for the two populations would arise from looking at the probability ratio:D(x) f2(x) A new observation x is classified as m if D(x)>1 and T2 if D(x)f2(x). (Or visa versa.)The conditional probability,P(21),of classifying an observation x as T2 when in fact x∈π1is P(21)=P(x∈R2|T)=五(x)dx. (2.1)
Chapter 2 Basic Discriminants The first classification methods we examine are linear discriminant; particularly, Linear Discriminant Analysis and Fisher’s Discriminant Analysis. They are similar in that they both produce linear decision functions that in fact are nearly identical, but the two methods have different assumptions and different approaches. In this chapter the two methods are compared and contrasted. 2.1 Linear Discriminant Analysis for Two Populations Given a pair of known populations, π1 and π2, assume that π1 has a probability density function (pdf) f1(x), and similarly, π2 has a pdf of f2(x), where f1(x) 6= f2(x). Then intuitively, a decision function for the two populations would arise from looking at the probability ratio: D(x) = f1(x) f2(x) . A new observation x is classified as π1 if D(x) > 1 and π2 if D(x) 1 as R1, and similarly the set of x ∈ Ω where f1(x) f2(x) f2(x). (Or visa versa.) The conditional probability, P(2|1), of classifying an observation x as π2 when in fact x ∈ π1 is P(2|1) = P(x ∈ R2 | π1) = Z R2 f1(x)dx. (2.1) 7
8 CHAPTER 2.BASIC DISCRIMINANTS Similarly,the conditional probability of classifying an observation x as m when in fact x Em2 is P(12)=P(xER I T2)=/f2(x)dx. (2.2) Often times,the costs of misclassification for the two populations are different;take for example testing for a fatal but easily curable disease.The cost of misclassifying a patient as healthy when he or she is sick in this case would be higher than misclassifying a healthy patient as sick.To account for such situations,we would like a classification method that would classify patients as healthy only when there is sufficient evidence to overcome a certain cost threshold.We define c(12)as the cost of misclassifying a data point in m when it is actually a member of 72,and similarly,c(21)as the cost of misclassification into m2 when x E m1. Another factor that can affect accurate classification is the prior probability of belonging to one or the other of the populations.Again referring to the above example,if said disease has a low prevalence,even a test with a high sensitivity will classify many patients as sick when they are not when administered to enough people,making it difficult to determine whether a positive test actually indicates a sick patient.Some sort of weight ought to be given to the prior probability that a random observation x is from m or 72,and we denote such probabilities as pi and p2 respectively. Note that with our prior probabilities,we can find the overall probabilities of misclassification by substituting our priors into the earlier conditional probabilities: P(observation comes from m and is misclassified as T2) (2.3) =P(x∈R2|T1)P(1) (2.4) =P(21)×p1, (2.5) and P(observation comes from m2 and is misclassified as m1) (2.6) =P(x∈R1|π2)P(2) (2.7) =P(12)×p2: (2.8) Then,the expected cost of misclassification (ECM)is given by multiplying the cost of misclassification by the overall probability of misclassification for each population: ECM=c(21)P(2|1)P1+c(12)P(12)p. (2.9) One criterion for a classification method is to minimize the ECM,which leads us to the following result: Theorem 2.1.1.(Johnson and Wichern,(4)The regions R1 and R2 that minimize the ECM are defined by the values of x for which the following inequalities hold: R1: (回>12× (五>c(21)p1 2 (2.10) R2: fi() c《12× 田<2× (2.11) Proof.Let us substitute the integral expressions for P(21)and P(12)given by (2.1)and(2.2)into the ECM: ECM c(21)p1 fi(x)dx+c(12)p2 f2(x)dx. (2.12) 2
8 CHAPTER 2. BASIC DISCRIMINANTS Similarly, the conditional probability of classifying an observation x as π1 when in fact x ∈ π2 is P(1|2) = P(x ∈ R1 | π2) = Z R1 f2(x)dx. (2.2) Often times, the costs of misclassification for the two populations are different; take for example testing for a fatal but easily curable disease. The cost of misclassifying a patient as healthy when he or she is sick in this case would be higher than misclassifying a healthy patient as sick. To account for such situations, we would like a classification method that would classify patients as healthy only when there is sufficient evidence to overcome a certain cost threshold. We define c(1|2) as the cost of misclassifying a data point in π1 when it is actually a member of π2, and similarly, c(2|1) as the cost of misclassification into π2 when x ∈ π1. Another factor that can affect accurate classification is the prior probability of belonging to one or the other of the populations. Again referring to the above example, if said disease has a low prevalence, even a test with a high sensitivity will classify many patients as sick when they are not when administered to enough people, making it difficult to determine whether a positive test actually indicates a sick patient. Some sort of weight ought to be given to the prior probability that a random observation x is from π1 or π2, and we denote such probabilities as p1 and p2 respectively. Note that with our prior probabilities, we can find the overall probabilities of misclassification by substituting our priors into the earlier conditional probabilities: P(observation comes from π1 and is misclassified as π2) (2.3) = P(x ∈ R2 | π1)P(π1) (2.4) = P(2|1) × p1, (2.5) and P(observation comes from π2 and is misclassified as π1) (2.6) = P(x ∈ R1 | π2)P(π2) (2.7) = P(1|2) × p2. (2.8) Then, the expected cost of misclassification (ECM) is given by multiplying the cost of misclassification by the overall probability of misclassification for each population: ECM = c(2|1)P(2|1)p1 + c(1|2)P(1|2)p1. (2.9) One criterion for a classification method is to minimize the ECM, which leads us to the following result: Theorem 2.1.1. (Johnson and Wichern, [4]) The regions R1 and R2 that minimize the ECM are defined by the values of x for which the following inequalities hold: R1 : f1(x) f2(x) > c(1|2) c(2|1) × p2 p1 , (2.10) R2 : f1(x) f2(x) < c(1|2) c(2|1) × p2 p1 . (2.11) Proof. Let us substitute the integral expressions for P(2|1) and P(1|2) given by (2.1) and (2.2) into the ECM: ECM = c(2|1)p1 Z R2 f1(x)dx + c(1|2)p2 Z R1 f2(x)dx. (2.12)
2.1.LINEAR DISCRIMINANT ANALYSIS FOR TWO POPULATIONS 9 Note that R1+R2+R3,so 1=人i=人i+i+国a (2.13) Since we know that fi(x)f2(x),R3 must be a union of distinct points,meaning fr fi(x)dx=0, leaving us with fi()ds+f(x)dx. (2.14) Plugging (2.14)into (2.12),we see f2(x)dx (2.15) →ECM= c(12)p2f2(x)-c(21)pf1(x)dx+c(21)p1. (2.16) Recall that p,p2,c(12),c(21)are all positive since they are all probabilities and fi(x)and f2(x) are both positive functions.Then by inspection,we can see that the ECM is minimized when R is defined by those x such that [c(12)p2f2(x)-c(2 1)pifi()] Suppose that the density functions for fi(x)and f2(x)for population T1 and T2 are given by 1 for i=1,2, (2.17)
2.1. LINEAR DISCRIMINANT ANALYSIS FOR TWO POPULATIONS 9 Note that Ω = R1 + R2 + R3, so 1 = Z Ω f1(x)dx = Z R1 f1(x)dx + Z R2 f1(x)dx + Z R3 f1(x)dx. (2.13) Since we know that f1(x) 6= f2(x), R3 must be a union of distinct points, meaning R R3 f1(x)dx = 0, leaving us with 1 = Z R1 f1(x)dx + Z R2 f1(x)dx. (2.14) Plugging (2.14) into (2.12), we see ECM = c(2|1)p1 " 1 − Z R1 f1(x)dx # + c(1|2)p2 Z R1 f2(x)dx (2.15) ⇒ ECM = Z R1 " c(1|2)p2f2(x) − c(2|1)p1f1(x) # dx + c(2|1)p1. (2.16) Recall that p1, p2, c(1|2), c(2|1) are all positive since they are all probabilities and f1(x) and f2(x) are both positive functions. Then by inspection, we can see that the ECM is minimized when R1 is defined by those x such that [c(1|2)p2f2(x) − c(2|1)p1f1(x)] ≤ 0. However, if in (2.13) we had decided to use f2(x) instead of f1(x), (2.16) would have been ECM = Z R2 " c(2|1)p1f1(x) − c(1|2)p2f2(x) # dx + c(1|2)p2, leading to the definition of R2 as the set {x | [c(2|1)p1f1(x) − c(1|2)p2f2(x)] ≤ 0}. As those x that satisfy c(1|2)p2f2(x) = c(2|1)p1f1(x) can therefore be defined as both R1 and R2, we require that the inequalities in Theorem (2.1.1) defining the classification regions be strict and denote the case of equality as unclassifiable. (Note that unclassifiable points happen with probability zero.) The decision function given in Theorem (2.1.1) compares the probability ratio to the cost ratio and prior probability. The use of ratios is important as often times, it is much easier estimate the cost ratio than each cost explicitly. For example, if we are considering the costs of an state university of educating an eventual dropout versus the costs of not educating a eventual graduate, the former can be estimated with the school taxes and tuition, but the latter is more difficult to gauge. However, it could still be accurate to predict that such a cost ratio might be 5:1 or so. Let us examine the case where fi(x) are multivariate normal densities with known mean vectors µi and covariance matrices Σi . 2.1.1 Classification of Normal Populations when Σ1 = Σ2 = Σ Suppose that the density functions for f1(x) and f2(x) for population π1 and π2 are given by fi(x) = 1 (2π) m 2 |Σ| 1 2 exp" − 1 2 (x − µi) T Σ −1 (x − µi) # for i = 1, 2, (2.17)
10 CHAPTER 2.BASIC DISCRIMINANTS for x E Rm.Then by substituting(2.17)into Theorem (2.1.1),we get the following minimum ECM regions: R1:exp -5x-P2-(x-)+x-)P2-(x-) c(12×2 c(21)p1 (2.18) 尾:e-x-)ry-1x-)+x-ry-x-) 1 n c(12)×2 c(21)p1 Else,allocate x to n2.And as before,equality implies that the data point is unclassifiable. Proof.Note that -x-u)P2-1(x-4)+5x-)T-1(x-2), 1 (2.20) =2xy-1-喝8-1x-)-x-1-z1x-) (2.21) =x1x-x-e-x-x+g-g -xΣ-lx+xΣ-1h1+Σ-x-TΣ-1山1 (2.22) Here,we can cancel out the xT-1x term and moreover,since(xT-1u2)T=u-lx is a constant, -xT∑-12-μ∑-1x=-2μ5∑-1x (2.23) xT∑-1w1+μu∑-1x=2μT∑-1x (2.24) (Recall that E is a symmetric matrix.)Substituting equation(2.23)and(2.24)into equation(2.22)
10 CHAPTER 2. BASIC DISCRIMINANTS for x ∈ R m. Then by substituting (2.17) into Theorem (2.1.1), we get the following minimum ECM regions: R1 : exp" − 1 2 (x − µ1) T Σ −1 (x − µ1) + 1 2 (x − µ2) T Σ −1 (x − µ2) # > c(1|2) c(2|1) × p2 p1 , (2.18) R2 : exp" − 1 2 (x − µ1) T Σ −1 (x − µ1) + 1 2 (x − µ2) T Σ −1 (x − µ2) # ln" c(1|2) c(2|1) × p2 p1 # . Else, allocate x to π2. And as before, equality implies that the data point is unclassifiable. Proof. Note that − 1 2 (x − µ1) T Σ −1 (x − µ1) + 1 2 (x − µ2) T Σ −1 (x − µ2), (2.20) = 1 2 " (x T Σ −1 − µ T 2 Σ −1 )(x − µ2) − (x T Σ −1 − µ T 1 Σ −1 )(x − µ1) # , (2.21) = 1 2 " x T Σ −1x − x T Σ −1µ2 − µ T 2 Σ −1x + µ T 2 Σ −1µ2 − x T Σ −1x + x T Σ −1µ1 + µ T 1 Σ −1x − µ T 1 Σ −1µ1 # . (2.22) Here, we can cancel out the x T Σ −1x term and moreover, since (x T Σ −1µ2) T = µ T 2 Σ −1x is a constant, −x T Σ −1µ2 − µ T 2 Σ −1x = −2µ T 2 Σ −1x, (2.23) x T Σ −1µ1 + µ T 1 Σ −1x = 2µ T 1 Σ −1x. (2.24) (Recall that Σ is a symmetric matrix.) Substituting equation (2.23) and (2.24) into equation (2.22)