《多元统计分析》课程教学资源（阅读材料）Statistical Classification.pdf_大学文库

Chapter 1 Introduction Consider the following problem:suppose you are a doctor whose patient has just been hospitalized with a heart attack.Will he or she have another heart attack within the next six months?Or this one:suppose you are a bank,and given a person who wants to take out a small business loan,will she default on the loan?Or how about this one:how can your email server tell which emails are spam and which ones are actual mail? Intuitively,there is a sense that all of the above situations can be resolved by examining em- pirical data and figuring out what factors are important in each.For example,in the heart attack case,one might want to look through hospitalization records of patients who have had two heart attacks in a six month period and see if your patient resembles them in age,demographics,diet and exercise habits,and other clinical measurements.Or for loans,one might evaluate future borrowers on their credit score,income,and number of credit cards they have,and make a decision based on previous borrowers with similar profiles. The above situations are examples of classification problems.Classification is a statistical method used to build predicative models to separate and classify new data points.Let us examine the spam filtering example in more detail.The populations we want to distinguish between are junk emails,those emails from advertisers or hackers that fill up your inbox and cause everyone a headache,and real emails.Next,suppose you have a collection of emails already,say N of them, some of them junk and some of them real,and from these emails we want to build ways to filter out spam.We refer to this collection of emails as the training set,which is,in general,a set of data that is already classified that we use to build our classification model.We refer to the training data as=[x1;x2,...,xN. Now,from our training data,we need to determine which variables,called features,we want to measure to assess whether future emails are spam or not.Features can be continuous or discrete: an example of a discrete feature is whether the sender of an email comes from a .edu address or not,and an example of a continuous feature is the percentage of an email that a certain word or character(like“free'”,“your”,or“l")takes up.If m features are measured,each email contains a m x 1 row vector xi,for i{1,...,N}of data,and we refer to the space Rm as the feature space. The training data is a N x m matrix,where entry rij represents the jth feature of the ith email. Finally,using our training data X,classification will make a decision function,D(x),a function

Chapter 1 Introduction Consider the following problem: suppose you are a doctor whose patient has just been hospitalized with a heart attack. Will he or she have another heart attack within the next six months? Or this one: suppose you are a bank, and given a person who wants to take out a small business loan, will she default on the loan? Or how about this one: how can your email server tell which emails are spam and which ones are actual mail? Intuitively, there is a sense that all of the above situations can be resolved by examining empirical data and figuring out what factors are important in each. For example, in the heart attack case, one might want to look through hospitalization records of patients who have had two heart attacks in a six month period and see if your patient resembles them in age, demographics, diet and exercise habits, and other clinical measurements. Or for loans, one might evaluate future borrowers on their credit score, income, and number of credit cards they have, and make a decision based on previous borrowers with similar profiles. The above situations are examples of classification problems. Classification is a statistical method used to build predicative models to separate and classify new data points. Let us examine the spam filtering example in more detail. The populations we want to distinguish between are junk emails, those emails from advertisers or hackers that fill up your inbox and cause everyone a headache, and real emails. Next, suppose you have a collection of emails already, say N of them, some of them junk and some of them real, and from these emails we want to build ways to filter out spam. We refer to this collection of emails as the training set, which is, in general, a set of data that is already classified that we use to build our classification model. We refer to the training data as X = {x1, x2, . . . , xN }. Now, from our training data, we need to determine which variables, called features, we want to measure to assess whether future emails are spam or not. Features can be continuous or discrete: an example of a discrete feature is whether the sender of an email comes from a .edu address or not, and an example of a continuous feature is the percentage of an email that a certain word or character (like “free”, “your”, or “!”) takes up. If m features are measured, each email contains a m ×1 row vector xi , for i ∈ {1, . . . , N} of data, and we refer to the space R m as the feature space. The training data is a N × m matrix, where entry xij represents the j th feature of the i th email. Finally, using our training data X, classification will make a decision function, D(x), a function 5

8 CHAPTER 2.BASIC DISCRIMINANTS Similarly,the conditional probability of classifying an observation x as m when in fact x Em2 is P(12)=P(xER I T2)=/f2(x)dx. (2.2) Often times,the costs of misclassification for the two populations are different;take for example testing for a fatal but easily curable disease.The cost of misclassifying a patient as healthy when he or she is sick in this case would be higher than misclassifying a healthy patient as sick.To account for such situations,we would like a classification method that would classify patients as healthy only when there is sufficient evidence to overcome a certain cost threshold.We define c(12)as the cost of misclassifying a data point in m when it is actually a member of 72,and similarly,c(21)as the cost of misclassification into m2 when x E m1. Another factor that can affect accurate classification is the prior probability of belonging to one or the other of the populations.Again referring to the above example,if said disease has a low prevalence,even a test with a high sensitivity will classify many patients as sick when they are not when administered to enough people,making it difficult to determine whether a positive test actually indicates a sick patient.Some sort of weight ought to be given to the prior probability that a random observation x is from m or 72,and we denote such probabilities as pi and p2 respectively. Note that with our prior probabilities,we can find the overall probabilities of misclassification by substituting our priors into the earlier conditional probabilities: P(observation comes from m and is misclassified as T2) (2.3) =P(x∈R2|T1)P(1) (2.4) =P(21)×p1, (2.5) and P(observation comes from m2 and is misclassified as m1) (2.6) =P(x∈R1|π2)P(2) (2.7) =P(12)×p2: (2.8) Then,the expected cost of misclassification (ECM)is given by multiplying the cost of misclassification by the overall probability of misclassification for each population: ECM=c(21)P(2|1)P1+c(12)P(12)p. (2.9) One criterion for a classification method is to minimize the ECM,which leads us to the following result: Theorem 2.1.1.(Johnson and Wichern,(4)The regions R1 and R2 that minimize the ECM are defined by the values of x for which the following inequalities hold: R1: (回>12× (五>c(21)p1 2 (2.10) R2: fi() c《12× 田<2× (2.11) Proof.Let us substitute the integral expressions for P(21)and P(12)given by (2.1)and(2.2)into the ECM: ECM c(21)p1 fi(x)dx+c(12)p2 f2(x)dx. (2.12) 2

8 CHAPTER 2. BASIC DISCRIMINANTS Similarly, the conditional probability of classifying an observation x as π1 when in fact x ∈ π2 is P(1|2) = P(x ∈ R1 | π2) = Z R1 f2(x)dx. (2.2) Often times, the costs of misclassification for the two populations are different; take for example testing for a fatal but easily curable disease. The cost of misclassifying a patient as healthy when he or she is sick in this case would be higher than misclassifying a healthy patient as sick. To account for such situations, we would like a classification method that would classify patients as healthy only when there is sufficient evidence to overcome a certain cost threshold. We define c(1|2) as the cost of misclassifying a data point in π1 when it is actually a member of π2, and similarly, c(2|1) as the cost of misclassification into π2 when x ∈ π1. Another factor that can affect accurate classification is the prior probability of belonging to one or the other of the populations. Again referring to the above example, if said disease has a low prevalence, even a test with a high sensitivity will classify many patients as sick when they are not when administered to enough people, making it difficult to determine whether a positive test actually indicates a sick patient. Some sort of weight ought to be given to the prior probability that a random observation x is from π1 or π2, and we denote such probabilities as p1 and p2 respectively. Note that with our prior probabilities, we can find the overall probabilities of misclassification by substituting our priors into the earlier conditional probabilities: P(observation comes from π1 and is misclassified as π2) (2.3) = P(x ∈ R2 | π1)P(π1) (2.4) = P(2|1) × p1, (2.5) and P(observation comes from π2 and is misclassified as π1) (2.6) = P(x ∈ R1 | π2)P(π2) (2.7) = P(1|2) × p2. (2.8) Then, the expected cost of misclassification (ECM) is given by multiplying the cost of misclassification by the overall probability of misclassification for each population: ECM = c(2|1)P(2|1)p1 + c(1|2)P(1|2)p1. (2.9) One criterion for a classification method is to minimize the ECM, which leads us to the following result: Theorem 2.1.1. (Johnson and Wichern, [4]) The regions R1 and R2 that minimize the ECM are defined by the values of x for which the following inequalities hold: R1 : f1(x) f2(x) > c(1|2) c(2|1) × p2 p1 , (2.10) R2 : f1(x) f2(x) < c(1|2) c(2|1) × p2 p1 . (2.11) Proof. Let us substitute the integral expressions for P(2|1) and P(1|2) given by (2.1) and (2.2) into the ECM: ECM = c(2|1)p1 Z R2 f1(x)dx + c(1|2)p2 Z R1 f2(x)dx. (2.12)

2.1. LINEAR DISCRIMINANT ANALYSIS FOR TWO POPULATIONS 9 Note that Ω = R1 + R2 + R3, so 1 = Z Ω f1(x)dx = Z R1 f1(x)dx + Z R2 f1(x)dx + Z R3 f1(x)dx. (2.13) Since we know that f1(x) 6= f2(x), R3 must be a union of distinct points, meaning R R3 f1(x)dx = 0, leaving us with 1 = Z R1 f1(x)dx + Z R2 f1(x)dx. (2.14) Plugging (2.14) into (2.12), we see ECM = c(2|1)p1 " 1 − Z R1 f1(x)dx # + c(1|2)p2 Z R1 f2(x)dx (2.15) ⇒ ECM = Z R1 " c(1|2)p2f2(x) − c(2|1)p1f1(x) # dx + c(2|1)p1. (2.16) Recall that p1, p2, c(1|2), c(2|1) are all positive since they are all probabilities and f1(x) and f2(x) are both positive functions. Then by inspection, we can see that the ECM is minimized when R1 is defined by those x such that [c(1|2)p2f2(x) − c(2|1)p1f1(x)] ≤ 0. However, if in (2.13) we had decided to use f2(x) instead of f1(x), (2.16) would have been ECM = Z R2 " c(2|1)p1f1(x) − c(1|2)p2f2(x) # dx + c(1|2)p2, leading to the definition of R2 as the set {x | [c(2|1)p1f1(x) − c(1|2)p2f2(x)] ≤ 0}. As those x that satisfy c(1|2)p2f2(x) = c(2|1)p1f1(x) can therefore be defined as both R1 and R2, we require that the inequalities in Theorem (2.1.1) defining the classification regions be strict and denote the case of equality as unclassifiable. (Note that unclassifiable points happen with probability zero.) The decision function given in Theorem (2.1.1) compares the probability ratio to the cost ratio and prior probability. The use of ratios is important as often times, it is much easier estimate the cost ratio than each cost explicitly. For example, if we are considering the costs of an state university of educating an eventual dropout versus the costs of not educating a eventual graduate, the former can be estimated with the school taxes and tuition, but the latter is more difficult to gauge. However, it could still be accurate to predict that such a cost ratio might be 5:1 or so. Let us examine the case where fi(x) are multivariate normal densities with known mean vectors µi and covariance matrices Σi . 2.1.1 Classification of Normal Populations when Σ1 = Σ2 = Σ Suppose that the density functions for f1(x) and f2(x) for population π1 and π2 are given by fi(x) = 1 (2π) m 2 |Σ| 1 2 exp" − 1 2 (x − µi) T Σ −1 (x − µi) # for i = 1, 2, (2.17)