正在加载图片...
Chapter 1 Introduction Consider the following problem:suppose you are a doctor whose patient has just been hospitalized with a heart attack.Will he or she have another heart attack within the next six months?Or this one:suppose you are a bank,and given a person who wants to take out a small business loan,will she default on the loan?Or how about this one:how can your email server tell which emails are spam and which ones are actual mail? Intuitively,there is a sense that all of the above situations can be resolved by examining em- pirical data and figuring out what factors are important in each.For example,in the heart attack case,one might want to look through hospitalization records of patients who have had two heart attacks in a six month period and see if your patient resembles them in age,demographics,diet and exercise habits,and other clinical measurements.Or for loans,one might evaluate future borrowers on their credit score,income,and number of credit cards they have,and make a decision based on previous borrowers with similar profiles. The above situations are examples of classification problems.Classification is a statistical method used to build predicative models to separate and classify new data points.Let us examine the spam filtering example in more detail.The populations we want to distinguish between are junk emails,those emails from advertisers or hackers that fill up your inbox and cause everyone a headache,and real emails.Next,suppose you have a collection of emails already,say N of them, some of them junk and some of them real,and from these emails we want to build ways to filter out spam.We refer to this collection of emails as the training set,which is,in general,a set of data that is already classified that we use to build our classification model.We refer to the training data as=[x1;x2,...,xN. Now,from our training data,we need to determine which variables,called features,we want to measure to assess whether future emails are spam or not.Features can be continuous or discrete: an example of a discrete feature is whether the sender of an email comes from a .edu address or not,and an example of a continuous feature is the percentage of an email that a certain word or character(like“free'”,“your”,or“l")takes up.If m features are measured,each email contains a m x 1 row vector xi,for i{1,...,N}of data,and we refer to the space Rm as the feature space. The training data is a N x m matrix,where entry rij represents the jth feature of the ith email. Finally,using our training data X,classification will make a decision function,D(x),a functionChapter 1 Introduction Consider the following problem: suppose you are a doctor whose patient has just been hospitalized with a heart attack. Will he or she have another heart attack within the next six months? Or this one: suppose you are a bank, and given a person who wants to take out a small business loan, will she default on the loan? Or how about this one: how can your email server tell which emails are spam and which ones are actual mail? Intuitively, there is a sense that all of the above situations can be resolved by examining em￾pirical data and figuring out what factors are important in each. For example, in the heart attack case, one might want to look through hospitalization records of patients who have had two heart attacks in a six month period and see if your patient resembles them in age, demographics, diet and exercise habits, and other clinical measurements. Or for loans, one might evaluate future borrowers on their credit score, income, and number of credit cards they have, and make a decision based on previous borrowers with similar profiles. The above situations are examples of classification problems. Classification is a statistical method used to build predicative models to separate and classify new data points. Let us examine the spam filtering example in more detail. The populations we want to distinguish between are junk emails, those emails from advertisers or hackers that fill up your inbox and cause everyone a headache, and real emails. Next, suppose you have a collection of emails already, say N of them, some of them junk and some of them real, and from these emails we want to build ways to filter out spam. We refer to this collection of emails as the training set, which is, in general, a set of data that is already classified that we use to build our classification model. We refer to the training data as X = {x1, x2, . . . , xN }. Now, from our training data, we need to determine which variables, called features, we want to measure to assess whether future emails are spam or not. Features can be continuous or discrete: an example of a discrete feature is whether the sender of an email comes from a .edu address or not, and an example of a continuous feature is the percentage of an email that a certain word or character (like “free”, “your”, or “!”) takes up. If m features are measured, each email contains a m ×1 row vector xi , for i ∈ {1, . . . , N} of data, and we refer to the space R m as the feature space. The training data is a N × m matrix, where entry xij represents the j th feature of the i th email. Finally, using our training data X, classification will make a decision function, D(x), a function 5
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有