正在加载图片...
124 BURGES where h is a non-negative integer called the Vapnik Chervonenkis (VC)dimension,and is a measure ofthe notion of capacity mentioned above.In the following we will call the right hand side of Eq.(3)the"risk bound."We depart here from some previous nomenclature: the authors of(Guyon et al.,1992)call it the "guaranteed risk"but this is something of a misnomer,since it is really a bound on a risk,not a risk,and it holds only with a certain probability,and so is not guaranteed.The second term on the right hand side is called the “VC confidence” We note three key points about this bound.First,remarkably,it is independent of P(x,y). It assumes only that both the training data and the test data are drawn independently ac- cording to some P(x,y).Second,it is usually not possible to compute the left hand side. Third,if we know h,we can easily compute the right hand side.Thus given several different learning machines(recall that"learning machine"is just another name for a family of func- tions f(x,a)),and choosing a fixed,sufficiently small n,by then taking that machine which minimizes the right hand side,we are choosing that machine which gives the lowest upper bound on the actual risk.This gives a principled method for choosing a learning machine for a given task,and is the essential idea of structural risk minimization(see Section 2.6). Given a fixed family of learning machines to choose from,to the extent that the bound is tight for at least one of the machines,one will not be able to do better than this.To the extent that the bound is not tight for any,the hope is that the right hand side still gives useful information as to which learning machine minimizes the actual risk.The bound not being tight for the whole chosen family of learning machines gives critics a justifiable target at which to fire their complaints.At present,for this case,we must rely on experiment to be the judge 2.1.The VC Dimension The VC dimension is a property of a set of functions ff(a)(again,we use a as a generic set of parameters:a choice of a specifies a particular function),and can be defined for various classes of function f.Here we will only consider functions that correspond to the two-class pattern recognition case,so that f(x,a)E{-1,1}Vx,a.Now if a given set of I points can be labeled in all possible 2 ways,and for each labeling,a member of the set f(a)can be found which correctly assigns those labels,we say thatthat set of points is shattered by that set of functions.The VC dimension for the set of functions {f(a)}is defined as the maximum number of training points that can be shattered by ff(a).Note that,if the VC dimension is h,then there exists at least one set of h points that can be shattered,but it in general it will not be true that every set of h points can be shattered. 2.2.Shattering Points with Oriented Hyperplanes in R Suppose that the space in which the data live is R2,and the set {f()}consists of oriented straight lines,so that for a given line,all points on one side are assigned the class 1,and all points on the other side,the class-1.The orientation is shown in Figure I by an arrow, specifying on which side of the line points are to be assigned the label 1.While it is possible to find three points that can be shattered by this set of functions,it is not possible to find four.Thus the VC dimension of the set of oriented lines in R-is three.124 BURGES where h is a non-negative integer called the Vapnik Chervonenkis (VC) dimension, and is a measure of the notion of capacity mentioned above. In the following we will call the right hand side of Eq. (3) the “risk bound.” We depart here from some previous nomenclature: the authors of (Guyon et al., 1992) call it the “guaranteed risk”, but this is something of a misnomer, since it is really a bound on a risk, not a risk, and it holds only with a certain probability, and so is not guaranteed. The second term on the right hand side is called the “VC confidence.” We note three key points about this bound. First, remarkably, it is independent of P(x, y). It assumes only that both the training data and the test data are drawn independently ac￾cording to some P(x, y). Second, it is usually not possible to compute the left hand side. Third, if we know h, we can easily compute the right hand side. Thus given several different learning machines (recall that “learning machine” is just another name for a family of func￾tions f(x, α)), and choosing a fixed, sufficiently small η, by then taking that machine which minimizes the right hand side, we are choosing that machine which gives the lowest upper bound on the actual risk. This gives a principled method for choosing a learning machine for a given task, and is the essential idea of structural risk minimization (see Section 2.6). Given a fixed family of learning machines to choose from, to the extent that the bound is tight for at least one of the machines, one will not be able to do better than this. To the extent that the bound is not tight for any, the hope is that the right hand side still gives useful information as to which learning machine minimizes the actual risk. The bound not being tight for the whole chosen family of learning machines gives critics a justifiable target at which to fire their complaints. At present, for this case, we must rely on experiment to be the judge. 2.1. The VC Dimension The VC dimension is a property of a set of functions {f(α)} (again, we use α as a generic set of parameters: a choice of α specifies a particular function), and can be defined for various classes of function f. Here we will only consider functions that correspond to the two-class pattern recognition case, so that f(x, α) ∈ {−1, 1} ∀x, α. Now if a given set of l points can be labeled in all possible 2l ways, and for each labeling, a member of the set {f(α)} can be found which correctly assigns those labels, we say that that set of points is shattered by that set of functions. The VC dimension for the set of functions {f(α)} is defined as the maximum number of training points that can be shattered by {f(α)}. Note that, if the VC dimension is h, then there exists at least one set of h points that can be shattered, but it in general it will not be true that every set of h points can be shattered. 2.2. Shattering Points with Oriented Hyperplanes in Rn Suppose that the space in which the data live is R2, and the set {f(α)} consists of oriented straight lines, so that for a given line, all points on one side are assigned the class 1, and all points on the other side, the class −1. The orientation is shown in Figure 1 by an arrow, specifying on which side of the line points are to be assigned the label 1. While it is possible to find three points that can be shattered by this set of functions, it is not possible to find four. Thus the VC dimension of the set of oriented lines in R2 is three
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有