《实用非参数统计》课程教学资源（阅读材料）W.J.Braun's Nonparametric regression notes.pdf_大学文库

1.1.SPLINE REGRESSION 3 so usually there is a discontinuity in the pth derivative.Thus,pth degree splines are usually said to have no more than p-1 continuous derivatives.For example,cubic splines often have two continuous derivatives.This is one reason for their popularity: having two continuous derivatives is often sufficient to give smooth approximations to functions.Furthermore,third degree piecewise polynomials are usually numerically well- behaved. B-splines The discussion below (1.1)indicates that we can represent any piecewise polynomial of degree p in the following way: S(x)=+3x+32x2+…+月nx2+月+1(x-t)岸+…+月p+k(x-tk)4 (1.2) This can be rigourously proven.See de Boor(1978),for example.Display (1.2)says that any piecewise polynomial can be expressed as a linear combination of truncated power functions and polynomials of degree p.In other words, {1,x,x2,,xP,(c-t)星，(c-t2)4,,(c-t)4} is a basis for the space of pth degree splines possessing knots at ti,t2,...,t. By adding a noise term to (1.2),we can obtain a spline regression model relating a response y=S(r)+E to the predictor z.Least-squares can be used to estimate the coefficients. The normal equations associated with the truncated power basis are highly ill-conditioned. In the polynomial context,this problem is avoided by considering orthogonal polynomials. For p>0,there are no orthogonal splines.However,there is a basis which is reasonably well-conditioned,the B-spline basis. Let us first consider the piecewise constant functions(Oth degree splines).These are normally not of much practical use.However,they provide us with a starting point to build up the B-spline basis. The truncated power basis for the piecewise constant functions on an interval [a,b having knots at t1,t2,...,ta,b is {1,(e-t1)4,,(红-t)4} Note that these basis functions overlap alot.That is,they are often 0 together or 1 together.This will lead to the ill-conditioning when applying least-squares mentioned earlier.A more natural basis for piecewise constant functions is the following {1e(at)1re()(),1re(ta)(),...1e()(x),1re()() To see that this is a basis for piecewise constant functions on [a,b with jumps at t1,...,t, note that any such function can be written as So(x)=BolzE(a,t)+ ∑ 1xe,t4+1）+Bk1rE(tk,b)

1.1. SPLINE REGRESSION 3 so usually there is a discontinuity in the pth derivative. Thus, pth degree splines are usually said to have no more than p − 1 continuous derivatives. For example, cubic splines often have two continuous derivatives. This is one reason for their popularity: having two continuous derivatives is often sufficient to give smooth approximations to functions. Furthermore, third degree piecewise polynomials are usually numerically wellbehaved. B-splines The discussion below (1.1) indicates that we can represent any piecewise polynomial of degree p in the following way: S(x) = β0 + β1x + β2x 2 + · · · + βpx p + βp+1(x − t1) p + + · · · + βp+k(x − tk) p + (1.2) This can be rigourously proven. See de Boor (1978), for example. Display (1.2) says that any piecewise polynomial can be expressed as a linear combination of truncated power functions and polynomials of degree p. In other words, {1, x, x2 , . . . , xp ,(x − t1) p +,(x − t2) p +, . . . ,(x − tk) p +} is a basis for the space of pth degree splines possessing knots at t1, t2, . . . , tk. By adding a noise term to (1.2), we can obtain a spline regression model relating a response y = S(x) + ε to the predictor x. Least-squares can be used to estimate the coefficients. The normal equations associated with the truncated power basis are highly ill-conditioned. In the polynomial context, this problem is avoided by considering orthogonal polynomials. For p > 0, there are no orthogonal splines. However, there is a basis which is reasonably well-conditioned, the B-spline basis. Let us first consider the piecewise constant functions (0th degree splines). These are normally not of much practical use. However, they provide us with a starting point to build up the B-spline basis. The truncated power basis for the piecewise constant functions on an interval [a, b] having knots at t1, t2, . . . , tk ∈ [a, b] is {1,(x − t1) 0 +, . . . ,(x − tk) 0 +}. Note that these basis functions overlap alot. That is, they are often 0 together or 1 together. This will lead to the ill-conditioning when applying least-squares mentioned earlier. A more natural basis for piecewise constant functions is the following {1x∈(a,t1) , 1x∈(t1,t2)(x), 1x∈(t2,t3)(x), . . . , 1x∈(tk−1,tk)(x), 1x∈(tk,b)(x)} To see that this is a basis for piecewise constant functions on [a, b] with jumps at t1, . . . , tk, note that any such function can be written as S0(x) = β01x∈(a,t1) + X k−1 j=1 βj1x∈(tk,tk+1) + βk1x∈(tk,b)

4 CHAPTER 1.NONPARAMETRIC REGRESSION Note that if we try to fit such a function to data of the form (1,1),(x2,22),...(n,Un), we require at least one x value between any pair of knots (which include a and b,the so- called 'Boundary knots')in order to obtain a full-rank model matrix.This is required for invertibility of the normal equations.Note,as well,that the columns of the model matrix are orthogonal which implies that this least-squares problem is now very well conditioned. The reason for the orthogonality is that the regions where the functions are non-zero do not overlap;each basis function is non-zero only on a single interval of the form [ti,. Finally,observe that the size of this basis is the same as the truncated power basis,so we are requiring the same numbers of degrees of freedom. Moving to the piecewise linear case,we note first that the piecewise constant B- splines could be obtained from their truncated counterparts by differencing.Two levels of differencing applied to the truncated linear functions on [a,b with knots at t1,...,t gives a set of continuous piecewise linear functions Bi(r)which are non-zero on intervals of the form [ti,t+2.These functions are called or hat'or'chapeau'functions.They are not orthogonal,but since Bi(x)is nonzero only with B:-1.1()and B+1.1(),the resulting model matrix used in computing the least-squares regression estimates is well-conditioned. The more usual way of computing the linear B-spline sequence is through the formula B1(x)= 二12e回+2二1e1@，i=-1,2,k（1.3) tit1-ti tit2-ti+l By taking t-1=to a and t+1=b,we can use the above formula to compute B-1.1() and Bo.1(x),B-1,1(x)and B.1(x),for t a,b,using the convention that 0x oo =0. In total,this gives us k +2 basis functions for continuous piecewise linear splines.For the case of knots at.3 and.6 and boundary knots at 0 and 1,these functions are plotted in Figure 1.1.Note that,with two knots,there are 4 basis functions.The middle two have the 'hat'shape,while the first and last are 'incomplete'hats.Note that because the B-splines are nonzero on at most the union of two neighbouring inter-knot subintervals, some pairs of B-splines are mutually orthogonal. ITo obtain the plot,type x <-seq(0,1,length=50) matplot(x,bs(x,knots=c(.3,.6),intercept=TRUE,degree=1)[,1:4], type="1")

4 CHAPTER 1. NONPARAMETRIC REGRESSION Note that if we try to fit such a function to data of the form (x1, y1),(x2, y2), . . .(xn, yn), we require at least one x value between any pair of knots (which include a and b, the socalled ‘Boundary knots’) in order to obtain a full-rank model matrix. This is required for invertibility of the normal equations. Note, as well, that the columns of the model matrix are orthogonal which implies that this least-squares problem is now very well conditioned. The reason for the orthogonality is that the regions where the functions are non-zero do not overlap; each basis function is non-zero only on a single interval of the form [ti , ti+1]. Finally, observe that the size of this basis is the same as the truncated power basis, so we are requiring the same numbers of degrees of freedom. Moving to the piecewise linear case, we note first that the piecewise constant Bsplines could be obtained from their truncated counterparts by differencing. Two levels of differencing applied to the truncated linear functions on [a, b] with knots at t1, . . . , tk gives a set of continuous piecewise linear functions Bi,1(x) which are non-zero on intervals of the form [ti , ti+2]. These functions are called or ‘hat’ or ‘chapeau’ functions. They are not orthogonal, but since Bi,1(x) is nonzero only with Bi−1,1(x) and Bi+1,1(x), the resulting model matrix used in computing the least-squares regression estimates is well-conditioned. The more usual way of computing the linear B-spline sequence is through the formula Bi,1(x) = (x − ti) ti+1 − ti 1x∈[ti,ti+1)(x) + (ti+2 − x) ti+2 − ti+1 1x∈[ti+1,ti+2)(x), i = −1, 2, . . . , k (1.3) By taking t−1 = t0 = a and tk+1 = b, we can use the above formula to compute B−1,1(x) and B0,1(x), Bk−1,1(x) and Bk,1(x), for x ∈ [a, b], using the convention that 0 × ∞ = 0. In total, this gives us k + 2 basis functions for continuous piecewise linear splines. For the case of knots at .3 and .6 and boundary knots at 0 and 1, these functions are plotted in Figure 1.1.1 Note that, with two knots, there are 4 basis functions. The middle two have the ‘hat’ shape, while the first and last are ‘incomplete’ hats. Note that because the B-splines are nonzero on at most the union of two neighbouring inter-knot subintervals, some pairs of B-splines are mutually orthogonal. 1To obtain the plot, type x <- seq(0, 1, length=50) matplot(x, bs(x, knots=c(.3, .6), intercept=TRUE, degree=1)[, 1:4], type="l")

1.1.SPLINE REGRESSION 7 The only difficulty is the poor conditioning of the truncated power basis which will result in inaccuracies in the calculation of B.It is for this reason that the B-spline basis was introduced.Using this basis,we re-formulate the regression model as p+k BBi,p(x）+e (1.6) i=0 or in vector-matrix form y=BB+E where the (j,i)element of B is Bip(i).The least-squares estimate of B is then 3=(BTB)-BTy The orthogonality of the B-splines which are far enough apart results in a banded matrix BTB which has better conditioning properties than the matrix TTT.The bandedness property actually allows for the use of more efficient numerical techniques in computing B.Again,all of the usual regression techniques are available.The only drawback with this model is that the coefficients are uninterpretable,and the B-splines are a little less intuitive than the truncated power functions. We have been assuming that the knots are known.In general,they are unknown,and they must be chosen.Badly chosen knots can result in bad approximations.Because the spline regression problem can be formulated as an ordinary regression problem with a transformed predictor,it is possible to apply variable selection techniques such as back- ward selection to choose a set of knots.The usual approach is to start with a set of knots located at a subset of the order statistics of the predictor.Then backward selection is applied,using the truncated power basis form of the model.Each time a basis function is eliminated,the corresponding knot is eliminated.The method has drawbacks,notably the ill-conditioning of the basis as mentioned earlier. Figure 1.3 exhibits an example of a least-squares spline with automatically generated knots,applied to a data set consisting of titanium measurements.3 A version of backward selection was used to generated these knots;the stopping rule used was similar to the Akaike Information Criterion (AIC)discussed in Chapter 6.Although this least-squares spline fit to these data is better than what could be obtained using polynomial regression, it is unsatisfactory in many ways.The flat regions are not modelled smoothly enough, and the peak is cut off. 3To obtain Figure 1.3,type attach(titanium) y.1m <-1m(g bs(temperature,knots=c(755,835,905,975), Boundary.knots=c(550,1100))) plot(titanium) lines(temperature,predict(y.lm))

1.1. SPLINE REGRESSION 7 The only difficulty is the poor conditioning of the truncated power basis which will result in inaccuracies in the calculation of βb. It is for this reason that the B-spline basis was introduced. Using this basis, we re-formulate the regression model as yj = X p+k i=0 βiBi,p(xi) + εj (1.6) or in vector-matrix form y = Bβ + ε where the (j, i) element of B is Bi,p(xj ). The least-squares estimate of β is then βb = (B TB) −1B T y The orthogonality of the B-splines which are far enough apart results in a banded matrix BTB which has better conditioning properties than the matrix T T T. The bandedness property actually allows for the use of more efficient numerical techniques in computing βb. Again, all of the usual regression techniques are available. The only drawback with this model is that the coefficients are uninterpretable, and the B-splines are a little less intuitive than the truncated power functions. We have been assuming that the knots are known. In general, they are unknown, and they must be chosen. Badly chosen knots can result in bad approximations. Because the spline regression problem can be formulated as an ordinary regression problem with a transformed predictor, it is possible to apply variable selection techniques such as backward selection to choose a set of knots. The usual approach is to start with a set of knots located at a subset of the order statistics of the predictor. Then backward selection is applied, using the truncated power basis form of the model. Each time a basis function is eliminated, the corresponding knot is eliminated. The method has drawbacks, notably the ill-conditioning of the basis as mentioned earlier. Figure 1.3 exhibits an example of a least-squares spline with automatically generated knots, applied to a data set consisting of titanium measurements.3 A version of backward selection was used to generated these knots; the stopping rule used was similar to the Akaike Information Criterion (AIC) discussed in Chapter 6. Although this least-squares spline fit to these data is better than what could be obtained using polynomial regression, it is unsatisfactory in many ways. The flat regions are not modelled smoothly enough, and the peak is cut off. 3To obtain Figure 1.3, type attach(titanium) y.lm <- lm(g ~ bs(temperature, knots=c(755, 835, 905, 975), Boundary.knots=c(550, 1100))) plot(titanium) lines(temperature, predict(y.lm))

1.1.SPLINE REGRESSION 9 that was already there improves the fit as well.4 1.1.3 Smoothing Splines One way around the problem of choosing knots is to use lots of them.A result analogous to the Weierstrass approximation theorem says that any sufficiently smooth function can be approximated arbitrarily well by spline functions with enough knots. The use of large numbers of knots alone is not sufficient to avoid trouble,since we will over-fit the data if the number of knots k is taken so large that p++1>n.In that case, we would have no degrees of freedom left for estimating the residual variance.A standard way of coping with the former problem is to apply a penalty term to the least-squares problem.One requires that the resulting spline regression estimate has low curvature as measured by the square of the second derivative. More precisely,one may try to minimize (for a given constant A) over the set of all functions S(x)which are twice continuously differentiable.The solution to this minimization problem has been shown to be a cubic spline which is surprisingly easy to calculate.5 Thus,the problem of choosing a set of knots is replaced by selecting a value for the smoothing parameter A.Note that if A is small,the solution will be a cubic spline which almost interpolates the data;increasing values of A render increasingly smooth approximations The usual way of choosing A is by cross-validation.The ordinary cross-validation choice of入minimizes CV()=∑-S6(c) j=1 where (()is the smoothing spline obtained using parameter A,using all data but the jth observation.Note that the CV function is similar in spirit to the PRESS statistic,but 4The plot in Figure 1.4 can be generated using y.1m<-1m(g~bs(temperature,knots=c(755,835,885,895,915,975), Boundary.knots=c(550,1100))) plot(titanium) lines(spline(temperature,predict(y.lm))) 5The B-spline coefficients for this spline can be obtained from an expression of the form B=(BTB+λDTD)-1Bry where B is the matrix used for least-squares regression splines and D is a matrix that arises in the calculation involving the squared second derivatives of the spline.Details can be found in de Boor (1978).It is sufficient to note here that this approach has similarities with ridge regression,and that the estimated regression is a linear function of the responses

1.1. SPLINE REGRESSION 9 that was already there improves the fit as well.4 1.1.3 Smoothing Splines One way around the problem of choosing knots is to use lots of them. A result analogous to the Weierstrass approximation theorem says that any sufficiently smooth function can be approximated arbitrarily well by spline functions with enough knots. The use of large numbers of knots alone is not sufficient to avoid trouble, since we will over-fit the data if the number of knots k is taken so large that p+k+1 > n. In that case, we would have no degrees of freedom left for estimating the residual variance. A standard way of coping with the former problem is to apply a penalty term to the least-squares problem. One requires that the resulting spline regression estimate has low curvature as measured by the square of the second derivative. More precisely, one may try to minimize (for a given constant λ) Xn j=1 (yj − S(xj ))2 + λ Z b a (S ′′(x))2 dx over the set of all functions S(x) which are twice continuously differentiable. The solution to this minimization problem has been shown to be a cubic spline which is surprisingly easy to calculate.5 Thus, the problem of choosing a set of knots is replaced by selecting a value for the smoothing parameter λ. Note that if λ is small, the solution will be a cubic spline which almost interpolates the data; increasing values of λ render increasingly smooth approximations. The usual way of choosing λ is by cross-validation. The ordinary cross-validation choice of λ minimizes CV(λ) = Xn j=1 (yj − Sbλ,(j)(xj ))2 where Sbλ,(j)(x) is the smoothing spline obtained using parameter λ, using all data but the jth observation. Note that the CV function is similar in spirit to the PRESS statistic, but 4The plot in Figure 1.4 can be generated using y.lm <- lm(g ~ bs(temperature, knots=c(755, 835, 885, 895, 915, 975), Boundary.knots=c(550, 1100))) plot(titanium) lines(spline(temperature, predict(y.lm))) 5The B-spline coefficients for this spline can be obtained from an expression of the form βb = (B T B + λDT D) −1B T y where B is the matrix used for least-squares regression splines and D is a matrix that arises in the calculation involving the squared second derivatives of the spline. Details can be found in de Boor (1978). It is sufficient to note here that this approach has similarities with ridge regression, and that the estimated regression is a linear function of the responses

10 CHAPTER 1.NONPARAMETRIC REGRESSION it is much more computationally intensive,since the regression must be computed for each value of j.Recall that PRESS could be computed from one regression,upon exploiting the diagonal of the hat matrix.If one pursues this idea in the choice of smoothing parameter,one obtains the so-called generalized cross-validation criterion which turns out to have very similar behaviour to ordinary cross-validation but with lower computational requirements.The generalized cross-validation calculation is carried out using the trace of a matrix which plays the role of the hat matrix.6 Figure 1.5 exhibits a plot which is very similar to what could be obtained using lines(spline(smooth.spline(temperature,g))) This invokes the smoothing spline with smoothing parameter selected by generalized cross- validation.To use ordinary cross-validation,include the argument cv=TRUE. Penalized splines The smoothing spline is an example of the use of penalized least-squares.In this case,re- gression functions which have large second derivatives are penalized,that is,the objective function to be minimized has something positive added to it.Regression functions which have small second derivatives are penalized less,but they may fail to fit the data as well. Thus,the penalty approach seeks a compromise between fitting the data and satisfying a constraint (in this case,a small second derivative). Eilers and Marx(1996)generalized this idea by observing that other forms of penalties may be used.The penalized splines of Eilers and Marx(1996)are based on a similar idea to the smoothing spline,with two innovations.The penalty term is no longer in terms of a second derivative,but a second divided difference,and the number of knots can be specified.(For smoothing splines,the number of knots is effectively equal to the number of observations.)? Recall that the least-squares spline problem is to find coefficients Bo,B1,...,Bp to minimize ∑-se》2 = where S()-B().Eilers and Marx(1996)advocate the use ofkequally spaced knots,instead of the order statistics of the predictor variable.Note that the number of knots must be chosen somehow,but this is simpler than choosing knot locations. The smoothness of a spline is related to its B-spline coefficients;thus,Eilers and Marx (1996)replace the second derivative penalty for the smoothing spline with lth order differences of B-spline coefficients.These are defined as follows: △0：=B:-B-1 6The hat matrix is called the smoother matrix in this context,and for smoothing splines,it is of the form H()=(BTB+ADT D)-1BT 7Penalized splines are implemented in the function psplinereg()in the R package kspline which can be obtained from http://www.stats.uwo.ca/faculty/braun/Rlinks.htm

10 CHAPTER 1. NONPARAMETRIC REGRESSION it is much more computationally intensive, since the regression must be computed for each value of j. Recall that PRESS could be computed from one regression, upon exploiting the diagonal of the hat matrix. If one pursues this idea in the choice of smoothing parameter, one obtains the so-called generalized cross-validation criterion which turns out to have very similar behaviour to ordinary cross-validation but with lower computational requirements. The generalized cross-validation calculation is carried out using the trace of a matrix which plays the role of the hat matrix.6 Figure 1.5 exhibits a plot which is very similar to what could be obtained using lines(spline(smooth.spline(temperature, g))) This invokes the smoothing spline with smoothing parameter selected by generalized crossvalidation. To use ordinary cross-validation, include the argument cv=TRUE. Penalized splines The smoothing spline is an example of the use of penalized least-squares. In this case, regression functions which have large second derivatives are penalized, that is, the objective function to be minimized has something positive added to it. Regression functions which have small second derivatives are penalized less, but they may fail to fit the data as well. Thus, the penalty approach seeks a compromise between fitting the data and satisfying a constraint (in this case, a small second derivative). Eilers and Marx (1996) generalized this idea by observing that other forms of penalties may be used. The penalized splines of Eilers and Marx (1996) are based on a similar idea to the smoothing spline, with two innovations. The penalty term is no longer in terms of a second derivative, but a second divided difference, and the number of knots can be specified. (For smoothing splines, the number of knots is effectively equal to the number of observations.)7 Recall that the least-squares spline problem is to find coefficients β0, β1, . . . , βk+p to minimize Xn j=1 (yj − S(xj ))2 where S(x) = Pk+p i=0 βiBi,p(x). Eilers and Marx (1996) advocate the use of k equally spaced knots, instead of the order statistics of the predictor variable. Note that the number of knots must be chosen somehow, but this is simpler than choosing knot locations. The smoothness of a spline is related to its B-spline coefficients; thus, Eilers and Marx (1996) replace the second derivative penalty for the smoothing spline with ℓth order differences of B-spline coefficients. These are defined as follows: ∆βi = βi − βi−1 6The hat matrix is called the smoother matrix in this context, and for smoothing splines, it is of the form H(λ) = (B T B + λDT D) −1B T . 7Penalized splines are implemented in the function psplinereg() in the R package kspline which can be obtained from http://www.stats.uwo.ca/faculty/braun/Rlinks.htm