1 Nonparametric Regression Given data of the form(1,y),(2,y2),...,(n,Un),we seek an estimate of the regression function g(r)satisfying the model y=g(x)+8 where the noise term satisfies the usual conditions assumed for simple linear regression. There are several approaches to this problem.In this chapter,we will describe methods involving splines as well as local polynomial methods. 1.1 Spline Regression The discovery that piecewise polynomials or splines could be used in place of polynomials occurred in the early twentieth century.Splines have since become one of the most popular ways of approximating nonlinear functions. In this section,we will describe some of the basic properties of splines,describing two bases.We will then go on to discuss how to estimate coefficients of a spline using least-squares regression.We close this section with a discussion of smoothing splines. 1.1.1 Basic properties of splines Splines are essentially defined as piecewise polynomials.In this subsection,we will de- scribe how splines can be viewed as linear combinations of truncated power functions. We will then describe the B-spline basis which is a more convenient basis for computing splines. Truncated power functions Let t be any real number,then we can define a pth degree truncated power function as (x-t)=(z-t)PI>t)() As a function of x,this function takes on the value 0 to the left of t,and it takes on the value (x-t)P to the right of t.The number t is called a knot. The above truncated power function is a basic example of a spline.In fact,it is a member of the set of basis functions for the space of splines. 1
1 Nonparametric Regression Given data of the form (x1, y1),(x2, y2), . . . ,(xn, yn), we seek an estimate of the regression function g(x) satisfying the model y = g(x) + ε where the noise term satisfies the usual conditions assumed for simple linear regression. There are several approaches to this problem. In this chapter, we will describe methods involving splines as well as local polynomial methods. 1.1 Spline Regression The discovery that piecewise polynomials or splines could be used in place of polynomials occurred in the early twentieth century. Splines have since become one of the most popular ways of approximating nonlinear functions. In this section, we will describe some of the basic properties of splines, describing two bases. We will then go on to discuss how to estimate coefficients of a spline using least-squares regression. We close this section with a discussion of smoothing splines. 1.1.1 Basic properties of splines Splines are essentially defined as piecewise polynomials. In this subsection, we will describe how splines can be viewed as linear combinations of truncated power functions. We will then describe the B-spline basis which is a more convenient basis for computing splines. Truncated power functions Let t be any real number, then we can define a pth degree truncated power function as (x − t) p + = (x − t) p I{x>t}(x) As a function of x, this function takes on the value 0 to the left of t, and it takes on the value (x − t) p to the right of t. The number t is called a knot. The above truncated power function is a basic example of a spline. In fact, it is a member of the set of basis functions for the space of splines. 1
2 CHAPTER 1.NONPARAMETRIC REGRESSION Properties of splines For simplicity,let us consider a general pth degree spline with a single knot at t.Let P(r) denote an arbitrary pth degree polynomial P(x)=+x+32x2+…+3nxP. Then S(x)=P(x)+Bp+1(x-t)4 (1.1) takes on the value P(z)for any t.Thus,restricted to each region,the function is a pth degree polynomial.As a whole,this function is a pth degree piecewise polynomial;there are two pieces. Note that we require p+2 coefficients to specify this piecewise polynomial.This is one more coefficient than the number needed to specify a pth degree polynomial,and this is a result of the addition of the truncated power function specified by the knot at t.In general,we may add k truncated power functions specified by knots at t1,t2,...,ts,each multiplied by different coefficients.This would result in p+k+1 degrees of freedom. An important property of splines is their smoothness.Polynomials are very smooth, possessing all derivatives everywhere.Splines possess all derivatives only at points which are not knots.The number of derivatives at a knot depends on the degree of the spline. Consider the spline defined by (1.1).We can show that S(x)is continuous at t,when p>0,by noting that S(t)=P(t) and lim9p+1(x-t)4=0 clt so that lim S(x)=P(t). xlt In fact,we can argue similarly for the first p-1 derivatives: S0)(t)=P6(t),j=1,2,.,p-1 and lim6p+1p(p-1).(p-j+1)(x-t)P-j=0 1 so that lim S()()=p)(t) xlt The pth derivative behaves differently: S(P)(t)=p!Bp and lim S(P)(x)=plpp+p9p+1 xlt
2 CHAPTER 1. NONPARAMETRIC REGRESSION Properties of splines For simplicity, let us consider a general pth degree spline with a single knot at t. Let P(x) denote an arbitrary pth degree polynomial P(x) = β0 + β1x + β2x 2 + · · · + βpx p . Then S(x) = P(x) + βp+1(x − t) p + (1.1) takes on the value P(x) for any x ≤ t, and it takes on the value P(x) + βp+1(x − t) p for any x > t. Thus, restricted to each region, the function is a pth degree polynomial. As a whole, this function is a pth degree piecewise polynomial; there are two pieces. Note that we require p + 2 coefficients to specify this piecewise polynomial. This is one more coefficient than the number needed to specify a pth degree polynomial, and this is a result of the addition of the truncated power function specified by the knot at t. In general, we may add k truncated power functions specified by knots at t1, t2, . . . , tk, each multiplied by different coefficients. This would result in p + k + 1 degrees of freedom. An important property of splines is their smoothness. Polynomials are very smooth, possessing all derivatives everywhere. Splines possess all derivatives only at points which are not knots. The number of derivatives at a knot depends on the degree of the spline. Consider the spline defined by (1.1). We can show that S(x) is continuous at t, when p > 0, by noting that S(t) = P(t) and lim x↓t βp+1(x − t) p + = 0 so that lim x↓t S(x) = P(t). In fact, we can argue similarly for the first p − 1 derivatives: S (j) (t) = P (j) (t), j = 1, 2, . . . , p − 1 and lim x↓t βp+1p(p − 1)· · ·(p − j + 1)(x − t) p−j = 0 so that lim x↓t S (j) (x) = P (j) (t) The pth derivative behaves differently: S (p) (t) = p!βp and lim x↓t S (p) (x) = p!βp + p!βp+1
1.1.SPLINE REGRESSION 3 so usually there is a discontinuity in the pth derivative.Thus,pth degree splines are usually said to have no more than p-1 continuous derivatives.For example,cubic splines often have two continuous derivatives.This is one reason for their popularity: having two continuous derivatives is often sufficient to give smooth approximations to functions.Furthermore,third degree piecewise polynomials are usually numerically well- behaved. B-splines The discussion below (1.1)indicates that we can represent any piecewise polynomial of degree p in the following way: S(x)=+3x+32x2+…+月nx2+月+1(x-t)岸+…+月p+k(x-tk)4 (1.2) This can be rigourously proven.See de Boor(1978),for example.Display (1.2)says that any piecewise polynomial can be expressed as a linear combination of truncated power functions and polynomials of degree p.In other words, {1,x,x2,,xP,(c-t)星,(c-t2)4,,(c-t)4} is a basis for the space of pth degree splines possessing knots at ti,t2,...,t. By adding a noise term to (1.2),we can obtain a spline regression model relating a response y=S(r)+E to the predictor z.Least-squares can be used to estimate the coefficients. The normal equations associated with the truncated power basis are highly ill-conditioned. In the polynomial context,this problem is avoided by considering orthogonal polynomials. For p>0,there are no orthogonal splines.However,there is a basis which is reasonably well-conditioned,the B-spline basis. Let us first consider the piecewise constant functions(Oth degree splines).These are normally not of much practical use.However,they provide us with a starting point to build up the B-spline basis. The truncated power basis for the piecewise constant functions on an interval [a,b having knots at t1,t2,...,ta,b is {1,(e-t1)4,,(红-t)4} Note that these basis functions overlap alot.That is,they are often 0 together or 1 together.This will lead to the ill-conditioning when applying least-squares mentioned earlier.A more natural basis for piecewise constant functions is the following {1e(at)1re()(),1re(ta)(),...1e()(x),1re()() To see that this is a basis for piecewise constant functions on [a,b with jumps at t1,...,t, note that any such function can be written as So(x)=BolzE(a,t)+ ∑ 1xe,t4+1)+Bk1rE(tk,b)
1.1. SPLINE REGRESSION 3 so usually there is a discontinuity in the pth derivative. Thus, pth degree splines are usually said to have no more than p − 1 continuous derivatives. For example, cubic splines often have two continuous derivatives. This is one reason for their popularity: having two continuous derivatives is often sufficient to give smooth approximations to functions. Furthermore, third degree piecewise polynomials are usually numerically wellbehaved. B-splines The discussion below (1.1) indicates that we can represent any piecewise polynomial of degree p in the following way: S(x) = β0 + β1x + β2x 2 + · · · + βpx p + βp+1(x − t1) p + + · · · + βp+k(x − tk) p + (1.2) This can be rigourously proven. See de Boor (1978), for example. Display (1.2) says that any piecewise polynomial can be expressed as a linear combination of truncated power functions and polynomials of degree p. In other words, {1, x, x2 , . . . , xp ,(x − t1) p +,(x − t2) p +, . . . ,(x − tk) p +} is a basis for the space of pth degree splines possessing knots at t1, t2, . . . , tk. By adding a noise term to (1.2), we can obtain a spline regression model relating a response y = S(x) + ε to the predictor x. Least-squares can be used to estimate the coefficients. The normal equations associated with the truncated power basis are highly ill-conditioned. In the polynomial context, this problem is avoided by considering orthogonal polynomials. For p > 0, there are no orthogonal splines. However, there is a basis which is reasonably well-conditioned, the B-spline basis. Let us first consider the piecewise constant functions (0th degree splines). These are normally not of much practical use. However, they provide us with a starting point to build up the B-spline basis. The truncated power basis for the piecewise constant functions on an interval [a, b] having knots at t1, t2, . . . , tk ∈ [a, b] is {1,(x − t1) 0 +, . . . ,(x − tk) 0 +}. Note that these basis functions overlap alot. That is, they are often 0 together or 1 together. This will lead to the ill-conditioning when applying least-squares mentioned earlier. A more natural basis for piecewise constant functions is the following {1x∈(a,t1) , 1x∈(t1,t2)(x), 1x∈(t2,t3)(x), . . . , 1x∈(tk−1,tk)(x), 1x∈(tk,b)(x)} To see that this is a basis for piecewise constant functions on [a, b] with jumps at t1, . . . , tk, note that any such function can be written as S0(x) = β01x∈(a,t1) + X k−1 j=1 βj1x∈(tk,tk+1) + βk1x∈(tk,b)
4 CHAPTER 1.NONPARAMETRIC REGRESSION Note that if we try to fit such a function to data of the form (1,1),(x2,22),...(n,Un), we require at least one x value between any pair of knots (which include a and b,the so- called 'Boundary knots')in order to obtain a full-rank model matrix.This is required for invertibility of the normal equations.Note,as well,that the columns of the model matrix are orthogonal which implies that this least-squares problem is now very well conditioned. The reason for the orthogonality is that the regions where the functions are non-zero do not overlap;each basis function is non-zero only on a single interval of the form [ti,. Finally,observe that the size of this basis is the same as the truncated power basis,so we are requiring the same numbers of degrees of freedom. Moving to the piecewise linear case,we note first that the piecewise constant B- splines could be obtained from their truncated counterparts by differencing.Two levels of differencing applied to the truncated linear functions on [a,b with knots at t1,...,t gives a set of continuous piecewise linear functions Bi(r)which are non-zero on intervals of the form [ti,t+2.These functions are called or hat'or'chapeau'functions.They are not orthogonal,but since Bi(x)is nonzero only with B:-1.1()and B+1.1(),the resulting model matrix used in computing the least-squares regression estimates is well-conditioned. The more usual way of computing the linear B-spline sequence is through the formula B1(x)= 二12e回+2二1e1@,i=-1,2,k(1.3) tit1-ti tit2-ti+l By taking t-1=to a and t+1=b,we can use the above formula to compute B-1.1() and Bo.1(x),B-1,1(x)and B.1(x),for t a,b,using the convention that 0x oo =0. In total,this gives us k +2 basis functions for continuous piecewise linear splines.For the case of knots at.3 and.6 and boundary knots at 0 and 1,these functions are plotted in Figure 1.1.Note that,with two knots,there are 4 basis functions.The middle two have the 'hat'shape,while the first and last are 'incomplete'hats.Note that because the B-splines are nonzero on at most the union of two neighbouring inter-knot subintervals, some pairs of B-splines are mutually orthogonal. ITo obtain the plot,type x <-seq(0,1,length=50) matplot(x,bs(x,knots=c(.3,.6),intercept=TRUE,degree=1)[,1:4], type="1")
4 CHAPTER 1. NONPARAMETRIC REGRESSION Note that if we try to fit such a function to data of the form (x1, y1),(x2, y2), . . .(xn, yn), we require at least one x value between any pair of knots (which include a and b, the socalled ‘Boundary knots’) in order to obtain a full-rank model matrix. This is required for invertibility of the normal equations. Note, as well, that the columns of the model matrix are orthogonal which implies that this least-squares problem is now very well conditioned. The reason for the orthogonality is that the regions where the functions are non-zero do not overlap; each basis function is non-zero only on a single interval of the form [ti , ti+1]. Finally, observe that the size of this basis is the same as the truncated power basis, so we are requiring the same numbers of degrees of freedom. Moving to the piecewise linear case, we note first that the piecewise constant Bsplines could be obtained from their truncated counterparts by differencing. Two levels of differencing applied to the truncated linear functions on [a, b] with knots at t1, . . . , tk gives a set of continuous piecewise linear functions Bi,1(x) which are non-zero on intervals of the form [ti , ti+2]. These functions are called or ‘hat’ or ‘chapeau’ functions. They are not orthogonal, but since Bi,1(x) is nonzero only with Bi−1,1(x) and Bi+1,1(x), the resulting model matrix used in computing the least-squares regression estimates is well-conditioned. The more usual way of computing the linear B-spline sequence is through the formula Bi,1(x) = (x − ti) ti+1 − ti 1x∈[ti,ti+1)(x) + (ti+2 − x) ti+2 − ti+1 1x∈[ti+1,ti+2)(x), i = −1, 2, . . . , k (1.3) By taking t−1 = t0 = a and tk+1 = b, we can use the above formula to compute B−1,1(x) and B0,1(x), Bk−1,1(x) and Bk,1(x), for x ∈ [a, b], using the convention that 0 × ∞ = 0. In total, this gives us k + 2 basis functions for continuous piecewise linear splines. For the case of knots at .3 and .6 and boundary knots at 0 and 1, these functions are plotted in Figure 1.1.1 Note that, with two knots, there are 4 basis functions. The middle two have the ‘hat’ shape, while the first and last are ‘incomplete’ hats. Note that because the B-splines are nonzero on at most the union of two neighbouring inter-knot subintervals, some pairs of B-splines are mutually orthogonal. 1To obtain the plot, type x <- seq(0, 1, length=50) matplot(x, bs(x, knots=c(.3, .6), intercept=TRUE, degree=1)[, 1:4], type="l")
1.1.SPLINE REGRESSION 8 8 0.0 0.2 0.40.6 0.81.0 Figure 1.1:Linear B-splines on 0,1]corresponding to knots at.3 and.6. Higher degree B-splines can be obtained,recursively,using an extension of the formula given at(1.3).That is,the pth degree B-spline functions are evaluated from the p-lst degree B-spline functions using B回=B-回+A-1e.i=-n-p+1,k1到 titp-ti titpt1-titl Here,to =t_1=t_2=...=t-p a and t=b. Figure 1.2 illustrates the 7 (i.e.p +k +1)cubic B-splines on [0,1]having knots at .3,.6 and.9.The knot locations have been highlighted using the rug()function.? 2This plot is obtained using matplot(x,bs(x,knots=c(.3,.6,.9),intercept=TRUE,degree=3)[,1:7], type="1") rug(c(0,.3,.6,.9,1),1wd=2,ticksize=.1)
1.1. SPLINE REGRESSION 5 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x linear B−splines Figure 1.1: Linear B-splines on [0, 1] corresponding to knots at .3 and .6. Higher degree B-splines can be obtained, recursively, using an extension of the formula given at (1.3). That is, the pth degree B-spline functions are evaluated from the p − 1st degree B-spline functions using Bi,p(x) = (x − ti) ti+p − ti Bi,p−1(x) + (ti+p+1 − x) ti+p+1 − ti+1 Bi+1,p−1(x), i = −p, −p + 1, . . . , k. (1.4) Here, t0 = t−1 = t−2 = · · · = t−p = a and tk = b. Figure 1.2 illustrates the 7 (i.e. p + k + 1) cubic B-splines on [0, 1] having knots at .3, .6 and .9. The knot locations have been highlighted using the rug() function.2 2This plot is obtained using matplot(x, bs(x, knots=c(.3, .6, .9), intercept=TRUE, degree=3)[, 1:7], type="l") rug(c(0,.3,.6,.9,1), lwd=2, ticksize=.1)
6 CHAPTER 1.NONPARAMETRIC REGRESSION 0.0 0.2 0.4 0.6 0.8 1.0 Figure 1.2:Cubic B-splines on [0,1]corresponding to knots at.3,.6 and.9. From this figure,it can be seen that the ith cubic B-spline is nonzero only on the interval [ti,tit4.In general,the ith p degree B-spline is nonzero only on the interval [ti,tp].This property ensures that the ith and i+j+1st B-splines are orthogonal, for j>p.B-splines whose supports overlap are linearly independent. 1.1.2 Least-Squares Splines Fitting a cubic spline to bivariate data can be done using least-squares.Using the trun- cated power basis,the model to be fit is of the form 斯=0+月x+…+月n+月+1(红-t岸+…+月+k(c-t)4+e,j=1,2,,n where s;satisfies the usual conditions.In vector-matrix form,we may write y=TB+E (1.5) where T is an nx(p+k+1)matrix whose first p+1 columns correspond to the model matrix for pth degree polynomial regression,and whose (j,p+1+i)element is (j) Applying least-squares to (1.5),we see that 3=(TT)-1Ty. Thus,all of the usual linear regression technology is at our disposal here,including stan- dard error estimates for coefficients and confidence and prediction intervals.Even regres- sion diagnostics are applicable in the usual manner
6 CHAPTER 1. NONPARAMETRIC REGRESSION 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x cubic b−splines Figure 1.2: Cubic B-splines on [0, 1] corresponding to knots at .3, .6 and .9. From this figure, it can be seen that the ith cubic B-spline is nonzero only on the interval [ti , ti+4]. In general, the ith p degree B-spline is nonzero only on the interval [ti , ti+p+1]. This property ensures that the ith and i + j + 1st B-splines are orthogonal, for j ≥ p. B-splines whose supports overlap are linearly independent. 1.1.2 Least-Squares Splines Fitting a cubic spline to bivariate data can be done using least-squares. Using the truncated power basis, the model to be fit is of the form yj = β0 + β1xj + · · · + βpx p j + βp+1(xj − t1) p + + · · · + βp+k(xj − tk) p + + εj , j = 1, 2, . . . , n where εj satisfies the usual conditions. In vector-matrix form, we may write y = T β + ε (1.5) where T is an n × (p + k + 1) matrix whose first p + 1 columns correspond to the model matrix for pth degree polynomial regression, and whose (j, p+ 1 +i) element is (xj −ti) p +. Applying least-squares to (1.5), we see that βb = (T T T) −1T T y. Thus, all of the usual linear regression technology is at our disposal here, including standard error estimates for coefficients and confidence and prediction intervals. Even regression diagnostics are applicable in the usual manner
1.1.SPLINE REGRESSION 7 The only difficulty is the poor conditioning of the truncated power basis which will result in inaccuracies in the calculation of B.It is for this reason that the B-spline basis was introduced.Using this basis,we re-formulate the regression model as p+k BBi,p(x)+e (1.6) i=0 or in vector-matrix form y=BB+E where the (j,i)element of B is Bip(i).The least-squares estimate of B is then 3=(BTB)-BTy The orthogonality of the B-splines which are far enough apart results in a banded matrix BTB which has better conditioning properties than the matrix TTT.The bandedness property actually allows for the use of more efficient numerical techniques in computing B.Again,all of the usual regression techniques are available.The only drawback with this model is that the coefficients are uninterpretable,and the B-splines are a little less intuitive than the truncated power functions. We have been assuming that the knots are known.In general,they are unknown,and they must be chosen.Badly chosen knots can result in bad approximations.Because the spline regression problem can be formulated as an ordinary regression problem with a transformed predictor,it is possible to apply variable selection techniques such as back- ward selection to choose a set of knots.The usual approach is to start with a set of knots located at a subset of the order statistics of the predictor.Then backward selection is applied,using the truncated power basis form of the model.Each time a basis function is eliminated,the corresponding knot is eliminated.The method has drawbacks,notably the ill-conditioning of the basis as mentioned earlier. Figure 1.3 exhibits an example of a least-squares spline with automatically generated knots,applied to a data set consisting of titanium measurements.3 A version of backward selection was used to generated these knots;the stopping rule used was similar to the Akaike Information Criterion (AIC)discussed in Chapter 6.Although this least-squares spline fit to these data is better than what could be obtained using polynomial regression, it is unsatisfactory in many ways.The flat regions are not modelled smoothly enough, and the peak is cut off. 3To obtain Figure 1.3,type attach(titanium) y.1m <-1m(g bs(temperature,knots=c(755,835,905,975), Boundary.knots=c(550,1100))) plot(titanium) lines(temperature,predict(y.lm))
1.1. SPLINE REGRESSION 7 The only difficulty is the poor conditioning of the truncated power basis which will result in inaccuracies in the calculation of βb. It is for this reason that the B-spline basis was introduced. Using this basis, we re-formulate the regression model as yj = X p+k i=0 βiBi,p(xi) + εj (1.6) or in vector-matrix form y = Bβ + ε where the (j, i) element of B is Bi,p(xj ). The least-squares estimate of β is then βb = (B TB) −1B T y The orthogonality of the B-splines which are far enough apart results in a banded matrix BTB which has better conditioning properties than the matrix T T T. The bandedness property actually allows for the use of more efficient numerical techniques in computing βb. Again, all of the usual regression techniques are available. The only drawback with this model is that the coefficients are uninterpretable, and the B-splines are a little less intuitive than the truncated power functions. We have been assuming that the knots are known. In general, they are unknown, and they must be chosen. Badly chosen knots can result in bad approximations. Because the spline regression problem can be formulated as an ordinary regression problem with a transformed predictor, it is possible to apply variable selection techniques such as backward selection to choose a set of knots. The usual approach is to start with a set of knots located at a subset of the order statistics of the predictor. Then backward selection is applied, using the truncated power basis form of the model. Each time a basis function is eliminated, the corresponding knot is eliminated. The method has drawbacks, notably the ill-conditioning of the basis as mentioned earlier. Figure 1.3 exhibits an example of a least-squares spline with automatically generated knots, applied to a data set consisting of titanium measurements.3 A version of backward selection was used to generated these knots; the stopping rule used was similar to the Akaike Information Criterion (AIC) discussed in Chapter 6. Although this least-squares spline fit to these data is better than what could be obtained using polynomial regression, it is unsatisfactory in many ways. The flat regions are not modelled smoothly enough, and the peak is cut off. 3To obtain Figure 1.3, type attach(titanium) y.lm <- lm(g ~ bs(temperature, knots=c(755, 835, 905, 975), Boundary.knots=c(550, 1100))) plot(titanium) lines(temperature, predict(y.lm))
8 CHAPTER 1.NONPARAMETRIC REGRESSION 8 0 6 0 600700 800 900 1000 temperature Figure 1.3:A least-squares spline fit to the titanium heat data using automatically gen- erated knots.The knots used were 755,835,905,and 975. 0 600 700800900 1000 temperature Figure 1.4:A least-squares spline fit to the titanium heat data using manually-selected knots. A substantial improvement can be obtained by manually selecting additional knots, and removing some of the automatically generated knots.In particular,we can render the peak more effectively by adding an additional knot in its vicinity.Adjusting the knot
8 CHAPTER 1. NONPARAMETRIC REGRESSION 600 700 800 900 1000 1.0 1.5 2.0 temperature g Figure 1.3: A least-squares spline fit to the titanium heat data using automatically generated knots. The knots used were 755, 835, 905, and 975. 600 700 800 900 1000 1.0 1.5 2.0 temperature g Figure 1.4: A least-squares spline fit to the titanium heat data using manually-selected knots. A substantial improvement can be obtained by manually selecting additional knots, and removing some of the automatically generated knots. In particular, we can render the peak more effectively by adding an additional knot in its vicinity. Adjusting the knot
1.1.SPLINE REGRESSION 9 that was already there improves the fit as well.4 1.1.3 Smoothing Splines One way around the problem of choosing knots is to use lots of them.A result analogous to the Weierstrass approximation theorem says that any sufficiently smooth function can be approximated arbitrarily well by spline functions with enough knots. The use of large numbers of knots alone is not sufficient to avoid trouble,since we will over-fit the data if the number of knots k is taken so large that p++1>n.In that case, we would have no degrees of freedom left for estimating the residual variance.A standard way of coping with the former problem is to apply a penalty term to the least-squares problem.One requires that the resulting spline regression estimate has low curvature as measured by the square of the second derivative. More precisely,one may try to minimize (for a given constant A) over the set of all functions S(x)which are twice continuously differentiable.The solution to this minimization problem has been shown to be a cubic spline which is surprisingly easy to calculate.5 Thus,the problem of choosing a set of knots is replaced by selecting a value for the smoothing parameter A.Note that if A is small,the solution will be a cubic spline which almost interpolates the data;increasing values of A render increasingly smooth approximations The usual way of choosing A is by cross-validation.The ordinary cross-validation choice of入minimizes CV()=∑-S6(c) j=1 where (()is the smoothing spline obtained using parameter A,using all data but the jth observation.Note that the CV function is similar in spirit to the PRESS statistic,but 4The plot in Figure 1.4 can be generated using y.1m<-1m(g~bs(temperature,knots=c(755,835,885,895,915,975), Boundary.knots=c(550,1100))) plot(titanium) lines(spline(temperature,predict(y.lm))) 5The B-spline coefficients for this spline can be obtained from an expression of the form B=(BTB+λDTD)-1Bry where B is the matrix used for least-squares regression splines and D is a matrix that arises in the calculation involving the squared second derivatives of the spline.Details can be found in de Boor (1978).It is sufficient to note here that this approach has similarities with ridge regression,and that the estimated regression is a linear function of the responses
1.1. SPLINE REGRESSION 9 that was already there improves the fit as well.4 1.1.3 Smoothing Splines One way around the problem of choosing knots is to use lots of them. A result analogous to the Weierstrass approximation theorem says that any sufficiently smooth function can be approximated arbitrarily well by spline functions with enough knots. The use of large numbers of knots alone is not sufficient to avoid trouble, since we will over-fit the data if the number of knots k is taken so large that p+k+1 > n. In that case, we would have no degrees of freedom left for estimating the residual variance. A standard way of coping with the former problem is to apply a penalty term to the least-squares problem. One requires that the resulting spline regression estimate has low curvature as measured by the square of the second derivative. More precisely, one may try to minimize (for a given constant λ) Xn j=1 (yj − S(xj ))2 + λ Z b a (S ′′(x))2 dx over the set of all functions S(x) which are twice continuously differentiable. The solution to this minimization problem has been shown to be a cubic spline which is surprisingly easy to calculate.5 Thus, the problem of choosing a set of knots is replaced by selecting a value for the smoothing parameter λ. Note that if λ is small, the solution will be a cubic spline which almost interpolates the data; increasing values of λ render increasingly smooth approximations. The usual way of choosing λ is by cross-validation. The ordinary cross-validation choice of λ minimizes CV(λ) = Xn j=1 (yj − Sbλ,(j)(xj ))2 where Sbλ,(j)(x) is the smoothing spline obtained using parameter λ, using all data but the jth observation. Note that the CV function is similar in spirit to the PRESS statistic, but 4The plot in Figure 1.4 can be generated using y.lm <- lm(g ~ bs(temperature, knots=c(755, 835, 885, 895, 915, 975), Boundary.knots=c(550, 1100))) plot(titanium) lines(spline(temperature, predict(y.lm))) 5The B-spline coefficients for this spline can be obtained from an expression of the form βb = (B T B + λDT D) −1B T y where B is the matrix used for least-squares regression splines and D is a matrix that arises in the calculation involving the squared second derivatives of the spline. Details can be found in de Boor (1978). It is sufficient to note here that this approach has similarities with ridge regression, and that the estimated regression is a linear function of the responses
10 CHAPTER 1.NONPARAMETRIC REGRESSION it is much more computationally intensive,since the regression must be computed for each value of j.Recall that PRESS could be computed from one regression,upon exploiting the diagonal of the hat matrix.If one pursues this idea in the choice of smoothing parameter,one obtains the so-called generalized cross-validation criterion which turns out to have very similar behaviour to ordinary cross-validation but with lower computational requirements.The generalized cross-validation calculation is carried out using the trace of a matrix which plays the role of the hat matrix.6 Figure 1.5 exhibits a plot which is very similar to what could be obtained using lines(spline(smooth.spline(temperature,g))) This invokes the smoothing spline with smoothing parameter selected by generalized cross- validation.To use ordinary cross-validation,include the argument cv=TRUE. Penalized splines The smoothing spline is an example of the use of penalized least-squares.In this case,re- gression functions which have large second derivatives are penalized,that is,the objective function to be minimized has something positive added to it.Regression functions which have small second derivatives are penalized less,but they may fail to fit the data as well. Thus,the penalty approach seeks a compromise between fitting the data and satisfying a constraint (in this case,a small second derivative). Eilers and Marx(1996)generalized this idea by observing that other forms of penalties may be used.The penalized splines of Eilers and Marx(1996)are based on a similar idea to the smoothing spline,with two innovations.The penalty term is no longer in terms of a second derivative,but a second divided difference,and the number of knots can be specified.(For smoothing splines,the number of knots is effectively equal to the number of observations.)? Recall that the least-squares spline problem is to find coefficients Bo,B1,...,Bp to minimize ∑-se》2 = where S()-B().Eilers and Marx(1996)advocate the use ofkequally spaced knots,instead of the order statistics of the predictor variable.Note that the number of knots must be chosen somehow,but this is simpler than choosing knot locations. The smoothness of a spline is related to its B-spline coefficients;thus,Eilers and Marx (1996)replace the second derivative penalty for the smoothing spline with lth order differences of B-spline coefficients.These are defined as follows: △0:=B:-B-1 6The hat matrix is called the smoother matrix in this context,and for smoothing splines,it is of the form H()=(BTB+ADT D)-1BT 7Penalized splines are implemented in the function psplinereg()in the R package kspline which can be obtained from http://www.stats.uwo.ca/faculty/braun/Rlinks.htm
10 CHAPTER 1. NONPARAMETRIC REGRESSION it is much more computationally intensive, since the regression must be computed for each value of j. Recall that PRESS could be computed from one regression, upon exploiting the diagonal of the hat matrix. If one pursues this idea in the choice of smoothing parameter, one obtains the so-called generalized cross-validation criterion which turns out to have very similar behaviour to ordinary cross-validation but with lower computational requirements. The generalized cross-validation calculation is carried out using the trace of a matrix which plays the role of the hat matrix.6 Figure 1.5 exhibits a plot which is very similar to what could be obtained using lines(spline(smooth.spline(temperature, g))) This invokes the smoothing spline with smoothing parameter selected by generalized crossvalidation. To use ordinary cross-validation, include the argument cv=TRUE. Penalized splines The smoothing spline is an example of the use of penalized least-squares. In this case, regression functions which have large second derivatives are penalized, that is, the objective function to be minimized has something positive added to it. Regression functions which have small second derivatives are penalized less, but they may fail to fit the data as well. Thus, the penalty approach seeks a compromise between fitting the data and satisfying a constraint (in this case, a small second derivative). Eilers and Marx (1996) generalized this idea by observing that other forms of penalties may be used. The penalized splines of Eilers and Marx (1996) are based on a similar idea to the smoothing spline, with two innovations. The penalty term is no longer in terms of a second derivative, but a second divided difference, and the number of knots can be specified. (For smoothing splines, the number of knots is effectively equal to the number of observations.)7 Recall that the least-squares spline problem is to find coefficients β0, β1, . . . , βk+p to minimize Xn j=1 (yj − S(xj ))2 where S(x) = Pk+p i=0 βiBi,p(x). Eilers and Marx (1996) advocate the use of k equally spaced knots, instead of the order statistics of the predictor variable. Note that the number of knots must be chosen somehow, but this is simpler than choosing knot locations. The smoothness of a spline is related to its B-spline coefficients; thus, Eilers and Marx (1996) replace the second derivative penalty for the smoothing spline with ℓth order differences of B-spline coefficients. These are defined as follows: ∆βi = βi − βi−1 6The hat matrix is called the smoother matrix in this context, and for smoothing splines, it is of the form H(λ) = (B T B + λDT D) −1B T . 7Penalized splines are implemented in the function psplinereg() in the R package kspline which can be obtained from http://www.stats.uwo.ca/faculty/braun/Rlinks.htm