D0110.1214088342304000000297 Density Estimation Simon J.Sheather Abstract.This paper provides a practical description of density estimation based on kernel methods.An important aim is to encourage practicing statis- ticians to apply these methods to data.As such,reference is made to imple- mentations of these methods in R,S-PLUS and SAS. Key words and phrases:Kernel density estimation,bandwidth selection, local likelihood density estimates,data sharpening. 1.INTRODUCTION data from the U.S.PGA tour in Section 5.Finally,in Density estimation has experienced a wide ex- Section 6 we provide some overall recommendations. plosion of interest over the last 20 years.Silver- man's (1986)book on this topic has been cited over 2.THE BASICS OF KERNEL 2000 times.Recent texts on smoothing which in- DENSITY ESTIMATION clude detailed density estimation include Bowman and Let X1,X2,....Xn denote a sample of size n from a Azzalini(1997),Simonoff(1996)and Wand and Jones random variable with density f. (1995).Density estimation has been applied in many The kernel density estimate of f at the point x is fields,including archaeology (e.g.,Baxter,Beardah given by and Westwood,2000),banking(e.g.,Tortosa-Ausina, 2002),climatology (e.g.,Ferreyra et al.,2001),eco- (1) nomics(e.g.,DiNardo,Fortin and Lemieux,1996),ge- nh i=1 netics(e.g.,Segal and Wiemels,2002),hydrology (e.g., Kim and Heo,2002)and physiology(e.g.,Paulsen and where the kernel K satisfies fK(x)dx =1 and the Heggelund,1996). smoothing parameter h is known as the bandwidth.In This paper provides a practical description of density practice,the kernel K is generally chosen to be a uni- estimation based on kernel methods.An important aim modal probability density symmetric about zero.In this is to encourage practicing statisticians to apply these case,K satisfies the conditions methods to data.As such,reference is made to imple- mentations of these methods in R,S-PLUS and SAS K(y)dy=1, Section 2 provides a description of the basic proper- ties of kernel density estimators.It is well known that yK(y)dy=0, the performance of kernel density estimators depends crucially on the value of the smoothing parameter, y2K0)d=2(K)>0. commonly referred to as the bandwidth.We describe methods for selecting the value of the bandwidth in A popular choice for K is the Gaussian kernel,namely, Section 3.In Section 4,we describe two recent im- portant improvements to kernel methods,namely,local (2) likelihood density estimates and data sharpening.We K)=2cp( compare the performance of some of the methods that Throughout this section we consider a small gener- have been discussed using a new example involving ated data set to illustrate the ideas presented.The data consist of a random sample of size n=10 from a Simon J.Sheather is Professor of Statistics,Australian normal mixture distribution made up of observations Graduate School of Management,University of New from N(u=-1,o2 (1/3)2)and N(u=1,o2= South Wales and the University of Sydney,Sydney,NSW (1/3)2),each with probability 0.5.Figure 1 shows a 2052,Australia (e-mail:simonsh @agsm.edu.au). kernel estimate of the density for these data using the 588
Statistical Science 2004, Vol. 19, No. 4, 588–597 DOI 10.1214/088342304000000297 © Institute of Mathematical Statistics, 2004 Density Estimation Simon J. Sheather Abstract. This paper provides a practical description of density estimation based on kernel methods. An important aim is to encourage practicing statisticians to apply these methods to data. As such, reference is made to implementations of these methods in R, S-PLUS and SAS. Key words and phrases: Kernel density estimation, bandwidth selection, local likelihood density estimates, data sharpening. 1. INTRODUCTION Density estimation has experienced a wide explosion of interest over the last 20 years. Silverman’s (1986) book on this topic has been cited over 2000 times. Recent texts on smoothing which include detailed density estimation include Bowman and Azzalini (1997), Simonoff (1996) and Wand and Jones (1995). Density estimation has been applied in many fields, including archaeology (e.g., Baxter, Beardah and Westwood, 2000), banking (e.g., Tortosa-Ausina, 2002), climatology (e.g., Ferreyra et al., 2001), economics (e.g., DiNardo, Fortin and Lemieux, 1996), genetics (e.g., Segal and Wiemels, 2002), hydrology (e.g., Kim and Heo, 2002) and physiology (e.g., Paulsen and Heggelund, 1996). This paper provides a practical description of density estimation based on kernel methods. An important aim is to encourage practicing statisticians to apply these methods to data. As such, reference is made to implementations of these methods in R, S-PLUS and SAS. Section 2 provides a description of the basic properties of kernel density estimators. It is well known that the performance of kernel density estimators depends crucially on the value of the smoothing parameter, commonly referred to as the bandwidth. We describe methods for selecting the value of the bandwidth in Section 3. In Section 4, we describe two recent important improvements to kernel methods, namely, local likelihood density estimates and data sharpening. We compare the performance of some of the methods that have been discussed using a new example involving Simon J. Sheather is Professor of Statistics, Australian Graduate School of Management, University of New South Wales and the University of Sydney, Sydney, NSW 2052, Australia (e-mail: simonsh@agsm.edu.au). data from the U.S. PGA tour in Section 5. Finally, in Section 6 we provide some overall recommendations. 2. THE BASICS OF KERNEL DENSITY ESTIMATION Let X1, X2,...,Xn denote a sample of size n from a random variable with density f . The kernel density estimate of f at the point x is given by fˆ h(x) = 1 nh n i=1 K x − Xi h (1) , where the kernel K satisfies K(x) dx = 1 and the smoothing parameter h is known as the bandwidth. In practice, the kernel K is generally chosen to be a unimodal probability density symmetric about zero. In this case, K satisfies the conditions K(y) dy = 1, yK(y) dy = 0, y2K(y) dy = µ2(K) > 0. A popular choice for K is the Gaussian kernel, namely, K(y) = 1 √2π exp −y2 2 (2) . Throughout this section we consider a small generated data set to illustrate the ideas presented. The data consist of a random sample of size n = 10 from a normal mixture distribution made up of observations from N (µ = −1, σ2 = (1/3)2) and N (µ = 1, σ2 = (1/3)2), each with probability 0.5. Figure 1 shows a kernel estimate of the density for these data using the 588
DENSITY ESTIMATION 589 bin width h is of order h,whereas centering the kernel at each data point and using a symmetric kernel zeroes this term and as such produces a leading bias term for the kernel estimate of order h2. Adding the leading variance and squared bias terms produces the asymptotic mean squared error(AMSE) AMSE)(KF A widely used choice of an overall measure of the discrepancy between f and f is the mean integrated squared error(MISE),which is given by Data FIG.1.Kernel density estimate and contributions from each data MsEi)=E∫ao)-joPd point (dashed curve)along with the true underlying density (solid curve). =Bias(fy))2dy+Var(f(y))dy. Gaussian kernel with bandwidth h=0.3 (the dashed Under an integrability assumption on f,integrating the curve)along with the true underlying density(the solid expression for AMSE gives the expression for the as- curve).The 10 data points are marked by vertical lines ymptotic mean integrated squared error(AMISE),that on the horizontal axis.Centered at each data point is 1s, each point's contribution to the overall density esti- mate,namely,(1/(nh))K((x-Xi)/h)(i.e.,1/n times 同)AMEE=RuK+答FR. h4 a normal density with mean Xi and standard devia- tion h).The density estimate (the dashed curve)is the where sum of these scaled normal densities.Increasing the value of h widens each normal curve,smoothing out R(f")=[f")dy the two modes currently apparent in the estimate. The value of the bandwidth that minimizes the AMISE A Java applet that allows the user to watch the effects is given by of changing the bandwidth and the shape of the kernel 1/ function on the resulting density estimate can be found R(K) (4) hAMISE L2(K)2R(f」 n-l/5 at http://www-users.york.ac.uk/~jb35/mygr2.htm.It is well known that the value of the bandwidth is of critical Assuming that f is sufficiently smooth,we can use in- importance,while the shape of the kernel function has tegration by parts to show that little practical impact. Assuming that the underlying density is sufficiently R(f")=[f"()P dy=-f))f)dy. smooth and that the kernel has finite fourth moment,it can be shown using Taylor series that Thus,the functional R(f")is a measure of the under- h2 lying roughness or curvature.In particular,the larger Bias(K)") the value of R(f")is,the larger is the value of AMISE (i.e.,the more difficult it is to estimate f)and the Vartjin()=R(K))+o smaller is the value of hAMISE (i.e.,the smaller the nh nh bandwidth needed to capture the curvature in f). where 3.BANDWIDTH SELECTION FOR KERNEL R(K)=K2)dy DENSITY ESTIMATES (e.g.,Wand and Jones,1995,pages 20-21).In addition In this section,we briefly review methods for choos- to the visual advantage of being a smooth curve,the ing a global value of the bandwidth h.Where ap- kernel estimate has an advantage over the histogram in plicable,reference is made to implementations of these terms of bias.The bias of a histogram estimator with methods in R.S-PLUS and SAS
DENSITY ESTIMATION 589 FIG. 1. Kernel density estimate and contributions from each data point (dashed curve) along with the true underlying density (solid curve). Gaussian kernel with bandwidth h = 0.3 (the dashed curve) along with the true underlying density (the solid curve). The 10 data points are marked by vertical lines on the horizontal axis. Centered at each data point is each point’s contribution to the overall density estimate, namely, (1/(nh))K((x −Xi)/h) (i.e., 1/n times a normal density with mean Xi and standard deviation h). The density estimate (the dashed curve) is the sum of these scaled normal densities. Increasing the value of h widens each normal curve, smoothing out the two modes currently apparent in the estimate. A Java applet that allows the user to watch the effects of changing the bandwidth and the shape of the kernel function on the resulting density estimate can be found at http://www-users.york.ac.uk/∼jb35/mygr2.htm. It is well known that the value of the bandwidth is of critical importance, while the shape of the kernel function has little practical impact. Assuming that the underlying density is sufficiently smooth and that the kernel has finite fourth moment, it can be shown using Taylor series that Bias{fˆ h(x)} = h2 2 µ2(K)f (x) + o(h2), Var{fˆ h(x)} = 1 nhR(K)f (x) + o 1 nh , where R(K) = K2(y) dy (e.g., Wand and Jones, 1995, pages 20–21). In addition to the visual advantage of being a smooth curve, the kernel estimate has an advantage over the histogram in terms of bias. The bias of a histogram estimator with bin width h is of order h, whereas centering the kernel at each data point and using a symmetric kernel zeroes this term and as such produces a leading bias term for the kernel estimate of order h2. Adding the leading variance and squared bias terms produces the asymptotic mean squared error (AMSE) AMSE{fˆ h(x)} = 1 nhR(K)f (x)+ h4 4 µ2(K)2[f (x)] 2. A widely used choice of an overall measure of the discrepancy between fˆ and f is the mean integrated squared error (MISE), which is given by MISE(fˆ h) = E fˆ h(y) − f (y)2 dy = Bias(fˆ h(y))2 dy + Var(fˆ h(y)) dy. Under an integrability assumption on f , integrating the expression for AMSE gives the expression for the asymptotic mean integrated squared error (AMISE), that is, AMISE{fˆ h} = 1 nhR(K) + h4 4 µ2(K)2R(f (3) ), where R(f ) = [f (y)] 2 dy. The value of the bandwidth that minimizes the AMISE is given by hAMISE = R(K) µ2(K)2R(f ) 1/5 n−1/5 (4) . Assuming that f is sufficiently smooth, we can use integration by parts to show that R(f ) = [f (y)] 2 dy = − f (4) (y)f (y) dy. Thus, the functional R(f ) is a measure of the underlying roughness or curvature. In particular, the larger the value of R(f ) is, the larger is the value of AMISE (i.e., the more difficult it is to estimate f ) and the smaller is the value of hAMISE (i.e., the smaller the bandwidth needed to capture the curvature in f ). 3. BANDWIDTH SELECTION FOR KERNEL DENSITY ESTIMATES In this section, we briefly review methods for choosing a global value of the bandwidth h. Where applicable, reference is made to implementations of these methods in R, S-PLUS and SAS.
590 S.J.SHEATHER In SAS,PROC KDE produces kernel density esti- Silverman's reference bandwidth or Silverman's rule mates based on the usual Gaussian kernel (i.e.,the of thumb.It is given by Gaussian density with mean 0 and standard devia- tion 1),whereas S-PLUS has a function density which hSROT=0.9An-l/5」 produces kernel density estimates with a default kernel, where A minsample standard deviation,(sam- the Gaussian density with mean 0 and standard devia- ple interquartile range)/1.34).In SAS PROC KDE. tion 1/4.Thus,the bandwidths described in what fol- this method is called Silverman's rule of thumb lows must be multiplied by 4 when used in S-PLUS. (METHOD SROT).In R,Silverman's bandwidth is The program R also has a function density which pro- invoked by bw ="bw.nrd0".In S-PLUS,Silverman's duces kernel density estimates with a default kernel, bandwidth with constant 1.06 rather than 0.9 is invoked the Gaussian density with mean 0 and standard devia- by width=“nrd'. tion 1. Terrell and Scott (1985)and Terrell (1990)de- 3.1 Rules of Thumb veloped a bandwidth selection method based on the maximal smoothing principle so as to produce over- The computationally simplest method for choosing smoothed density estimates.The method is based on a global bandwidth h is based on replacing R(f"), choosing the "largest degree of smoothing compati- the unknown part of hAMISE,by its value for a para- ble with the estimated scale of the density"(Terrell, metric family expressed as a multiple of a scale pa- 1990,page 470).Looking back at (3),this amounts to rameter,which is then estimated from the data.The finding,for a given value of scale,the density f with method seems to date back to Deheuvels (1977)and the smallest value of R(f").Taking the variance o2 Scott (1979),who each proposed it for histograms. as the scale parameter,Terrell(1990,page 471)found However,the method was popularized for kernel den- that the beta(4,4)family of distributions with vari- sity estimates by Silverman(1986,Section 3.2),who ance o2 minimizes R(f").For the standard Gaussian used the normal distribution as the parametric family. kernel this leads to the oversmoothed bandwidth Let o and IQR denote the standard deviation and in- terquartile range of X,respectively.Take the kernel K hos=1.144Sn-1/5 to be the usual Gaussian kernel.Assuming that the un- derlying distribution is normal,Silverman(1986,pages In SAS PROC KDE,this method is called over- 45 and 47)showed that (3)reduces to smoothed(METHOD =OS). Comparing the oversmoothed bandwidth with the hAMISENORMAL =1.06gn-1/5 normal reference bandwidth hsNR,we see that the oversmoothed bandwidth is 1.08 times larger.Thus, and hAMISENORMAL =0.79IQRn-1/5. in practice there is often very little visual differ- ence between density estimates produced using either Jones,Marron and Sheather(1996)studied the Monte the oversmoothed bandwidth or the normal reference Carlo performance of the normal reference bandwidth bandwidth. based on the standard deviation,that is,they considered 3.2 Cross-Validation Methods hSNR 1.06Sn-1/5 A measure of the closeness of f and f for a given where S is the sample standard deviation.In SAS sample is the integrated squared error (ISE),which is PROC KDE,this method is called the simple nor- given by mal reference(METHOD SNR).Jones,Marron and Sheather(1996)found that hsNR had a mean that was ISE()=(f)-f()2dy usually unacceptably large and thus often produced oversmoothed density estimates. =ao2d-2∫a0d Furthermore,Silverman (1986,page 48)recom- mended reducing the factor 1.06 in the previous equa- +)dy. tion to 0.9 in an attempt not to miss bimodality and using the smaller of two scale estimates.This rule is Notice that the last term on the right-hand side of the commonly used in practice and it is often referred to as previous expression does not involve h
590 S. J. SHEATHER In SAS, PROC KDE produces kernel density estimates based on the usual Gaussian kernel (i.e., the Gaussian density with mean 0 and standard deviation 1), whereas S-PLUS has a function density which produces kernel density estimates with a default kernel, the Gaussian density with mean 0 and standard deviation 1/4. Thus, the bandwidths described in what follows must be multiplied by 4 when used in S-PLUS. The program R also has a function density which produces kernel density estimates with a default kernel, the Gaussian density with mean 0 and standard deviation 1. 3.1 Rules of Thumb The computationally simplest method for choosing a global bandwidth h is based on replacing R(f ), the unknown part of hAMISE, by its value for a parametric family expressed as a multiple of a scale parameter, which is then estimated from the data. The method seems to date back to Deheuvels (1977) and Scott (1979), who each proposed it for histograms. However, the method was popularized for kernel density estimates by Silverman (1986, Section 3.2), who used the normal distribution as the parametric family. Let σ and IQR denote the standard deviation and interquartile range of X, respectively. Take the kernel K to be the usual Gaussian kernel. Assuming that the underlying distribution is normal, Silverman (1986, pages 45 and 47) showed that (3) reduces to hAMISENORMAL = 1.06σ n−1/5 and hAMISENORMAL = 0.79 IQRn−1/5. Jones, Marron and Sheather (1996) studied the Monte Carlo performance of the normal reference bandwidth based on the standard deviation, that is, they considered hSNR = 1.06Sn−1/5, where S is the sample standard deviation. In SAS PROC KDE, this method is called the simple normal reference (METHOD = SNR). Jones, Marron and Sheather (1996) found that hSNR had a mean that was usually unacceptably large and thus often produced oversmoothed density estimates. Furthermore, Silverman (1986, page 48) recommended reducing the factor 1.06 in the previous equation to 0.9 in an attempt not to miss bimodality and using the smaller of two scale estimates. This rule is commonly used in practice and it is often referred to as Silverman’s reference bandwidth or Silverman’s rule of thumb. It is given by hSROT = 0.9An−1/5, where A = min{sample standard deviation, (sample interquartile range)/1.34}. In SAS PROC KDE, this method is called Silverman’s rule of thumb (METHOD = SROT). In R, Silverman’s bandwidth is invoked by bw = “bw.nrd0”. In S-PLUS, Silverman’s bandwidth with constant 1.06 rather than 0.9 is invoked by width = “nrd”. Terrell and Scott (1985) and Terrell (1990) developed a bandwidth selection method based on the maximal smoothing principle so as to produce oversmoothed density estimates. The method is based on choosing the “largest degree of smoothing compatible with the estimated scale of the density” (Terrell, 1990, page 470). Looking back at (3), this amounts to finding, for a given value of scale, the density f with the smallest value of R(f ). Taking the variance σ2 as the scale parameter, Terrell (1990, page 471) found that the beta(4, 4) family of distributions with variance σ2 minimizes R(f ). For the standard Gaussian kernel this leads to the oversmoothed bandwidth hOS = 1.144Sn−1/5. In SAS PROC KDE, this method is called oversmoothed (METHOD = OS). Comparing the oversmoothed bandwidth with the normal reference bandwidth hSNR, we see that the oversmoothed bandwidth is 1.08 times larger. Thus, in practice there is often very little visual difference between density estimates produced using either the oversmoothed bandwidth or the normal reference bandwidth. 3.2 Cross-Validation Methods A measure of the closeness of fˆ and f for a given sample is the integrated squared error (ISE), which is given by ISE(fˆ h) = fˆ h(y) − f (y)2 dy = (fˆ h(y))2 dy − 2 fˆ h(y)f (y) dy + f 2(y) dy. Notice that the last term on the right-hand side of the previous expression does not involve h.
DENSITY ESTIMATION 591 Bowman(1984)proposed choosing the bandwidth In S-PLUS,hLscv is invoked by width ="bandwidth. as the value of h that minimizes the estimate of the two ucv”,while in R it is invoked by bw=“bw.ucv”.The other terms in the last expression,namely least squares cross-validation function LSCV(h)can have more than one local minimum(Hall and Marron, 5 1991).Thus,in practice,it is prudent to plot LSCV(h) and not just rely on the result of a minimization rou- where fi(y)denotes the kernel estimator constructed tine.Jones,Marron and Sheather(1996)recommended from the data without the observation Xi.The method that the largest local minimizer of LSCV(h)be used is commonly referred to as least squares cross- as hLscv,since this value produces better empirical validation,since it is based on the so-called leave-one- performance than the global minimizer.The Bowman out density estimator f_i(y).Rudemo(1982)proposed and Azzalini (1997)library of S-PLUS functions con- the same technique from a slightly different viewpoint. tains the function cv(y,h)which produces values of Bowman and Azzalini(1997,page 37)provided an ex- LSCV(h)for the data set y over the vector of different plicit expression for(5)for the Gaussian kernel. bandwidth values in h. Stone(1984,pages 1285-1286)provided the follow- For the least squares cross-validation based criterion, ing straightforward demonstration that the second term by using the representation in(5)is an unbiased estimate of the second term in ISE. Observe (6)LSCV(h) E[ao)fo)dy】 nh =fK(h)rds fdy where y(c)=fK(w)K(w+c)dw-2K(c),Scott and Terrell(1987)showed that =[(',小 h4 EILSCV(-(K(R( This leads to the unbiased estimate offf(y)f(y)dy -R(f)+O0n- a2()-2.6 AMISE[fin}-R(f)+0(n-1). Hall(1983,page 1157)showed that Thus,least cross-validation essentially provides esti- mates of R(f"),the only unknown quantity in 2i-6rd=io2dw+o,(品) AMISE(fi). For a given set of data,denote the bandwidth that minimizes ISE(fh)by hIsE.A number of authors(e.g., and hence changed the least squares cross-validation Gu,1998)argued that the ideal bandwidth is the ran- (LSCV)based criterion from(5)to dom quantity hIsE,since it minimizes the ISE for the Lscv=∫a0》2d山-2手,x. given sample.However,hIsE is an inherently difficult quantity to estimate.In particular,Hall and Marron i=l (1987a)showed that the smallest possible relative er- since "it is slightly simpler to compute,without affect- ing the asymptotics."This version is the one used by ror for any data based bandwidth h is most authors.We denote the value of h that minimizes LSCV(h)by hLscv.Least squares cross-validation is 元-1=0,n-0 also referred to as unbiased cross-validation since Fu Hall and Marron(1987b)and Scott and Terrell(1987) E[LSCV(h)]= E[io-fo2a】 showed that the least squares cross-validation band- width hscv achieves this best possible convergence -f20)d rate.In particular,they showed that -MISE-f2o)dy hISE
DENSITY ESTIMATION 591 Bowman (1984) proposed choosing the bandwidth as the value of h that minimizes the estimate of the two other terms in the last expression, namely 1 n n i=1 (fˆ −i(y))2 dy − 2 n n i=1 fˆ (5) −i(Xi), where fˆ −i(y) denotes the kernel estimator constructed from the data without the observation Xi. The method is commonly referred to as least squares crossvalidation, since it is based on the so-called leave-oneout density estimator fˆ −i(y). Rudemo (1982) proposed the same technique from a slightly different viewpoint. Bowman and Azzalini (1997, page 37) provided an explicit expression for (5) for the Gaussian kernel. Stone (1984, pages 1285–1286) provided the following straightforward demonstration that the second term in (5) is an unbiased estimate of the second term in ISE. Observe E fˆ h(y)f (y) dy = K y − x h f (x) dx f (y) dy = E K Y − X h . This leads to the unbiased estimate of f (y) f (y) dy ˆ : 1 n(n − 1)h i=j K Xi − Xj h = 1 n n i=1 fˆ −i(Xi). Hall (1983, page 1157) showed that 1 n n i=1 (fˆ −i(y))2 dy = (fˆ h(y))2 dy + Op 1 n2h and hence changed the least squares cross-validation (LSCV) based criterion from (5) to LSCV(h) = (fˆ h(y))2 dy − 2 n n i=1 fˆ −i(Xi), since “it is slightly simpler to compute, without affecting the asymptotics.” This version is the one used by most authors. We denote the value of h that minimizes LSCV(h) by hLSCV. Least squares cross-validation is also referred to as unbiased cross-validation since E[LSCV(h)] = E fˆ h(y) − f (y)2 dy − f 2(y) dy = MISE − f 2(y) dy. In S-PLUS, hLSCV is invoked by width = “bandwidth. ucv”, while in R it is invoked by bw = “bw.ucv”. The least squares cross-validation function LSCV(h) can have more than one local minimum (Hall and Marron, 1991). Thus, in practice, it is prudent to plot LSCV(h) and not just rely on the result of a minimization routine. Jones, Marron and Sheather (1996) recommended that the largest local minimizer of LSCV(h) be used as hLSCV, since this value produces better empirical performance than the global minimizer. The Bowman and Azzalini (1997) library of S-PLUS functions contains the function cv(y, h) which produces values of LSCV(h) for the data set y over the vector of different bandwidth values in h. For the least squares cross-validation based criterion, by using the representation (6) LSCV(h) = 1 nhR(K) + 2 n2h i<j γ Xi − Xj h , where γ (c) = K(w)K(w+c) dw−2K(c), Scott and Terrell (1987) showed that E[LSCV(h)] = 1 nhR(K) + h4 4 µ2(K)2R(f ) − R(f ) + O(n−1) = AMISE{fˆ h} − R(f ) + O(n−1). Thus, least cross-validation essentially provides estimates of R(f ), the only unknown quantity in AMISE{fˆ h}. For a given set of data, denote the bandwidth that minimizes ISE(fˆ h) by hˆISE. A number of authors (e.g., Gu, 1998) argued that the ideal bandwidth is the random quantity hˆISE, since it minimizes the ISE for the given sample. However, hˆISE is an inherently difficult quantity to estimate. In particular, Hall and Marron (1987a) showed that the smallest possible relative error for any data based bandwidth hˆ is hˆ hˆISE − 1 = Op n−1/10 . Hall and Marron (1987b) and Scott and Terrell (1987) showed that the least squares cross-validation bandwidth hLSCV achieves this best possible convergence rate. In particular, they showed that n1/10hLSCV hˆISE − 1
592 S.J.SHEATHER has an asymptotic N(0,scv)distribution.The slow n/1rate of convergence means that hscv is highly variable in practice,a fact that has been demon- strated in many simulation studies (see Simonoff, 8520 1996,page 76 for references).In addition to high variability,least squares cross-validation "often un- dersmooths in practice,in that it leads to spurious bumpiness in the underlying density"(Simonoff,1996, page 76).On the other hand,a major advantage of least squares cross-validation over other methods is that it is widely applicable. Figure 2 shows Gaussian kernel density estimates 0.1 02 0.3 0.4 0.5 based on two different bandwidths for a sample of 500 data points from the standard normal distribu- tion.The 500 data points are marked as vertical bars FIG.3.Plot of the least squares cross-validation function above the horizontal axis in Figure 2.The dramati- LSCV(h)against h. cally undersmoothed density estimate(depicted by the dashed line in Figure 2)is based on the bandwidth ob- (see Wand and Jones,1995,pages 36-39 for further tained from least squares cross-validation,in this case material on measuring how difficult a density is to es- hLscy =0.059,while the density estimate depicted by timate). the solid curve in Figure 2 is based on the Sheather- Scott and Terrell(1987)proposed a method called Jones plug-in bandwidth,hs(which is discussed be- biased cross-validation (BCV),which is based on low).In this case,hsJ=0.277.Since both the data and choosing the bandwidth that minimizes an estimate of the kernel in this example are Gaussian,it is possible to AMISE rather than an estimate of ISE.The BCV ob- perform exact MISE calculations(see Wand and Jones, jective function is just the estimate of AMISE obtained 1995,pages 24-25 for details).The bandwidth which by replacing R(f")in (3)by minimizes MISE in this case is hMISE=0.315. Following the advice given above,Figure 3 contains (7) R()-I a plot of the least squares cross-validation function R(). LSCV(h)against h.It is clear from this figure that the where f"is the second derivative of the kernel density value hLscy =0.059 is the unambiguous minimizer estimate (1)and the subscript h denotes the fact that of LSCV(h).Thus,least squares cross-validation has the bandwidth used for this estimate is the same one performed poorly by depicting many modes in a situa- used to estimate the density f itself.The reason for tion in which the underlying density is easy to estimate subtracting the second term in(6)is that this term is the positive constant bias term that corresponds to the diagonal terms in R(). The BCV objective function is thus given by BCV=方R(K+42(K [R的-n亦RK] R+ nh 2n2h ∑(,) i<i LEIaILL 2 0 2 where (c)=fK"(w)K"(w+c)dw (Scott and Terrell, Data 1987).Notice the similarity of this last equation and the FIG.2.Kernel density estimates based on LSCV (dashed curve) version of least squares cross-validation given in(6). and the Sheather-Jones plug-in (solid curve)for 500 data points We denote the bandwidth that minimizes BCV(h) from a standard normal distribution. by hBcv.In S-PLUS,hBcv is invoked by width
592 S. J. SHEATHER has an asymptotic N (0, σ2 LSCV) distribution. The slow n−1/10 rate of convergence means that hLSCV is highly variable in practice, a fact that has been demonstrated in many simulation studies (see Simonoff, 1996, page 76 for references). In addition to high variability, least squares cross-validation “often undersmooths in practice, in that it leads to spurious bumpiness in the underlying density” (Simonoff, 1996, page 76). On the other hand, a major advantage of least squares cross-validation over other methods is that it is widely applicable. Figure 2 shows Gaussian kernel density estimates based on two different bandwidths for a sample of 500 data points from the standard normal distribution. The 500 data points are marked as vertical bars above the horizontal axis in Figure 2. The dramatically undersmoothed density estimate (depicted by the dashed line in Figure 2) is based on the bandwidth obtained from least squares cross-validation, in this case hLSCV = 0.059, while the density estimate depicted by the solid curve in Figure 2 is based on the Sheather– Jones plug-in bandwidth, hSJ (which is discussed below). In this case, hSJ = 0.277. Since both the data and the kernel in this example are Gaussian, it is possible to perform exact MISE calculations (see Wand and Jones, 1995, pages 24–25 for details). The bandwidth which minimizes MISE in this case is hMISE = 0.315. Following the advice given above, Figure 3 contains a plot of the least squares cross-validation function LSCV(h) against h. It is clear from this figure that the value hLSCV = 0.059 is the unambiguous minimizer of LSCV(h). Thus, least squares cross-validation has performed poorly by depicting many modes in a situation in which the underlying density is easy to estimate FIG. 2. Kernel density estimates based on LSCV (dashed curve) and the Sheather–Jones plug-in (solid curve) for 500 data points from a standard normal distribution. FIG. 3. Plot of the least squares cross-validation function LSCV(h) against h. (see Wand and Jones, 1995, pages 36–39 for further material on measuring how difficult a density is to estimate). Scott and Terrell (1987) proposed a method called biased cross-validation (BCV), which is based on choosing the bandwidth that minimizes an estimate of AMISE rather than an estimate of ISE. The BCV objective function is just the estimate of AMISE obtained by replacing R(f ) in (3) by R(fˆ h ) − 1 nh5 R(K (7) ), where fˆ h is the second derivative of the kernel density estimate (1) and the subscript h denotes the fact that the bandwidth used for this estimate is the same one used to estimate the density f itself. The reason for subtracting the second term in (6) is that this term is the positive constant bias term that corresponds to the diagonal terms in R(fˆ h ). The BCV objective function is thus given by BCV(h) = 1 nhR(K) + h4 4 µ2(K)2 · R(fˆ h ) − 1 nh5 R(K) = 1 nhR(K) + µ2(K)2 2n2h i<j φ Xi − Xj h , where φ(c)= K(w)K(w +c) dw (Scott and Terrell, 1987). Notice the similarity of this last equation and the version of least squares cross-validation given in (6). We denote the bandwidth that minimizes BCV(h) by hBCV. In S-PLUS, hBCV is invoked by width =
DENSITY ESTIMATION 593 "bandwidth.bcv",while in R it is invoked by bw pilot bandwidth for the estimate R(f"),as a function "bw.bcv".Scott (1992,page 167)pointed out that of h,namely, lim BCV(h)=0 and hence he recommended that hBcv be taken as the largest local minimizer less 8(h)=C(K) Rf7n, than or equal to the oversmoothed bandwidth hos.On R(f") the other hand,Jones,Marron and Sheather (1996) and estimating the resulting unknown functionals of f recommend that hBcy be taken as the smallest local using kernel density estimates with bandwidths based minimizer,since they claim it gives better empirical performance. on normal rules of thumb.In this situation,the only Scott and Terrell (1987)showed that unknown in the following equation is h: 1/ n/10 hBCV R(K) n-1/5 hAMISE Lu(K)R()] has an asymptotic N(O,)distribution.A related The Sheather-Jones plug-in bandwidth hsJ is the so- result holds for least squares cross-validation,namely, lution to this equation.In S-PLUS,hsJ is invoked by that width ="bandwidth.sj",while in R it is invoked by hLSCV bw ="bw.SJ".In SAS PROC KDE,this method is hAMISE called Sheather-Jones plug-in(METHOD SJPI). 2 has an asymptotic N(0,oLscy)distribution(Hall and Under smoothness assumptions on the underlying Marron,1987a;Scott and Terrell,1987).According to density, Wand and Jones (1995,page 80),the ratio of the two asymptotic variances for the Gaussian kernel is n5/14 o2scy≈15.7, has an asymptotic N(0,os)distribution.Thus,the OBCV Sheather-Jones plug-in bandwidth has a relative con- thus indicating that bandwidths obtained from least vergence rate of order n-5/14,which is much higher squares cross-validation are expected to be much than that of BCV.Most of the improvement is because more variable than those obtained from biased cross- BCV effectively uses the same bandwidth to estimate validation. R(f")as it does to estimate f,while the Sheather- Jones plug-in approach uses different bandwidths. 3.3 Plug-in Methods However,it is important to note that the Sheather- The slow rate of convergence of LSCV and BCV Jones plug-in approach assumes more smoothness of encouraged much research on faster converging meth- the underlying density than either LSCV or BCV. ods.A popular approach,commonly called plug-in Jones,Marron and Sheather (1996)found that for methods,is to replace the unknown quantity R(f")in easy to estimate densities [i.e.,those for which R(f") the expression for hAMISE given by (3)with an esti- is relatively small],the distribution of hsy tends to mate.The method is commonly thought to date back be centered near hAMISE and has much lower vari- to Woodroofe (1970),who proposed it for estimat- ability than the distribution of hLscv.For hard to es- ing the density at a given point.Estimating R(f") timate densities [i.e.,those for which If"(x)varies by R(f")requires the user to choose the bandwidth widely],they found that the distribution of hs,tends to g for this so-called pilot estimate.There are many be centered at values larger than hAMISE(and thus over- ways this can be done.We next describe the "solve- smooths)and again has much lower variability than the the-equation"plug-in approach developed by Sheather distribution of hLscv. and Jones (1991),since this method is widely recom- A number of authors recommended that density esti- mended (e.g.,Simonoff,1996,page 77;Bowman and mates be drawn with more than one value of the band- Azzalini,1997,page 34;Venables and Ripley,2002, width.Scott (1992,page 161)advised looking at"a se- page 129). quence of(density)estimates based on the sequence of Different versions of the plug-in approach depend on smoothing parameters the exact form of the estimate of R(f").The Sheather and Jones (1991)approach is based on writing g,the h=hos/1.05fork=0,1,2
DENSITY ESTIMATION 593 “bandwidth.bcv”, while in R it is invoked by bw = “bw.bcv”. Scott (1992, page 167) pointed out that limh→∞ BCV(h) = 0 and hence he recommended that hBCV be taken as the largest local minimizer less than or equal to the oversmoothed bandwidth hOS. On the other hand, Jones, Marron and Sheather (1996) recommend that hBCV be taken as the smallest local minimizer, since they claim it gives better empirical performance. Scott and Terrell (1987) showed that n1/10 hBCV hAMISE − 1 has an asymptotic N (0, σ2 BCV) distribution. A related result holds for least squares cross-validation, namely, that n1/10 hLSCV hAMISE − 1 has an asymptotic N (0, σ2 LSCV) distribution (Hall and Marron, 1987a; Scott and Terrell, 1987). According to Wand and Jones (1995, page 80), the ratio of the two asymptotic variances for the Gaussian kernel is σ2 LSCV σ2 BCV 15.7, thus indicating that bandwidths obtained from least squares cross-validation are expected to be much more variable than those obtained from biased crossvalidation. 3.3 Plug-in Methods The slow rate of convergence of LSCV and BCV encouraged much research on faster converging methods. A popular approach, commonly called plug-in methods, is to replace the unknown quantity R(f ) in the expression for hAMISE given by (3) with an estimate. The method is commonly thought to date back to Woodroofe (1970), who proposed it for estimating the density at a given point. Estimating R(f ) by R(fˆ g ) requires the user to choose the bandwidth g for this so-called pilot estimate. There are many ways this can be done. We next describe the “solvethe-equation” plug-in approach developed by Sheather and Jones (1991), since this method is widely recommended (e.g., Simonoff, 1996, page 77; Bowman and Azzalini, 1997, page 34; Venables and Ripley, 2002, page 129). Different versions of the plug-in approach depend on the exact form of the estimate of R(f ). The Sheather and Jones (1991) approach is based on writing g, the pilot bandwidth for the estimate R(fˆ), as a function of h, namely, g(h) = C(K) R(f ) R(f ) 1/7 h5/7, and estimating the resulting unknown functionals of f using kernel density estimates with bandwidths based on normal rules of thumb. In this situation, the only unknown in the following equation is h: h = R(K) µ2(K)2R(fˆ g(h)) 1/5 n−1/5. The Sheather–Jones plug-in bandwidth hSJ is the solution to this equation. In S-PLUS, hSJ is invoked by width = “bandwidth.sj”, while in R it is invoked by bw = “bw.SJ”. In SAS PROC KDE, this method is called Sheather–Jones plug-in (METHOD = SJPI). Under smoothness assumptions on the underlying density, n5/14 hSJ hAMISE − 1 has an asymptotic N (0, σ2 SJ) distribution. Thus, the Sheather–Jones plug-in bandwidth has a relative convergence rate of order n−5/14, which is much higher than that of BCV. Most of the improvement is because BCV effectively uses the same bandwidth to estimate R(f ) as it does to estimate f , while the Sheather– Jones plug-in approach uses different bandwidths. However, it is important to note that the Sheather– Jones plug-in approach assumes more smoothness of the underlying density than either LSCV or BCV. Jones, Marron and Sheather (1996) found that for easy to estimate densities [i.e., those for which R(f ) is relatively small], the distribution of hSJ tends to be centered near hAMISE and has much lower variability than the distribution of hLSCV. For hard to estimate densities [i.e., those for which |f (x)| varies widely], they found that the distribution of hSJ tends to be centered at values larger than hAMISE (and thus oversmooths) and again has much lower variability than the distribution of hLSCV. A number of authors recommended that density estimates be drawn with more than one value of the bandwidth. Scott (1992, page 161) advised looking at “a sequence of (density) estimates based on the sequence of smoothing parameters h = hOS/1.05k for k = 0, 1, 2,...,
594 S.J.SHEATHER starting with the sample oversmoothed bandwidth hos If the bandwidth h is large,then the second term in(8) and stopping when the estimate displays some instabil- is close to zero and the first term in (8)is close to pro- ity and very local noise near the peaks."Marron and portional to the log-likelihood,assuming that the den- Chung (2001)also recommend looking at a family of sity of x is v. density estimates for the given data set based on dif- Let =(x)=(00,01,...,p)denote this maxi- ferent values of the smoothing parameter.Marron and mum.Then the local log-polynomial likelihood density Chung(2001,page 198)advised that this family be estimate of f(x)of degree p is given by based around a "center point"which is "an effective choice of the global smoothing parameter."They rec- fiLPE(x)=exp(0o) ommended the Sheather-Jones plug-in bandwidth for (Hjort and Jones,1996;Loader,1996).The Loader this purpose.Silverman(1981)showed that an impor- (1999)library ofS-PLUS functions,LOCFIT,contains tant advantage of using a Gaussian kernel is,in this functions that calculate local likelihood density esti- case,that the number of modes in the density estimate mates decreases monotonically as the bandwidth h increases. Taking p =1 in (8)produces a density estimator This means that the number of features in the esti- with asymptotic bias of the same order as a kernel den- mated density is a decreasing function of the amount sity estimator and thus it too suffers from the boundary of smoothing. bias problem described above.Taking p =2 in(8)pro- duces a density estimator with asymptotic bias iden- 4.POTENTIAL IMPROVEMENTS TO KERNEL tical to that of a boundary kernel,which corrects for DENSITY ESTIMATES boundary bias.In other words,any local likelihood In this section,we consider recent improvements to density estimator based on p =2 automatically cor- kernel density estimates.We focus on two such im- rects for boundary bias without having to explicitly de- provements,namely local likelihood density estimates fine a boundary kernel. and data sharpening for density estimation. Another advantage of local likelihood density es- timators is that choosing a high value of p in (8) 4.1 Local Likelihood Density Estimates produces density estimators with optimal rates of con- vergence without the spurious bumps and wiggles,and One of the shortcomings of kernel density estimates without the problem of taking negative values that are is increased bias at and near the boundary.Wand and characteristic of higher-order kernel estimators. Jones (1995,pages 46-47)provided a detailed discus- On the other hand,Hall and Tao(2002)argued that sion of this phenomenon.One way to overcome this kernel density estimators can have distinct advantages bias is to use a so-called boundary kernel.which is a over local likelihood density estimators when edge ef- modification of the kernel at and near the boundary.An fects are not present.In the log-linear case(i.e.,p=1), alternative and more general approach is to use a local Hall and Tao(2002)showed that"the asymptotic inte- likelihood density estimate,which we discuss next. grated squared bias(ISB)of a local log-linear estima- The local log-polynomial likelihood density estimate tor is strictly greater than its counterpart in the kernel of f(x)is produced by fitting,via maximum likeli- case,whereas the asymptotic integrated squared vari- hood,a density of the form ances are identical.Moreover,the ISB for local log- (u)=(4-xlo,01,.,0p) linear estimators can be up to four times greater,for densities that have two square integrable derivatives. (u-x) Furthermore,this excess of bias occurs in cases where the bias is already large,and that fact tends particu- larly to exacerbate global performance difficulties ex- in the neighborhood of x.As such,=(00,01,...,ep) perienced by local log-linear likelihood." is chosen to maximize 4.2 Data Sharpening 2(行)e-py A new general method for reducing bias in density estimation recently was proposed by Hall and Minnotte (8) -石∫k(片)u-xoa (2002).The method is known as data sharpening,since it involves moving the data away from regions where
594 S. J. SHEATHER starting with the sample oversmoothed bandwidth hOS and stopping when the estimate displays some instability and very local noise near the peaks.” Marron and Chung (2001) also recommend looking at a family of density estimates for the given data set based on different values of the smoothing parameter. Marron and Chung (2001, page 198) advised that this family be based around a “center point” which is “an effective choice of the global smoothing parameter.” They recommended the Sheather–Jones plug-in bandwidth for this purpose. Silverman (1981) showed that an important advantage of using a Gaussian kernel is, in this case, that the number of modes in the density estimate decreases monotonically as the bandwidth h increases. This means that the number of features in the estimated density is a decreasing function of the amount of smoothing. 4. POTENTIAL IMPROVEMENTS TO KERNEL DENSITY ESTIMATES In this section, we consider recent improvements to kernel density estimates. We focus on two such improvements, namely local likelihood density estimates and data sharpening for density estimation. 4.1 Local Likelihood Density Estimates One of the shortcomings of kernel density estimates is increased bias at and near the boundary. Wand and Jones (1995, pages 46–47) provided a detailed discussion of this phenomenon. One way to overcome this bias is to use a so-called boundary kernel, which is a modification of the kernel at and near the boundary. An alternative and more general approach is to use a local likelihood density estimate, which we discuss next. The local log-polynomial likelihood density estimate of f (x) is produced by fitting, via maximum likelihood, a density of the form ψ(u) = ψ(u − x|θ0, θ1,...,θp) = exp p j=0 θj (u − x)j in the neighborhood of x. As such, θ = (θ0, θ1,...,θp) is chosen to maximize 1 nh n i=1 K Xi − x h log ψ(Xi − x|θ ) (8) − 1 h K u − x h ψ(u − x|θ ) du. If the bandwidth h is large, then the second term in (8) is close to zero and the first term in (8) is close to proportional to the log-likelihood, assuming that the density of x is ψ. Let θˆ = θ (x) ˆ = (θˆ 0, θˆ 1,..., θˆ p) denote this maximum. Then the local log-polynomial likelihood density estimate of f (x) of degree p is given by fˆ LLPE(x) = exp(θˆ 0) (Hjort and Jones, 1996; Loader, 1996). The Loader (1999) library of S-PLUS functions, LOCFIT, contains functions that calculate local likelihood density estimates. Taking p = 1 in (8) produces a density estimator with asymptotic bias of the same order as a kernel density estimator and thus it too suffers from the boundary bias problem described above. Taking p = 2 in (8) produces a density estimator with asymptotic bias identical to that of a boundary kernel, which corrects for boundary bias. In other words, any local likelihood density estimator based on p = 2 automatically corrects for boundary bias without having to explicitly de- fine a boundary kernel. Another advantage of local likelihood density estimators is that choosing a high value of p in (8) produces density estimators with optimal rates of convergence without the spurious bumps and wiggles, and without the problem of taking negative values that are characteristic of higher-order kernel estimators. On the other hand, Hall and Tao (2002) argued that kernel density estimators can have distinct advantages over local likelihood density estimators when edge effects are not present. In the log-linear case (i.e., p = 1), Hall and Tao (2002) showed that “the asymptotic integrated squared bias (ISB) of a local log-linear estimator is strictly greater than its counterpart in the kernel case, whereas the asymptotic integrated squared variances are identical. Moreover, the ISB for local loglinear estimators can be up to four times greater, for densities that have two square integrable derivatives. Furthermore, this excess of bias occurs in cases where the bias is already large, and that fact tends particularly to exacerbate global performance difficulties experienced by local log-linear likelihood.” 4.2 Data Sharpening A new general method for reducing bias in density estimation recently was proposed by Hall and Minnotte (2002). The method is known as data sharpening, since it involves moving the data away from regions where
DENSITY ESTIMATION 595 they are sparse toward regions where the density is In this example we look at data on putts per round, higher. which is the average number of putts per round played The method of Hall and Minnotte (2002)is based for the top 175 players on the 1980 and 2001 PGA on the following result:If Hr is a smooth distribution tours.The data were taken from http://www.golfweb. function with the property fK(u)H(x-hu)du com/stats/leaders/r/1980/119 and http://www.golfweb. f(x)+0(h")and yr=H(F),where F is the dis- com/stats/leaders/r/2001/119.Interest centers on com- tribution function that corresponds to the density f= paring the results for the two years to determine if there E(f),then under smoothness conditions on f, has been any improvement. Figure 4 shows Gaussian kernel density estimates [(-A】]=+o, based on two different bandwidths for the data from 1980 and 2001 combined.The resulting 350 data Comparing this last result with(1),we see that replac- points are marked as vertical bars above the horizontal ing the data Xi by y(Xi),its so-called sharpened form, axis in Figure 4.The density estimate depicted by the produces an estimator of f(x)which is always positive dashed line in Figure 4 is based on the bandwidth ob- and for which the bias is o(h")(r=4,6,8,...)rather tained from least squares cross-validation.In this case, than the O(h2)bias of f(x). hLscv =0.054,producing an estimate with at least In practice,yr has to be estimated.Using a Taylor se- four modes,while the density estimate depicted by the solid curve in Figure 4 is based on the Sheather-Jones ries expansion on H,and plug-in estimators,Hall and plug-in bandwidth hsJ.In this case,hsJ=0.154,pro- Minnotte(2002)produced the estimators ducing an estimate with just two modes,which we see 4=1+h22(K)P below correspond to the fact that the data come from two separate years. Figure 5 shows Gaussian kernel density estimates 房=+( 2_2(K)2)f" based on two different bandwidths for the separate 2一 data sets from 1980 and 2001.The density estimates depicted by the dashed line in Figure 5 are based 2(K)2f"2(K)2(f)3 on the bandwidths obtained from least squares cross- 2f2 83 validation.In this case,hLscv =0.061 (for 2001) and 0.187 (for 1980),while the density estimates de- where denotes the identity function.Thus.the data picted by the solid curve in Figure 5 are based on the sharpened density estimator of order r is given by Sheather-Jones plug-in bandwidth hsJ.In this case, hsJ =0.121 (for 2001)and 0.158(for 1980).While A闲=2K x-(X) the two density estimates are very similar for 1980, the same is not true for 2001.The density estimate for Using a different approach,Samiuddin and El-Sayyad (1990)obtained the expression for f4.h.Note that the same bandwidth h is used for the original estimate f, all necessary derivatives and the final sharpened esti- mate.This ensures that bias terms cancel. Finally,Hall and Minnotte (2002)discovered by a0 pa Monte Carlo simulation that the optimal bandwidth for a sharpened density estimator is larger than the optimal bandwidth for a second-order kernel density estimator. g 5.REAL DATA EXAMPLE In this section we compare the performance of a 29 30 number of the bandwidth selection methods described Avera的e Number of Pults in Section 3 on a new example that involves data from FIG.4.Kernel density estimates based on LSCV (dashed curve) the PGA golf tour.A number of other examples can be and the Sheather-Jones plug-in (solid curve)for the data from 1980 found in Sheather(1992). and 2001 combined
DENSITY ESTIMATION 595 they are sparse toward regions where the density is higher. The method of Hall and Minnotte (2002) is based on the following result: If Hr is a smooth distribution function with the property K(u)H r(x − hu) du = f (x) + O(hr) and γr = H −1 r (F )¯ , where F¯ is the distribution function that corresponds to the density f¯ = E(fˆ h), then under smoothness conditions on f , 1 h E K x − γr(Xi) h = f (x) + O(hr ). Comparing this last result with (1), we see that replacing the data Xi by γr(Xi), its so-called sharpened form, produces an estimator of f (x) which is always positive and for which the bias is O(hr) (r = 4, 6, 8,...) rather than the O(h2) bias of fˆ h(x). In practice, γr has to be estimated. Using a Taylor series expansion on Hr and plug-in estimators, Hall and Minnotte (2002) produced the estimators γˆ4 = I + h2µ2(K) 2 fˆ fˆ , γˆ6 = ˆγ4 + h4 µ4(K) 24 − µ2(K)2 2 fˆ fˆ + µ2(K)2 2 fˆfˆ fˆ2 − µ2(K)2 8 (fˆ )3 fˆ3 , where I denotes the identity function. Thus, the data sharpened density estimator of order r is given by fˆ r,h(x) = 1 nh n i=1 K x − ˆγr(Xi) h . Using a different approach, Samiuddin and El-Sayyad (1990) obtained the expression for fˆ 4,h. Note that the same bandwidth h is used for the original estimate fˆ, all necessary derivatives and the final sharpened estimate. This ensures that bias terms cancel. Finally, Hall and Minnotte (2002) discovered by Monte Carlo simulation that the optimal bandwidth for a sharpened density estimator is larger than the optimal bandwidth for a second-order kernel density estimator. 5. REAL DATA EXAMPLE In this section we compare the performance of a number of the bandwidth selection methods described in Section 3 on a new example that involves data from the PGA golf tour. A number of other examples can be found in Sheather (1992). In this example we look at data on putts per round, which is the average number of putts per round played for the top 175 players on the 1980 and 2001 PGA tours. The data were taken from http://www.golfweb. com/stats/leaders/r/1980/119 and http://www.golfweb. com/stats/leaders/r/2001/119. Interest centers on comparing the results for the two years to determine if there has been any improvement. Figure 4 shows Gaussian kernel density estimates based on two different bandwidths for the data from 1980 and 2001 combined. The resulting 350 data points are marked as vertical bars above the horizontal axis in Figure 4. The density estimate depicted by the dashed line in Figure 4 is based on the bandwidth obtained from least squares cross-validation. In this case, hLSCV = 0.054, producing an estimate with at least four modes, while the density estimate depicted by the solid curve in Figure 4 is based on the Sheather–Jones plug-in bandwidth hSJ. In this case, hSJ = 0.154, producing an estimate with just two modes, which we see below correspond to the fact that the data come from two separate years. Figure 5 shows Gaussian kernel density estimates based on two different bandwidths for the separate data sets from 1980 and 2001. The density estimates depicted by the dashed line in Figure 5 are based on the bandwidths obtained from least squares crossvalidation. In this case, hLSCV = 0.061 (for 2001) and 0.187 (for 1980), while the density estimates depicted by the solid curve in Figure 5 are based on the Sheather–Jones plug-in bandwidth hSJ. In this case, hSJ = 0.121 (for 2001) and 0.158 (for 1980). While the two density estimates are very similar for 1980, the same is not true for 2001. The density estimate for FIG. 4. Kernel density estimates based on LSCV (dashed curve) and the Sheather–Jones plug-in (solid curve) for the data from 1980 and 2001 combined.
596 S.J.SHEATHER 6.RECOMMENDED APPROACH 2001 We conclude with the following recommended ap- 1980 proach to density estimation:Always produce a family of density estimates based on a number of values of the bandwidth.Following Marron and Chung (2001),we recommend that this set of estimates be based around a "center point"bandwidth.Natural choices of this center point bandwidth include the Sheather-Jones plug-in bandwidth hs]and least squares cross- validation hLscv.The Sheather-Jones plug-in band- width is widely recommended due to its overall good 29 30 performance.However,for hard to estimate densities Average Number of Putts [i.e.,those for which If"(x)I varies widely]it tends FIG.5.Kernel density estimates based on LSCV (dashed curve) to oversmooth.In these situations,least squares cross- and the Sheather-Jones plug-in (solid curve)produced separately validation often provides some protection against this, for the data from 1980 and 2001. due to its tendency to undersmooth.Recall that it is important to plot the least squares cross-validation function LSCV(h)and not just rely on the result of 2001 based on LSCV produces at least three modes. a minimization routine.Finally,density estimates with When one considers that the data are in the form of more than one mode can be generally improved by us- averages taken over at least 43 rounds,the density esti- ing a higher-order sharpened estimate. mates based on the Sheather-Jones plug-in bandwidth seem to be more reasonable. ACKNOWLEDGMENT Figure 6 shows a Gaussian kernel density estimate Michael Minnotte kindly provided his S-PLUS func- and a sixth-order sharpened estimate for the PGA data tions for calculating sharpened density estimates. from 1980 and 2001 combined.The density estimate depicted by the solid curve in Figure 6 is based on the REFERENCES Sheather-Jones plug-in bandwidth hsJ.The density es- timate depicted by the dashed line in Figure 6 is based BAXTER,M.J.,BEARDAH,C.C.and WESTWOOD,S.(2000). Sample size and related issues in the analysis of lead isotope on the sharpened estimate with bandwidth equal to hsJ data.J.Archaeological Science 27 973-980. In this case,the sharpened estimate displays the two BOWMAN,A.(1984).An alternative method of cross-validation for modes more distinctly the smoothing of density estimates.Biometrika 71 353-360. BOWMAN,A.and AZZALINI,A.(1997).Applied Smoothing Tech- niques for Data Analysis:The Kernel Approach with S-Plus Illustrations.Oxford Univ.Press. DEHEUVELS,P.(1977).Estimation nonparametrique de la densite par histogrammes generalises.Rev.Statist.Appl.25 5-42. DINARDO,J.,FORTIN,N.M.and LEMIEUX,T.(1996).Labor market institutions and the distribution of wages,1973-1992: A semiparametric approach.Econometrica 64 1001-1044. FERREYRA,R.A.,PODESTA,G.P.,MESSINA,C.D.,LETSON,D. DARDANELLI,J.,GUEVARA,E.and MEIRA,S.(2001). A linked-modeling framework to estimate maize production risk associated with ENSO-related climate variability in Ar- gentina.Agricultral and Forest Meteorology 107 177-192. GU,C.(1998).Model indexing and smoothing parameter selection in nonparametric function estimation (with discussion).Statist. Sinica8607-646. HALL,P.(1983).Large sample optimality of least-squares cross- Average Number o时Puts validation in density estimation.Ann.Statist.11 1156-1174. HALL,P.and MARRON,J.S.(1987a).Extent to which least- FIG.6.Kernel density estimate based on the Sheather-Jones squares cross-validation minimizes integrated square error in plug-in (solid curve)and a sixth-order sharpened estimate for the nonparametric density estimation.Probab.Theory Related data from 1980 and 2001 combined. Fields74567-581
596 S. J. SHEATHER FIG. 5. Kernel density estimates based on LSCV (dashed curve) and the Sheather–Jones plug-in (solid curve) produced separately for the data from 1980 and 2001. 2001 based on LSCV produces at least three modes. When one considers that the data are in the form of averages taken over at least 43 rounds, the density estimates based on the Sheather–Jones plug-in bandwidth seem to be more reasonable. Figure 6 shows a Gaussian kernel density estimate and a sixth-order sharpened estimate for the PGA data from 1980 and 2001 combined. The density estimate depicted by the solid curve in Figure 6 is based on the Sheather–Jones plug-in bandwidth hSJ. The density estimate depicted by the dashed line in Figure 6 is based on the sharpened estimate with bandwidth equal to hSJ. In this case, the sharpened estimate displays the two modes more distinctly. FIG. 6. Kernel density estimate based on the Sheather–Jones plug-in (solid curve) and a sixth-order sharpened estimate for the data from 1980 and 2001 combined. 6. RECOMMENDED APPROACH We conclude with the following recommended approach to density estimation: Always produce a family of density estimates based on a number of values of the bandwidth. Following Marron and Chung (2001), we recommend that this set of estimates be based around a “center point” bandwidth. Natural choices of this center point bandwidth include the Sheather–Jones plug-in bandwidth hSJ and least squares crossvalidation hLSCV. The Sheather–Jones plug-in bandwidth is widely recommended due to its overall good performance. However, for hard to estimate densities [i.e., those for which |f (x)| varies widely] it tends to oversmooth. In these situations, least squares crossvalidation often provides some protection against this, due to its tendency to undersmooth. Recall that it is important to plot the least squares cross-validation function LSCV(h) and not just rely on the result of a minimization routine. Finally, density estimates with more than one mode can be generally improved by using a higher-order sharpened estimate. ACKNOWLEDGMENT Michael Minnotte kindly provided his S-PLUS functions for calculating sharpened density estimates. REFERENCES BAXTER, M. J., BEARDAH, C. C. and WESTWOOD, S. (2000). Sample size and related issues in the analysis of lead isotope data. J. Archaeological Science 27 973–980. BOWMAN, A. (1984). An alternative method of cross-validation for the smoothing of density estimates. Biometrika 71 353–360. BOWMAN, A. and AZZALINI, A. (1997). Applied Smoothing Techniques for Data Analysis: The Kernel Approach with S-Plus Illustrations. Oxford Univ. Press. DEHEUVELS, P. (1977). Estimation nonparamétrique de la densité par histogrammes généralisés. Rev. Statist. Appl. 25 5–42. DINARDO, J., FORTIN, N. M. and LEMIEUX, T. (1996). Labor market institutions and the distribution of wages, 1973–1992: A semiparametric approach. Econometrica 64 1001–1044. FERREYRA, R. A., PODESTA, G. P., MESSINA, C. D., LETSON, D., DARDANELLI, J., GUEVARA, E. and MEIRA, S. (2001). A linked-modeling framework to estimate maize production risk associated with ENSO-related climate variability in Argentina. Agricultural and Forest Meteorology 107 177–192. GU, C. (1998). Model indexing and smoothing parameter selection in nonparametric function estimation (with discussion). Statist. Sinica 8 607–646. HALL, P. (1983). Large sample optimality of least-squares crossvalidation in density estimation. Ann. Statist. 11 1156–1174. HALL, P. and MARRON, J. S. (1987a). Extent to which leastsquares cross-validation minimizes integrated square error in nonparametric density estimation. Probab. Theory Related Fields 74 567–581.
DENSITY ESTIMATION 597 HALL,P.and MARRON,J.S.(1987b).Estimation of integrated SCOTT,D.W.and TERRELL,G.R.(1987).Biased and unbiased squared density derivatives.Statist.Probab.Lett.6109-115. cross-validation in density estimation.J.Amer:Statist.Assoc HALL,P.and MARRON,J.S.(1991).Local minima in cross- 821131-1146. validation functions.J.Roy.Statist.Soc.Ser:B 53 245-252. SEGAL,M.R.and WIEMELS,J.L.(2002).Clustering of translo- HALL,P.and MINNOTTE,M.C.(2002).Higher order data sharp- cation breakpoints.J.Amer:Statist.Assoc.97 66-76. ening for density estimation.J.R.Stat.Soc.Ser:B Stat. SHEATHER,S.J.(1992).The performance of six popular band- Methodol.64 141-157. width selection methods on some real data sets (with discus- HALL,P.and TAO,T.(2002).Relative efficiencies of kernel and local likelihood density estimators.J.R.Stat.Soc.Ser.B Stat. sion).Comput.Statist.7 225-281. Me1 hodol.64537-547. SHEATHER,S.J.and JONES,M.C.(1991).A reliable data-based HJORT,N.L.and JONES,M.C.(1996).Locally parametric non- bandwidth selection method for kernel density estimation. parametric density estimation.Ann.Statist.24 1619-1647. J.Roy.Statist.Soc.Ser:B 53 683-690. JONES,M.C.,MARRON,J.S.and SHEATHER,S.J.(1996).A brief SILVERMAN,B.W.(1981).Using kernel density estimates to in- survey of bandwidth selection for density estimation.J Amer vestigate multimodality.J.Roy.Statist.Soc.Ser:B 43 97-99. Statist.Assoc.91 401-407. SILVERMAN,B.W.(1986).Density Estimation for Statistics and KIM,K.-D.and HEO,J.-H.(2002).Comparative study of flood Data Analysis.Chapman and Hall,London. quantiles estimation by nonparametric models.J.Hydrology SIMONOFF,J.S.(1996).Smoothing Methods in Statistics.Springer, 260176-193. New York LOADER,C.R.(1996).Local likelihood density estimation.Ann. STONE,C.J.(1984).An asymptotically optimal window selection Statis1.241602-1618. rule for kernel density estimates.Ann.Statist.12 1285-1297. LOADER,C.R.(1999).Local Regression and Likelihood.Springer, TERRELL,G.R.(1990).The maximal smoothing principle in den- New York. sity estimation.J.Amer:Statist.Assoc.85 470-477 MARRON,J.S.and CHUNG,S.S.(2001).Presentation of smoothers:The family approach.Comput.Statist.16 195-207. TERRELL,G.R.and SCOTT,D.W.(1985).Oversmoothed non- PAULSEN,O.and HEGGELUND,P.(1996).Quantal properties of parametric density estimates.J.Amer:Statist.Assoc.80 209- 214. spontaneous EPSCs in neurones of the Guinea-pig dorsal lat- ToRTOSA-AUSINA,E.(2002).Financial costs,operating costs,and eral geniculate nucleus.J.Physiology 496 759-772. RUDEMO,M.(1982).Empirical choice of histograms and kernel specialization of Spanish banking firms as distribution dynam- density estimators.Scand.J.Statist.9 65-78. ics.Applied Economics 34 2165-2176. SAMIUDDIN,M.and EL-SAYYAD,G.M.(1990).On nonparamet- VENABLES,W.N.and RIPLEY,B.D.(2002).Modern Applied Sta- ric kernel density estimates.Biometrika 77 865-874. tistics with S,4th ed.Springer,New York. SCoTT,D.W.(1979).On optimal and data-based histograms.Bio- WAND.M.P.and JONES,M.C.(1995).Kernel Smoothing.Chap- metrika66605-610. man and Hall,London. SCOTT,D.W.(1992).Multivariate Density Estimation:Theory, WOODROOFE,M.(1970).On choosing a delta-sequence.Ann. Practice and Visualization.Wiley,New York. Math.S1 atist.411665-1671
DENSITY ESTIMATION 597 HALL, P. and MARRON, J. S. (1987b). Estimation of integrated squared density derivatives. Statist. Probab. Lett. 6 109–115. HALL, P. and MARRON, J. S. (1991). Local minima in crossvalidation functions. J. Roy. Statist. Soc. Ser. B 53 245–252. HALL, P. and MINNOTTE, M. C. (2002). Higher order data sharpening for density estimation. J. R. Stat. Soc. Ser. B Stat. Methodol. 64 141–157. HALL, P. and TAO, T. (2002). Relative efficiencies of kernel and local likelihood density estimators. J. R. Stat. Soc. Ser. B Stat. Methodol. 64 537–547. HJORT, N. L. and JONES, M. C. (1996). Locally parametric nonparametric density estimation. Ann. Statist. 24 1619–1647. JONES, M. C., MARRON, J. S. and SHEATHER, S. J. (1996). A brief survey of bandwidth selection for density estimation. J. Amer. Statist. Assoc. 91 401–407. KIM, K.-D. and HEO, J.-H. (2002). Comparative study of flood quantiles estimation by nonparametric models. J. Hydrology 260 176–193. LOADER, C. R. (1996). Local likelihood density estimation. Ann. Statist. 24 1602–1618. LOADER, C. R. (1999). Local Regression and Likelihood. Springer, New York. MARRON, J. S. and CHUNG, S. S. (2001). Presentation of smoothers: The family approach. Comput. Statist. 16 195–207. PAULSEN, O. and HEGGELUND, P. (1996). Quantal properties of spontaneous EPSCs in neurones of the Guinea-pig dorsal lateral geniculate nucleus. J. Physiology 496 759–772. RUDEMO, M. (1982). Empirical choice of histograms and kernel density estimators. Scand. J. Statist. 9 65–78. SAMIUDDIN, M. and EL-SAYYAD, G. M. (1990). On nonparametric kernel density estimates. Biometrika 77 865–874. SCOTT, D. W. (1979). On optimal and data-based histograms. Biometrika 66 605–610. SCOTT, D. W. (1992). Multivariate Density Estimation: Theory, Practice and Visualization. Wiley, New York. SCOTT, D. W. and TERRELL, G. R. (1987). Biased and unbiased cross-validation in density estimation. J. Amer. Statist. Assoc. 82 1131–1146. SEGAL, M. R. and WIEMELS, J. L. (2002). Clustering of translocation breakpoints. J. Amer. Statist. Assoc. 97 66–76. SHEATHER, S. J. (1992). The performance of six popular bandwidth selection methods on some real data sets (with discussion). Comput. Statist. 7 225–281. SHEATHER, S. J. and JONES, M. C. (1991). A reliable data-based bandwidth selection method for kernel density estimation. J. Roy. Statist. Soc. Ser. B 53 683–690. SILVERMAN, B. W. (1981). Using kernel density estimates to investigate multimodality. J. Roy. Statist. Soc. Ser. B 43 97–99. SILVERMAN, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall, London. SIMONOFF, J. S. (1996). Smoothing Methods in Statistics. Springer, New York. STONE, C. J. (1984). An asymptotically optimal window selection rule for kernel density estimates. Ann. Statist. 12 1285–1297. TERRELL, G. R. (1990). The maximal smoothing principle in density estimation. J. Amer. Statist. Assoc. 85 470–477. TERRELL, G. R. and SCOTT, D. W. (1985). Oversmoothed nonparametric density estimates. J. Amer. Statist. Assoc. 80 209– 214. TORTOSA-AUSINA, E. (2002). Financial costs, operating costs, and specialization of Spanish banking firms as distribution dynamics. Applied Economics 34 2165–2176. VENABLES, W. N. and RIPLEY, B. D. (2002). Modern Applied Statistics with S, 4th ed. Springer, New York. WAND, M. P. and JONES, M. C. (1995). Kernel Smoothing. Chapman and Hall, London. WOODROOFE, M. (1970). On choosing a delta-sequence. Ann. Math. Statist. 41 1665–1671