Noname manuscript No. (will be inserted by the editor) A Review on Empirical Likelihood Methods for Regression Song Xi Chen.Ingrid Van Keilegom Received:date Accepted:date Abstract We provide a review on the empirical likelihood method for regression type inference problems.The regression models considered in this review include parametric, semiparametric and nonparametric models.Both missing data and censored data are accommodated. Keywords Censored data:empirical likelihood;missing data;nonparametric regression;parametric regression;semiparametric regression;Wilks'theorem. 1 Introduction It has been twenty years since Art Owen published his seminal paper(Owen,1988)that introduces the notion of empirical likelihood (EL).Since then,there has been a rich body of literature on the novel idea of formulating versions of nonparametric likelihood in various settings of statistical inference.There have been two major reviews on the empirical likelihood.The first review was given by Hall and La Scala (1990)in the early years of the EL method,which summarized some key properties of the method. The second one was the book by the inventor of the methodology (Owen,2001),which provided a comprehensive overview up to that time. The body of empirical likelihood literature is increasing rapidly,and it would be a daunting task to review the entire field in one review paper like this one.We therefore decided to concentrate our review on regression due to its prominence in statistical S.X.Chen Department of Statistics,lowa State University,Ames,Iowa 50011-1210,USA and Guanghua School of Management,Peking University,China Tel.:1-515-2942729 Fax:1-515-2944040 E-mail:songchen@iastate.edu;csx@gsm.pku.edu.cn I.Van Keilegom Institute of Statistics,Universite catholique de Louvain,Voie du Roman Pays 20,1348 Louvain- la-Neuve,Belgium TeL.:+3210474330 Fax:+3210473032 E-mail:ingrid.vankeilegom@uclouvain.be
Noname manuscript No. (will be inserted by the editor) A Review on Empirical Likelihood Methods for Regression Song Xi Chen · Ingrid Van Keilegom Received: date / Accepted: date Abstract We provide a review on the empirical likelihood method for regression type inference problems. The regression models considered in this review include parametric, semiparametric and nonparametric models. Both missing data and censored data are accommodated. Keywords Censored data; empirical likelihood; missing data; nonparametric regression; parametric regression; semiparametric regression; Wilks’ theorem. 1 Introduction It has been twenty years since Art Owen published his seminal paper (Owen, 1988) that introduces the notion of empirical likelihood (EL). Since then, there has been a rich body of literature on the novel idea of formulating versions of nonparametric likelihood in various settings of statistical inference. There have been two major reviews on the empirical likelihood. The first review was given by Hall and La Scala (1990) in the early years of the EL method, which summarized some key properties of the method. The second one was the book by the inventor of the methodology (Owen, 2001), which provided a comprehensive overview up to that time. The body of empirical likelihood literature is increasing rapidly, and it would be a daunting task to review the entire field in one review paper like this one. We therefore decided to concentrate our review on regression due to its prominence in statistical S.X. Chen Department of Statistics, Iowa State University, Ames, Iowa 50011-1210, USA and Guanghua School of Management, Peking University, China Tel.: 1-515-2942729 Fax: 1-515-2944040 E-mail: songchen@iastate.edu; csx@gsm.pku.edu.cn I. Van Keilegom Institute of Statistics, Universit´e catholique de Louvain, Voie du Roman Pays 20, 1348 Louvainla-Neuve, Belgium Tel.: +32 10 47 43 30 Fax : +32 10 47 30 32 E-mail: ingrid.vankeilegom@uclouvain.be
inference.The regression models considered in this review cover parametric,nonpara- metric and semiparametric regression models.In addition to the case of completely observed data,we also accommodate missing and censored data in this review. The EL method (Owen,1988,1990)owns its broad usage and fast research de- velopment to a number of important advantages.Generally speaking,it combines the reliability of nonparametric methods with the effectiveness of the likelihood approach. It yields confidence regions that respect the boundaries of the support of the target parameter.The regions are invariant under transformations and behave often better than confidence regions based on asymptotic normality when the sample size is small. Moreover,they are of natural shape and orientation since the regions are obtained by contouring a log likelihood ratio,and they often do not require the estimation of the variance,as the studentization is carried out internally via the optimization procedure. The EL method turns out appealing not only in getting confidence regions,but it also has its unique attractions in parameter estimation and formulating goodness-of-fit tests. 2 Parametric regression Suppose that we observe a sample of independent observations(i)where each Yi is regarded as the response of a d-dimensional design (covariate)variable Xi The preliminary interest here is in the conditional mean function (regression function) of Yi given Xi.One distinguishes between the design Xi being either fixed or random. Despite regression is conventionally associated with fixed designs,for ease of presen- tation,we will concentrate on random designs.The empirical likelihood analysis for fixed designs can be usually extended by regularizing the random designs. Consider first the following parametric regression model: Yi=m(Xi;B)+i,for i=1,...,n, (1) where m(r;B)is the known regression function with an unknown p-dimensional (p0.Note that for these two special cases p d. In the absence of model information on the conditional variance,the least squares (LS)regression estimator of B is obtained by minimizing the sum of least squares 5n(a)=:∑-m(X:a2. =1 The LS estimator of B is Bis =arg infg Sn(B).When the regression function m(;B) is smooth enough with respect to B,Bis will be a solution of the following estimating
2 inference. The regression models considered in this review cover parametric, nonparametric and semiparametric regression models. In addition to the case of completely observed data, we also accommodate missing and censored data in this review. The EL method (Owen, 1988, 1990) owns its broad usage and fast research development to a number of important advantages. Generally speaking, it combines the reliability of nonparametric methods with the effectiveness of the likelihood approach. It yields confidence regions that respect the boundaries of the support of the target parameter. The regions are invariant under transformations and behave often better than confidence regions based on asymptotic normality when the sample size is small. Moreover, they are of natural shape and orientation since the regions are obtained by contouring a log likelihood ratio, and they often do not require the estimation of the variance, as the studentization is carried out internally via the optimization procedure. The EL method turns out appealing not only in getting confidence regions, but it also has its unique attractions in parameter estimation and formulating goodness-of-fit tests. 2 Parametric regression Suppose that we observe a sample of independent observations {(X T i , Yi) T } n i=1, where each Yi is regarded as the response of a d-dimensional design (covariate) variable Xi . The preliminary interest here is in the conditional mean function (regression function) of Yi given Xi . One distinguishes between the design Xi being either fixed or random. Despite regression is conventionally associated with fixed designs, for ease of presentation, we will concentrate on random designs. The empirical likelihood analysis for fixed designs can be usually extended by regularizing the random designs. Consider first the following parametric regression model: Yi = m(Xi ; β) + εi , for i = 1, . . . , n, (1) where m(x; β) is the known regression function with an unknown p-dimensional (p 0. Note that for these two special cases p = d. In the absence of model information on the conditional variance, the least squares (LS) regression estimator of β is obtained by minimizing the sum of least squares Sn(β) =: Xn i=1 {Yi − m(Xi ; β)} 2 . The LS estimator of β is βˆ ls = arg infβ Sn(β). When the regression function m(x; β) is smooth enough with respect to β, βˆ ls will be a solution of the following estimating
3 equation: Om(:(Y-m(X::))0. (2) 83 Suppose that Bo is the true parameter value such that it is the unique value to make Em(=0.Let p,n be a set of probability weights allocated to the data.The empirical likelihood(EL)for B,in the spirit of Owen(1988) and (1991),is Ln(a=maxΠpa, (3) =1 where the maximization is subject to the constraints ∑n=1aud (4) = .Om(XB)(Y:-m(X:B))=0. (5) 83 The empirical likelihood,as conveyed by (3),is essentially a constrained profile like- lihood,with a trivial constraint (4)indicating the p,'s are probability weights.The constraint (5)is the most important one as it defines the nature of the parameters This formulation is similar to the original one given in Owen(1988,1990)for the mean parameter,say u,of Xi.There the second constraint,reflecting the nature of u,was given by∑-1pi(X-)=0. In getting the empirical likelihood at each candidate parameter value B,the above optimization problem as given in (3),(4)and(5)has to be solved for the optimal Pi's.It may be surprising in first instance that the above optimization problem can admit a solution as there are n pi's to be determined with only p+1 constraints.As the objective function In(B)is concave,and the constraints are linear in the Pi's,the optimization problem does admit unique solutions. The algorithm for computing Ln(B)at a candidate B is as follows.If the convex hull of the set of points (depending on {mm(in RPcontains the origin(zero)in RP,then the EL optimization problem for Ln(B)admits a solution. If the zero of RP is not contained in the convex hull of the points for the given B,then Ln(B)does not admit a finite solution as some weights pi are forced to take negative values;see Owen (1988;1990)for a discussion on this aspect.Tsao (2004)studied the probability of the EL not admitting a finite solution and the dependence of this probability on dimensionality. By introducing the Lagrange multipliers Ao ER and AIE RP,the constrained op- timization problem (3)-(5)can be translated into an unconstrained one with objective function T(p,0,A1)=∑1og,)+o(∑P-1)+T∑p Om(XY:-m(XB)(6) 03 where p=(p1,n)T.Differentiating T(p,o,with respect to each pi and set- ting the derivative to zero,it can be shown after some algebra that Ao =-n and by
3 equation: Xn i=1 ∂m(Xi ; β) ∂β {Yi − m(Xi ; β)} = 0. (2) Suppose that β0 is the true parameter value such that it is the unique value to make E[ ∂m(Xi;β) ∂β {Yi−m(Xi ; β)}|Xi ] = 0. Let p1, · · · , pn be a set of probability weights allocated to the data. The empirical likelihood (EL) for β, in the spirit of Owen (1988) and (1991), is Ln(β) = maxYn i=1 pi , (3) where the maximization is subject to the constraints Xn i=1 pi = 1 and (4) Xn i=1 pi ∂m(Xi ; β) ∂β {Yi − m(Xi ; β)} = 0. (5) The empirical likelihood, as conveyed by (3), is essentially a constrained profile likelihood, with a trivial constraint (4) indicating the pi ’s are probability weights. The constraint (5) is the most important one as it defines the nature of the parameters. This formulation is similar to the original one given in Owen (1988, 1990) for the mean parameter, say µ, of Xi . There the second constraint, reflecting the nature of µ, was given by Pn i=1 pi(Xi − µ) = 0. In getting the empirical likelihood at each candidate parameter value β, the above optimization problem as given in (3), (4) and (5) has to be solved for the optimal pi ’s. It may be surprising in first instance that the above optimization problem can admit a solution as there are n pi ’s to be determined with only p + 1 constraints. As the objective function Ln(β) is concave, and the constraints are linear in the pi ’s, the optimization problem does admit unique solutions. The algorithm for computing Ln(β) at a candidate β is as follows. If the convex hull of the set of points (depending on β) { ∂m(Xi;β) ∂β {Yi−m(Xi ; β)}}n i=1 in R p contains the origin (zero) in R p , then the EL optimization problem for Ln(β) admits a solution. If the zero of R p is not contained in the convex hull of the points for the given β, then Ln(β) does not admit a finite solution as some weights pi are forced to take negative values; see Owen (1988; 1990) for a discussion on this aspect. Tsao (2004) studied the probability of the EL not admitting a finite solution and the dependence of this probability on dimensionality. By introducing the Lagrange multipliers λ0 ∈ R and λ1 ∈ R p , the constrained optimization problem (3)-(5) can be translated into an unconstrained one with objective function T(p, λ0, λ1) = Xn i=1 log(pi) +λ0( Xn i=1 pi −1) +λ T 1 Xn i=1 pi ∂m(Xi ; β) ∂β {Yi − m(Xi ; β)}, (6) where p = (p1, · · · , pn) T . Differentiating T(p, λ0, λ1) with respect to each pi and setting the derivative to zero, it can be shown after some algebra that λ0 = −n and by
defining A =-nA1,we find that the optimal pi's are given by: 1 1 pi= 1+ATmg器出:-m(X:奶 where,from the structural constraint (5),A satisfies ma出-m(X 合1+AT型-m(X: -=0 (7) 8B Substituting the optimal weights into the empirical likelihood in(3),we get and the log empirical likelihood is in(8)=log(n(B))=->logf1+xm((Y-m(X:))-nlog(n).(8) The computing intensive nature of the empirical likelihood is clear from the above discussions.Indeed,to evaluate the EL at a B,one needs to solve the non-linear equation (7)for the A which depends on B.An alternative computational approach,as given in Owen (1990)is to translate the optimization problem (3)-(5)with respect to the EL weights (pi)to its dual problem with respect to A. The dual problem to(3)-(5)involves minimizing an objective function Q()=:->logf1+xTm(x(Y:-m(X::) i=1 which is the first term in the empirical likelihood ratio in (8),subject to 1+XTOm(X:化-m(X:8}≥1 for eachi=l,,n. 03 (9) The constraint(9)comes from 0<pi<1 for each i,whereas the gradient of Q(A)is the function on the left hand side of(7).Let D:1+m(-m())2 I/for each=1. 83 Then,the dual problem becomes the problem of minimizing (A)over the set D.It can be verified that D is convex,closed and compact.Hence,there is a unique minimum within D.As suggested in Owen (1990),the set D can be removed by modifying the log(r)function in Q()by a log"()such that log"(r)=log(r)for r 1/n;and log*(x)=-n222/2+2nx-3/2-log(n)for r<1/n,which is the quadratic function that matches log(r)and its first two derivatives at r=1/n. We note that the profile likelihood Pi achieves its maximum n when all the weights pi equal n for i=1,.,n.Thus,if there exists a B,say B,which solves (7)with入=0,namely (Xm(-m(X:}=0, (10)
4 defining λ = −nλ1, we find that the optimal pi ’s are given by: pi = 1 n 1 1 + λT ∂m(Xi;β) ∂β {Yi − m(Xi ; β)} , where, from the structural constraint (5), λ satisfies Xn i=1 ∂m(Xi;β) ∂β {Yi − m(Xi ; β)} 1 + λT ∂m(Xi;β) ∂β {Yi − m(Xi ; β)} = 0. (7) Substituting the optimal weights into the empirical likelihood in (3), we get Ln(β) = Yn i=1 1 n 1 1 + λT ∂m(Xi;β) ∂β {Yi − m(Xi ; β)} and the log empirical likelihood is ℓn(β) =: log{Ln(β)} = − Xn i=1 log{1 + λ T ∂m(Xi ; β) ∂β {Yi − m(Xi ; β)} − n log(n). (8) The computing intensive nature of the empirical likelihood is clear from the above discussions. Indeed, to evaluate the EL at a β, one needs to solve the non-linear equation (7) for the λ which depends on β. An alternative computational approach, as given in Owen (1990) is to translate the optimization problem (3)-(5) with respect to the EL weights {pi} n i=1 to its dual problem with respect to λ. The dual problem to (3)-(5) involves minimizing an objective function Q(λ) =: − Xn i=1 log{1 + λ T ∂m(Xi ; β) ∂β {Yi − m(Xi ; β)}, which is the first term in the empirical likelihood ratio in (8), subject to 1 + λ T ∂m(Xi ; β) ∂β {Yi − m(Xi ; β)} ≥ 1/n for each i = 1, . . . , n. (9) The constraint (9) comes from 0 ≤ pi ≤ 1 for each i, whereas the gradient of Q(λ) is the function on the left hand side of (7). Let D = λ : 1 + λ T ∂m(Xi ; β) ∂β {Yi − m(Xi ; β)} ≥ 1/n for each i = 1, . . . , n . Then, the dual problem becomes the problem of minimizing Q(λ) over the set D. It can be verified that D is convex, closed and compact. Hence, there is a unique minimum within D. As suggested in Owen (1990), the set D can be removed by modifying the log(x) function in Q(λ) by a log∗ (x) such that log∗ (x) = log(x) for x ≥ 1/n; and log∗ (x) = −n 2x 2 /2 + 2nx − 3/2 − log(n) for x < 1/n, which is the quadratic function that matches log(x) and its first two derivatives at x = 1/n. We note that the profile likelihood Qn i=1 pi achieves its maximum n −n when all the weights pi equal n −1 for i = 1, · · · , n. Thus, if there exists a β, say βˆ, which solves (7) with λ = 0, namely Xn i=1 ∂m(Xi ; β) ∂β {Yi − m(Xi ; β)} = 0, (10)
then the EL attains its maximum Ln(B)=n-n at B.In the parametric regression we are considering.the number of parameters and the number of equations in(10)are the same.Hence,(10)has a solution B with probability approaching one in large samples. There are inference situations where the number of estimating equations is larger than the number of parameters (strictly speaking,dimension of the parameter space),for instance the Generalized Method of Moments in econometrics (Hansen,1982).Here, more model information is accounted for by imposing more moment restrictions,leading to more estimating equations than the number of parameters in the model.In statistics, they appear in the form of extra model information.In these so-called over-identified situations,the maximum EL,still using the notation Ln(B),may be different from n-.See Qin and Lawless(1994)for a discussion on this issue. Following the convention of the standard parametric likelihood,we can define from (8)the log EL ratio rn(8)=-2l0g(Ln(B)/n()=2 log1+xm((Y-m(X:B))).(11) 83 =1 Wilks'theorem(Wilks,1938)is a key property of the parametric likelihood ratio.If we replace the EL Ln(B)by the corresponding parametric likelihood,say Lpn(B),and use rpn(B)to denote the parametric likelihood ratio,according to Wilks'theorem,under certain regularity conditions, rpm(o)4X话asn一oo (12) This property is maintained by the EL,as is demonstrated in Owen (1990)for the mean parameter,Owen(1991)for linear regression,and many other situations(Qin and Lawless,1994,Molanes Lopez,Van Keilegom and Veraverbeke,2009).In the context of parametric regression, rn(8o)4xp as noo. (13) This can be viewed as a nonparametric version of Wilks'theorem,and it is quite re- markable for the empirical likelihood to achieve such a property under a nonparametric setting with much less parametric distributional assumptions.We call this analogue of sharing the Wilks'theorem the first order analogue between the parametric and the empirical likelihood. To appreciate why the nonparametric version of Wilks'theorem is valid,we would like to present a few steps of derivation that offer some insights into the nonparametric likelihood.Typically,the first step in a study on EL is considering an expansion for A defined in (7)at Bo,the true value of B,and determining its order of magnitude.It can be shown that for the current parametric regression, X=Op(n-1/2). (14) Such a rate for A is obtained in the original papers of Owen (1988,1990)for the mean parameter (which can be treated as a trivial case of regression without covariates), in Owen (1991)for linear regression,and in Qin and Lawless (1994)and Molanes Lopez,Van Keilegom and Veraverbeke (2009)for the more general case of estimating equations
5 then the EL attains its maximum Ln(βˆ) = n −n at βˆ. In the parametric regression we are considering, the number of parameters and the number of equations in (10) are the same. Hence, (10) has a solution βˆ with probability approaching one in large samples. There are inference situations where the number of estimating equations is larger than the number of parameters (strictly speaking, dimension of the parameter space), for instance the Generalized Method of Moments in econometrics (Hansen, 1982). Here, more model information is accounted for by imposing more moment restrictions, leading to more estimating equations than the number of parameters in the model. In statistics, they appear in the form of extra model information. In these so-called over-identified situations, the maximum EL, still using the notation Ln(βˆ), may be different from n −n . See Qin and Lawless (1994) for a discussion on this issue. Following the convention of the standard parametric likelihood, we can define from (8) the log EL ratio rn(β) = −2 log{Ln(β)/Ln(βˆ)} = 2Xn i=1 log{1 + λ T ∂m(Xi ; β) ∂β (Yi − m(Xi ; β))}. (11) Wilks’ theorem (Wilks, 1938) is a key property of the parametric likelihood ratio. If we replace the EL Ln(β) by the corresponding parametric likelihood, say Lpn(β), and use rpn(β) to denote the parametric likelihood ratio, according to Wilks’ theorem, under certain regularity conditions, rpn(β0) d→ χ 2 p as n → ∞. (12) This property is maintained by the EL, as is demonstrated in Owen (1990) for the mean parameter, Owen (1991) for linear regression, and many other situations (Qin and Lawless, 1994, Molanes L´opez, Van Keilegom and Veraverbeke, 2009). In the context of parametric regression, rn(β0) d→ χ 2 p as n → ∞. (13) This can be viewed as a nonparametric version of Wilks’ theorem, and it is quite remarkable for the empirical likelihood to achieve such a property under a nonparametric setting with much less parametric distributional assumptions. We call this analogue of sharing the Wilks’ theorem the first order analogue between the parametric and the empirical likelihood. To appreciate why the nonparametric version of Wilks’ theorem is valid, we would like to present a few steps of derivation that offer some insights into the nonparametric likelihood. Typically, the first step in a study on EL is considering an expansion for λ defined in (7) at β0, the true value of β, and determining its order of magnitude. It can be shown that for the current parametric regression, λ = Op(n −1/2 ). (14) Such a rate for λ is obtained in the original papers of Owen (1988, 1990) for the mean parameter (which can be treated as a trivial case of regression without covariates), in Owen (1991) for linear regression, and in Qin and Lawless (1994) and Molanes L´opez, Van Keilegom and Veraverbeke (2009) for the more general case of estimating equations
6 With (14),(7)can be inverted (see DiCiccio,Hall and Romano,1989,for more details).To simplify the notation,definem-m()).Then,(7) 8那6 can be inverted as ZZ2=0. i=1 i=1 The last term on the left hand side(LHS)is Op(n),which is negligible relative to the first term on the LHS.Therefore, A=Sn'n-1>Zni+op(n-1/2). where SApplying a Taylor expansion of log()around 1,and substituting this one-term expansion into the EL ratio rn(B0)in (11),we have for some Yi between 1 and 1+TZni (i=1,...,n): rn(o)=2∑1og1+ATZn) i=1 =2∑azi-azn2+=} =1 =2xT∑Zni-T∑ZizA+0pm-1/2) i=1 1 (15) which leads to Wilks'theorem as Sn (Bo)=:EfZniZni}and n-1/2∑Zni4N(0,(6》asn-o. i-1 As commonly practiced in parametric likelihood,the above nonparametric version of Wilks'theorem can be used to construct likelihood ratio confidence regions for Bo. An EL confidence region with a nominal level of confidence 1-o is h-a={3:rn(g)≤X2-a, whereis the (1-a)-quantile of thedistribution.Wilks'theorem in (13) ensures that P{0∈I1-a}一1-aasn-o. This construction mirrors the conventional likelihood ratio confidence regions except that the EL ratio is employed here instead of the parametric likelihood ratio. Note that equation (15)also shows that the EL method is (first order)asymptot- ically equivalent to the normal approximation method.However,the normal method
6 With (14), (7) can be inverted (see DiCiccio, Hall and Romano, 1989, for more details). To simplify the notation, define Zni = ∂m(Xi;β0) ∂β0 {Yi − m(Xi ; β0)}. Then, (7) can be inverted as n −1Xn i=1 Zni(1 − λ TZni) + n −1Xn i=1 Zni λ TZniZ T niλ 1 + λTZni = 0. The last term on the left hand side (LHS) is Op(n −1 ), which is negligible relative to the first term on the LHS. Therefore, λ = S −1 n n −1Xn i=1 Zni + op(n −1/2 ), where Sn = n −1 Pn i=1 ZniZ T ni. Applying a Taylor expansion of log(·) around 1, and substituting this one-term expansion into the EL ratio rn(β0) in (11), we have for some γi between 1 and 1 + λ TZni (i = 1, . . . , n) : rn(β0) = 2Xn i=1 log(1 + λ TZni) = 2Xn i=1 {λ T Zni − 1 2 (λ TZni) 2 + 1 3 (λ T Zni) 3 (1+γi) 3 } = 2λ T Xn i=1 Zni − λ T Xn i=1 ZniZ T niλ + Op(n −1/2 ) = n −1Xn i=1 Zni!T S −1 n n −1Xn i=1 Zni! + op(1), (15) which leads to Wilks’ theorem as Sn p→ Σ(β0) =: E{ZniZ T ni} and n −1/2Xn i=1 Zni d→ N(0, Σ(β0)) as n → ∞. As commonly practiced in parametric likelihood, the above nonparametric version of Wilks’ theorem can be used to construct likelihood ratio confidence regions for β0. An EL confidence region with a nominal level of confidence 1 − α is I1−α = {β : rn(β) ≤ χ 2 p,1−α}, where χ 2 p,1−α is the (1 − α)-quantile of the χ 2 p distribution. Wilks’ theorem in (13) ensures that P{β0 ∈ I1−α} → 1 − α as n → ∞. This construction mirrors the conventional likelihood ratio confidence regions except that the EL ratio is employed here instead of the parametric likelihood ratio. Note that equation (15) also shows that the EL method is (first order) asymptotically equivalent to the normal approximation method. However, the normal method
requires the estimation of the variance (Bo),whereas the EL method does not re- quire any explicit variance estimation.This is because the studentization is carried out internally via the optimization procedure. In addition to the first order analogue between the parametric and the empirical likelihood,there is a second order analogue between them in the form of the Bartlett correction.Bartlett correction is an elegant second order property of the parametric likelihood ratios,which was conjectured and proposed in Bartlett (1937).It was for- mally established and studied in a series of papers including Lawley (1956),Hayakawa (1977),Barndorff-Nielsen and Cox(1984)and Barndorff-Nielsen and Hall (1988). Let wi=(B0)-1/2Zni =(w.P)T and for it1),I =1 define=E)for a k-th multivariate cross moments of wi.By as- suming the existence of higher order moments of Zni,it may be shown via developing Edgeworth expansions that the distribution of the empirical likelihood ratio admits the following expansion: P{rm(o)≤x2,1-a}=1-a-ax2,1-a9p(x2.1-a)n-1+0(n-3/2),(16) where gp is the density of the x distribution,and a=p1(3∑3m=1ajmm-专∑k,m=1 ajkmajkm (17) This means that for the parametric regression both parametric and empirical like- lihood ratio confidence regions 11-have coverage error of order n.Part of the coverage error is due to the fact that the mean of rn(Bo)does not agree with p,the mean of xp,that is Efrn(Bo)}p,but rather E{rn(o)}=p(1+an-1)+O(m-2), where a has been given above. The idea of the Bartlett correction is to adjust the EL ratio rn(Bo)tor(Bo)= rn(Bo)/(1+an-1)so that Efr(Bo)}=p+0(n-2).And amazingly this simple adjust- ment to the mean leads to improvement in (16)by one order of magnitude(DiCiccio, Hall and Romano,1991;Chen,1993 and Chen and Cui,2007)so that P{ri(3o)≤x2.1-a}=1-a+O(n-2). (18) 3 Nonparametric regression Consider in this section the nonparametric regression model Yi=m(Xi)+Ei, (19) where the regression function m(x)=E(YiXi=x)is nonparametric,and Xi is d-dimensional.We assume the regression can be heteroscedastic in that o-(r)= Var(YiXi =x),the conditional variance of Yi given Xi =z,may depend on x. The kernel smoothing method is a popular method for estimating m(r)nonpara- metrically.See Hardle (1990)and Fan and Gijbels (1996)for comprehensive overviews Other nonparametric methods for estimating m(z)include splines,orthogonal series
7 requires the estimation of the variance Σ(β0), whereas the EL method does not require any explicit variance estimation. This is because the studentization is carried out internally via the optimization procedure. In addition to the first order analogue between the parametric and the empirical likelihood, there is a second order analogue between them in the form of the Bartlett correction. Bartlett correction is an elegant second order property of the parametric likelihood ratios, which was conjectured and proposed in Bartlett (1937). It was formally established and studied in a series of papers including Lawley (1956), Hayakawa (1977), Barndorff-Nielsen and Cox (1984) and Barndorff-Nielsen and Hall (1988). Let wi = Σ(β0) −1/2Zni = (w 1 i , . . . , w p i ) T and for jl ∈ {1, · · · , p}, l = 1, · · · , k, define α j1···jk = E(w j1 i · · · w jk i ) for a k-th multivariate cross moments of wi . By assuming the existence of higher order moments of Zni, it may be shown via developing Edgeworth expansions that the distribution of the empirical likelihood ratio admits the following expansion: P{rn(β0) ≤ χ 2 p,1−α} = 1 − α − a χ 2 p,1−α gp(χ 2 p,1−α) n −1 + O(n −3/2 ), (16) where gp is the density of the χ 2 p distribution, and a = p −1 1 2 Pp j,m=1 α j j m m − 1 3 Pp j,k,m=1 α j k mα j k m . (17) This means that for the parametric regression both parametric and empirical likelihood ratio confidence regions I1−α have coverage error of order n −1 . Part of the coverage error is due to the fact that the mean of rn(β0) does not agree with p, the mean of χ 2 p, that is E{rn(β0)} 6= p, but rather E{rn(β0)} = p(1 + an −1 ) + O(n −2 ), where a has been given above. The idea of the Bartlett correction is to adjust the EL ratio rn(β0) to r ∗ n(β0) = rn(β0)/(1+an−1 ) so that E{r ∗ n(β0)} = p+O(n −2 ). And amazingly this simple adjustment to the mean leads to improvement in (16) by one order of magnitude (DiCiccio, Hall and Romano, 1991; Chen, 1993 and Chen and Cui, 2007) so that P{r ∗ n(β0) ≤ χ 2 p,1−α} = 1 − α + O(n −2 ). (18) 3 Nonparametric regression Consider in this section the nonparametric regression model Yi = m(Xi) + εi , (19) where the regression function m(x) = E(Yi |Xi = x) is nonparametric, and Xi is d-dimensional. We assume the regression can be heteroscedastic in that σ 2 (x) = Var(Yi |Xi = x), the conditional variance of Yi given Xi = x, may depend on x. The kernel smoothing method is a popular method for estimating m(x) nonparametrically. See H¨ardle (1990) and Fan and Gijbels (1996) for comprehensive overviews. Other nonparametric methods for estimating m(x) include splines, orthogonal series
and wavelets methods.The simplest kernel regression estimator for m(r)is the follow- ing Nadaraya-Watson estimator: m(x)= ∑Kae-X)y (20) ∑=1Kh(e-X) where Kh(t)=K(t/h)/hd,K is a d-dimensional kernel function and h is a band- width.The above kernel estimator can be obtained by minimizing the following locally weighted sum of least squares: ∑Ke-X)出-m(}2 i= with respect to m(r).It is effectively the solution of the following estimating equation: Kae-X化-me=0 (21) i=1 Under the nonparametric regression model,the unknown 'parameter'is the re- gression function m(r)itself.The empirical likelihood for m()at a fixed x can be formulated in a fashion similar to the parametric regression setting considered in the previous section.Alternatively,since the empirical likelihood is being applied to the weighted average()m(),it is also similar to the EL of a mean. Let pi,...,Pn be probability weights adding to one.The empirical likelihood eval- uated at 0(x),a candidate value of m(r),is Ln{(e}=maxⅡn (22) 三1 where the maximization is subject toPi1and n p,Kh(e-X){Y-9}=0. (23) =1 By comparing this formulation of the EL with that for the parametric regression,we see that the two formulations are largely similar except that (23)is used as the struc- tural constraint instead of(5).This comparison does highlight the role played by the structural constraint in the EL formulation.Indeed,different structural constraints give rise to EL for different 'parameters'(quantity of interest),just like different den- sities give rise to different parametric likelihoods.In gerenal,the empirical likelihood is formulated based on the parameters of interest via the structural constraints,and the parametric likelihood is fully based on a parametric model. The algorithm for solving the above optimization problem(22)-(23)is similar to the EL algorithm for the parametric regression given under (4)and (5),except that it may be viewed easier as the parameter'is one-dimensional if we ignore the issue of bandwidth selection for nonparametric regression.By introducing Lagrange multipliers like we did in (6)in the previous section,we have that the optimal EL weights for the above optimization problem at 0(r)are given by 1。 Pi= n1+(x)Kh(-Xi){Yi-0()}
8 and wavelets methods. The simplest kernel regression estimator for m(x) is the following Nadaraya-Watson estimator: mˆ (x) = Pn P i=1 Kh (x − Xi) Yi n i=1 Kh (x − Xi) , (20) where Kh(t) = K(t/h)/hd , K is a d-dimensional kernel function and h is a bandwidth. The above kernel estimator can be obtained by minimizing the following locally weighted sum of least squares: Xn i=1 Kh (x − Xi) {Yi − m(x)} 2 with respect to m(x). It is effectively the solution of the following estimating equation: Xn i=1 Kh (x − Xi) {Yi − m(x)} = 0. (21) Under the nonparametric regression model, the unknown ‘parameter’ is the regression function m(x) itself. The empirical likelihood for m(x) at a fixed x can be formulated in a fashion similar to the parametric regression setting considered in the previous section. Alternatively, since the empirical likelihood is being applied to the weighted average Pn i=1 Kh(x − Xi)m(x), it is also similar to the EL of a mean. Let p1, . . . , pn be probability weights adding to one. The empirical likelihood evaluated at θ(x), a candidate value of m(x), is Ln{θ(x)} = maxYn i=1 pi (22) where the maximization is subject to Pn i=1 pi = 1 and Xn i=1 piKh (x − Xi) {Yi − θ(x)} = 0. (23) By comparing this formulation of the EL with that for the parametric regression, we see that the two formulations are largely similar except that (23) is used as the structural constraint instead of (5). This comparison does highlight the role played by the structural constraint in the EL formulation. Indeed, different structural constraints give rise to EL for different ‘parameters’ (quantity of interest), just like different densities give rise to different parametric likelihoods. In gerenal, the empirical likelihood is formulated based on the parameters of interest via the structural constraints, and the parametric likelihood is fully based on a parametric model. The algorithm for solving the above optimization problem (22) – (23) is similar to the EL algorithm for the parametric regression given under (4) and (5), except that it may be viewed easier as the ‘parameter’ is one-dimensional if we ignore the issue of bandwidth selection for nonparametric regression. By introducing Lagrange multipliers like we did in (6) in the previous section, we have that the optimal EL weights for the above optimization problem at θ(x) are given by pi = 1 n 1 1 + λ(x)Kh (x − Xi) {Yi − θ(x)}
9 where A(x)is a univariate Lagrange multiplier that satisfies Kh(x-X){Y-0(x)} ∑1+A0-可=0 (24) Substituting the optimal weights into the empirical likelihood in(22),the empirical likelihood evaluated at 0(x)is Ln0}=Πh1+A四Ke-X)-oo i=1 and the log empirical likelihood is in{0(x)}=log{Ln{0(x)}}=->log1+X()Kn(-X:){Y:-0(x)}]-nlog(n). =1 (25) The overall EL is maximized at pi=n-1,which corresponds to 0(z)being the Nadaraya-Watson estimator m()in (20).Hence,we can define the log EL ratio at 0(x)as rn{0(x)}=-2log[Ln{0(x)}/m-n]=21og[1+A(x)Kh(x-X)Y-0(x)l.(26) =1 The above EL is not actually for m(x),the true underlying function value at z, but rather for Efm(z).This can be actually detected by the form of the structural constraint (23).It is well known in kernel estimation that m(x)is not an unbiased estimator of m(r),as is the case for almost all nonparametric estimators.For the Nadaraya-Watson estimator, E{m(x)}=m(x)+b(x)+o(h2) where b(r)=th2m"(r)+2m'(r)f'()/f(r)}is the leading bias of the kernel es- timator,and f is the density of Xi.Then,the EL is actually evaluated at a 0(), that is a candidate value of m()+b(r)instead of m(x).There are two strategies to reduce the effect of the bias (Hall,1991).One is to undersmooth with a bandwidth h=o(n/(4+),the optimal order of bandwidth that minimizes the mean squared error of estimation with a second order kernel (d is the dimension of X).Another is to explicitly estimate the bias and then to subtract it from the kernel estimate.We consider the first approach of undersmoothing here for reasons of simplicity. When undersmoothing so that n2/(+0,Wilks'theorem is valid for the EL under the current nonparametric regression in that rn{m(e}4X好asn→oo. This means that an empirical likelihood confidence interval with nominal coverage equal to 1-a,denoted as I1-a.el,is given by 11-a.el ={0()rnfo(r)}s xi.1-a}. (27)
9 where λ(x) is a univariate Lagrange multiplier that satisfies Xn i=1 Kh (x − Xi) {Yi − θ(x)} 1 + λ(x)Kh (x − Xi) {Yi − θ(x)} = 0. (24) Substituting the optimal weights into the empirical likelihood in (22), the empirical likelihood evaluated at θ(x) is Ln{θ(x)} = Yn i=1 1 n 1 1 + λ(x)Kh (x − Xi) {Yi − θ(x)} and the log empirical likelihood is ℓn{θ(x)} =: log{Ln{θ(x)}} = − Xn i=1 log[1 + λ(x)Kh (x − Xi) {Yi − θ(x)}] − n log(n). (25) The overall EL is maximized at pi = n −1 , which corresponds to θ(x) being the Nadaraya-Watson estimator ˆm(x) in (20). Hence, we can define the log EL ratio at θ(x) as rn{θ(x)} = −2 log[Ln{θ(x)}/n−n ] = 2Xn i=1 log[1 + λ(x)Kh (x − Xi) {Yi − θ(x)}]. (26) The above EL is not actually for m(x), the true underlying function value at x, but rather for E{mˆ (x)}. This can be actually detected by the form of the structural constraint (23). It is well known in kernel estimation that ˆm(x) is not an unbiased estimator of m(x), as is the case for almost all nonparametric estimators. For the Nadaraya-Watson estimator, E{mˆ (x)} = m(x) + b(x) + o(h 2 ) where b(x) = 1 2 h 2 {m′′(x) + 2m′ (x)f ′ (x)/f(x)} is the leading bias of the kernel estimator, and f is the density of Xi . Then, the EL is actually evaluated at a θ(x), that is a candidate value of m(x) + b(x) instead of m(x). There are two strategies to reduce the effect of the bias (Hall, 1991). One is to undersmooth with a bandwidth h = o(n −1/(4+d) ), the optimal order of bandwidth that minimizes the mean squared error of estimation with a second order kernel (d is the dimension of X). Another is to explicitly estimate the bias and then to subtract it from the kernel estimate. We consider the first approach of undersmoothing here for reasons of simplicity. When undersmoothing so that n 2/(4+d)h 2 → 0, Wilks’ theorem is valid for the EL under the current nonparametric regression in that rn{m(x)} d→ χ 2 1 as n → ∞. This means that an empirical likelihood confidence interval with nominal coverage equal to 1 − α, denoted as I1−α,el, is given by I1−α,el = {θ(x) : rn{θ(x)} ≤ χ 2 1,1−α}. (27)
o A special feature of the empirical likelihood confidence interval is that no explicit variance estimator is required in its construction as the studentization is carried out internally via the optimization procedure. Define wi=Kh(r-Xi){Yi-m(r)}and,for positive integers j, K(u)du We note here that the bias in the kernel smoothing makes u=O(h2)while in the parametric regression caseu=0. It is shown in Chen and Qin(2003)that the coverage probability of I1et admits the following Edgeworth expansion: P{m(x)∈Ii-a,el} =1-a-{nhi+(524-33)(nh)-1}a1-子(a-) +Onh+6+h4+(mh)-1h2+(h4-2, (28) where and 21-are the density and the (1-g)-quantile of a standard normal random variable. The above expansion is non-standard in that the leading coverage error consists of two terms.The first term,nhd,of order nhdt4 is due to the bias in the kernel smoothing.The second term of order (nhd)-1 is largely similar to the leading coverage error for parametric regression in (16).We note that in the second term,the effective sample size in the nonparametric estimation near x is nha instead of n.the effective sample size in the parametric regression. The next question is if the Bartlett correction is still valid under the nonparametric regression.The answer is yes.It may be shown that Elrnim()]=1+(nhd)-1y+ofnhd+4+(nhd)-1), where y=42(nh4μ1)2+2424-3423喝, (29) Note that y appears in the leading coverage error term in(28).Based on(28)and choosing h=O(n),we have,with ca =x1.1-a Prn{m(e}≤ca{1+(n4- =Px≤ca+h4-y -(h4-1ycd2{1+y(oh4-1/2lca1/1+(h1y/2别+o6h4-2y =P(x≤ca)+(h4-1a1-告(a1-安)-(mh4-1a1-号(a1-安)+0h4-2 =1-a+0(nb). (30) Therefore,the empirical likelihood is Bartlett correctable in the current context of nonparametric regression.In practice,the Bartlett factor y has to be estimated,say by a consistent 7.Chen and Qin(2003)gave more details on practical implementation; see also Chen (1996)for an implementation in the case of density estimation
10 A special feature of the empirical likelihood confidence interval is that no explicit variance estimator is required in its construction as the studentization is carried out internally via the optimization procedure. Define ωi = Kh(x − Xi){Yi − m(x)} and, for positive integers j, ω¯j = n −1Xn i=1 ω j i , µj = E(¯ωj ) and Rj (K) = Z K j (u)du. We note here that the bias in the kernel smoothing makes µ1 = O(h 2 ) while in the parametric regression case µ1 = 0. It is shown in Chen and Qin (2003) that the coverage probability of I1−α,el admits the following Edgeworth expansion: P{m(x) ∈ I1−α,el} = 1 − α − {nhd µ 2 1µ −1 2 + ( 1 2 µ −2 2 µ4 − 1 3 µ −3 2 µ 2 3 )(nhd ) −1 }z1− α 2 φ(z1− α 2 ) +O{nhd+6 + h 4 + (nhd ) −1 h 2 + (nhd ) −2 }, (28) where φ and z1− α 2 are the density and the (1 − α 2 )-quantile of a standard normal random variable. The above expansion is non-standard in that the leading coverage error consists of two terms. The first term, nhdµ1µ −1 2 , of order nhd+4 is due to the bias in the kernel smoothing. The second term of order (nhd ) −1 is largely similar to the leading coverage error for parametric regression in (16). We note that in the second term, the effective sample size in the nonparametric estimation near x is nhd instead of n, the effective sample size in the parametric regression. The next question is if the Bartlett correction is still valid under the nonparametric regression. The answer is yes. It may be shown that E[rn{m(x)}] = 1 + (nhd ) −1 γ + o{nhd+4 + (nhd ) −1 }, where γ = µ −1 2 (nhd µ1) 2 + 1 2 µ −2 2 µ4 − 1 3 µ −3 2 µ 2 3 . (29) Note that γ appears in the leading coverage error term in (28). Based on (28) and choosing h = O(n − 1 d+2 ), we have, with cα = χ 2 1,1−α, P h rn{m(x)} ≤ cα{1 + γ(nhd ) −1 } i = P h χ 2 1 ≤ cα{1 + γ(nhd ) −1 } i −(nhd ) −1 γc 1/2 α {1 + γ(nhd ) −1 } 1/2 φ[c −1/2 α {1 + γ(nhd ) −1 } 1/2 }] + O{(nhd ) −2 } = P χ 2 1 ≤ cα + (nhd ) −1 γz1− α 2 φ(z1− α 2 ) − (nhd ) −1 γz1− α 2 φ(z1− α 2 ) + O{(nhd ) −2 } = 1 − α + O(n − 4 d+2 ). (30) Therefore, the empirical likelihood is Bartlett correctable in the current context of nonparametric regression. In practice, the Bartlett factor γ has to be estimated, say by a consistent ˆγ. Chen and Qin (2003) gave more details on practical implementation; see also Chen (1996) for an implementation in the case of density estimation