Example 1.Somites of earthworms. 1.Introduction Earthworms have segmented bodies.The segments are known as 1.1 Why empirical likelihood somites.As a worm grows,both the number and the length of its somites increases. .nonparametric method:without having to assume the form of the underlying distribution The dataset contains the No.of somites on each of 487 worms gathered near Ann Arbor in 1902. likelihood based inference:taking the advantages of likelihood The histogram shows that the distribution is skewed to the left, methods and has a heavier tail to the left. E(X-EX)31 alternative method when other (more conventional)methods Skewness:=Var(X) a measure for symmetry are not applicable Kurtosis:=-3.a measure for tail-heaviness Estimation for y and K Let=n-1isisn Xi,and and 2=(n-1)-1 Eisisn(Xi-)2. Remark.(i)For N(u,o2),y=0 and K=0. 1" (K-)3, R=- (ii)For symmetric distributions,y=0. How to find the confidence sets for (y.) (iii)When >0,heavier tails than those of N(H,2). Answer:Empirical likelihood contours. Let l(y,K)be the (log-)empirical likelihood function of (y,K).The confidence region for (y,)is defined as {(Y,K):l(Y,K)>C, where C>0 is a constant determined by the confidence level,i.e. P{l(Y,K)>C}=1-a
1. Introduction 1.1 Why empirical likelihood • nonparametric method: without having to assume the form of the underlying distribution • likelihood based inference: taking the advantages of likelihood methods • alternative method when other (more conventional) methods are not applicable Example 1. Somites of earthworms. Earthworms have segmented bodies. The segments are known as somites. As a worm grows, both the number and the length of its somites increases. The dataset contains the No. of somites on each of 487 worms gathered near Ann Arbor in 1902. The histogram shows that the distribution is skewed to the left, and has a heavier tail to the left. Skewness: γ = E{(X−EX) 3} {Var(X)} 3/2 , — a measure for symmetry Kurtosis: κ = E{(X−EX) 4} {Var(X)} 2 −3, — a measure for tail-heaviness Remark. (i) For N(µ, σ2), γ = 0 and κ = 0. (ii) For symmetric distributions, γ = 0. (iii) When κ > 0, heavier tails than those of N(µ, σ2). Estimation for γ and κ Let X¯ = n −1 P 1≤i≤n Xi , and and σb 2 = (n − 1)−1 P 1≤i≤n(Xi − X¯) 2. γb = 1 nσb3 Xn i=1 (Xi − X¯) 3 , κb = 1 nσb4 Xn i=1 (Xi − X¯) 4 . How to find the confidence sets for (γ, κ)? Answer: Empirical likelihood contours. Let l(γ, κ) be the (log-) empirical likelihood function of (γ, κ). The confidence region for (γ, κ) is defined as {(γ, κ) : l(γ, κ) > C}, where C > 0 is a constant determined by the confidence level, i.e. P{l(γ, κ) > C} = 1 − α
Raw data E.L.Confidence Regions Why do conventional methods not apply? Parametric likelihood.Not normal distribution!Likelihood infer- ence for high moments is typically not robust wrt a misspeci- 导 fied distribution. 8 Bootstrap.Difficult in picking out the confidence region from a point cloud consisting of a large number of bootstrap esti- E0100120140160 3.02.5.201.5 mates for (,) Number of Somites Somte skewness For example,given 1000 bootstrap estimates for (y,K),ideally In the second panel,the empirical likelihood confidence regions 95%confidence region should contain 950 central points. (i.e.contours)correspond to confidence levels of 50%,90%, In practice,we restrict to rectangle or ellipse regions in order 95%,99%,99.9%and99.99% to facilitate the estimation. Note.(y.K)=(0,0)is not contained in the confidence regions 1.2 Introducing empirical likelihood Since Let X =(X1,...,Xn)be a random sample from an unknown P{X1=r1,…,Xn=n}=p1…p distribution F().We know nothing about F(). the likelihood is In practice we observeXi=i (i 1,...,n),z1,...;In are n L(p1,,Pm)三L(p1,…,pmX)=Π, known numbers. 1=1 which is called an empirical likelihood. Basic idea.Assume F is a discrete distribution on1,..,n with Remark.The number of parameters is the same as the number Pi=F(ri), i=1,…,n of observations. where Note P≥0, 含州=1 =1 n=1 What is the likelihood function of(pi)?What is the MLE? the equality holds iff p1=...=Pn 1/n
In the second panel, the empirical likelihood confidence regions (i.e. contours) correspond to confidence levels of 50%, 90%, 95%, 99%, 99.9% and 99.99%. Note. (γ, κ) = (0, 0) is not contained in the confidence regions Why do conventional methods not apply? Parametric likelihood. Not normal distribution! Likelihood inference for high moments is typically not robust wrt a misspeci- fied distribution. Bootstrap. Difficult in picking out the confidence region from a point cloud consisting of a large number of bootstrap estimates for (γ, κ). For example, given 1000 bootstrap estimates for (γ, κ), ideally 95% confidence region should contain 950 central points. In practice, we restrict to rectangle or ellipse regions in order to facilitate the estimation. 1.2 Introducing empirical likelihood Let X = (X1, · · · , Xn) τ be a random sample from an unknown distribution F(·). We know nothing about F(·). In practice we observe Xi = xi (i = 1, · · · , n), x1, · · · , xn are n known numbers. Basic idea. Assume F is a discrete distribution on {x1, · · · , xn} with pi = F(xi), i = 1, · · · , n, where pi ≥ 0, Xn i=1 pi = 1. What is the likelihood function of {pi}? What is the MLE? Since P{X1 = x1, · · · , Xn = xn} = p1 · · · pn, the likelihood is L(p1, · · · , pn) ≡ L(p1, · · · , pn; X) = Yn i=1 pi , which is called an empirical likelihood. Remark. The number of parameters is the same as the number of observations. Note Yn i=1 pi 1/n ≤ 1 n Xn i=1 pi = 1 n , the equality holds iff p1 = · · · = pn = 1/n
Puti=1/n,we have Example 2.Find the MELE for EX1. L(p1,…,PmX)≤L(p1,…,n:X) Corresponding to the EL, for any P≥0and∑ih=1. μ=∑Pmi=H(p1,…,Pm). i=1 Hence the MLE based on the empirical likelihood,which is called maximum empirical likelihood estimator (MELE),puts the Therefore,the MELE for u is equal probability mass 1/n on the n observed values x1,...,In. =4m,…,m)=24 1分X=元 n=1 ni=l Namely the MELE for F is the uniform distribution on observed data points.The corresponding distribution function Similarly,the MELE fork=E()is the simply the sample k-th 1 n Fn(a)=二∑(X:≤r) moment: ni=1 =∑X is called the empirical distribution of the sample X =(X1,...,Xn)". n1 2.Empirical likelihood for means Let X1,...,Xn be a random sample from an unknown distribution. Remarks.(i)MELEs,without further constraints,are simply the method of moments estimators,which is not new. Goal:test hypotheses on =EX1.or find confidence intervals forμ. (ii)Empirical likelihood is a powerful tool in dealing with testing Tool:empirical likelihood ratios (ELR) hypotheses and interval estimation in a nonparametric manner based on the likelihood tradition,which also involves evaluating 2.1 Tests Consider the hypotheses MELEs under some further constraints. Ho:μ=0vsH1:μ≠o Let L(p1,...,Pn)=IIiPi.We reject Ho for large values of the ELR T=maxL1,…,pm)=Ln-1…,n-1 maxHo L(p1…,pm)L(1,·,pm) where (1}are the constrained MELEs for (pi}under Ho
Put pbi = 1/n, we have L(p1, · · · , pn; X) ≤ L(pb1, · · · , pbn; X) for any pi ≥ 0 and P i pi = 1. Hence the MLE based on the empirical likelihood, which is called maximum empirical likelihood estimator (MELE), puts the equal probability mass 1/n on the n observed values x1, · · · , xn. Namely the MELE for F is the uniform distribution on observed data points. The corresponding distribution function Fn(x) = 1 n Xn i=1 I(Xi ≤ x) is called the empirical distribution of the sample X = (X1, · · · , Xn) τ . Example 2. Find the MELE for µ ≡ EX1. Corresponding to the EL, µ = Xn i=1 pixi = µ(p1, · · · , pn). Therefore, the MELE for µ is µb = µ(pb1, · · · , pbn) = 1 n Xn i=1 xi = 1 n Xn i=1 Xi = X. ¯ Similarly, the MELE for µk ≡ E(Xk 1 ) is the simply the sample k-th moment: µbk = 1 n Xn i=1 Xk i . Remarks. (i) MELEs, without further constraints, are simply the method of moments estimators, which is not new. (ii) Empirical likelihood is a powerful tool in dealing with testing hypotheses and interval estimation in a nonparametric manner based on the likelihood tradition, which also involves evaluating MELEs under some further constraints. 2. Empirical likelihood for means Let X1, · · · , Xn be a random sample from an unknown distribution. Goal: test hypotheses on µ ≡ EX1, or find confidence intervals for µ. Tool: empirical likelihood ratios (ELR) 2.1 Tests Consider the hypotheses H0 : µ = µ0 vs H1 : µ 6= µ0. Let L(p1, · · · , pn) = Q i pi . We reject H0 for large values of the ELR T = max L(p1, · · · , pn) maxH0 L(p1, · · · , pn) = L(n −1, · · · , n−1) L(pe1, · · · , pen) , where {pe1} are the constrained MELEs for {pi} under H0
Two problems: (0)贴=7 Theorem1.Forμ∈(r(u):r(m), (ii)What is the distribution of T under Ho? 1 p(P)=n-- 、>0,1≤i≤n (1) (i)The constrained MELEspi=pi(Ho),where (pi()}are the where A is the unique solution of the equation solution of the maximisation problem: Ii- =0 max∑logm 台n-(-) (2) {}1 subject to the conditions in the interval (n P20, ∑=1, ∑=. Proof.We use the Lagrange multiplier technique to solve this =1 optimisation problem.Put The solution for the above problem is given in the theorem below.Note Q=∑IogP:+(∑P-1)十A(∑:-4): r)三min≤≤max:三工o Letting the partial derivatives of Q w.r.t.P.and X equal 0,we It is natural we require)≤r≤r(m have Now let g(A)be the function on the LHS of (2).Then m1++=0 (3) (红1-4)2 P=1 (4) )=En-A-p>0 ∑P江=H (5) Hence g(A)is a strictly increasing function.Note By(3). 四9)=0,= 灯m年 P=-1/(的+λx) (6) Hence g(A)=0 has a unique solution between in the interval Hence,1+p5+入rpi=O,which impliesψ=-(n+入). (nn together with (6)imply (1).By (1)and(5). 1)-μ'm-u 2-名-两= 工i (7) Note for any A in this interval, 1 1 It follows (4)that =空m=a--可 n-A()- >0。 n-((n)-) >0, and 1/{n-(-))is a monotonic function of It holds that This together with (7)imply(2). (r)>0 for all1≤i≤n
Two problems: (i) pei =? (ii) What is the distribution of T under H0? (i) The constrained MELEs pei = pi(µ0), where {pi(µ)} are the solution of the maximisation problem: max {pi} Xn i=1 log pi subject to the conditions pi ≥ 0, Xn i=1 pi = 1, Xn i=1 pixi = µ. The solution for the above problem is given in the theorem below. Note x(1) ≡ min i xi ≤ Xn i=1 pixi ≤ max i xi ≡ x(n) . It is natural we require x(1) ≤ µ ≤ x(n) . Theorem 1. For µ ∈ (x(1), x(n) ), pi(µ) = 1 n − λ(xi − µ) > 0, 1 ≤ i ≤ n, (1) where λ is the unique solution of the equation Xn j=1 xj − µ n − λ(xj − µ) = 0 (2) in the interval ( n x(1)−µ , n x(n)−µ ). Proof. We use the Lagrange multiplier technique to solve this optimisation problem. Put Q = X i log pi + ψ X i pi − 1) + λ( X i pixi − µ). Letting the partial derivatives of Q w.r.t. pi , ψ and λ equal 0, we have p −1 X i + ψ + λxi = 0 (3) i pi = 1 (4) X i pixi = µ (5) By (3), pi = −1/(ψ + λxi). (6) Hence , 1 + ψpi + λxipi = 0, which implies ψ = −(n + λµ). This together with (6) imply (1). By (1) and (5), X i xi n − λ(xi − µ) = µ. (7) It follows (4) that µ = µ X i pi = X i µ n − λ(xi − µ) . This together with (7) imply (2). Now let g(λ) be the function on the LHS of (2). Then g˙(λ) = X i (xi − µ) 2 {n − λ(xi − µ)} 2 > 0. Hence g(λ) is a strictly increasing function. Note lim λ↑ n x(n) −µ g(λ) = ∞, lim λ↓ n x(1)−µ g(λ) = −∞, Hence g(λ) = 0 has a unique solution between in the interval n x(1) − µ , n x(n) − µ . Note for any λ in this interval, 1 n − λ(x(1) − µ) > 0, 1 n − λ(x(n) − µ) > 0, and 1/{n − λ(x − µ)} is a monotonic function of x. It holds that pi(µ) > 0 for all 1 ≤ i ≤ n
Remarks.(a)When===0,and p:(m)=1/mi=1,…,n. (c)The likelihood function L()may be calculated using R-code It may be shown for u close to E(X;).and n large and Splus-code,downloaded at n0片1+新e-可 1 http://www-stat.stanford.edu/~owen/empirical/ where S(w)=∑:(ri-r)2. (ii)The asymptotic theorem for the classic likelihood ratio tests (i.e.Wilks'Theorem)still holds for the ELR tests. (b)We may view L()=L{p1(r),…,p()} as a profile empirical likelihood for u. Let X1,.…,Xni.i.d.,andμ=E(Xi).To test Hypothetically consider an 1-1 parameter transformation from H0:μ=0vsH1:μ≠0: {p1,…,pm}to{u,01,…,0m-1}.Then L()=maxL(4,01,…,0n-1)=L{4,01(μ),…,0n-1(4)} {} the ELR statistic is which is the solution of f(An)=0,where T max L(p1,....pn)(1/n)" maxH L(p1,…,P)=io (n)= ,X-0 n台11-n(Xy-o) inp(o)s】 红-acx-o以 =1 By a simple Taylor expansion 0=f(An)f(0)+f(0)An, where A is the unique solution of X-40 g-o/io)=-(-o/月x-o2 含-o =0 Now Theorem 2.Let E(X)<oo.Then under Ho. 2gTs2罗--o)-当0x-o月 21ogT=21og{1-2x-o)}一x好 =-2nn(R-4o)-2∑(X-Ho2≈n-2x-o2 n(灭-4o)2 =1 in distribution as n-oo. By the LLN,n-1∑(Xi-μo)2→Var(Xi).By the CLT,vm(r- A sketch proof.Under Ho.EX;=Ho.Therefore Ho is close to Ho)-N(0,Var(X1))in distribution.Hence 2 log Tx in distri- Xfor large n.Hence the )or more precisely.n =A/n is small, bution
Remarks. (a) When µ = ¯x = X¯, λ = 0, and pi(µ) = 1/n, i = 1, · · · , n. It may be shown for µ close to E(Xi), and n large pi(µ) ≈ 1 n · 1 1 + x¯−µ S(µ) (xi − µ) , where S(µ) = 1 n P i(xi − µ) 2. (b) We may view L(µ) = L{p1(µ), · · · , pn(µ)}. as a profile empirical likelihood for µ. Hypothetically consider an 1-1 parameter transformation from {p1, · · · , pn} to {µ, θ1, · · · , θn−1}. Then L(µ) = max {θi} L(µ, θ1, · · · , θn−1) = L{µ, bθ1(µ), · · · , bθn−1(µ)} (c) The likelihood function L(µ) may be calculated using R-code and Splus-code, downloaded at http://www-stat.stanford.edu/∼owen/empirical/ (ii) The asymptotic theorem for the classic likelihood ratio tests (i.e. Wilks’ Theorem) still holds for the ELR tests. Let X1, · · · , Xn i.i.d., and µ = E(X1). To test H0 : µ = µ0 vs H1 : µ 6= µ0, the ELR statistic is T = max L(p1, · · · , pn) maxH0 L(p1, · · · , pn) = (1/n) n L(µ0) = Yn i=1 1 npi(µ0) = Yn i=1 n 1 − λ n (Xi − µ0) o . where λ is the unique solution of Xn j=1 Xj − µ0 n − λ(Xj − µ0) = 0. Theorem 2. Let E(X2 1 ) < ∞. Then under H0, 2 log T = 2 Xn i=1 log n 1 − λ n (Xi − µ0) o → χ 2 1 in distribution as n → ∞. A sketch proof. Under H0, EXi = µ0. Therefore µ0 is close to X¯ for large n. Hence the λ, or more precisely, λn ≡ λ/n is small, which is the solution of f(λn) = 0, where f(λn) = 1 n Xn j=1 Xj − µ0 1 − λn(Xj − µ0) . By a simple Taylor expansion 0 = f(λn) ≈ f(0) + f˙(0)λn, λn ≈ −f(0). f˙(0) = −(X¯ − µ0) 1 n X j (Xj − µ0) 2 . Now 2 log T ≈ 2 X i {−λn(Xi − µ0) − λ 2 n 2 (Xi − µ0) 2 } = −2λnn(X¯ − µ0) − λ 2 n X i (Xi − µ0) 2 ≈ n(X¯ − µ0) 2 n−1 P i(Xi − µ0) 2 . By the LLN, n −1 P i(Xi − µ0) 2 → Var(X1). By the CLT, √ n(X¯ − µ0) → N(0, Var(X1)) in distribution. Hence 2 log T → χ 2 1 in distribution
2.2 Confidence intervals for # Example 3.Darwin's data:gains in height of plants from cross- fertilisation For a given a E(0,1).since we will not reject the null hypothesis X=height(Cross-F)-height(Self-F) H0:μ=0 iff 2logT-0.5x3.1-a-nlogn} The sample meanX=2.61,the standard error s=4.71. =1 =cim)>-05a-小 Is the gain significant? Intuitively:YES,if no two negative observations -8.4 and -6.0. Let u=EXi. (a)Normal plot H0:4=0D8H1:μ>0 QQ-plot: Quantile of (i)Standard approach:Assume {X1,...,X15}is a random sample N(0,1)vs Quantile of the from N(u,2) empirical distribution. MLE:立=X=2.61 -1 Nomal quantiles The t-test statistic: T=√求/s=2.14 (b)Lk likelihood.k=1,2,4.8 Since T~t(14)under Ho.the p-value is 0.06-significant but The profile likelihood l(u)is not overwhelming. plotted against u for k=1 Is N(u,o2)an appropriate assumption?as the data do not (solid),2 (dashed),4 (dot- appear to be normal(with a heavy left tail);see Fig(a). ted),and 8(dot-dashed)
2.2 Confidence intervals for µ. For a given α ∈ (0, 1), since we will not reject the null hypothesis H0 : µ = µ0 iff 2 log T −0.5χ 2 1,1−α − n log n o = n µ Xn i=1 log{npi(µ)} > −0.5χ 2 1,1−α o . Example 3. Darwin’s data: gains in height of plants from crossfertilisation X = height(Cross-F) - height(Self-F) 15 observations: 6.1, -8.4, 1.0, 2.0, 0.7, 2.9, 3.5, 5.1, 1.8, 3.6, 7.0, 3.0, 9.3, 7.5, -6.0 The sample mean X¯ = 2.61, the standard error s = 4.71. Is the gain significant? Intuitively: YES, if no two negative observations -8.4 and -6.0. Let µ = EXi . H0 : µ = 0 vs H1 : µ > 0 (i) Standard approach: Assume {X1, · · · , X15} is a random sample from N(µ, σ2) MLE: µb = X¯ = 2.61 The t-test statistic: T = √ nX/s ¯ = 2.14 Since T ∼ t(14) under H0, the p-value is 0.06 — significant but not overwhelming. Is N(µ, σ2) an appropriate assumption? as the data do not appear to be normal (with a heavy left tail); see Fig(a). −1 0 1 −5 0 5 10 Normal quantiles Observed quantiles (a) Normal plot −5 0 5 10 0.0 0.4 0.8 µ Profile likelihood (b) Lk likelihood, k=1,2,4,8 QQ-plot: Quantile of N(0, 1) vs Quantile of the empirical distribution. The profile likelihood lk(µ) is plotted against µ for k = 1 (solid), 2 (dashed), 4 (dotted), and 8 (dot-dashed)
If we use the distribution functions with k=1 to fit the data, (ii)Consider a generalised normal family the p-value for the test is 0.03-much more significant than that 2-1-1/k a)=r+加{- under the assumption of normal distribution. which has the mean u.When k=2,it is N(u,2). (iii)The empirical likelihood ratio test statistic 2logT 3.56, which rejects Ho with the p-value 0.04. To find the profile likelihood of u,the 'MLE'for a is The 95%confidence interval is ==克2X-4 15 2n1 {μ∑1og()>-1.92-15Iog(15)月=0.17,4.27]. =1 Hence ()=(n,a)=-nlogr(1+-n(1+)l0g2-nloga- The DE density is of the formel/.With fixed,the MLE for o is nX;-.Hence the parametric log (profile) Fig.(b)shows the MLE=(k)varies with respect to k.In fact likelihood is (k)increases as k decreases. -nlog∑IX:-4 3.Empirical likelihood for random vectors Let X1,...:Xn be i.i.d.random vectors from distribution F. Similar to the univariate case,we assume pi=F(X),i=1,…,n, where pi0 and ipi=1.The empirical likelihood is L(pm1,…,pPm)=Πn 2=1 Parametric log-likelihood (solid curve)based on the DE distribu- Without any further constraints,the MELEs are tion,and the empirical log-likelihood(dashed curve).(Both curves i=1/m,i=1,…,n. were shifted vertically by their own maximum values
(ii) Consider a generalised normal family fk(x|µ, σ) = 2 −1−1/k Γ(1 + 1/k)σ exp − 1 2 x − µ σ k , which has the mean µ. When k = 2, it is N(µ, σ2). To find the profile likelihood of µ, the ‘MLE’ for σ is σb k ≡ σb(µ) k = k 2n Xn i=1 |Xi − µ| k . Hence lk(µ) = lk(µ, σb) = −n log Γ(1 + 1 k ) − n(1 + 1 k ) log 2 − n log σb − n k . Fig.(b) shows the MLE µb = µb(k) varies with respect to k. In fact µb(k) increases as k decreases. If we use the distribution functions with k = 1 to fit the data, the p-value for the test is 0.03 – much more significant than that under the assumption of normal distribution. (iii) The empirical likelihood ratio test statistic 2 log T = 3.56, which rejects H0 with the p-value 0.04. The 95% confidence interval is {µ X 15 i=1 log pi(µ) > −1.92 − 15 log(15)} = [0.17, 4.27]. The DE density is of the form 1 2σ e −|x−µ|/σ. With µ fixed, the MLE for σ is n −1 P i |Xi − µ|. Hence the parametric log (profile) likelihood is −n logX i |Xi − µ|. 0 2 4 6 −6 −4 −2 0 µ log−likelihood Parametric log-likelihood (solid curve) based on the DE distribution, and the empirical log-likelihood (dashed curve). (Both curves were shifted vertically by their own maximum values.) 3. Empirical likelihood for random vectors Let X1, · · · , Xn be i.i.d. random vectors from distribution F. Similar to the univariate case, we assume pi = F(Xi), i = 1, · · · , n, where pi ≥ 0 and P i pi = 1. The empirical likelihood is L(p1, · · · , pn) = Yn i=1 pi Without any further constraints, the MELEs are pbi = 1/n, i = 1, · · · , n
3.1 EL for multivariate means Remarks.(i)In the case that =0,there exists an integer g0,含m= =1 T,…:TB This ensures the solutions pi()>0 exist. We may draw an X*from the uniform distribution on [X1,:..,Xn} as follows::draw Z~U(o,1),define X*=X:ifZ∈[是,)-
3.1 EL for multivariate means The profile empirical likelihood for µ = EX1 is L(µ) = max Yn i=1 pi pi ≥ 0, Xn i=1 pi = 1, Xn i=1 piXi = µ ≡ Yn i=1 pi(µ), where pi(µ) is the MELE of pi with the additional constraint EXi = µ. Define the ELR T ≡ T(µ) = L(1/n, · · · , 1/n) L(µ) = 1 Yn i=1 {npi(µ)}. Theorem 3. Let X1, · · · , Xn be d × 1 i.i.d. with mean µ and finite covariance matrix Σ and |Σ| 6= 0. Then as n → ∞, 2 log{T(µ)} = −2 Xn i=1 log{npi(µ)} → χ 2 d in distribution. Remarks. (i) In the case that |Σ| = 0, there exists an integer q 0, Xn i=1 pi = 1o . This ensures the solutions pi(µ) > 0 exist
We solve the problem in 3 steps: Step 1: Similar to Theorem 1,the LM method entails 1.Transform the constrained n-dim problem to a con- 1 strained d-dim problem. p(四)=n-X:-四 i=1,…,n 2.Transform the constrained problem to an unconstrained where A is the solution of problem. 3.Apply a Newton-Raphson algorithm. 卫 X-4 含-”四=0 (8) Put Hence ()=login -X(Xi-H)}=M(A) (9) I(r)≡IogL()=∑logP() i=1 1 NoteM()=0 leads to(⑧),and a2M(A) 8λ0入r 紧品… Thus M()is a convex function on any connected sets satisfying Step 2:We extend M(A)outside of the set Ha such that it is still n-入r(X:-)>0,i=1,…,n a convex function on the whole Rd. (10) Note.(10)and (8)together imply ipi()=1. Define The original n-dimensional optimisation problem is equivalent to log.(=)= log z 221, -1.5+2z-0.522 z1,i=1,…,n. ·M.(X)=M(X)onHx: Then H a convex set in Rd,which contains the minimiser of the convex function M(A).(See 'Note'above) .M.(A)is a convex function on whole Rd. Unfortunately M(A)is not defined on the sets Hence,M.(A)and M(A)share the same minimiser which is the {A|n-Ar(X-4)=0},i=1,…,n. solution of (8)
We solve the problem in 3 steps: 1. Transform the constrained n-dim problem to a constrained d-dim problem. 2. Transform the constrained problem to an unconstrained problem. 3. Apply a Newton-Raphson algorithm. Put l(µ) ≡ log L(µ) = Xn i=1 log pi(µ) = max Xn i=1 log pi pi ≥ 0, Xn i=1 pi = 1, Xn i=1 piXi = µ . Step 1: Similar to Theorem 1, the LM method entails pi(µ) = 1 n − λ τ (Xi − µ) , i = 1, · · · , n, where λ is the solution of Xn j=1 Xj − µ n − λ τ (Xj − µ) = 0. (8) Hence l(µ) = − Xn i=1 log{n − λ τ (Xi − µ)} ≡ M(λ). (9) Note ∂ ∂λM(λ) = 0 leads to (8), and ∂ 2M(λ) ∂λ∂λ τ = Xn i=1 (Xi − µ)(Xi − µ) τ {n − λ τ (Xi − µ)} 2 > 0. Thus M(·) is a convex function on any connected sets satisfying n − λ τ (Xi − µ) > 0, i = 1, · · · , n. (10) Note. (10) and (8) together imply P i pi(µ) = 1. The original n-dimensional optimisation problem is equivalent to a d-dimensional problem of minimising M(λ) subject to the constraints (10). Let Hλ be the set consisting all the values of λ satisfying n − λ τ (Xi − µ) > 1, i = 1, · · · , n. Then Hλ a convex set in Rd , which contains the minimiser of the convex function M(λ). (See ‘Note’ above) Unfortunately M(λ) is not defined on the sets {λ | n − λ τ (Xi − µ) = 0}, i = 1, · · · , n. Step 2: We extend M(λ) outside of the set Hλ such that it is still a convex function on the whole Rd . Define log⋆(z) = ( log z z ≥ 1, −1.5 + 2z − 0.5z 2 z < 1. It is easy to see that log⋆(z) has two continuous derivatives on R. Put M⋆(λ) = − Pn i=1 log⋆{n − λ τ (Xi − µ)}. Then • M⋆(λ) = M(λ) on Hλ. • M⋆(λ) is a convex function on whole Rd . Hence, M⋆(λ) and M(λ) share the same minimiser which is the solution of (8)
Step 3:We apply a Newton-Raphson algorithm to compute iteratively: k+1=k-{.()}M.(A). 3.2 EL for smooth functions of means A convenient initial value would Xo =0,corresponding to Pi=1/n. Basic idea.Let Y1,...,Yn be i.i.d.random variables with variance o2.Note Remarks.(i)S-code "el.S",available from a2=EY2-(EY)2=h() where #EXi,and Xi=(Y,Y2).We may deduce a confidence www-stat.stanford.edu/~owen/empirical interval for o2 from that of u. calculates the empirical likelihood ratio loginp:()) i=1 and other related quantities. Theorem 4.Let X1,...,Xn be d x 1 i.i.d.r.v.s with mean #o Remarks.(i)The idea of bootstrap calibration may be applied and |Var(X1)0.Let 8=h(u)be a smooth function from Rd here too. to R(q0,let C2.r={h()μeC1,r} 2 C2,r is a practical feasible confidence set,while C3,r is not since C1r={u∑1ognn}≥-0.5r} Ho and 0o are unknown in practice.Note for p close to Ho. and o+G(u-0)≈h(L). C3.r={0+G(μ-o)μ∈C,r} Then as n-oo, (i)In general,P{μ∈C1,r}≤P{8∈C2,r}- P{0∈C3.r}一P(x≤r). (By Theorem3,P{μ∈C1,r}一P(xi≤r)
Step 3: We apply a Newton-Raphson algorithm to compute λ iteratively: λk+1 = λk − n M¨⋆(λk) o−1 M˙ ⋆(λk). A convenient initial value would λ0 = 0, corresponding to pi = 1/n. Remarks. (i) S-code “el.S”, available from www-stat.stanford.edu/∼owen/empirical calculates the empirical likelihood ratio Xn i=1 log{npi(µ)} and other related quantities. 3.2 EL for smooth functions of means Basic idea. Let Y1, · · · , Yn be i.i.d. random variables with variance σ 2. Note σ 2 = EY 2 i − (EYi) 2 = h(µ), where µ = EXi , and Xi = (Yi , Y 2 i ). We may deduce a confidence interval for σ 2 from that of µ. Theorem 4. Let X1, · · · , Xn be d × 1 i.i.d. r.v.s with mean µ0 and |Var(X1)| 6= 0. Let θ = h(µ) be a smooth function from Rd to Rq (q ≤ d), and θ0 = h(µ0). We assume |GGτ | 6= 0, G = ∂θ ∂µτ . For any r > 0, let C1,r = n µ Xn i=1 log{npi(µ)} ≥ −0.5r o , and C3,r = n θ0 + G(µ − µ0) µ ∈ C1,ro . Then as n → ∞, P{θ ∈ C3,r} → P(χ 2 q ≤ r). Remarks. (i) The idea of bootstrap calibration may be applied here too. (ii) Under more conditions, P{θ ∈ C2,r} → P(χ 2 q ≤ r), where C2,r = n h(µ) µ ∈ C1,ro . C2,r is a practical feasible confidence set, while C3,r is not since µ0 and θ0 are unknown in practice. Note for µ close to µ0, θ0 + G(µ − µ0) ≈ h(µ). (iii) In general, P{µ ∈ C1,r} ≤ P{θ ∈ C2,r}. (By Theorem 3, P{µ ∈ C1,r} → P(χ 2 d ≤ r))