An Inconsistent Maximum Likelihood Estimate THOMAS S.FERGUSON* An example is given of a family of distributions on [-1,then there exists a sequence of roots,0,of the likelihood 1]with a continuous one-dimensional parameterization equation. that joins the triangular distribution(when 0 =0)to the uniform (when 0=1),for which the maximum likelihood 31ogLn0)=0. a01 estimates exist and converge strongly to 0 1 as the sample size tends to infinity,whatever be the true value that converges in probability to 0o as no.Moreover, of the parameter.A modification that satisfies Cramer's any such sequence 0 is asymptotically normal and conditions is also given. asymptotically efficient.It is known that Cramer's theo- KEY WORDS:Maximum likelihood estimates;Incon- rem extends to the multiparameter case. To emphasize the point that this is a local result and sistency;Asymptotic efficiency;Mixtures. may have nothing to do with maximum likelihood esti- 1.INTRODUCTION mation,we consider the following well-known example, a special case of some quite practical problems mentioned There are many examples in the literature of estimation recently by Quandt and Ramsey (1978).Let the density problems for which the maximum likelihood principle f(x|0)be a mixture of two normals,N(0,1)and N(, does not yield a consistent sequence of estimates,notably o2),with mixing parameter, Neyman and Scott(1948),Basu (1955),Kraft and LeCam (1956),and Bahadur(1958).In this article a very simple f(x|μ,o)=克p(x)+是p(x-u)o)/o, example of inconsistency of the maximum likelihood where o is the density of the standard normal distribution, method is presented that shows clearly one danger to be and the parameter space is ={(u,o):o>0.It is clear wary of in an otherwise regular-looking situation.A re- that for any given sample,X1,...,Xn,from this density cent article by Berkson(1980)followed by a lively dis- the likelihood function can be made as large as desired cussion shows that there is still interest in these problems.by taking =X,say,and o sufficiently small.Never- The discussion in this article is centered on a sequence theless,Cramer's conditions are satisfied and so there of independent,identically distributed,and,for the sake exists a consistent asymptotically efficient sequence of of convenience,real random variables,X1,X2,..., roots of the likelihood equation even though maximum distributed according to a distribution,F(x0),for some likelihood estimates do not exist. 0 in a fixed parameter space It is assumed that there A more disturbing example is given by Kraft and is a o-finite measure with respect to which densities,f(x LeCam(1956),in which Cramer's conditions are satis- 0),exist for all 0 e 0.The maximum likelihood estimate fied,the maximum likelihood estimate exists,is unique, of 0 based on X1,...,Xn is a value,0n(x1,...,x)of and satisfies the likelihood equation,but is not consistent. 0∈⊙,if any,that maximizes the likelihood function In such examples,it is possible to find the asymptotically efficient sequence of roots of the likelihood equation by Ln(0)=Πfx|0) first finding a consistent extimate and then finding the i-1 closest root or improving by the method of scoring as in The maximum likelihood method of estimation goes back Rao(1965).See Lehmann(1980)for a discussion of these to Gauss,Edgeworth,and Fisher.For historical points, problems. see LeCam (1953)and Edwards (1972).For a general Other more practical examples of inconsistency in the survey of the area and a large bibliography,see Norton maximum likelihood method involve an infinite number (1972). of parameters.Neyman and Scott (1948)show that the The starting point of our discussion is the theorem of maximum likelihood estimate of the common variance of Cramer (1946,p.500),which states that under certain a sequence of normal populations with unknown means regularity conditions on the densities involved,if 0 is real based on a fixed sample size k taken from each population valued and if the true value 0o is an interior point of converges to a value lower than the true value as the number of populations tends to infinity.This example led directly to the paper of Kiefer and Wolfowitz (1956)on Thomas S.Ferguson is Professor,Department of Mathematics, the consistency and efficiency of the maximum likelihood University of California,Los Angeles,CA 90024.Research was sup- ported in part by the National Science Foundation under Grant MCS77- 2121.The author wishes to acknowledge the help of an exceptionally Journal of the American Statistical Association good referee whose very detailed comments benefited this article December 1982,Volume 77,Number 380 substantially. Theory and Methods Section 831
An Inconsistent Maximum Likelihood Estimate THOMAS S. FERGUSON* An example is given of a family of distributions on [ - 1, 1] with a continuous one-dimensional parameterization that joins the triangular distribution (when 0 = 0) to the uniform (when 0 = 1), for which the maximum likelihood estimates exist and converge strongly to 0 = 1 as the sample size tends to infinity, whatever be the true value of the parameter. A modification that satisfies Cramer's conditions is also given. KEY WORDS: Maximum likelihood estimates; Inconsistency; Asymptotic efficiency; Mixtures. 1. INTRODUCTION There are many examples in the literature of estimation problems for which the maximum likelihood principle does not yield a consistent sequence of estimates, notably Neyman and Scott (1948), Basu (1955), Kraft and LeCam (1956), and Bahadur (1958). In this article a very simple example of inconsistency of the maximum likelihood method is presented that shows clearly one danger to be wary of in an otherwise regular-looking situation. A recent article by Berkson (1980) followed by a lively discussion shows that there is still interest in these problems. The discussion in this article is centered on a sequence of independent, identically distributed, and, for the sake of convenience, real random variables, Xl, X2, . . distributed according to a distribution, F(x I 0), for some 0 in a fixed parameter space 0. It is assumed that there is a ur-finite measure with respect to which densities, f(x I 0), exist for all 0 E 0. The maximum likelihood estimate of 0 based on X1,. .., X is a value, On(x, . .., xn) of 0 E 0, if any, that maximizes the likelihood function n Ln (0) = H f(xi I 0) i = I The maximum likelihood method of estimation goes back to Gauss, Edgeworth, and Fisher. For historical points, see LeCam (1953) and Edwards ('972). For a general survey of the area and a large bibliography, see Norton (1972). The starting point of our discussion is the theorem of Cramer (1946, p. 500), which states that under certain regularity conditions on the densities involved, if 0 is real valued and if the true value 00 is an interior point of 0, * Thomas S. Ferguson is Professor, Department of Mathematics, University of California, Los Angeles, CA 90024. Research was supported in part by the National Science Foundation under Grant MCS77- 2121. The author wishes to acknowledge the help of an exceptionally good referee whose very detailed comments benefited this article substantially. then there exists a sequence of roots, 0), of the likelihood equation, -log Lnf(O) = 0, ao that converges in probability to Oo as n m. Moreover, any such sequence 0,, is asymptotically normal and asymptotically efficient. It is known that Cramer's theorem extends to the multiparameter case. To emphasize the point that this is a local result and may have nothing to do with maximum likelihood estimation, we consider the following well-known example, a special case of some quite practical problems mentioned recently by Quandt and Ramsey (1978). Let the density f(x I 0) be a mixture of two normals, N(O, 1) and N(i, (c2), with mixing parameter 2, f(x I P, a) = 2 p(x) + 2 ((- )Io)I, where 'p is the density of the standard normal distribution, and the parameter space is 0 = {(p, r): u > 0}. It is clear that for any given sample, XI, . . , X, from this density the likelihood function can be made as large as desired by taking 11 = XI, say, and r sufficiently small. Nevertheless, Cramer's conditions are satisfied and so there exists a consistent asymptotically efficient sequence of roots of the likelihood equation even though maximum likelihood estimates do not exist. A more disturbing example is given by Kraft and LeCam (1956), in which Cramer's conditions are satisfied, the maximum likelihood estimate exists, is unique, and satisfies the likelihood equation, but is not consistent. In such examples, it is possible to find the asymptotically efficient sequence of roots of the likelihood equation by first finding a consistent extimate and then finding the closest root or improving by the method of scoring as in Rao (1965). See Lehmann (1980) for a discussion of these problems. Other more practical examples of inconsistency in the maximum likelihood method involve an infinite number of parameters. Neyman and Scott (1948) show that the maximum likelihood estimate of the common variance of a sequence of normal populations with unknown means based on a fixed sample size k taken from each population converges to a value lower than the true value as the number of populations tends to infinity. This example led directly to the paper of Kiefer and Wolfowitz (1956) on the consistency and efficiency of the maximum likelihood ? Journal of the American Statistical Association December 1982, Volume 77, Number 380 Theory and Methods Section 831
832 Joumal of the American Statistical Association,December 1982 estimates with infinitely many nuisance parameters.An- 2.THE EXAMPLE other example,mentioned in Barlow et al.(1972),in- volves estimating a distribution known to be star-shaped The following densities on[-1,1]provide a continuous (i.e,F(λx)≤λF(x)for all0≤λ≤1 and all x such that parameterization between the triangular distribution (when F(x)0 sufficiently fast as 0-1,since then fied.This gives an example in which asymptotically ef- 0,will eventually be greater than a for any preassigned ficient estimates exist and may be found by improving a<I.Let Mn=max{Xi,.·,Xn}.Then M→1with any convenient 0(Vn)-consistent estimate by scoring, probability one whatever be the true value of 0,and since and yet the maximum likelihood estimate exists and even- 0<M<1 with probability one, tually satisfies the likelihood equation but converges to a fixed point with probability 1 no matter what the true max1ln(0)≥ln(Mn) 0≤0s1n n value of the parameter happens to be.Such an example was announced by LeCam in the discussion of Berkson's (1980)papr. log 2 n n
832 Joumal of the American Statistical Association, December 1982 estimates with infinitely many nuisance parameters. Another example, mentioned in Barlow et al. (1972), involves estimating a distribution known to be star-shaped (i.e., F(Ax) s XF(x) for all 0 ?O with probability one o-o-i n provided 8(0) -> 0 sufficiently fast as 0 1-> , since then On will eventually be greater than a for any preassigned a 1 with probability one whatever be the true value of 0, and since O < Mn < 1 with probability one, max -I ln(0) - I ln(Mn) O-OI fl fn _n-i1 Mn, 1 1 -Mn, -log 2y+-nlog 5(M,,)
Ferguson:Inconsistent Maximum Likelihood Estimate 833 Therefore,with probability one 5.There is a function K(x)=0 with finite expectation, lim inf max,号l.(0)≥log2 1 1 EeoK(x)=K(x)f(x|0o)dxo with probability one for the triangular 1-e]since the density would then be bounded. and hence for all other possible true values of 0,com- pleting the proof. 3.A DIFFERENTIABLE MODIFICATION How fast is fast enough?Take 0 =0 and note that if 0e)=∑Po(Mn2 (to obtain identifiability),and (1957). 4.δ(o)tends to o sufficiently fast as0→l. If one simple condition were added to conditions 1 through 4 of the introduction,the argument of Wald(1949)For 0 1,f(x|1)is defined to be g(x 1,1).Then f(x would imply the strong consistency of the maximum like-0)is continuous in 0 e [1]for each x,and for the true lihood estimates.This is a uniform boundedness condi- o∈(,l),Cramer's conditions are satisfied. tion that may be stated as follows:Let 0o denote the true The proof that every maximum likelihood sequence value of the parameter.Then the maximum likelihood converges to 1 with probability one as n-o no matter estimate 0r converges to 0o with probability one provided what the true value of 0 E[,1]is completely analogous conditions 1 through 4 hold and to the corresponding proof in Section 2,except that in
Ferguson: Inconsistent Maximum Likelihood Estimate 833 Therefore, with probability one lim inf max 1 1n(0))-logI n xo-oc 1 n 2 + lrn inf I logI M n --> n 8(Mn) Whatever be the value of 0, Mn converges to 1 at a certain rate, the slowest rate being for the triangular (0 = 0) since this distribution has smaller mass than any of the others in sufficiently small neighborhoods of 1. Thus we can choose 8(0) -> 0 so fast as 0 1-> that (1/n) log((1 - Mn)l(Mn)) -X oo with probability one for the triangular and hence for all other possible true values of 0, completing the proof. How fast is fast enough? Take 0 = 0 and note that if 0 e) = Po(Mn Po(X 0 with probability one. Therefore, the choice 8(0) = (1 - 0)exp(-(1 - O)-4) + 1 gives a 8(0) that is continuous, decreasing, with 8(0) = 1, 0 1. Thus we take 0 = [2, 1], a(0) = 08(0), and 13(0) = (1 - 0)8(0). The particular form of 8(0) is not important. What is important is that 1. 8(0) is twice continuously differentiable, 2. (1 - 0)8(0), and hence 8(0), is increasing on [1, 1), 3. 8(2) > 2 (to obtain identifiability), and 4. 5(0) tends to oo sufficiently fast as 0 -> 1. For0 = 1, f(x I 1) is defined to be g(x I 1, 1). Then f(x I 0) is continuous in 0 E [L, 1] for each x, and for the true 00 E (2, 1), Cramer's conditions are satisfied. The proof that every maximum likelihood sequence converges to 1 with probability one as n -X0 no matter what the true value of 0 E [1, 1] is completely analogous to the corresponding proof in Section 2, except that in
834 Joumal of the American Statistical Association,December 1982 the inequalities,Stirling's formula in the form DOOB,J.(1948),"Application of the Theory of Martingales,"Le Cal- V2raa-(1/2)e-a≤T(a) cul des Probabilites et ses Applications.Colloques Internationaux du Centre National de la Researche Scientifique,Paris,23-28. EDWARDS,A.W.F.(1972),Likelihood,Cambridge:Cambridge Uni- V2T a-(12)exp(-a (1/12a)) versitv Press FELLER,W.(1950),An Introduction to Probability Theory and its as in Feller (1950,p.44)is useful.In this example,the Applications,(Vol.1,Ist Ed.),New York:John Wiley. slowest rate of convergence of maxi=nX:to 1 occurs for KIEFER,J.,and WOLFOWITZ,J.(1956),"Consistency of the Max- 0=.By the method of Section 2,it may be calculated imum Likelihood Estimator in the Presence of Infinitely Many In- cidental Parameters,"'Annals of Mathematical Statistics,27,887-906 that the function KRAFT,C.H.,and LECAM,L.M.(1956),"A Remark on the Roots of the Maximum Likelihood Equation,"'Annals of Mathematical 8(0)=(1-0)-1exp(1-0)-2) S1 atistics,27,1174-1177. LECAM,L.M.(1953),"On Some Asymptotic Properties of Maximum converges to o sufficiently fast and satisfies conditions Likelihood Estimates and Related Bayes Estimates,University of 1 to 4 of this section. California Publications in Statistics,1,277-328. LEHMANN,E.L.(1980),"Efficient Likelihood Estimators,'"The [Received October 1980.Revised April 1982. American Statistican,34,233-235. NEYMAN,J.,and SCOTT,E.(1948),"Consistent Estimators Based REFERENCES on Partially Consistent Observations,"Econometrica,16,1-32. NORTON,R.H.(1972),"A Survey of Maximum Likelihood Estimat- BAHADUR,R.R.(1958),"Examples of Inconsistency of Maximum tion,"Review of the International Statistical Institute,40,329-354, Likelihood Estimates,"'Sankhya,20,207-210. and part IⅡ(1973),41,39-58. 一(I96),“Rates of Convergence of Estimates and Test Statistics,,” PERLMAN,M.D.(1972),"On the Strong Consistency of Approximate Annals of Mathematical Statistics,38,303-324. Maximum Likelihood Estimates,''Proceedings of the Sixth Berkeley -(1971),Some Limit Theorems in Statistics,Regional Conference Symposium on Mathematical Statistics and Probability,1,263-281. Series in Applied Mathematics,4,Philadelphia:SIAM. QUANDT,R.E.,and RAMSEY,J.L.(1978),"Estimating Mixtures of BARLOW.R.E..BARTHOLOMEW,D.J.,BREMNER,J.M.,and Normal Distributions and Switching Regressions,"Journal of the BRUNK,H.D.(1972),Statistical Inference Under Order Restric- American Statistical Association,73,730-738. tions.New York:John Wiley. RAO,C.R.(1965),Linear Statistical Inference and Its Applications, BASU,D.(1955),"An Inconsistency of the Method of Maximum Like- New York:John Wiley. lihood,''Annals of Mathematical Statistics,26,144-145. WALD,A.(1949),"Note on the Consistency of the Maximum Like- BERKSON,J.(1980),"Minimum Chi-Square,not Maximum Likeli- lihood Estimate,"Annals of Mathematical Statistics,20,595-601. hood!"'Annals of Statistics,8,457-487. WOLFOWITZ,J.(1957),"The Minimum Distance Method,"Annals CRAMER,H.(1946),Mathematical Methods of Statistics,Princeton: of Mathematical Statistics,28,75-88. Princeton University Press
834 Journal of the American Statistical Association, December 1982 the inequalities, Stirling's formula in the form 2 a-(1/2) e -?1 7 (a) _:< Vg (x a - (1/2) exp(-o. + (1/12a)) as in Feller (1950, p. 44) is useful. In this example, the slowest rate of convergence of maxi?Xi to 1 occurs for 0 = -. By the method of Section 2, it may be calculated that the function 6(0) = (1 - 0)1 exp((1 - 0)-2) converges to X sufficiently fast and satisfies conditions 1 to 4 of this section. [Received October 1980. Revised April 1982.] REFERENCES BAHADUR, R.R. (1958), "Examples of Inconsistency of Maximum Likelihood Estimates," Sankhya, 20, 207-210. (1967), "Rates of Convergence of Estimates and Test Statistics," Annals of Mathematical Statistics, 38, 303-324. (1971), Some Limit Theorems in Statistics, Regional Conference Series in Applied Mathematics, 4, Philadelphia: SIAM. BARLOW, R.E., BARTHOLOMEW, D.J., BREMNER, J.M., and BRUNK, H.D. (1972), Statistical Inference Under Order Restrictions, New York: John Wiley. BASU, D. (1955), "An Inconsistency of the Method of Maximum Likelihood," Annals of Mathematical Statistics, 26, 144-145. BERKSON, J. (1980), "Minimum Chi-Square, not Maximum Likelihood!" Annals of Statistics, 8, 457-487. CRAMER, H. (1946), Mathematical Methods of Statistics, Princeton: Princeton University Press. DOOB, J. (1948), "Application of the Theory of Martingales," Le Calcul des Probabilites et ses Applications. Colloques Internationaux du Centre National de la Researche Scientifique, Paris, 23-28. EDWARDS, A.W.F. (1972), Likelihood, Cambridge: Cambridge University Press. FELLER, W. (1950), An Introduction to Probability Theory and its Applications, (Vol. 1, 1st Ed.), New York: John Wiley. KIEFER, J., and WOLFOWITZ, J. (1956), "Consistency of the Maximum Likelihood Estimator in the Presence of Infinitely Many Incidental Parameters," Annals of Mathematical Statistics, 27, 887-906. KRAFT, C.H., and LECAM, L.M. (1956), "A Remark on the Roots of the Maximum Likelihood Equation," Annals of Mathematical Statistics, 27, 1174-1177. LECAM, L.M. (1953), "On Some Asymptotic Properties of Maximum Likelihood Estimates and Related Bayes Estimates," University of California Publications in Statistics, 1, 277-328. LEHMANN, E.L. (1980), "Efficient Likelihood Estimators," The American Statistican, 34, 233-235. NEYMAN, J., and SCOTT, E. (1948), "Consistent Estimators Based on Partially Consistent Observations," Econometrica, 16, 1-32. NORTON, R.H. (1972), "A Survey of Maximum Likelihood Estimattion," Review of the International Statistical Institute, 40, 329-354, and part 11 (1973), 41, 39-58. PERLMAN, M.D. (1972), "On the Strong Consistency of Approximate Maximum Likelihood Estimates," Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, 1, 263-281. QUANDT, R.E., and RAMSEY, J.L. (1978), "Estimating Mixtures of Normal Distributions and Switching Regressions," Journal of the American Statistical Association, 73, 730-738. RAO, C.R. (1965), Linear Statistical Inference and Its Applications, New York: John Wiley. WALD, A. (1949), "Note on the Consistency of the Maximum Likelihood Estimate," Annals of Mathematical Statistics, 20, 595-601. WOLFOWITZ, J. (1957), "The Minimum Distance Method," Annals of Mathematical Statistics, 28, 75-88