International Statistical Review (1990).58,2,pp.153-171.Printed in Great Britain Interational Statistical Institute Maximum Likelihood:An Introduction L.Le Cam Department of Statistics,University of California,Berkeley,California 94720,USA Summary Maximum likelihood estimates are reported to be best under all circumstances.Yet there are numerous simple examples where they plainly misbehave.One gives some examples for problems that had not been invented for the purpose of annoying maximum likelihood fans.Another example,imitated from Bahadur,has been specially created with just such a purpose in mind.Next, we present a list of principles leading to the construction of good estimates.The main principle says that one should not believe in principles but study each problem for its own sake. Key words:Estimation;Maximum likelihood;One-step approximations. 1 Introduction One of the most widely used methods of statistical estimation is that of maximum likelihood.Opinions on who was the first to propose the method differ.However Fisher is usually credited with the invention of the name 'maximum likelihood',with a major effort intended to spread its use and with the derivation of the optimality properties of the resulting estimates. Qualms about the general validity of the optimality properties have been expressed occasionally.However as late as 1970 L.J.Savage could imply in his 'Fisher lecture'that the difficulties arising in some examples would have rightly been considered 'mathemati- cal caviling'by R.A.Fisher. Of course nobody has been able to prove that maximum likelihood estimates are 'best' under all circumstances.The lack of any such proof is not sufficient by itself to invalidate Fisher's claims.It might simply mean that we have not yet translated into mathematics the basic principles which underlied Fisher's intuition. The present author has,unwittingly,contributed to the confusion by writing two papers which have been interpreted by some as attempts to substantiate Fisher's claims. To clarify the situation we present a few known facts which should be kept in mind as one proceeds along through the various proofs of consistency,asymptotic normality or asymptotic optimality of maximum likelihood estimates. The examples given here deal mostly with the case of independent identically distributed observations.They are intended to show that maximum likelihood does possess disquieting features which rule out the possibility of existence of undiscovered underlying principles which could be used to justify it.One of the very gross forms of misbehavior can be stated as follows. Maximum likelihood estimates computed with all the information available may turn out to be inconsistent.Throwing away a substantial part of the information may render them consistent. The examples show that,in spite of all its presumed virtues,the maximum likelihood procedure cannot be universally recommended.This does not mean that we advocate
International Statistical Review (1990), 58, 2, pp. 153-171. Printed in Great Britain ? International Statistical Institute Maximum Likelihood: An Introduction L. Le Cam Department of Statistics, University of California, Berkeley, California 94720, USA Summary Maximnm likelihood estimates are reported to be best under all circumstances. Yet there are numerous simple examples where they plainly misbehave. One gives some eranmples for problems that had not been invented for the purpose of annoying ms,aximunm likelihood fans. Another example, imitated from B'hadu'r, has been specially created with just such a purpose in mind. Next, we present a list of principles leading to the construction of good estimates. The main principle says that one should not believe in principles but study each problem for its own sake. Key words: Estimation; Maximum likelihood; One-step approximations. 1 Introduction One of the most widely used methods of statistical estimation is that of maximum likelihood. Opinions on who was the first to propose the method differ. However Fisher is usually credited with the invention of the name 'maximum likelihood', with a major effort intended to spread its use and with the derivation of the optimality properties of the resulting estimates. Qualms about the general validity of the optimality properties have been expressed occasionally. However as late as 1970 L.J. Savage could imply in his 'Fisher lecture' that the difficulties arising in some examples would have rightly been considered 'mathematical caviling' by R.A. Fisher. Of course nobody has been able to prove that maximum likelihood estimates are 'best' under all circumstances. The lack of any such proof is not sufficient by itself to invalidate Fisher's claims. It might simply mean that we have not yet translated into mathematics the basic principles which underlied Fisher's intuition. The present author has, unwittingly, contributed to the confusion by writing two papers which have been interpreted by some as attempts to substantiate Fisher's claims. To clarify the situation we present a few known facts which should be kept in mind as one proceeds along through the various proofs of consistency, asymptotic normality or asymptotic optimality of maximum likelihood estimates. The examples given here deal mostly with the case of independent identically distributed observations. They are intended to show that maximum likelihood does possess disquieting features which rule out the possibility of existence of undiscovered underlying principles which could be used to justify it. One of the very gross forms of misbehavior can be stated as follows. Maximum likelihood estimates computed with all the information available may turn out to be inconsistent. Throwing away a substantial part of the information may render them consistent. The examples show that, in spite of all its presumed virtues, the maximum likelihood procedure cannot be universally recommended. This does not mean that we advocate
154 L.LE CAM some other principle instead,although we give a few guidelines in 6.For other views see the discussion of the paper by Berkson (1980). This paper is adapted from lectures given at the University of Maryland,College Park, in the Fall of 1975.We are greatly indebted to Professor Grace L.Yang for the invitation to give the lectures and for the permission to reproduce them. 2 A Few Old Examples Let X,X2,...,X be independent identically distributed observations with values in some space X,A).Suppose that there is a o-finite measure A on A and that the distribution Pe of X;has a density f(x,0)with respect to u.The parameter 0 takes its values in some set e. For n observations x1,x2,...,xn the maximum likelihood estimate is any value 6 such that IIf(0)=sup IIf(0). j=1 ej=1 Note that such a 6 need not exist,and that,when it does,it usually depends on what version of the densities f(x,6)was selected.A function (x1,...,x)(x1,...,x) selecting a value 6 for each n-tuple (x1,...,x)may or may not be measurable. However all of this is not too depressing.Let us consider some examples. Example 1.(This may be due to Kiefer and Wolfowitz or to whoever first looked at mixtures of Normal distributions.)Let a be the number=10-10.Let =(u,o), u(-0,+)o>0.Let fi(x,0)be the density defined with respect to Lebesgue measureλon the line by a,o-a高e卿{-+vp{, Then,for (x1,...,x)one can take u=x1 and note that supΠfk;h,o)=o. 0=1 If o=0 was allowed one could claim that=(x,0)is maximum likelihood. Example 2.The above Example 1 is obviously contaminated and not fit to drink.Now a variable X is called log normal if there are numbers (a,b,c)such that X=c+ear+b with a Y which is N(0,1).Let 0=(a,b,c)in R3.The density of X can be taken zero for xsc and for x>c,and is equal to ,o)-v2np{-aloge-a)-br() 1 A sample (x1,...,x)from this density will almost surely have no ties and a unique minimum z minxi. The only values to consider are those for which c<z.Fix a value of b,say b =0.Take a
L. LE CAM some other principle instead, although we give a few guidelines in ? 6. For other views see the discussion of the paper by Berkson (1980). This paper is adapted from lectures given at the University of Maryland, College Park, in the Fall of 1975. We are greatly indebted to Professor Grace L. Yang for the invitation to give the lectures and for the permission to reproduce them. 2 A Few Old Examples Let X1, X2, ... , X, be independent identically distributed observations with values in some space {X,A}. Suppose that there is a a-finite measure A on A and that the distribution P0 of Xj has a density f(x, 0) with respect to M. The parameter 0 takes its values in some set 0. For n observations x,l, x,.. ., xn the maximum likelihood estimate is any value 0 such that n n f (x0) sup f(x,e 0). j=1 0eO j= Note that such a 0 need not exist, and that, when it does, it usually depends on what version of the densities f(x, 0) was selected. A function (xl,..., x,n) 0((x,.. ., x,) selecting a value 0 for each n-tuple (xl,..., x,) may or may not be measurable. However all of this is not too depressing. Let us consider some examples. Example 1. (This may be due to Kiefer and Wolfowitz or to whoever first looked at mixtures of Normal distributions.) Let ca be the number c = 10-1017. Let 0= (,u, a), M e (-00, +oo), a>0. Let fl(x, 0) be the density defined with respect to Lebesgue measure A on the line by - 2p{( -^ 1 (X{-7)2} fi(x, 0) = (2r) exp -2 (x - P)2 + a(2r) exp {- (a2 Then, for (xl, ..., xn) one can take p = xl and note that n sup fi(x,;p, o)= o. a j=l If a = 0 was allowed one could claim that 0 = (xl, 0) is maximum likelihood. Example 2. The above Example 1 is obviously contaminated and not fit to drink. Now a variable X is called log normal if there are numbers (a, b, c) such that X = c + eaY+b with a Y which is N(0, 1). Let 0 = (a, b, c) in R3. The density of X can be taken zero for x c, and is equal to 2(X, ) = (2) exp 2 [log (x - c) - b]2} - (-x ). A sample (x1, .. ., Xn) from this density will almost surely have no ties and a unique minimum z = min xj. The only values to consider are those for which c < z. Fix a value of b, say b = 0. Take a 154
Maximum Likelihood 155 ce(z-2,z)so close to z that llog (z-c)=max llog (x-c) Then the sum of squares in the exponent of the joint density does not exceed azn llog (z-c)2 One can make sure that this does not get too large by taking a=nllog(z-c).The extra factor in the density has then a term of the type [og(z-c训1 -c' which can still be made as large as you please. If you do not believe my algebra,look at the paper by Hill(1963). Example 3.The preceding example shows that the log normal distribution misbehaves. Everybody knows that taking logarithms is unfair.The following shows that three dimensional parameters are often unfair as well.(The example can be refined to apply to 0∈R2) Let X=R3=e.Let xll be the usual Euclidean length of x.Take a density i6,8)=ce-o2 llx-e118' with Be(0,1)fixed,say B=.Here again 5,6) =1 will have a supremum equal to +This time it is even attained by taking 6=x1,or x2. One can make the situation a bit worse selecting a dense countable subset {ak}, k=1,2,...,in R3 and taking f6,)-∑ck)exp{-r-6-4的 x-8-akl吃 with suitable coefficients C(k)which decrease rapidly to zero. Now take again a=10-101and take 1-a f5x,)= 2me-r+a6k,0). If we do take into account the contamination afa(,e)the supremum is infinite and attained at each x.If we ignore it everything seems fine,but then the maximum likelihood estimate is the mean n i=1 which,says C.Stein,is not admissible. Example 4.The following example shows that,as in Examples 2 and 3,one should not shift things.Take independent identically distributed observations X1,...,X from the
Maximum Likelihood c e (z - 1, z) so close to z that Ilog (z - c)l = max Ilog (xj - c)l. i Then the sum of squares in the exponent of the joint density does not exceed 1n 1log (z - c)12. One can make sure that this does not get too large by taking a = n |log (z - c)l. The extra factor in the density has then a term of the type [n? log (z - c)l]-n - c which can still be made as large as you please. If you do not believe my algebra, look at the paper by Hill (1963). Example 3. The preceding example shows that the log normal distribution misbehaves. Everybody knows that taking logarithms is unfair. The following shows that three dimensional parameters are often unfair as well. (The example can be refined to apply to 0eR2.) Let X = R3 = O. Let lIxll be the usual Euclidean length of x. Take a density -fllx-0112 IIx - 011' with 3 e (0, 1) fixed, say / = -.Here again n If3(xj, 0) j=1 will have a supremum equal to +o0. This time it is even attained by taking 0 = xi, or x2. One can make the situation a bit worse selecting a dense countable subset {ak}, k = 1, 2,..., in R3 and taking f4(x, 0)= C(k) exp (-llx - 0- ak112} k IIx - 0 - ak II' with suitable coefficients C(k) which decrease rapidly to zero. Now take again a = 101-137 and take f5(x, 0) = (2r)32 e + of3(x, e). If we do take into account the contamination af3(x, 0) the supremum is infinite and attained at each xi. If we ignore it everything seems fine, but then the maximum likelihood estimate is the mean n nj=1 which, says C. Stein, is not admissible. Example 4. The following example shows that, as in Examples 2 and 3, one should not shift things. Take independent identically distributed observations X, .. ., Xn from the 155
156 L.LE CAM gamma density shifted to start atg so that it is fx,)=BT-(a)e-x-)-1 forx>and zero otherwise.Let B and a take positive values and let g be arbitrary real. Here,for arbitrary 00,one will have P0.9)=m One can achieve by taking=min Xi,ae(0,1)and B arbitrary.The shape of your observed histogram may be trying to tell you that it comes from an a>10,but that must be ignored. Example 5.The previous examples have infinite contaminated inadmissible difficulties. Let us be more practical.Suppose that Xi,X2,...,X are independent uniformly distributed on [0,e],>0 Let Z=max Xi.Then =Z is the m.l.e.It is obviously pretty good.For instance E(6n-62=82, 2 (n+1)(n+2) Except for mathematical caviling,as L.S.Savage says,it is also obviously best for all purposes.So,let us not cavil,but try 时=n+22 n+1 Then E(8:-0P=6,1 (n+1)21 The ratio of the two is Ea(0n-0)2 E0(8*-B)2=2n+>、 This must be less than unity.Therefore one must have 2(n+1)sn+2 or equivalently n≤0. It is hard to design experiments where the number of observations is strictly negative. Thus our best bet is to design them with n=0 and uphold the faith. 3 A More Disturbing Example This one is due to Neyman and Scott.Suppose that (Xi,Y),j=1,2,...,are all independent random variables with X,and Y both Normal N(,o).We wish to estimate o2.A natural way to proceed would be to eliminate the nuisances and use the differences Z=X-Y;which are now N(0,202).One could then estimate o2 by That looks possible,but we may have forgotten about some of the information which is contained in the pairs (X,Y)but not in their differences Z.Certainly a direct application of maximum likelihood principles would be better and much less likely to lose information.So we compute o2 by taking suprema over all 5 and over o
L. LE CAM gamma density shifted to start at ~ so that it is f(x, o) = fiT-l(c)e-(x-)(x - 5) -1 for x ? $ and zero otherwise. Let fS and ar take positive values and let t be arbitrary real. Here, for arbitrary 0 0, one will have n sup f(xi, 0)=. / j=1 One can achieve +0o by taking J = min Xi, c E (0, 1) and A arbitrary. The shape of your observed histogram may be trying to tell you that it comes from an a ? 10, but that must be ignored. Example 5. The previous examples have infinite contaminated inadmissible difficulties. Let us be more practical. Suppose that X1, X2, . .., Xn are independent uniformly distributed on [0, 0], 0 >0 Let Z = maxXj. Then n = Z is the m.l.e. It is obviously pretty good. For instance Eo(O - 0)2= 02 2 (n + 1)(n + 2)' Except for mathematical caviling, as L.S. Savage says, it is also obviously best for all purposes. So, let us not cavil, but try n+2 n ~- Z. n* + n+l 1 Then E,(08* - 0)2 = 02 (n + 1)2' The ratio of the two is Ee(O- 0)2 n_+1 =2 E(On*- 0)2 n+2 This must be less than unity. Therefore one must have 2(n + 1) S n + 2 or equivalently 0. It is hard to design experiments where the number of observations is strictly negative. Thus our best bet is to design them with n = 0 and uphold the faith. 3 A More Disturbing Example This one is due to Neyman and Scott. Suppose that (Xj, Yj), j = 1, 2,..., are all independent random variables with Xj and Yj both Normal N(ij, a2). We wish to estimate a2. A natural way to proceed would be to eliminate the nuisances , and use the differences Zi = Xj - Yj which are now N(0, 22). One could then estimate a2 by 1 n s2=_ zj2 2n i- =l That looks possible, but we may have forgotten about some of the information which is contained in the pairs (Xj, Yj) but not in their differences Zj. Certainly a direct application of maximum likelihood principles would be better and much less likely to lose information. So we compute e2 by taking suprema over all 5j and over a. 156
Maximum Likelihood 157 This gives 0=却品 Now,we did not take logarithms,nothing was contaminated,there was no infinity involved.In fact nothing seems amiss. So the best estimate must be not the intuitive s2 but2=2. The usual explanation for this discrepancy is that Neyman and Scott had too many parameters.This may be,but how many is too many?When there are too many should one correct the m.l.e.by a factor of two or (n+2)/(n+1)as in Example 5,or by taking a square root as in the m.l.e.for a star-like distribution?For this latter case,see Barlow etal.(1972). The number of parameters,by itself,does not seem to be that relevant.Take,for instance,i.i.d.observations XIX2,...,X on the line with a totally unknown distribu- tion function F.The m.l.e.of F is the empirical cumulative F.It is not that bad.Yet,a crude evaluation shows that F depends on very many parameters indeed,perhaps even more than Barlow et al.had for their star-like distributions. Note that in the above examples we did not let n tend to infinity.It would not have helped,but now let us consider some examples where the misbehavior will be described asn→e. 4 An Example of Bahadur The following is a slight modification of an example given by Bahadur in 1958.The modification does not have the purity of the original but it is more transparent and the purity can be recovered. Take a function,say h,defined on (0,1].Assume that h is decreasing,that h(x)>1 for allx∈(0,1]and that h(x)dx=oo. 0 Select a number c,ce(0,1)and proceed as follows.One probability measure,say po,on [0,1]is the Lebesgue measure A itself.Define a number a by the property [h(x)-c]dx=1-c. Take for pi the measure whose density with respect to A is c for 0sxsa and h(x)for a1<x≤1. If a1,a2,...,ak-1 have been determined define ak by the relation [h(x)-c]dx=1-c and take for p&the measure whose density with respect to A on [0,1]is c for x (ak,ak-] and h(x)for xE(ak,ag-1]. Since h(x)dx=oo Jo the process can be continued indefinitely,giving a countable family of measures p&, k=1,2,....Note that any two of them,say p;and p&with j<k,are mutually absolutely continuous
Maximum Likelihood This gives 2n = 2S Now, we did not take logarithms, nothing was contaminated, there was no infinity involved. In fact nothing seems amiss. So the best estimate must be not the intuitive s2 but a2 = 1S2 The usual explanation for this discrepancy is that Neyman and Scott had too many parameters. This may be, but how many is too many? When there are too many should one correct the m.l.e. by a factor of two or (n + 2)/(n + 1) as in Example 5, or by taking a square root as in the m.l.e. for a star-like distribution? For this latter case, see Barlow et al. (1972). The number of parameters, by itself, does not seem to be that relevant. Take, for instance, i.i.d. observations X1X2, Xn on the line with a totally unknown distribution function F. The m.l.e. of F is the empirical cumulative Fn. It is not that bad. Yet, a crude evaluation shows that F depends on very many parameters indeed, perhaps even more than Barlow et al. had for their star-like distributions. Note that in the above examples we did not let n tend to infinity. It would not have helped, but now let us consider some examples where the misbehavior will be described as n -> o. 4 An Example of Bahadulr The following is a slight modification of an example given by Bahadur in 1958. The modification does not have the purity of the original but it is more transparent and the purity can be recovered. Take a function, say h, defined on (0, 1]. Assume that h is decreasing, that h(x) > 1 for all x E (0, 1] and that f h(x)dx= . Select a number c, c E (0, 1) and proceed as follows. One probability measure, say po, on [0, 1] is the Lebesgue measure A itself. Define a number al by the property [h(x) - c]dx = 1 - c. ial Take for pi the measure whose density with respect to A is c for 0 x - a1 and h(x) for al <x<l. If al, a2, .. , ak- have been determined define ak by the relation [h(x) - c]dx = 1 - c ak and take for Pk the measure whose density with respect to A on [0, 1] is c for x ? (ak, ak-1] and h(x) for x e (ak, ak-1]. Since f h(x)dx = oo the process can be continued indefinitely, giving a countable family of measures Pk, k = 1, 2, .... Note that any two of them, say pj and Pk with j < k, are mutually absolutely continuous. 157
158 L.LE CAM If x1,x2,...,are n observations taken on [0,1]the corresponding logarithm of likelihood ratio is given by the expression: A-si0图-8os-坐oeg dpi(xi) where the first sum (is for(ak,a-]and the second is for xe(a,-l. Now assume that the X1,...,X are actually i.i.d.from some distribution pio They have a minimum Zn min Xi. With probability unity this will fall in some interval (ak,]with k=k(Z).Fix a value j and consider nAf".This is at least equal to 1,h(Za)1.h(a) log- -Vi.n log- c n where vin is the number of X,'s which fall in (a,a-1]. According to the strong law of large numbers nv converges to some constant P1.Also,jo being fixed,Zn tends almost surely to zero.In fact if yy)=(1-cy)"se-ncy. Thus,as long as Ee-me1.One can find functions u of that kind which are strictly increasing on (0,1)and are infinitely differentiable on (-+) Now let pe=p&if 6 is equal to the integer k.If 6e(k,k+1)let Pa=[1-u(0-k)]pk+u(8-k)pk+1:
L. LE CAM If Xl, X2,..., Xn are n observations taken on [0, 1] the corresponding logarithm of likelihood ratio is given by the expression: log fn dpk(x,) (k) (x) h(xi) i=1 dpj(xi) i c c where the first sum (k) is for xi E (ak, ak-lI and the second is for xi E (aj, aj-1]. Now assume that the X1,..., Xn are actually i.i.d. from some distribution pjo. They have a minimum Zn = min Xi. i With probability unity this will fall in some interval (akn, akn-l] with kn = kn(Zn). Fix a value j and consider n-lAk". This is at least equal to 1 h(Zn) 1 h(ai) n n-log c c -nv,log n c where Vj,n is the number of Xj's which fall in (aj, aj_1]. According to the strong law of large numbers n-l'vjn converges to some constant Pjo,s S 1. Also, jo being fixed, Zn tends almost surely to zero. In fact if y y} = (1 - cy)n 1. One can find functions u of that kind which are strictly increasing on (0, 1) and are infinitely differentiable on (-oo, +oo). Now let pe =Pk if 0 is equal to the integer k. If 0 e (k, k + 1) let pe = [1 - u(0 - k)pk + u(0 - k)pk+. 158
Maximum Likelihood 159 Taking for each p&the densities f used previously,we obtain similarly densities f(x,)=[1-u(6-k)]f(x)+u(0-k)f+1(x): The function u can be constructed,for instance,by taking a multiple of the indefinite integral of the function ep{[+] for te[0,1)and zero otherwise.If so f(x,6)is certainly infinitely differentiable in 0. Also the integral ff(x,)dx can be differentiated infinitely under the integral sign.There is a slight annoyance that at all integer values of e all the derivatives vanish.To cure this take a=10-101"and let 8(x,)=[f(x,)+f(x,日+e-1 Then,certainly,everything is under control and the famous conditions in Cramer's text are all duly satisfied.Furthermore,0'implies Jl8x,6)-8x,0"1dc>0. In spite of all this,whatever may be the true value 6o,the maximum likelihood estimate still tends almost surely to infinity. Let us return to the initial example with measures p&,k=1,2,...,and let us waste some information.Having observed X1,...,X,according to one of the p&take independent identically distributed N(0,10)variables Yi,...,Y and consider V= Xi+Yi for j=1,2,...,n. Certainly one who observes V,j=1,...,n,instead of Xi,i=1,...,n,must be at a gross disadvantage! Maximum likelihood estimates do not really think so. The densities of the new variables V are functions,say va,defined,positive analytic, etc.on the whole line R=(-,+)They still are all different.In other words Ipx(x)-9,(x川d>0(k≠): Compute the maximum likelihood estimate=(v1,...,v)for these new observa- tions.We claim that pln(y,…,V)=j-→1 asn→oo. To prove this let o=103 and note that (v)is a moderately small distortion of the function 1e-(v-/o)d+(1-c) )=c。N2而 1 e-(v-aj)21(202) V(2π) Furthermore,as m the function v(v)converges pointwise to f_1eo-smd5+(1-c)。 .)=c。N2m 1e22. V(2π) Thus,we can compactify the set =(1,2,...}by addition of a point at infinity with (v)as described above
Maximum Likelihood Taking for each Pk the densities fk used previously, we obtain similarly densities f(x, 0) = [1 - u( - k)lfk(x) + u(O - k)fk+l(X). The function u can be constructed, for instance, by taking a multiple of the indefinite integral of the function {-[ 1 for t E [0, 1) and zero otherwise. If so f(x, 0) is certainly infinitely differentiable in 0. Also the integral ff(x, 0) dx can be differentiated infinitely under the integral sign. There is a slight annoyance that at all integer values of 0 all the derivatives vanish. To cure this take a = 1010137 and let g(x, 0) = l[f(x, 0) +f(x, 0 + ce-4)]. Then, certainly, everything is under control and the famous conditions in Cramer's text are all duly satisfied. Furthermore, 0 6O' implies Ig(x, 0)-g(x, 0') dx >0. In spite of all this, whatever may be the true value O0, the maximum likelihood estimate still tends almost surely to infinity. Let us return to the initial example with measures Pk, k = 1, 2,..., and let us waste some information. Having observed X1,... , Xn, according to one of the Pk take independent identically distributed N(0, 106) variables Yl,..., Yn and consider Vj= Xj + Yj for j= 1, 2, ..., n. Certainly one who observes Vj, j = 1,..., n, instead of Xj, i = 1,... , n, must be at a gross disadvantage! Maximum likelihood estimates do not really think so. The densities of the new variables Vj are functions, say IPk, defined, positive analytic, etc. on the whole line R = (-oo, +oo). They still are all different. In other words I Ik(x)- j(x)l dx >0 (k j). Compute the maximum likelihood estimate On = n(v1,..., Vn) for these new observations. We claim that pj[O (V1 ..., Vn) =j]- 1 as n - oo. To prove this let a = 103 and note that ipj(v) is a moderately small distortion of the function i(v) = c a e(2v) 2f(2a2) d_ + (1 - C) ( (ve-i)2/(2a2) or +V(2r) oV/(2sr) Furthermore, as m -- oo the function Pm(v) converges pointwise to .1 1 1 -(v) = c e a((2-) e 2( d +1 - (2) ev2/(222) orN(2.1r)'a~ + ( c)o/(2r) Thus, we can compactify the set = {1, 2, .. .} by addition of a point at infinity with t~oo(v) as described above. 159
160 L.LE CAM We now have a family {we;0e)such that ve(v)is continuous in 0 for each v.Also x(v)+ sup log k≥产m w;(v)] does not exceed 10 1w-12-1 Since this is certainly integrable,the theorem due to Wald(1949)is applicable and6 is consistent. So throwing away quite a bit of information made the m.l.e.consistent.Here we wasted information by fudging the observations.Another way would be to enlarge the parameter space and introduce irrelevant other measures pe. For this purpose consider our original variables Xi,but record only in which interval (ax,]the variable X falls.We obtain then discrete variables,say Y such that PlY=k]is the integral qi(k)of pi(x)on (ak,ak-1].Now,the set e of all possible discrete measures on the integers k=1,2,...can be metrized,for instance by the metric I2,-2=Σlq.(k)-9(kl For this metric the space is a complete separable space. Given discrete observations Y,j=1,...,n,we can compute a maximum likelihood estimate,say in this whole space The value of is that element e of which assigns to the integer k a probability (k)equal to the frequency of k in the sample. Now,if 6 is any element whatsoever of e,for every s>0,Pe{lle-l>E)tends to zero as no.More precisely,almost surely. The family we are interested in,the qa,i=1,2,...,constructed above form a certain subset,say o,of It is a nice closed(even discrete)subset of Suppose that we do know that eeo.Then,certainly,one should waste that information.However if we insist on taking a 6e o that maximizes the likelihood there, then 6 will almost never tend to 6.If on the contrary we maximize the likelihood over the entire space of all probability measures on the integers,we get an estimate that is consistent. It is true that this is not the answer to the problem of estimating a 6 that lies in o.May be that is too hard a problem?Let us try to select a point 0o closest to If there is no such closest point just take such that l6*-8.ll≤2-m+inf{ll8*-6l;0e⊙o}. Then Pa{0=0 for all sufficiently large n)=1. So the problem cannot be too terribly hard.In addition Doob(1948)says that,if we place on eo a prior measure that charges every point,the corresponding Bayes estimate will behave in the same manner as our 6. As explained this example is imitated from one given by Bahadur (1958).Another example imitated from Bahadur and from the mixture of Example 1 has been given by Ferguson (1982).Ferguson takes =[0,1]and considers i.i.d.variables taking values in [-1,+1].The densities,with respect to Lebesgue measure on [-1,-1],are of the form e,n=+[6o别oT
L. LE CAM We now have a family {tie; 0 e 0} such that ipe(v) is continuous in 0 for each v. Also sup log lkv + kam L ti(V) does not exceed 106 I(v - 1)2 - v21. Since this is certainly integrable, the theorem due to Wald (1949) is applicable and 0 is consistent. So throwing away quite a bit of information made the m.l.e. consistent. Here we wasted information by fudging the observations. Another way would be to enlarge the parameter space and introduce irrelevant other measures Po. For this purpose consider our original variables Xj, but record only in which interval (ak, ak_-] the variable Xj falls. We obtain then discrete variables, say Yj such that Pi[Yj = k] is the integral qi(k) of pi(x) on (ak, ak_-]. Now, the set 0 of all possible discrete measures on the integers k = 1, 2, ... can be metrized, for instance by the metric IIQs - Qrl1 = > Iqs(k)- qr(k)l. k For this metric the space is a complete separable space. Given discrete observations Yj, j = 1,..., n, we can compute a maximum likelihood estimate, say O*, in this whole space 0. The value of O* is that element O* of 0 which assigns to the integer k a probability O*(k) equal to the frequency of k in the sample. Now, if 0 is any element whatsoever of 0, for every e >0, Pe{llO - 0* > e} tends to zero as n -- oo. More precisely, 0* - 0 almost surely. The family we are interested in, the qi, i = 1, 2,..., constructed above form a certain subset, say 00, of 0. It is a nice closed (even discrete) subset of 0. Suppose that we do know that 0 Eo. Then, certainly, one should waste that information. However if we insist on taking a On e 00 that maximizes the likelihood there, then On will almost never tend to 0. If on the contrary we maximize the likelihood over the entire space of all probability measures on the integers, we get an estimate 6* that is consistent. It is true that this is not the answer to the problem of estimating a 0 that lies in 00. May be that is too hard a problem? Let us try to select a point On E 0o closest to 0*. If there is no such closest point just take On such that |I0n - 0nil 2-n + inf {|0n - 011; 0 E o}. Then Pe {n = 0 for all sufficiently large n} = 1. So the problem cannot be too terribly hard. In addition Doob (1948) says that, if we place on 00 a prior measure that charges every point, the corresponding Bayes estimate will behave in the same manner as our On. As explained this example is imitated from one given by Bahadur (1958). Another example imitated from Bahadur and from the mixture of Example 1 has been given by Ferguson (1982). Ferguson takes 0 = [0, 1] and considers i.i.d. variables taking values in [-1, +1]. The densities, with respect to Lebesgue measure on [-1, -1], are of the form f (X, ) 2 + -0) -x-9(o) \]' f(x^e)62b() 6(o) 160
Maximum Likelihood 161 where 6 is a continuous function that decreases from 1 to 0 on [0,1].If it tends to zero rapidly enough as e-1,the peaks of the triangles will distract the m.l.e.from its appointed rounds.In Example 1,$2,the m.l.e.led a precarious existence.Here everything is compact and continuous and all of Wald's conditions,except one,are satisfied.To convert the example into one that satisfies Cramer's conditions,for ee(0,1),Ferguson replaces the triangles by Beta densities. The above example relies heavily on the fact that ratios of the type f(x,0)/f(x,00)are unbounded functions of 6.One can also make up examples where the ratios stay bounded and m.l.e.still misbehaves. A possible example is as follows.For each integer m>1 divide the interval (0,1]by binary division,getting 2"intervals of the form (2m,(0+1)2m](0=0,1,..,2m-1). For each such division there are 2m 2m-1 ways of selecting 2-of the intervals.Make a selection s.On the selected ones,letm be equal to 1.On the remaining ones let .m be equal to(-1). This gives a certain countable family of functions. Now for given m and for the selection s let psm be the measure whose density with respect to Lebesgue measure on(0,1]is 1+(1-e-m)中.m In this case the ratio of densities is always between and 2.The measures are all distinct from one another. Application of a maximum likelihood technique would lead us to estimate m by + (This is essentially equivalent to another example of Bahadur. 5 An Example from Biostatistics The following is intended to show that even for 'straight'exponential families one can sometimes do better than the m.l.e. The example has a long history,which we shall not recount.It occurs from the evaluation of dose responses in biostatistics. Suppose that a chemical can be injected to rats at various doses y,y2,...,y:.For a particular dose,one just observes whether or not there is a response.There is then for each y a certain probability of response.Biostatisticians,being complicated people,prefer to work out not with the dose y but with its logarithm x=log y. We shall then let p(x)be the probability of response if the animal is given the log dose t. Some people,including Sir Ronald,felt that the relation xp(x)would be well described by a cumulative normal distribution,in standard form 1 p-V2m_e护d业 I do not know why.Some other people felt that the probability p has a derivative p' about proportional to p except that for p close to unity (large dose)the poor animal is saturated so that the curve has a ceiling at 1
Maximum Likelihood where 6 is a continuous function that decreases from 1 to 0 on [0, 1]. If it tends to zero rapidly enough as 0- 1, the peaks of the triangles will distract the m.l.e. from its appointed rounds. In Example 1, ? 2, the m.l.e. led a precarious existence. Here everything is compact and continuous and all of Wald's conditions, except one, are satisfied. To convert the example into one that satisfies Cramer's conditions, for 0 E (0, 1), Ferguson replaces the triangles by Beta densities. The above example relies heavily on the fact that ratios of the type f(x, 0)/f(x, 00) are unbounded functions of 0. One can also make up examples where the ratios stay bounded and m.l.e. still misbehaves. A possible example is as follows. For each integer m > 1 divide the interval (0, 1] by binary division, getting 2m intervals of the form (j2-, (j + 1)2-m] (j = , 1,...,2m 1). For each such division there are (2m) 2M-1 ways of selecting 2m-1 of the intervals. Make a selection s. On the selected ones, let )s,m be equal to 1. On the remaining ones let qPs,m be equal to (-1). This gives a certain countable family of functions. Now for given m and for the selection s let Ps,m be the measure whose density with respect to Lebesgue measure on (0, 1] is 1 + (1 - e-m)s,m. In this case the ratio of densities is always between I and 2. The measures are all distinct from one another. Application of a maximum likelihood technique would lead us to estimate m by +00. (This is essentially equivalent to another example of Bahadur.) 5 An Example from Biostatistics The following is intended to show that even for 'straight' exponential families one can sometimes do better than the m.l.e. The example has a long history, which we shall not recount. It occurs from the evaluation of dose responses in biostatistics. Suppose that a chemical can be injected to rats at various doses yl, Y2, . . , yi > 0. For a particular dose, one just observes whether or not there is a response. There is then for each y a certain probability of response. Biostatisticians, being complicated people, prefer to work out not with the dose y but with its logarithm x = log y. We shall then let p(x) be the probability of response if the animal is given the log dose x. Some people, including Sir Ronald, felt that the relation x->p(x) would be well described by a cumulative normal distribution, in standard form p() =/(2r) e- dt I do not know why. Some other people felt that the probability p has a derivative p' about proportional to p except that for p close to unity (large dose) the poor animal is saturated so that the curve has a ceiling at 1. 161
162 L.LE CAM Thus,somebody,perhaps Raymond Pearl,following Verhulst,proposed the 'logistic' p'(x)=p(x)[1-p(x)]whose solution is p刻1+2 give or take a few constants. Therefore,we shall assume that p(x)has the form 1 p(x)=1+e(a+网 with two constants a and B,B>0. Since we are not particularly interested in the actual animals,we shall consider only the case where B is known,say B=1 so that a is the only parameter and 1 p(四)=1+e(+雨· Now we select a few log doses x1,x2,...,x.At x,we inject n,animals and count the number of responses r.We want to estimate a. For reasons which are not biostatistical but historical (or more precisely routine of thought)it is decided that the estimate a should be such that R(a,a)=Ea(a-a)2 be as small as possible. A while back,Cramer and Rao said that,for unbiased estimates,R(a,a)cannot be smaller than 1/I(a),where I(a)is the Fisher information I(a)=∑n;p(c1-p(小 So,to take into account the fact that some positions of a are better than some others we shall use instead of R the ratio F=I()E.(d-a)2 The joint density of the observations is easy to write.It is just Π(pe-perm-Π(r. where s,is the number of non respondents at log dose xi. Now 1-px2=e-(a+0, p(x) so that the business term of the above is just Πe-(a+两 in which one recognizes immediately a standard one-dimensional,linearly indexed exponential family,with sufficient statistic s;. The first thing to try is of course the best estimate of all,namely m.l.e.That leads to a
162 L. LE CAM Thus, somebody, perhaps Raymond Pearl, following Verhulst, proposed the 'logistic' p'(x) =p(x)[l -p(x)] whose solution is p(x) =1 + e-x give or take a few constants. Therefore, we shall assume that p(x) has the form p(x) = 1 + 1e- e-(*a+x) with two constants a and f3, f > 0. Since we are not particularly interested in the actual animals, we shall consider only the case where f is known, say f3 = 1 so that ac is the only parameter and p(x) = 1( 1 + e-('"+X)' Now we select a few log doses x1, x2,.. ., x,. At xj we inject nj animals and count the number of responses rj. We want to estimate a. For reasons which are not biostatistical but historical (or more precisely routine of thought) it is decided that the estimate & should be such that R(&, a)=E,(&- C)2 be as small as possible. A while back, Cramer and Rao said that, for unbiased estimates, R(&, a) cannot be smaller than 1/1(ca), where I(a) is the Fisher information I(a) = njp(xj)[l -p(xj)]. So, to take into account the fact that some positions of a are better than some others we shall use instead of R the ratio F = I(a)E,(& - a)2. The joint density of the observations is easy to write. It is just H (r)[P )] p)] H (r )[ p(xj) where sj is the number of non respondents at log dose xj. Now 1-p(xi) =_(+Xj) p(xj) so that the business term of the above is just Il e-(t+xj)sj j in which one recognizes immediately a standard one-dimensional, linearly indexed exponential family, with sufficient statistic E sj. The first thing to try is of course the best estimate of all, namely m.l.e. That leads to a