《数理统计》课程教学资源（参考资料）Maximum Likelihood - An Introduction.pdf_大学文库

International Statistical Review (1990).58,2,pp.153-171.Printed in Great Britain Interational Statistical Institute Maximum Likelihood:An Introduction L.Le Cam Department of Statistics,University of California,Berkeley,California 94720,USA Summary Maximum likelihood estimates are reported to be best under all circumstances.Yet there are numerous simple examples where they plainly misbehave.One gives some examples for problems that had not been invented for the purpose of annoying maximum likelihood fans.Another example,imitated from Bahadur,has been specially created with just such a purpose in mind.Next, we present a list of principles leading to the construction of good estimates.The main principle says that one should not believe in principles but study each problem for its own sake. Key words:Estimation;Maximum likelihood;One-step approximations. 1 Introduction One of the most widely used methods of statistical estimation is that of maximum likelihood.Opinions on who was the first to propose the method differ.However Fisher is usually credited with the invention of the name 'maximum likelihood',with a major effort intended to spread its use and with the derivation of the optimality properties of the resulting estimates. Qualms about the general validity of the optimality properties have been expressed occasionally.However as late as 1970 L.J.Savage could imply in his 'Fisher lecture'that the difficulties arising in some examples would have rightly been considered 'mathemati- cal caviling'by R.A.Fisher. Of course nobody has been able to prove that maximum likelihood estimates are 'best' under all circumstances.The lack of any such proof is not sufficient by itself to invalidate Fisher's claims.It might simply mean that we have not yet translated into mathematics the basic principles which underlied Fisher's intuition. The present author has,unwittingly,contributed to the confusion by writing two papers which have been interpreted by some as attempts to substantiate Fisher's claims. To clarify the situation we present a few known facts which should be kept in mind as one proceeds along through the various proofs of consistency,asymptotic normality or asymptotic optimality of maximum likelihood estimates. The examples given here deal mostly with the case of independent identically distributed observations.They are intended to show that maximum likelihood does possess disquieting features which rule out the possibility of existence of undiscovered underlying principles which could be used to justify it.One of the very gross forms of misbehavior can be stated as follows. Maximum likelihood estimates computed with all the information available may turn out to be inconsistent.Throwing away a substantial part of the information may render them consistent. The examples show that,in spite of all its presumed virtues,the maximum likelihood procedure cannot be universally recommended.This does not mean that we advocate

International Statistical Review (1990), 58, 2, pp. 153-171. Printed in Great Britain ? International Statistical Institute Maximum Likelihood: An Introduction L. Le Cam Department of Statistics, University of California, Berkeley, California 94720, USA Summary Maximnm likelihood estimates are reported to be best under all circumstances. Yet there are numerous simple examples where they plainly misbehave. One gives some eranmples for problems that had not been invented for the purpose of annoying ms,aximunm likelihood fans. Another example, imitated from B'hadu'r, has been specially created with just such a purpose in mind. Next, we present a list of principles leading to the construction of good estimates. The main principle says that one should not believe in principles but study each problem for its own sake. Key words: Estimation; Maximum likelihood; One-step approximations. 1 Introduction One of the most widely used methods of statistical estimation is that of maximum likelihood. Opinions on who was the first to propose the method differ. However Fisher is usually credited with the invention of the name 'maximum likelihood', with a major effort intended to spread its use and with the derivation of the optimality properties of the resulting estimates. Qualms about the general validity of the optimality properties have been expressed occasionally. However as late as 1970 L.J. Savage could imply in his 'Fisher lecture' that the difficulties arising in some examples would have rightly been considered 'mathematical caviling' by R.A. Fisher. Of course nobody has been able to prove that maximum likelihood estimates are 'best' under all circumstances. The lack of any such proof is not sufficient by itself to invalidate Fisher's claims. It might simply mean that we have not yet translated into mathematics the basic principles which underlied Fisher's intuition. The present author has, unwittingly, contributed to the confusion by writing two papers which have been interpreted by some as attempts to substantiate Fisher's claims. To clarify the situation we present a few known facts which should be kept in mind as one proceeds along through the various proofs of consistency, asymptotic normality or asymptotic optimality of maximum likelihood estimates. The examples given here deal mostly with the case of independent identically distributed observations. They are intended to show that maximum likelihood does possess disquieting features which rule out the possibility of existence of undiscovered underlying principles which could be used to justify it. One of the very gross forms of misbehavior can be stated as follows. Maximum likelihood estimates computed with all the information available may turn out to be inconsistent. Throwing away a substantial part of the information may render them consistent. The examples show that, in spite of all its presumed virtues, the maximum likelihood procedure cannot be universally recommended. This does not mean that we advocate

L. LE CAM some other principle instead, although we give a few guidelines in ? 6. For other views see the discussion of the paper by Berkson (1980). This paper is adapted from lectures given at the University of Maryland, College Park, in the Fall of 1975. We are greatly indebted to Professor Grace L. Yang for the invitation to give the lectures and for the permission to reproduce them. 2 A Few Old Examples Let X1, X2, ... , X, be independent identically distributed observations with values in some space {X,A}. Suppose that there is a a-finite measure A on A and that the distribution P0 of Xj has a density f(x, 0) with respect to M. The parameter 0 takes its values in some set 0. For n observations x,l, x,.. ., xn the maximum likelihood estimate is any value 0 such that n n f (x0) sup f(x,e 0). j=1 0eO j= Note that such a 0 need not exist, and that, when it does, it usually depends on what version of the densities f(x, 0) was selected. A function (xl,..., x,n) 0((x,.. ., x,) selecting a value 0 for each n-tuple (xl,..., x,) may or may not be measurable. However all of this is not too depressing. Let us consider some examples. Example 1. (This may be due to Kiefer and Wolfowitz or to whoever first looked at mixtures of Normal distributions.) Let ca be the number c = 10-1017. Let 0= (,u, a), M e (-00, +oo), a>0. Let fl(x, 0) be the density defined with respect to Lebesgue measure A on the line by - 2p{( -^ 1 (X{-7)2} fi(x, 0) = (2r) exp -2 (x - P)2 + a(2r) exp {- (a2 Then, for (xl, ..., xn) one can take p = xl and note that n sup fi(x,;p, o)= o. a j=l If a = 0 was allowed one could claim that 0 = (xl, 0) is maximum likelihood. Example 2. The above Example 1 is obviously contaminated and not fit to drink. Now a variable X is called log normal if there are numbers (a, b, c) such that X = c + eaY+b with a Y which is N(0, 1). Let 0 = (a, b, c) in R3. The density of X can be taken zero for x c, and is equal to 2(X, ) = (2) exp 2 [log (x - c) - b]2} - (-x ). A sample (x1, .. ., Xn) from this density will almost surely have no ties and a unique minimum z = min xj. The only values to consider are those for which c < z. Fix a value of b, say b = 0. Take a 154

Maximum Likelihood 157 This gives 0=却品 Now,we did not take logarithms,nothing was contaminated,there was no infinity involved.In fact nothing seems amiss. So the best estimate must be not the intuitive s2 but2=2. The usual explanation for this discrepancy is that Neyman and Scott had too many parameters.This may be,but how many is too many?When there are too many should one correct the m.l.e.by a factor of two or (n+2)/(n+1)as in Example 5,or by taking a square root as in the m.l.e.for a star-like distribution?For this latter case,see Barlow etal.(1972). The number of parameters,by itself,does not seem to be that relevant.Take,for instance,i.i.d.observations XIX2,...,X on the line with a totally unknown distribu- tion function F.The m.l.e.of F is the empirical cumulative F.It is not that bad.Yet,a crude evaluation shows that F depends on very many parameters indeed,perhaps even more than Barlow et al.had for their star-like distributions. Note that in the above examples we did not let n tend to infinity.It would not have helped,but now let us consider some examples where the misbehavior will be described asn→e. 4 An Example of Bahadur The following is a slight modification of an example given by Bahadur in 1958.The modification does not have the purity of the original but it is more transparent and the purity can be recovered. Take a function,say h,defined on (0,1].Assume that h is decreasing,that h(x)>1 for allx∈(0,1]and that h(x)dx=oo. 0 Select a number c,ce(0,1)and proceed as follows.One probability measure,say po,on [0,1]is the Lebesgue measure A itself.Define a number a by the property [h(x)-c]dx=1-c. Take for pi the measure whose density with respect to A is c for 0sxsa and h(x)for a1<x≤1. If a1,a2,...,ak-1 have been determined define ak by the relation [h(x)-c]dx=1-c and take for p&the measure whose density with respect to A on [0,1]is c for x (ak,ak-] and h(x)for xE(ak,ag-1]. Since h(x)dx=oo Jo the process can be continued indefinitely,giving a countable family of measures p&, k=1,2,....Note that any two of them,say p;and p&with j<k,are mutually absolutely continuous

Maximum Likelihood This gives 2n = 2S Now, we did not take logarithms, nothing was contaminated, there was no infinity involved. In fact nothing seems amiss. So the best estimate must be not the intuitive s2 but a2 = 1S2 The usual explanation for this discrepancy is that Neyman and Scott had too many parameters. This may be, but how many is too many? When there are too many should one correct the m.l.e. by a factor of two or (n + 2)/(n + 1) as in Example 5, or by taking a square root as in the m.l.e. for a star-like distribution? For this latter case, see Barlow et al. (1972). The number of parameters, by itself, does not seem to be that relevant. Take, for instance, i.i.d. observations X1X2, Xn on the line with a totally unknown distribution function F. The m.l.e. of F is the empirical cumulative Fn. It is not that bad. Yet, a crude evaluation shows that F depends on very many parameters indeed, perhaps even more than Barlow et al. had for their star-like distributions. Note that in the above examples we did not let n tend to infinity. It would not have helped, but now let us consider some examples where the misbehavior will be described as n -> o. 4 An Example of Bahadulr The following is a slight modification of an example given by Bahadur in 1958. The modification does not have the purity of the original but it is more transparent and the purity can be recovered. Take a function, say h, defined on (0, 1]. Assume that h is decreasing, that h(x) > 1 for all x E (0, 1] and that f h(x)dx= . Select a number c, c E (0, 1) and proceed as follows. One probability measure, say po, on [0, 1] is the Lebesgue measure A itself. Define a number al by the property [h(x) - c]dx = 1 - c. ial Take for pi the measure whose density with respect to A is c for 0 x - a1 and h(x) for al <x<l. If al, a2, .. , ak- have been determined define ak by the relation [h(x) - c]dx = 1 - c ak and take for Pk the measure whose density with respect to A on [0, 1] is c for x ? (ak, ak-1] and h(x) for x e (ak, ak-1]. Since f h(x)dx = oo the process can be continued indefinitely, giving a countable family of measures Pk, k = 1, 2, .... Note that any two of them, say pj and Pk with j < k, are mutually absolutely continuous. 157

Maximum Likelihood Taking for each Pk the densities fk used previously, we obtain similarly densities f(x, 0) = [1 - u( - k)lfk(x) + u(O - k)fk+l(X). The function u can be constructed, for instance, by taking a multiple of the indefinite integral of the function {-[ 1 for t E [0, 1) and zero otherwise. If so f(x, 0) is certainly infinitely differentiable in 0. Also the integral ff(x, 0) dx can be differentiated infinitely under the integral sign. There is a slight annoyance that at all integer values of 0 all the derivatives vanish. To cure this take a = 1010137 and let g(x, 0) = l[f(x, 0) +f(x, 0 + ce-4)]. Then, certainly, everything is under control and the famous conditions in Cramer's text are all duly satisfied. Furthermore, 0 6O' implies Ig(x, 0)-g(x, 0') dx >0. In spite of all this, whatever may be the true value O0, the maximum likelihood estimate still tends almost surely to infinity. Let us return to the initial example with measures Pk, k = 1, 2,..., and let us waste some information. Having observed X1,... , Xn, according to one of the Pk take independent identically distributed N(0, 106) variables Yl,..., Yn and consider Vj= Xj + Yj for j= 1, 2, ..., n. Certainly one who observes Vj, j = 1,..., n, instead of Xj, i = 1,... , n, must be at a gross disadvantage! Maximum likelihood estimates do not really think so. The densities of the new variables Vj are functions, say IPk, defined, positive analytic, etc. on the whole line R = (-oo, +oo). They still are all different. In other words I Ik(x)- j(x)l dx >0 (k j). Compute the maximum likelihood estimate On = n(v1,..., Vn) for these new observations. We claim that pj[O (V1 ..., Vn) =j]- 1 as n - oo. To prove this let a = 103 and note that ipj(v) is a moderately small distortion of the function i(v) = c a e(2v) 2f(2a2) d_ + (1 - C) ( (ve-i)2/(2a2) or +V(2r) oV/(2sr) Furthermore, as m -- oo the function Pm(v) converges pointwise to .1 1 1 -(v) = c e a((2-) e 2( d +1 - (2) ev2/(222) orN(2.1r)'a~ + ( c)o/(2r) Thus, we can compactify the set = {1, 2, .. .} by addition of a point at infinity with t~oo(v) as described above. 159

160 L.LE CAM We now have a family {we;0e)such that ve(v)is continuous in 0 for each v.Also x(v)+ sup log k≥产m w;(v)] does not exceed 10 1w-12-1 Since this is certainly integrable,the theorem due to Wald(1949)is applicable and6 is consistent. So throwing away quite a bit of information made the m.l.e.consistent.Here we wasted information by fudging the observations.Another way would be to enlarge the parameter space and introduce irrelevant other measures pe. For this purpose consider our original variables Xi,but record only in which interval (ax,]the variable X falls.We obtain then discrete variables,say Y such that PlY=k]is the integral qi(k)of pi(x)on (ak,ak-1].Now,the set e of all possible discrete measures on the integers k=1,2,...can be metrized,for instance by the metric I2,-2=Σlq.(k)-9(kl For this metric the space is a complete separable space. Given discrete observations Y,j=1,...,n,we can compute a maximum likelihood estimate,say in this whole space The value of is that element e of which assigns to the integer k a probability (k)equal to the frequency of k in the sample. Now,if 6 is any element whatsoever of e,for every s>0,Pe{lle-l>E)tends to zero as no.More precisely,almost surely. The family we are interested in,the qa,i=1,2,...,constructed above form a certain subset,say o,of It is a nice closed(even discrete)subset of Suppose that we do know that eeo.Then,certainly,one should waste that information.However if we insist on taking a 6e o that maximizes the likelihood there, then 6 will almost never tend to 6.If on the contrary we maximize the likelihood over the entire space of all probability measures on the integers,we get an estimate that is consistent. It is true that this is not the answer to the problem of estimating a 6 that lies in o.May be that is too hard a problem?Let us try to select a point 0o closest to If there is no such closest point just take such that l6*-8.ll≤2-m+inf{ll8*-6l;0e⊙o}. Then Pa{0=0 for all sufficiently large n)=1. So the problem cannot be too terribly hard.In addition Doob(1948)says that,if we place on eo a prior measure that charges every point,the corresponding Bayes estimate will behave in the same manner as our 6. As explained this example is imitated from one given by Bahadur (1958).Another example imitated from Bahadur and from the mixture of Example 1 has been given by Ferguson (1982).Ferguson takes =[0,1]and considers i.i.d.variables taking values in [-1,+1].The densities,with respect to Lebesgue measure on [-1,-1],are of the form e,n=+[6o别oT

L. LE CAM We now have a family {tie; 0 e 0} such that ipe(v) is continuous in 0 for each v. Also sup log lkv + kam L ti(V) does not exceed 106 I(v - 1)2 - v21. Since this is certainly integrable, the theorem due to Wald (1949) is applicable and 0 is consistent. So throwing away quite a bit of information made the m.l.e. consistent. Here we wasted information by fudging the observations. Another way would be to enlarge the parameter space and introduce irrelevant other measures Po. For this purpose consider our original variables Xj, but record only in which interval (ak, ak_-] the variable Xj falls. We obtain then discrete variables, say Yj such that Pi[Yj = k] is the integral qi(k) of pi(x) on (ak, ak_-]. Now, the set 0 of all possible discrete measures on the integers k = 1, 2, ... can be metrized, for instance by the metric IIQs - Qrl1 = > Iqs(k)- qr(k)l. k For this metric the space is a complete separable space. Given discrete observations Yj, j = 1,..., n, we can compute a maximum likelihood estimate, say O*, in this whole space 0. The value of O* is that element O* of 0 which assigns to the integer k a probability O*(k) equal to the frequency of k in the sample. Now, if 0 is any element whatsoever of 0, for every e >0, Pe{llO - 0* > e} tends to zero as n -- oo. More precisely, 0* - 0 almost surely. The family we are interested in, the qi, i = 1, 2,..., constructed above form a certain subset, say 00, of 0. It is a nice closed (even discrete) subset of 0. Suppose that we do know that 0 Eo. Then, certainly, one should waste that information. However if we insist on taking a On e 00 that maximizes the likelihood there, then On will almost never tend to 0. If on the contrary we maximize the likelihood over the entire space of all probability measures on the integers, we get an estimate 6* that is consistent. It is true that this is not the answer to the problem of estimating a 0 that lies in 00. May be that is too hard a problem? Let us try to select a point On E 0o closest to 0*. If there is no such closest point just take On such that |I0n - 0nil 2-n + inf {|0n - 011; 0 E o}. Then Pe {n = 0 for all sufficiently large n} = 1. So the problem cannot be too terribly hard. In addition Doob (1948) says that, if we place on 00 a prior measure that charges every point, the corresponding Bayes estimate will behave in the same manner as our On. As explained this example is imitated from one given by Bahadur (1958). Another example imitated from Bahadur and from the mixture of Example 1 has been given by Ferguson (1982). Ferguson takes 0 = [0, 1] and considers i.i.d. variables taking values in [-1, +1]. The densities, with respect to Lebesgue measure on [-1, -1], are of the form f (X, ) 2 + -0) -x-9(o) \]' f(x^e)62b() 6(o) 160

Maximum Likelihood 161 where 6 is a continuous function that decreases from 1 to 0 on [0,1].If it tends to zero rapidly enough as e-1,the peaks of the triangles will distract the m.l.e.from its appointed rounds.In Example 1,$2,the m.l.e.led a precarious existence.Here everything is compact and continuous and all of Wald's conditions,except one,are satisfied.To convert the example into one that satisfies Cramer's conditions,for ee(0,1),Ferguson replaces the triangles by Beta densities. The above example relies heavily on the fact that ratios of the type f(x,0)/f(x,00)are unbounded functions of 6.One can also make up examples where the ratios stay bounded and m.l.e.still misbehaves. A possible example is as follows.For each integer m>1 divide the interval (0,1]by binary division,getting 2"intervals of the form (2m,(0+1)2m](0=0,1,..,2m-1). For each such division there are 2m 2m-1 ways of selecting 2-of the intervals.Make a selection s.On the selected ones,letm be equal to 1.On the remaining ones let .m be equal to(-1). This gives a certain countable family of functions. Now for given m and for the selection s let psm be the measure whose density with respect to Lebesgue measure on(0,1]is 1+(1-e-m)中.m In this case the ratio of densities is always between and 2.The measures are all distinct from one another. Application of a maximum likelihood technique would lead us to estimate m by + (This is essentially equivalent to another example of Bahadur. 5 An Example from Biostatistics The following is intended to show that even for 'straight'exponential families one can sometimes do better than the m.l.e. The example has a long history,which we shall not recount.It occurs from the evaluation of dose responses in biostatistics. Suppose that a chemical can be injected to rats at various doses y,y2,...,y:.For a particular dose,one just observes whether or not there is a response.There is then for each y a certain probability of response.Biostatisticians,being complicated people,prefer to work out not with the dose y but with its logarithm x=log y. We shall then let p(x)be the probability of response if the animal is given the log dose t. Some people,including Sir Ronald,felt that the relation xp(x)would be well described by a cumulative normal distribution,in standard form 1 p-V2m_e护d业 I do not know why.Some other people felt that the probability p has a derivative p' about proportional to p except that for p close to unity (large dose)the poor animal is saturated so that the curve has a ceiling at 1

Maximum Likelihood where 6 is a continuous function that decreases from 1 to 0 on [0, 1]. If it tends to zero rapidly enough as 0- 1, the peaks of the triangles will distract the m.l.e. from its appointed rounds. In Example 1, ? 2, the m.l.e. led a precarious existence. Here everything is compact and continuous and all of Wald's conditions, except one, are satisfied. To convert the example into one that satisfies Cramer's conditions, for 0 E (0, 1), Ferguson replaces the triangles by Beta densities. The above example relies heavily on the fact that ratios of the type f(x, 0)/f(x, 00) are unbounded functions of 0. One can also make up examples where the ratios stay bounded and m.l.e. still misbehaves. A possible example is as follows. For each integer m > 1 divide the interval (0, 1] by binary division, getting 2m intervals of the form (j2-, (j + 1)2-m] (j = , 1,...,2m 1). For each such division there are (2m) 2M-1 ways of selecting 2m-1 of the intervals. Make a selection s. On the selected ones, let )s,m be equal to 1. On the remaining ones let qPs,m be equal to (-1). This gives a certain countable family of functions. Now for given m and for the selection s let Ps,m be the measure whose density with respect to Lebesgue measure on (0, 1] is 1 + (1 - e-m)s,m. In this case the ratio of densities is always between I and 2. The measures are all distinct from one another. Application of a maximum likelihood technique would lead us to estimate m by +00. (This is essentially equivalent to another example of Bahadur.) 5 An Example from Biostatistics The following is intended to show that even for 'straight' exponential families one can sometimes do better than the m.l.e. The example has a long history, which we shall not recount. It occurs from the evaluation of dose responses in biostatistics. Suppose that a chemical can be injected to rats at various doses yl, Y2, . . , yi > 0. For a particular dose, one just observes whether or not there is a response. There is then for each y a certain probability of response. Biostatisticians, being complicated people, prefer to work out not with the dose y but with its logarithm x = log y. We shall then let p(x) be the probability of response if the animal is given the log dose x. Some people, including Sir Ronald, felt that the relation x->p(x) would be well described by a cumulative normal distribution, in standard form p() =/(2r) e- dt I do not know why. Some other people felt that the probability p has a derivative p' about proportional to p except that for p close to unity (large dose) the poor animal is saturated so that the curve has a ceiling at 1. 161