Lecture 3 Properties of MLE:consistency, asymptotic normality. Fisher information. In this section we will try to understand why MLEs are 'good'. Let us recall two facts from probability that we be used often throughout this course. Law of Large Numbers (LLN): If the distribution of the i.i.d.sample X1,...,Xn is such that Xi has finite expectation, i.e.EXI0, P(区n-EX1l>e)→0asn→oo. Note.Whenever we will use the LLN below we will simply say that the average converges to its expectation and will not mention in what sense.More mathematically inclined students are welcome to carry out these steps more rigorously,especially when we use LLN in combination with the Central Limit Theorem. .Central Limit Theorem (CLT): If the distribution of the i.i.d.sample Xi,...,Xn is such that Xi has finite expectation and variance,i.e.EXl oo and o2 Var(X)<oo,then Vn(n-EX1)一4N(0,o2) converges in distribution to normal distribution with zero mean and variance o2,which means that for any interval [a,b, P(元-x)ea)一高益血 16
Lecture 3 Properties of MLE: consistency, asymptotic normality. Fisher information. In this section we will try to understand why MLEs are ’good’. Let us recall two facts from probability that we be used often throughout this course. • Law of Large Numbers (LLN): If the distribution of the i.i.d. sample X1, . . . , Xn is such that X1 has finite expectation, i.e. |EX1| 0, P ¯ (|Xn − EX1| > θ) � 0 as n � →. Note. Whenever we will use the LLN below we will simply say that the average converges to its expectation and will not mention in what sense. More mathematically inclined students are welcome to carry out these steps more rigorously, especially when we use LLN in combination with the Central Limit Theorem. • Central Limit Theorem (CLT): If the distribution of the i.i.d. sample X1, . . . , Xn is such that X1 has finite expectation and variance, i.e. |EX1| < → and π2 = Var(X) < →, then ≥n(X¯n − EX1) �d N(0, π2 ) converges in distribution to normal distribution with zero mean and variance π2, which means that for any interval [a, b], 2x P �≥n(X¯n − EX1) ∞ [a, b] � � � a b ≥2 1 ∂π e− 2�2 dx. 16
In other words,the random variable vn(Xn-EX)will behave like a random variable from normal distribution when n gets large. Exercise.Illustrate CLT by generating 100 Bernoulli random varibles B(p)(or one Binomial r.v.B(100,p))and then computing vn(Xn-EX1).Repeat this many times and use'dfittool'to see that this random quantity will be well approximated by normal distribution. We will prove that MLE satisfies(usually)the following two properties called consistency and asymptotic normality. 1.Consistency.We say that an estimate is consistent if0o in probability as n-oo,where 0o is the 'true'unknown parameter of the distribution of the sample. 2.Asymptotic Normality.We say that 0 is asymptotically normal if V元(0-o)-4N(0,o6) where is called the asymptotic variance of the estimate 0.Asymptotic normality says that the estimator not only converges to the unknown parameter,but it converges fast enough,at a rate 1/vn. Consistency of MLE. To make our discussion as simple as possible,let us assume that a likelihood function is smooth and behaves in a nice way like shown in figure 3.1,i.e.its maximum is achieved at a unique point 0. -02 -04 p(0) -1 0.5 25 3.5 Figure 3.1:Maximum Likelihood Estimator (MLE) Suppose that the data X1,...,Xn is generated from a distribution with unknown pa- rameter 0o and 6 is a MLE.Why 0 converges to the unknown parameter 0o?This is not immediately obvious and in this section we will give a sketch of why this happens. 17
PSfrag replacements In other words, the random variable ≥n(X¯n −EX1) will behave like a random variable from normal distribution when n gets large. Exercise. Illustrate CLT by generating 100 Bernoulli random varibles B(p) (or one Binomial r.v. B(100, p)) and then computing ≥n(X¯n − EX1). Repeat this many times and use ’dfittool’ to see that this random quantity will be well approximated by normal distribution. We will prove that MLE satisfies (usually) the following two properties called consistency and asymptotic normality. 1. Consistency. We say that an estimate ϕ ˆ is consistent if ϕ ˆ � ϕ0 in probability as n � →, where ϕ0 is the ’true’ unknown parameter of the distribution of the sample. 2. Asymptotic Normality. We say that ϕ ˆ is asymptotically normal if ≥n(ϕ ˆ− ϕ0) �d N(0, π� 2 0 ) where π� 2 0 is called the asymptotic variance of the estimate ϕ ˆ. Asymptotic normality says that the estimator not only converges to the unknown parameter, but it converges fast enough, at a rate 1/ ≥n. Consistency of MLE. To make our discussion as simple as possible, let us assume that a likelihood function is smooth and behaves in a nice way like shown in figure 3.1, i.e. its maximum is achieved at a unique point ϕ.ˆ 1 0.8 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 �(ϕ) �(ϕ) 0 0.5 1 1.5 2 2.5 3 3.5 4 ϕ ˆ ϕ Figure 3.1: Maximum Likelihood Estimator (MLE) Suppose that the data X1, . . . , Xn is generated from a distribution with unknown parameter ϕ0 and ϕ ˆ is a MLE. Why ϕ ˆ converges to the unknown parameter ϕ0? This is not immediately obvious and in this section we will give a sketch of why this happens. 17
First of all,MLE 0 is the maximizer of Ln(0)= >log f(X:0) i=1 which is a log-likelihood function normalized by I(of course,this does not affect maxi- mization).Notice that function In()depends on data.Let us consider a function l(X)= log f(X)and define L(9)=E61(XI9), where Eao denotes the expectation with respect to the true uknown parameter 0o of the sample X1,...,Xn.If we deal with continuous distributions then L(0)= (log f(0))f(10o)dr. By law of large numbers,for any 6, Ln(0)→El(X1e)=L(0). Note that L()does not depend on the sample,it only depends on 0.We will need the following Lemma.We have that for any 0, L(0≤L(o). Moreover,the inequality is strict,L(0)<L(0o),unless Poo(f(X 0)=f(X 0o))=1. which means that Po Poo. Proof.Let us consider the difference f(X10) L(0)-L(0)=Ea(log f(X10)-logf(X0)=Eo kog Since logt <t-1,we can write Eoo log f(X10) ≤ f(X0o) 器-)-离-a EofX100) f(xl0)dx- f(x 00)dx =1-1=0. Both integrals are equal to 1 because we are integrating the probability density functions. This proves that L()-L(00)<0.The second statement of Lemma is also clear. ◇ 18
� � � First of all, MLE ϕ ˆ is the maximizer of n 1 � Ln(ϕ) = n log f(Xi|ϕ) i=1 which is a log-likelihood function normalized by n 1 (of course, this does not affect maximization). Notice that function Ln(ϕ) depends on data. Let us consider a function l(X|ϕ) = log f(X|ϕ) and define L(ϕ) = E�0 l(X|ϕ), where E�0 denotes the expectation with respect to the true uknown parameter ϕ0 of the sample X1, . . . , Xn. If we deal with continuous distributions then L(ϕ) = (log f(x|ϕ))f(x|ϕ0)dx. By law of large numbers, for any ϕ, Ln(ϕ) � E�0 l(X|ϕ) = L(ϕ). Note that L(ϕ) does not depend on the sample, it only depends on ϕ. We will need the following Lemma. We have that for any ϕ, L(ϕ) ≡ L(ϕ0). Moreover, the inequality is strict, L(ϕ) < L(ϕ0), unless P�0 (f(X|ϕ) = f(X|ϕ0)) = 1. which means that P� = P�0 . Proof. Let us consider the difference L(ϕ) − L(ϕ0) = E�0 (log f(X|ϕ) − log f(X|ϕ0)) = E�0 log f f ( ( X X| ϕ ϕ 0 ) ) . | Since log t ≡ t − 1, we can write E�0 log = f(x ϕ0)dx f f ( ( X X | | ϕ ϕ 0 ) ) ≡ E�0 � f f ( ( X X | | ϕ ϕ 0 ) ) − 1 � � � f f ( ( x x | | ϕ ϕ 0 ) ) − 1 � | = f(x|ϕ)dx − f(x|ϕ0)dx = 1 − 1 = 0. Both integrals are equal to 1 because we are integrating the probability density functions. This proves that L(ϕ) − L(ϕ0) ≡ 0. The second statement of Lemma is also clear. 18
We will use this Lemma to sketch the consistency of the MLE. Theorem:Under some regularity conditions on the family of distributions,MLE 0 is cons2 stent,,i.e.0→0oasn→oo. The statement of this Theorem is not very precise but but rather than proving a rigorous mathematical statement our goal here is to illustrate the main idea.Mathematically inclined students are welcome to come up with some precise statement. Ln(e) 060 Figure 3.2:Illustration to Theorem. Proof.We have the following facts: 1.6 is the maximizer of Ln(0)(by definition). 2.00 is the maximizer of L()(by Lemma). 3.Ve we have In(θ)→L(a)by LLN. This situation is illustrated in figure 3.2.Therefore,since two functions Ln and L are getting closer,the points of maximum should also get closer which exactly means that 0→00 ▣ Asymptotic normality of MLE.Fisher information. We want to show the asymptotic normality of MLE,i.e.to show that n(0-00)-d N(0,LE)for some iLE and computeoe.This asymptotic variance in some sense measures the quality of MLE. First,we need to introduce the notion called Fisher Information. Let us recall that above we defined the function l(X)=log f(X).To simplify the notations we will denote by l'(X),I"(X),etc.the derivatives of l(X)with respect to 0. Definition.(Fisher information.Fisher information of a random variable X with distribution Peo from the family [Pe:0eel is defined by I0)=ErX6oP=B(品gfXl。月 19
PSfrag replacements We will use this Lemma to sketch the consistency of the MLE. Theorem: Under some regularity conditions on the family of distributions, MLE ϕ ˆ is consistent, i.e. ϕ ˆ � ϕ0 as n � →. The statement of this Theorem is not very precise but but rather than proving a rigorous mathematical statement our goal here is to illustrate the main idea. Mathematically inclined students are welcome to come up with some precise statement. ϕ ˆϕ ϕ0 Ln(ϕ) L(ϕ) Figure 3.2: Illustration to Theorem. Proof. We have the following facts: 1. ϕ ˆ is the maximizer of Ln(ϕ) (by definition). 2. ϕ0 is the maximizer of L(ϕ) (by Lemma). 3. �ϕ we have Ln(ϕ) � L(ϕ) by LLN. This situation is illustrated in figure 3.2. Therefore, since two functions Ln and L are getting closer, the points of maximum should also get closer which exactly means that ϕ ˆ � ϕ0. Asymptotic normality of MLE. Fisher information. We want to show the asymptotic normality of MLE, i.e. to show that ≥n(ϕ ˆ− ϕ0) �d N(0, π2 ) for some π2 MLE MLE and compute π2 MLE. This asymptotic variance in some sense measures the quality of MLE. First, we need to introduce the notion called Fisher Information. Let us recall that above we defined the function l(X|ϕ) = log f(X|ϕ). To simplify the notations we will denote by l � (X ϕ), l �� | (X|ϕ), etc. the derivatives of l(X|ϕ) with respect to ϕ. Definition. (Fisher information.) Fisher information of a random variable X with distribution P�0 from the family {P� : ϕ ∞ �} is defined by � � � �2 I(ϕ0) = E�0 (l � (X|ϕ0))2 � E�0 �ϕ log f(X|ϕ)� � �=�0 . 19
Remark.Let us give a very informal interpretation of Fisher information.The derivative I'(X10o)=(log f(X10))= f"(X10o) f(X10o) can be interpreted as a measure of how quickly the distribution density or p.f.will change when we slightly change the parameter 0 near 00.When we square this and take expectation, i.e.average over X,we get an averaged version of this measure.So if Fisher information is large,this means that the distribution will change quickly when we move the parameter,so the distribution with parameter 0o is 'quite different'and 'can be well distinguished'from the distributions with parameters not so close to 0o.This means that we should be able to estimate 0o well based on the data.On the other hand,if Fisher information is small,this means that the distribution is 'very similar'to distributions with parameter not so close to 0o and,thus,more difficult to distinguish,so our estimation will be worse.We will see precisely this behavior in Theorem below. Next lemma gives another often convenient way to compute Fisher information. Lemma.We have, 02 Eal"()o )-I(). Proof.First of all,we have 1'(XI90)=(1ogf(X10)Y'= f(x10) f(X9) and (1ogf(X19)"= f"(X10)(F(x10))2 f(X0) f2(X 0) Also,since p.d.f.integrates to 1, f(0)dx=1, if we take derivatives of this equation with respect to (and interchange derivative and integral,which can usually be done)we will get, (dr=0and gzf(rl0)dz f"(x0)dx =0. To finish the proof we write the following computation Eal"(X10)=Elog f()=( (log f(x00))"f(x|0o)dx -(器)r f"(9)ar-Ea,(t'(X1o》2=0-Io=-I(0o). 20
� Remark. Let us give a very informal interpretation of Fisher information. The derivative l � (X|ϕ0) = (log f(X|ϕ0))� = f f � ( ( X X | | ϕ ϕ 0 0 ) ) can be interpreted as a measure of how quickly the distribution density or p.f. will change when we slightly change the parameter ϕ near ϕ0. When we square this and take expectation, i.e. average over X, we get an averaged version of this measure. So if Fisher information is large, this means that the distribution will change quickly when we move the parameter, so the distribution with parameter ϕ0 is ’quite different’ and ’can be well distinguished’ from the distributions with parameters not so close to ϕ0. This means that we should be able to estimate ϕ0 well based on the data. On the other hand, if Fisher information is small, this means that the distribution is ’very similar’ to distributions with parameter not so close to ϕ0 and, thus, more difficult to distinguish, so our estimation will be worse. We will see precisely this behavior in Theorem below. Next lemma gives another often convenient way to compute Fisher information. Lemma. We have, �2 E�0 l ��(X|ϕ0) � E�0 �ϕ2 log f(X|ϕ0) = −I(ϕ0). Proof. First of all, we have l � (X|ϕ) = (log f(X|ϕ))� = f f � ( ( X X| ϕ ϕ ) ) | and (log f(X|ϕ))�� = f f �� ( ( X X | | ϕ ϕ ) ) − (f f � 2 ( ( X X |ϕ |ϕ )) ) 2 . Also, since p.d.f. integrates to 1, f(x|ϕ)dx = 1, if we take derivatives of this equation with respect to ϕ (and interchange derivative and integral, which can usually be done) we will get, � � � �2 � �ϕ f(x|ϕ)dx = 0 and �ϕ2 f(x|ϕ)dx = f��(x|ϕ)dx = 0. To finish the proof we write the following computation �2 � E�0 l ��(X ϕ0) = E�0 log f(X ϕ0) = (log f(x ϕ0))�� | f(x ϕ0)dx �ϕ2 | | | �f (x �2� = � � �f f �� ( ( x x | | ϕ ϕ 0 0 ) ) − f � (x| | ϕ ϕ 0 0 ) ) f(x|ϕ0)dx = f��(x|ϕ0)dx − E�0 (l � (X|ϕ0))2 = 0 − I(ϕ0 = −I(ϕ0). 20
◇ We are now ready to prove the main result of this section. Theorem.(Asymptotic normality of MLE.We have, va6-a)→N0高) As we can see,the asymptotic variance/dispersion of the estimate around true parameter will be smaller when Fisher information is larger. Proof.Since MLE 0 is maximizer of (0)=log f(),we have Ln(0=0. Let us use the Mean Value Theorem f(a)-f()-f(e)or f(a)=f()+(c)(a-)for c a-6 with f()=Ln(),a=0 and b=00.Then we can write, 0=Ln(©=Ln(0o)+L%(0)(6-0o) for some1∈[0,l.From here we get that 6-6=- L (0o) and n(0)=L(0o) (3.0.1) L%(0) L(91) Since by Lemma in the previous section we know that 0o is the maximizer of L(),we have L(0o)=E.1'(X9o)=0. (3.0.2) Therefore,the numerator in (3.0.1) va)=V(∑rx)-0 (3.0.3) 三1 V(片∑rXw)-Erx)-N(a,Vaar(xI) converges in distribution by Central Limit Theorem. Next,let us consider the denominator in (3.0.1).First of all,we have that for all 0, gO)=元∑r'(Xo)→Br(XI9)by LLN. (3.0.4) Also,since 0e [0,0ol and by consistency result of previous section,000,we have 00. Using this together with (10.0.3)we get L"(0)-E01"(X1100)=-I(00)by Lemma above. 21
We are now ready to prove the main result of this section. Theorem. (Asymptotic normality of MLE.) We have, ≥n(ϕ ˆ− ϕ0) � N � 0, 1 � . I(ϕ0) As we can see, the asymptotic variance/dispersion of the estimate around true parameter will be smaller when Fisher information is larger. Proof. Since MLE ϕ ˆ is maximizer of Ln(ϕ) = n 1 � i n =1 log f(Xi|ϕ), we have L� (ϕ ˆ) = 0. n Let us use the Mean Value Theorem f(a) − f(b) a − b = f� (c) or f(a) = f(b) + f � (c)(a − b) for c ∞ [a, b] with f(ϕ) = L� n(ϕ), a = ϕ ˆ and b = ϕ0. Then we can write, 0 = L� (ϕ ˆ) = L� n(ϕ0) + L��(ϕ ˆ 1)(ϕ ˆ− ϕ0) n n for some ˆϕ1 ∞ [ϕ, ϕ0]. From here we get that ˆ ˆ n n ϕ − ϕ0 = −L� ( (ϕ ˆ 0) and ≥n(ϕ ˆ− ϕ0) = − ≥nL ( � ˆ (ϕ0) . (3.0.1) L�� n ϕ1) Ln �� ϕ1) Since by Lemma in the previous section we know that ϕ0 is the maximizer of L(ϕ), we have L� (ϕ0) = E�0 l � (X|ϕ0) = 0. (3.0.2) Therefore, the numerator in (3.0.1) n ≥nLn � (ϕ0) = ≥n � n 1 �l � (Xi|ϕ0) − 0 � (3.0.3) i=1 � n 1 � � � � = ≥n n i=1 l � (Xi|ϕ0) − E�0 l � (X1|ϕ0) � N 0, Var�0 (l � (X1|ϕ0)) converges in distribution by Central Limit Theorem. Next, let us consider the denominator in (3.0.1). First of all, we have that for all ϕ, 1 � L�� n(ϕ) = l ��(Xi|ϕ) � E�0 l ��(X1|ϕ) by LLN. (3.0.4) n Also, since ϕ ˆ 1 ∞ [ˆ ϕ � ϕ0, we have ˆ ϕ, ϕ0] and by consistency result of previous section, ˆ ϕ1 � ϕ0. Using this together with (10.0.3) we get L�� n(ϕ ˆ 1) � E�0 l ��(X1|ϕ0) = −I(ϕ0) by Lemma above. 21
Combining this with (3.0.3)we get nL(00)N(0, ar0(U(Xil0o)) L%(A) (I(0)2 Finally,the variance, Var(t'(Xl0o)=E.('(X19o)2-(Ea1(x9o)2=I(0o)-0 where in the last equality we used the definition of Fisher information and(3.0.2). Let us compute Fisher information for some particular distributions. Example 1.The family of Bernoulli distributions B(p)has p.f. f(xlp)=p*(1-p)1-* and taking the logarithm log f(xp)=xlogp+(1-x)log(1-p). The second derivative with respect to parameter p is se-吕罗ww-产品 ap Then the Fisher information can be computed as 82, os f(XIp)=这+二=品+1- 1 I(p)=-E- p2 T(1-p2-p(1-p) The MLE of p isp=X and the asymptotic normality result states that V(D-p0)→N(0,po(1-p0) which,of course,also follows directly from the CLT. Example.The family of exponential distributions E(a)has p.d.f. f(xla)= ∫ae-ar,x≥0 0, x<0 and,therefore, 02 logf(xla)=loga-ax→ oz logf(rla)=-1 This does not depend on X and we get 02 I(a)=-E- logf(Xa)=京 Therefore,the MLE a =1/X is asymptotically normal and Vm(a-ao)→N(0,a) ◇ 22
Combining this with (3.0.3) we get ≥nL� n(ϕ0) d N � 0, Var�0 (l � (X1|ϕ0))� − . L�� n(ϕ ˆ 1) � (I(ϕ0))2 Finally, the variance, Var�0 (l � (X1|ϕ0)) = E�0 (l � (X|ϕ0))2 − (E�0 l � (x|ϕ0))2 = I(ϕ0) − 0 where in the last equality we used the definition of Fisher information and (3.0.2). Let us compute Fisher information for some particular distributions. Example 1. The family of Bernoulli distributions B(p) has p.f. f(x|p) = px(1 − p) 1−x and taking the logarithm log f(x|p) = x log p + (1 − x)log(1 − p). The second derivative with respect to parameter p is � log f(x p) = x 1 − x , �2 log f(x p) = − x 1 − x . 2 �p | p − 1 − p �p2 | p − (1 − p)2 Then the Fisher information can be computed as I(p) = −E �2 log f(X p) = EX + 1 − EX = p + 1 − p = 1 . �p2 | p2 (1 − p)2 p2 (1 − p)2 p(1 − p) The MLE ¯ of p is pˆ = X and the asymptotic normality result states that ≥n(pˆ− p0) � N(0, p0(1 − p0)) which, of course, also follows directly from the CLT. Example. The family of exponential distributions E(�) has p.d.f. f(x �) = � �e−�x, x ∀ 0 | 0, x < 0 and, therefore, �2 1 log f(x|�) = log � − �x ≤ ��2 log f(x|�) = −�2 . This does not depend on X and we get �2 1 I(�) = −E log f(X|�) = . ��2 �2 Therefore, the MLE �ˆ = 1/X¯ is asymptotically normal and ≥n(�ˆ − �0) � N(0, �2 ). 0 22