DEPARTMENT OF ECONOMICS UNIVERSITY OF CYPRUS THE MM,ME,ML,EL,EF AND GMM APPROACHES TO ESTIMATION:A SYNTHESIS Anil K.Bera and Yannis Bilias Discussion Paper 2001-09 P.O.Box20537,1678 Nicosia,CYPRUS Tel.:++357-2-892430,Fax:++357-2-892432 Web site:http://www.econ.ucy.ac.cy
DEPARTMENT OF ECONOMICS UNIVERSITY OF CYPRUS THE MM, ME, ML, EL, EF AND GMM APPROACHES TO ESTIMATION: A SYNTHESIS Anil K. Bera and Yannis Bilias Discussion Paper 2001-09 P.O. Box 20537, 1678 Nicosia, CYPRUS Tel.: ++357-2-892430, Fax: ++357-2-892432 Web site: http://www.econ.ucy.ac.cy
Abstract The 20th century began on an auspicious statistical note with the publication of Karl Pearson's(1900)goodness-of-fit test,which is regarded as one of the most important scien- tific breakthroughs.The basic motivation behind this test was to see whether an assumed probability model adequately described the data at hand.Pearson(1894)also introduced a formal approach to statistical estimation through his method of moments(MM)estima- tion.Ronald A.Fisher,while he was a third year undergraduate at the Gonville and Caius College,Cambridge,suggested the maximum likelihood estimation (MLE)procedure as an alternative to Pearson's MM approach.In 1922 Fisher published a monumental paper that introduced such basic concepts as consistency,efficiency,sufficiency-and even the term "parameter"with its present meaning.Fisher (1922)provided the analytical foundation of MLE and studied its efficiency relative to the MM estimator.Fisher (1924a)established the asymptotic equivalence of minimum x2 and ML estimators and wrote in favor of using minimum x2 method rather than Pearson's MM approach.Recently,econometricians have found working under assumed likelihood functions restrictive,and have suggested using a generalized version of Pearson's MM approach,commonly known as the GMM estimation procedure as advocated in Hansen (1982).Earlier,Godambe (1960)and Durbin (1960) developed the estimating function(EF)approach to estimation that has been proven very useful for many statistical models.A fundamental result is that score is the optimum EF. Ferguson(1958)considered an approach very similar to GMM and showed that estimation based on the Pearson chi-squared statistic is equivalent to efficient GMM.Golan,Judge and Miller (1996)developed entropy-based formulation that allowed them to solve a wide range of estimation and inference problems in econometrics.More recently,Imbens,Spady and Johnson (1998),Kitamura and Stutzer (1997)and Mittelhammer,Judge and Miller (2000)put GMM within the framework of empirical likelihood (EL)and maximum en- tropy (ME)estimation.It can be shown that many of these estimation techniques can be obtained as special cases of minimizing Cressie and Read (1984)power divergence crite- rion that comes directly from the Pearson(1900)chi-squared statistic.In this way we are able to assimilate a number of seemingly unrelated estimation techniques into a unified framework
Abstract The 20th century began on an auspicious statistical note with the publication of Karl Pearson’s (1900) goodness-of-fit test, which is regarded as one of the most important scientific breakthroughs. The basic motivation behind this test was to see whether an assumed probability model adequately described the data at hand. Pearson (1894) also introduced a formal approach to statistical estimation through his method of moments (MM) estimation. Ronald A. Fisher, while he was a third year undergraduate at the Gonville and Caius College, Cambridge, suggested the maximum likelihood estimation (MLE) procedure as an alternative to Pearson’s MM approach. In 1922 Fisher published a monumental paper that introduced such basic concepts as consistency, efficiency, sufficiency–and even the term “parameter” with its present meaning. Fisher (1922) provided the analytical foundation of MLE and studied its efficiency relative to the MM estimator. Fisher (1924a) established the asymptotic equivalence of minimum χ 2 and ML estimators and wrote in favor of using minimum χ 2 method rather than Pearson’s MM approach. Recently, econometricians have found working under assumed likelihood functions restrictive, and have suggested using a generalized version of Pearson’s MM approach, commonly known as the GMM estimation procedure as advocated in Hansen (1982). Earlier, Godambe (1960) and Durbin (1960) developed the estimating function (EF) approach to estimation that has been proven very useful for many statistical models. A fundamental result is that score is the optimum EF. Ferguson (1958) considered an approach very similar to GMM and showed that estimation based on the Pearson chi-squared statistic is equivalent to efficient GMM. Golan, Judge and Miller (1996) developed entropy-based formulation that allowed them to solve a wide range of estimation and inference problems in econometrics. More recently, Imbens, Spady and Johnson (1998), Kitamura and Stutzer (1997) and Mittelhammer, Judge and Miller (2000) put GMM within the framework of empirical likelihood (EL) and maximum entropy (ME) estimation. It can be shown that many of these estimation techniques can be obtained as special cases of minimizing Cressie and Read (1984) power divergence criterion that comes directly from the Pearson (1900) chi-squared statistic. In this way we are able to assimilate a number of seemingly unrelated estimation techniques into a unified framework
1 Prologue:Karl Pearson's method of moment estima- tion and chi-squared test,and entropy In this paper we are going to discuss various methods of estimation,especially those developed in the twentieth century,beginning with a review of some developments in statistics at the close of the nineteenth century.In 1892 W.F.Raphael Weldon,a zoologist turned statistician, requested Karl Pearson (1857-1936)to analyze a set of data on crabs.After some investigation Pearson realized that he could not fit the usual normal distribution to this data.By the early 1890's Pearson had developed a class of distributions that later came to be known as the Pearson system of curves,which is much broader than the normal distribution.However,for the crab data Pearson's own system of curves was not good enough.He dissected this "abnormal frequency curve"into two normal curves as follows: f(y)=af(y)+(1-a)f2(y) (1) where 1 f(y)= 1 o,exp-2) √2r01 ,j=1,2 This model has five parameters!(a,u,of,u2,02).Previously,there had been no method avail- able to estimate such a model.Pearson quite unceremoniously suggested a method that simply equated the first five population moments to the respective sample counterparts.It was not easy to solve five highly nonlinear equations.Therefore,Pearson took an analytical approach of eliminating one parameter in each step.After considerable algebra he found a ninth-degree polynomial equation in one unknown.Then,after solving this equation and by repeated back- substitutions,he found solutions to the five parameters in terms of the first five sample moments. It was around the autumn of 1893 he completed this work and it appeared in 1894.And this was the beginning of the method of moment (MM)estimation.There is no general theory in Pearson(1894).The paper is basically a worked-out "example"(though a very difficult one as the first illustration of MM estimation)of a new estimation method.? IThe term "parameter"was introduced by Fisher(1922,p.311)[also see footnote 16].Karl Pearson described the“parameters”as“constants”of the“crve.”Fisher(1912)also used“frequency curve.”However,in Fisher (1922)he used the term "distribution"throughout."Probability density function"came much later,in Wilks (1943,p.8)[see,David(1995)l 2Shortly after Karl Pearson's death,his son Egon Pearson provided an account of life and work of the elder Pearson [see Pearson (1936)].He summarized (pp.219-220)the contribution of Pearson (1894)stating,"The paper is particularly noteworthy for its introduction of the method of moments as a means of fitting a theoretical curve to observed data.This method is not claimed to be the best but is advocated from the utilitarian standpoint 1
1 Prologue: Karl Pearson’s method of moment estimation and chi-squared test, and entropy In this paper we are going to discuss various methods of estimation, especially those developed in the twentieth century, beginning with a review of some developments in statistics at the close of the nineteenth century. In 1892 W.F. Raphael Weldon, a zoologist turned statistician, requested Karl Pearson (1857-1936) to analyze a set of data on crabs. After some investigation Pearson realized that he could not fit the usual normal distribution to this data. By the early 1890’s Pearson had developed a class of distributions that later came to be known as the Pearson system of curves, which is much broader than the normal distribution. However, for the crab data Pearson’s own system of curves was not good enough. He dissected this “abnormal frequency curve” into two normal curves as follows: f(y) = αf1(y) + (1 − α)f2(y), (1) where fj (y) = 1 √ 2πσj exp[− 1 2σ 2 j (y − µj ) 2 ], j = 1, 2. This model has five parameters1 (α, µ1, σ2 1 , µ2, σ2 2 ). Previously, there had been no method available to estimate such a model. Pearson quite unceremoniously suggested a method that simply equated the first five population moments to the respective sample counterparts. It was not easy to solve five highly nonlinear equations. Therefore, Pearson took an analytical approach of eliminating one parameter in each step. After considerable algebra he found a ninth-degree polynomial equation in one unknown. Then, after solving this equation and by repeated backsubstitutions, he found solutions to the five parameters in terms of the first five sample moments. It was around the autumn of 1893 he completed this work and it appeared in 1894. And this was the beginning of the method of moment (MM) estimation. There is no general theory in Pearson (1894). The paper is basically a worked-out “example” (though a very difficult one as the first illustration of MM estimation) of a new estimation method.2 1The term “parameter” was introduced by Fisher (1922, p.311) [also see footnote 16]. Karl Pearson described the “parameters” as “constants” of the “curve.” Fisher (1912) also used “frequency curve.” However, in Fisher (1922) he used the term “distribution” throughout. “Probability density function” came much later, in Wilks (1943, p.8) [see, David (1995)] 2Shortly after Karl Pearson’s death, his son Egon Pearson provided an account of life and work of the elder Pearson [see Pearson (1936)]. He summarized (pp.219-220) the contribution of Pearson (1894) stating, “The paper is particularly noteworthy for its introduction of the method of moments as a means of fitting a theoretical curve to observed data. This method is not claimed to be the best but is advocated from the utilitarian standpoint 1
After an experience of "some eight years"in applying the MM to a vast range of physical and social data,Pearson (1902)provided some "theoretical"justification of his methodology. Suppose we want to estimate the parameter vector 6=(01,02,...,0p)'of the probability density function f(y;0).By a Taylor series expansion of f(y)=f(y;0)around y=0,we can write )=0+1划++ ”31++Φ知阶+2 (2) where 2,...,depends on 01,02,...,p and R is the remainder term.Let f(y)be the ordinate corresponding to y given by observations.Therefore,the problem is to fit a smooth curve f(y;0)to p histogram ordinates given by f(y).Then f(y)-f(y)denotes the distance between the theoretical and observed curve at the point corresponding to y,and our objective would be to make this distance as small as possible by a proper choice of ..[see Pearson (1902,p.268)].3 Although Pearson discussed the fit of f(y)to p histogram ordinates f(y),he proceeded to find a"theoretical"version of f(y)that minimizes [see Mensch(1980)] Lf(w)-F(u)Pdv. (3) Since f(.)is the variable,the resulting equation is /[f()-f)16fdy=0, (4) where,from(2),the differential of can be written as yi Bp 6f=∑60,7+00 (5) j=0 Therefore,we can write equation (4)as /ro-o空号+4海-上/o-号+,-0 (6) 1=0 Since the quantities 0,01,02,...,p are at our choice,for (6)to hold,each component should be independently zero,i.e.,we should have r)om ∂R dw=0, j=0,1,2,,p () on the grounds that it appears to give excellent fits and provides algebraic solutions for calculating the constants of the curve which are analytically possible." 3It is hard to trace the first use of smooth non-parametric density estimation in the statistics literature. Koenker(2000,p.349)mentioned Galton's(1885)illustration of "regression to the mean"where Galton averaged the counts from the four adjacent squares to achieve smoothness.Karl Pearson's minimization of the distance between f(y)and f(y)looks remarkably modern in terms of ideas and could be viewed as a modern-equivalent of smooth non-parametric density estimation [see also Mensch(1980)]. 2
After an experience of “some eight years” in applying the MM to a vast range of physical and social data, Pearson (1902) provided some “theoretical” justification of his methodology. Suppose we want to estimate the parameter vector θ = (θ1, θ2, . . . , θp) 0 of the probability density function f(y; θ). By a Taylor series expansion of f(y) ≡ f(y; θ) around y = 0, we can write f(y) = φ0 + φ1y + φ2 y 2 2! + φ3 y 3 3! + . . . + φp y p p! + R, (2) where φ0, φ1, φ2, . . . , φp depends on θ1, θ2, . . . , θp and R is the remainder term. Let ¯f(y) be the ordinate corresponding to y given by observations. Therefore, the problem is to fit a smooth curve f(y; θ) to p histogram ordinates given by ¯f(y). Then f(y) − ¯f(y) denotes the distance between the theoretical and observed curve at the point corresponding to y, and our objective would be to make this distance as small as possible by a proper choice of φ0, φ1, φ2, . . . , φp [see Pearson (1902, p.268)].3 Although Pearson discussed the fit of f(y) to p histogram ordinates ¯f(y), he proceeded to find a “theoretical” version of f(y) that minimizes [see Mensch (1980)] Z [f(y) − ¯f(y)]2 dy. (3) Since f(.) is the variable, the resulting equation is Z [f(y) − ¯f(y)]δf dy = 0, (4) where, from (2), the differential δf can be written as δf = X p j=0 (δφj y j j! + ∂R ∂φj δφj ). (5) Therefore, we can write equation (4) as Z [f(y) − ¯f(y)]X p j=0 (δφj y j j! + ∂R ∂φj δφj )dy = X p j=0 Z [f(y) − ¯f(y)](y j j! + ∂R ∂φj )dyδφj = 0. (6) Since the quantities φ0, φ1, φ2, . . . , φp are at our choice, for (6) to hold, each component should be independently zero, i.e., we should have Z [f(y) − ¯f(y)](y j j! + ∂R ∂φj )dy = 0, j = 0, 1, 2, . . . , p, (7) on the grounds that it appears to give excellent fits and provides algebraic solutions for calculating the constants of the curve which are analytically possible.” 3 It is hard to trace the first use of smooth non-parametric density estimation in the statistics literature. Koenker (2000, p.349) mentioned Galton’s (1885) illustration of “regression to the mean” where Galton averaged the counts from the four adjacent squares to achieve smoothness. Karl Pearson’s minimization of the distance between f(y) and ¯f(y) looks remarkably modern in terms of ideas and could be viewed as a modern-equivalent of smooth non-parametric density estimation [see also Mensch (1980)]. 2
which is same as ∂Bdy, =m!f(u)( j=0,1,2,,p. (8) Hereand mj are,respectively,the j-th moment corresponding to the theoretical curve f(y) and the observed curve f(y).4 Pearson(1902)then ignored the integral terms arguing that they involve the small factor f(y)-f(y),and the remainder term R,which by "hypothesis"is small for large enough sample size.After neglecting the integral terms in (8),Pearson obtained the equations 4=m,j=0,1,.,p. (9) Then,he stated the principle of the MM as [see Pearson(1902,p.270)]:"To fit a good theoretical curve f(y;01,02,...,p)to an observed curve,express the area and moments of the curve for the given range of observations in terms of 01,02,...,0p,and equate these to the like quantities for the observations."Arguing that,if the first p moments of two curves are identical,the higher moments of the curves becomes "ipso facto more and more nearly identical"for larger sample size,he concluded that the "equality of moments gives a good method of fitting curves to observations"[Pearson(1902,p.271)].We should add that much of his theoretical argument is not very rigorous,but the 1902 paper did provide a reasonable theoretical basis for the MM and illustrated its usefulness.5 For detailed discussion on the properties of the MM estimator see Shenton(1950,1958,1959). After developing his system of curves [Pearson (1895)],Pearson and his associates were fitting this system to a large number of data sets.Therefore,there was a need to formulate a test to check whether an assumed probability model adequately explained the data at hand. He succeeded in doing that and the result was Pearson's celebrated (1900)x2 goodness-of- fit test.To describe the Pearson test let us consider a distribution with k classes with the 4 t should be stressed that mj=∫yf(y)dy=∑yπwithπi denoting the area of the bin of the ith observation;this is not necessarily equal to the sample moment nthat is used in today's MM.Rather, Pearson's formulation of empirical moments uses the efficient weighting mi under a multinomial probability framework,an idea which is used in the literature of empirical likelihood and maximum entropy and to be described later in this paper. 5One of the first and possibly most important applications of MM idea is the derivation of t-distribution in Student(1908)which was major breakthrough in introducing the concept of finite sample(exact)distribution in statistics.Student(1908)obtained the first four moments of the sample variance 2,matched them with those of the Pearson type III distribution,and concluded(p.4)"a curve of Professor Pearson's type III may be expected to fit the distribution of S2."Student,however,was very cautious and quickly added(p.5),"it is probable that the curve found represents the theoretical distribution of S2 so that although we have no actual proof we shall assume it to do so in what follows."And this was the basis of his derivation of the t-distribution.The name t-distribution was given by Fisher(1924b). 2
which is same as µj = mj − j! Z [f(y) − ¯f(y)]( ∂R ∂φj )dy, j = 0, 1, 2, . . . , p. (8) Here µj and mj are, respectively, the j-th moment corresponding to the theoretical curve f(y) and the observed curve ¯f(y).4 Pearson (1902) then ignored the integral terms arguing that they involve the small factor f(y) − ¯f(y), and the remainder term R, which by “hypothesis” is small for large enough sample size. After neglecting the integral terms in (8), Pearson obtained the equations µj = mj , j = 0, 1, . . . , p. (9) Then, he stated the principle of the MM as [see Pearson (1902, p.270)]: “To fit a good theoretical curve f(y; θ1, θ2, . . . , θp) to an observed curve, express the area and moments of the curve for the given range of observations in terms of θ1, θ2, . . . , θp, and equate these to the like quantities for the observations.” Arguing that, if the first p moments of two curves are identical, the higher moments of the curves becomes “ipso facto more and more nearly identical” for larger sample size, he concluded that the “equality of moments gives a good method of fitting curves to observations” [Pearson (1902, p.271)]. We should add that much of his theoretical argument is not very rigorous, but the 1902 paper did provide a reasonable theoretical basis for the MM and illustrated its usefulness.5 For detailed discussion on the properties of the MM estimator see Shenton (1950, 1958, 1959). After developing his system of curves [Pearson (1895)], Pearson and his associates were fitting this system to a large number of data sets. Therefore, there was a need to formulate a test to check whether an assumed probability model adequately explained the data at hand. He succeeded in doing that and the result was Pearson’s celebrated (1900) χ 2 goodness-of- fit test. To describe the Pearson test let us consider a distribution with k classes with the 4 It should be stressed that mj = R y j ¯f(y)dy = Pn i y j i πi with πi denoting the area of the bin of the ith observation; this is not necessarily equal to the sample moment n −1 P i y j i that is used in today’s MM. Rather, Pearson’s formulation of empirical moments uses the efficient weighting πi under a multinomial probability framework, an idea which is used in the literature of empirical likelihood and maximum entropy and to be described later in this paper. 5One of the first and possibly most important applications of MM idea is the derivation of t-distribution in Student (1908) which was major breakthrough in introducing the concept of finite sample (exact) distribution in statistics. Student (1908) obtained the first four moments of the sample variance S 2 , matched them with those of the Pearson type III distribution, and concluded (p.4) “a curve of Professor Pearson’s type III may be expected to fit the distribution of S 2 .” Student, however, was very cautious and quickly added (p.5), “it is probable that the curve found represents the theoretical distribution of S 2 so that although we have no actual proof we shall assume it to do so in what follows.” And this was the basis of his derivation of the t-distribution. The name t-distribution was given by Fisher (1924b). 3
probability of j-th class being),j=1,2....and=1.Suppose that according to the assumed probability model,4=gio;therefore,one would be interested in testing the hypothesis,Ho:4=go,j=1,2,...,k.Let nj denote the observed frequency of the j-th class,with-N.Pearson (1900)suggested the goodness-of-fit statistic p=方-NgmP-方O-E2 (10) =1 j=1 E where O;and E;denote,respectively,the observed and expected frequencies of the j-th class This is the first constructive test in the statistics literature.Broadly speaking,P is essentially a distance measure between the observed and expected frequencies. It is quite natural to question the relevance of this test statistic in the context of estimation. Let us note that P could be used to measure the distance between any two sets of probabilities, say,(pi,qi),j=1,2,...,k by simply writing pi=nj/N and qi qio,i.e., P=x2色=2 (11) j=1 As we will see shortly a simple transformation of P could generate a broad class of distance measures.And later,in Section 5,we will demonstrate that many of the current estimation procedures in econometrics can be cast in terms of minimizing the distance between two sets of probabilities subject to certain constraints.In this way,we can tie and assimilate many estimation techniques together using Pearson's MM and x2-statistic as the unifying themes. We can write P as (12) =1 1 Therefore,the essential quantity in measuring the divergence between two probability distribu- tions is the ratio (p/g).Using Steven's (1975)idea on "visual perception"Cressie and Read 6This test is regarded as one of the 20 most important scientific breakthroughs of this century along with advances and discoveries like the theory of relativity,the IQ test,hybrid corn,antibiotics,television,the transistor and the computer [see Hacking(1984)].In his editorial in the inaugural issue of Sankhya,The Indian Journal of Statistics,Mahalanobis(1933)wrote,"...the history of modern statistics may be said to have begun from Karl Pearson's work on the distribution of x2 in 1900.The Chi-square test supplied for the first time a tool by which the significance of the agreement or discrepancy between theoretical expectations and actual observations could be judged with precision."Even Pearson's lifelong arch-rival Ronald A.Fisher (1922,p.314)conceded,"Nor is the introduction of the Pearsonian system of frequency curves the only contribution which their author has made to the solution of problems of specification:of even greater importance is the introduction of an objective criterion of goodness of fit."For more on this see Bera(2000)and Bera and Bilias(2001). 4
probability of j-th class being qj (≥ 0), j = 1, 2, . . . , k and Pk j=1 qj = 1. Suppose that according to the assumed probability model, qj = qj0; therefore, one would be interested in testing the hypothesis, H0 : qj = qj0, j = 1, 2, . . . , k. Let nj denote the observed frequency of the j-th class, with Pk j=1 nj = N. Pearson (1900) suggested the goodness-of-fit statistic6 P = X k j=1 (nj − Nqj0) 2 Nqj0 = X k j=1 (Oj − Ej ) 2 Ej , (10) where Oj and Ej denote, respectively, the observed and expected frequencies of the j-th class. This is the first constructive test in the statistics literature. Broadly speaking, P is essentially a distance measure between the observed and expected frequencies. It is quite natural to question the relevance of this test statistic in the context of estimation. Let us note that P could be used to measure the distance between any two sets of probabilities, say, (pj , qj ), j = 1, 2, . . . , k by simply writing pj = nj/N and qj = qj0, i.e., P = N X k j=1 (pj − qj ) 2 qj . (11) As we will see shortly a simple transformation of P could generate a broad class of distance measures. And later, in Section 5, we will demonstrate that many of the current estimation procedures in econometrics can be cast in terms of minimizing the distance between two sets of probabilities subject to certain constraints. In this way, we can tie and assimilate many estimation techniques together using Pearson’s MM and χ 2 -statistic as the unifying themes. We can write P as P = N X k j=1 pj (pj − qj ) qj = N X k j=1 pj pj qj − 1 . (12) Therefore, the essential quantity in measuring the divergence between two probability distributions is the ratio (pj/qj ). Using Steven’s (1975) idea on “visual perception” Cressie and Read 6This test is regarded as one of the 20 most important scientific breakthroughs of this century along with advances and discoveries like the theory of relativity, the IQ test, hybrid corn, antibiotics, television, the transistor and the computer [see Hacking (1984)]. In his editorial in the inaugural issue of Sankhy¯a, The Indian Journal of Statistics, Mahalanobis (1933) wrote, “. . . the history of modern statistics may be said to have begun from Karl Pearson’s work on the distribution of χ 2 in 1900. The Chi-square test supplied for the first time a tool by which the significance of the agreement or discrepancy between theoretical expectations and actual observations could be judged with precision.” Even Pearson’s lifelong arch-rival Ronald A. Fisher (1922, p.314) conceded, “Nor is the introduction of the Pearsonian system of frequency curves the only contribution which their author has made to the solution of problems of specification: of even greater importance is the introduction of an objective criterion of goodness of fit.” For more on this see Bera (2000) and Bera and Bilias (2001). 4
(1984)suggested using the relative difference between the perceived probabilities as(pi/qi)A-1 where A "typically lies in the range from 0.6 to 0.9"but could theoretically be any real num- ber [see also Read and Cressie (1988,p.17)].By weighing this quantity proportional to p;and summing over all the classes,leads to the following measure of divergence: (13) This is approximately proportional to the Cressie and Read(1984)power divergence family of statistics' Ip,9)= 2 (A+1) [- 言[+(g-}”- (14) where p=(p1,p2,...,Pn)'and q=(g1,92,...,qn)'.Lindsay (1994,p.1085)calls (pi/qi)-1 the"Pearson"residual since we can express the Pearson statistic in(11)asP-N From this,it is immediately seen that when A=1,I(p,q)reduces to P/N.In fact,a number of well-known test statistics can be obtained from Ia(p,q).When A-0,we have the likelihood (LR)test statistic,which,as an alternative to (10),can be written as -n()-空() (15) =1 Similarly,A=-1/2 gives the Freeman and Tukey (FT)(1950)statistic,or Hellinger distance, FT= 4∑v而-Vm2=4∑vO,-VER (16) All these test statistics are just different measures of distance between the observed and expected frequencies.Therefore,Ix(p,q)provides a very rich class of divergence measures Any probability distribution pi,i=1,2,...,n(say)of a random variable taking n values provides a measure of uncertainty regarding that random variable.In the information theory literature,this measure of uncertainty is called entropy.The origin of the term "entropy"goes 7In the entropy literature this is known as Renyi's(1961)a-class generalized measures of entropy see Maa- soumi(1993,p.144),Ullah(1996,p.142)and Mittelhammer,Judge and Miller (2000,p.328)].Golan,Judge and Miller (1996,p.36)referred to Schuitzenberger(1954)as well.This formulation has also been used extensively as a general class of decomposable income inequality measures,for example,see Cowell(1980)and Shorrocks (1980),and in time-series analysis to distinguish chaotic data from random data [Pompe(1994)]
(1984) suggested using the relative difference between the perceived probabilities as (pj/qj ) λ −1 where λ “typically lies in the range from 0.6 to 0.9” but could theoretically be any real number [see also Read and Cressie (1988, p.17)]. By weighing this quantity proportional to pj and summing over all the classes, leads to the following measure of divergence: X k j=1 pj " pj qj λ − 1 # . (13) This is approximately proportional to the Cressie and Read (1984) power divergence family of statistics7 Iλ(p, q) = 2 λ(λ + 1) X k j=1 pj " pj qj λ − 1 # = 2 λ(λ + 1) X k j=1 qj " 1 + pj qj − 1 λ+1 − 1 # , (14) where p = (p1, p2, . . . , pn) 0 and q = (q1, q2, . . . , qn) 0 . Lindsay (1994, p.1085) calls δj = (pj/qj ) − 1 the “Pearson” residual since we can express the Pearson statistic in (11) as P = N Pk j=1 qjδ 2 j . From this, it is immediately seen that when λ = 1, Iλ(p, q) reduces to P/N. In fact, a number of well-known test statistics can be obtained from Iλ(p, q). When λ → 0, we have the likelihood (LR) test statistic, which, as an alternative to (10), can be written as LR = 2X k j=1 nj ln nj Nqj0 = 2X k j=1 Oj ln Oj Ej . (15) Similarly, λ = −1/2 gives the Freeman and Tukey (FT) (1950) statistic, or Hellinger distance, F T = 4X k j=1 ( √ nj − √ nqj0) 2 = 4X k j=1 ( p Oj − p Ej ) 2 . (16) All these test statistics are just different measures of distance between the observed and expected frequencies. Therefore, Iλ(p, q) provides a very rich class of divergence measures. Any probability distribution pi , i = 1, 2, . . . , n (say) of a random variable taking n values provides a measure of uncertainty regarding that random variable. In the information theory literature, this measure of uncertainty is called entropy. The origin of the term “entropy” goes 7 In the entropy literature this is known as Renyi’s (1961) α-class generalized measures of entropy [see Maasoumi (1993, p.144), Ullah (1996, p.142) and Mittelhammer, Judge and Miller (2000, p.328)]. Golan, Judge and Miller (1996, p.36) referred to Sch¨utzenberger (1954) as well. This formulation has also been used extensively as a general class of decomposable income inequality measures, for example, see Cowell (1980) and Shorrocks (1980), and in time-series analysis to distinguish chaotic data from random data [Pompe (1994)]. 5
back to thermodynamics.The second law of thermodynamics states that there is an inherent tendency for disorder to increase.A probability distribution gives us a measure of disorder. Entropy is generally taken as a measure of expected information,that is,how much information do we have in the probability distribution pi,i=1,2,...,n.Intuitively,information should be a decreasing function of pi,i.e.,the more unlikely an event,the more interesting it is to know that it can happen see Shannon and Weaver (1949,p.105)and Sen (1975,pp.34-35)]. A simple choice for such a function is-In pi.Entropy H(p)is defined as a weighted sum of the information-In pi,i=1,2,...,n with respective probabilities as weights,namely, Hp)=-∑plnp, (17) If pi=0 for some i,then piln pi is taken to be zero.When pi=1/n for all i,H(p)=Inn and then we have the marimum value of the entropy and consequently the least information available from the probability distribution.The other extreme case occurs when pi=1 for one i,and =0 for the rest;then H(p)=0.If we do not weigh each-Inpi by pi and simply take the sum,another measure of entropy would be H'(p)=-∑lnp. (18) i=1 Following(17),the cross-entropy of one probability distribution p=(p1,p2,...,pn)'with respect to another distribution q=(g,q2,...,qn)'can be defined as C(p,q)=>p:ln(p:/q:)=E[np]-E[ndl, (19) which is yet another measure of distance between two distributions.It is easy to see the link between C(p,q)and the Cressie and Read (1984)power divergence family.If we choose q= (1/n,1/n,...,1/n)'=i/n where i is a n x 1 vector of ones,C(p,q)reduces to C(p,i/n)=>p:lnp:-Inn. (20) i=1 Therefore,entropy maximization is a special case of cross-entropy minimization with respect to the uniform distribution.For more on entropy,cross-entropy and their uses in econometrics see Maasoumi (1993),Ullah (1996),Golan,Judge and Miller (1996,1997 and 1998),Zellner and Highfield(1988),Zellner (1991)and other papers in Grandy and Schick(1991),Zellner (1997) and Mittelhammer,Judge and Miller(2000). If we try to find a probability distribution that maximizes the entropy H(p)in (17),the optimal solution is the uniform distribution,i.e.,p*=i/n.In the Bayesian literature,it is 6
back to thermodynamics. The second law of thermodynamics states that there is an inherent tendency for disorder to increase. A probability distribution gives us a measure of disorder. Entropy is generally taken as a measure of expected information, that is, how much information do we have in the probability distribution pi , i = 1, 2, . . . , n. Intuitively, information should be a decreasing function of pi , i.e., the more unlikely an event, the more interesting it is to know that it can happen [see Shannon and Weaver (1949, p.105) and Sen (1975, pp.34-35)]. A simple choice for such a function is − ln pi . Entropy H(p) is defined as a weighted sum of the information − ln pi , i = 1, 2, . . . , n with respective probabilities as weights, namely, H(p) = − Xpi ln pi . (17) If pi = 0 for some i, then pi ln pi is taken to be zero. When pi = 1/n for all i, H(p) = ln n and then we have the maximum value of the entropy and consequently the least information available from the probability distribution. The other extreme case occurs when pi = 1 for one i, and = 0 for the rest; then H(p) = 0. If we do not weigh each − ln pi by pi and simply take the sum, another measure of entropy would be H 0 (p) = − Xn i=1 ln pi . (18) Following (17), the cross-entropy of one probability distribution p = (p1, p2, . . . , pn) 0 with respect to another distribution q = (q1, q2, . . . , qn) 0 can be defined as C(p, q) = Xn i=1 pi ln(pi/qi) = E[ln p] − E[ln q], (19) which is yet another measure of distance between two distributions. It is easy to see the link between C(p, q) and the Cressie and Read (1984) power divergence family. If we choose q = (1/n, 1/n, . . . , 1/n) 0 = i/n where i is a n × 1 vector of ones, C(p, q) reduces to C(p, i/n) = Xn i=1 pi ln pi − ln n. (20) Therefore, entropy maximization is a special case of cross-entropy minimization with respect to the uniform distribution. For more on entropy, cross-entropy and their uses in econometrics see Maasoumi (1993), Ullah (1996), Golan, Judge and Miller (1996, 1997 and 1998), Zellner and Highfield (1988), Zellner (1991) and other papers in Grandy and Schick (1991), Zellner (1997) and Mittelhammer, Judge and Miller (2000). If we try to find a probability distribution that maximizes the entropy H(p) in (17), the optimal solution is the uniform distribution, i.e., p ∗ = i/n. In the Bayesian literature, it is 6
common to maximize an entropy measure to find non-informative priors.Jaynes (1957)was the first to consider the problem of finding a prior distribution that maximizes H(p)subject to certain side conditions,which could be given in the form of some moment restrictions.Jaynes' problem can be stated as follows.Suppose we want to find a least informative probability distribution pi=Pr(Y=yi),i=1,2,...,n of a random variable Y satisfying,say,m moment restrictions E[hj(Y)]=uj with known 's,j=1,2,...,m.Jaynes (1957,p.623)found an explicit solution to the problem of maximizing H(p)subject to the above moment conditions and=1 [for a treatment of this problem under very general conditions,see,Haberman (1984)].We can always find some (in fact,many)solutions just by satisfying the constraints; however,maximization of(17)makes the resulting probabilities pi(i=1,2,...,n)as smooth as possible.Jaynes (1957)formulation has been extensively used in the Bayesian literature to find priors that are as noninformative as possible given some prior partial information [see Berger(1985,pp.90-94)].In recent years econometricians have tried to estimate parameter(s)of interest say,6,utilizing only certain moment conditions satisfied by the underlying probability distribution,known as the generalized method of moments (GMM)estimation.The GMM procedure is an extension of Pearson's(1895,1902)MM when we have more moment restrictions than the dimension of the unknown parameter vector.The GMM estimation technique can also be cast into the information theoretic approach of maximization of entropy following the empirical likelihood (EL)method of Owen (1988,1990,1991)and Qin and Lawless (1994). Back and Brown (1993),Kitamura and Stutzer (1997)and Imbens,Spady and Johnson (1998) developed information theoretic approaches of entropy maximization estimation procedures that include GMM as a special case.Therefore,we observe how seemingly distinct ideas of Pearson's x2 test statistic and GMM estimation are tied to the common principle of measuring distance between two probability distributions through the entropy measure.The modest aim of this review paper is essentially this idea of assimilating distinct estimation methods.In the following two sections we discuss Fisher's (1912,1922)maximum likelihood estimation(MLE)approach and its relative efficiency to the MM estimation method.The MLE is the forerunner of the currently popular EL approach.We also discuss the minimum x2 method of estimation,which is based on the minimization of the Pearson x2 statistic.Section 4 proceeds with optimal estimation using an estimating function(EF)approach.In Section 5,we discuss the instrumental variable (IV)and GMM estimation procedure along with their recent variants.Both EF and GMM approaches were devised in order to handle problems of method of moments estimation where the number of moment restrictions is larger than the number of parameters.The last section provides some concluding remarks.While doing the survey,we also try to provide some personal perspectives on researchers who contributed to the amazing progress in statistical and 7
common to maximize an entropy measure to find non-informative priors. Jaynes (1957) was the first to consider the problem of finding a prior distribution that maximizes H(p) subject to certain side conditions, which could be given in the form of some moment restrictions. Jaynes’ problem can be stated as follows. Suppose we want to find a least informative probability distribution pi = Pr(Y = yi), i = 1, 2, . . . , n of a random variable Y satisfying, say, m moment restrictions E[hj (Y )] = µj with known µj ’s, j = 1, 2, . . . , m. Jaynes (1957, p.623) found an explicit solution to the problem of maximizing H(p) subject to the above moment conditions and Pn i=1 pi = 1 [for a treatment of this problem under very general conditions, see, Haberman (1984)]. We can always find some (in fact, many) solutions just by satisfying the constraints; however, maximization of (17) makes the resulting probabilities pi (i = 1, 2, . . . , n) as smooth as possible. Jaynes (1957) formulation has been extensively used in the Bayesian literature to find priors that are as noninformative as possible given some prior partial information [see Berger (1985, pp.90-94)]. In recent years econometricians have tried to estimate parameter(s) of interest say, θ, utilizing only certain moment conditions satisfied by the underlying probability distribution, known as the generalized method of moments (GMM) estimation. The GMM procedure is an extension of Pearson’s (1895, 1902) MM when we have more moment restrictions than the dimension of the unknown parameter vector. The GMM estimation technique can also be cast into the information theoretic approach of maximization of entropy following the empirical likelihood (EL) method of Owen (1988, 1990, 1991) and Qin and Lawless (1994). Back and Brown (1993), Kitamura and Stutzer (1997) and Imbens, Spady and Johnson (1998) developed information theoretic approaches of entropy maximization estimation procedures that include GMM as a special case. Therefore, we observe how seemingly distinct ideas of Pearson’s χ 2 test statistic and GMM estimation are tied to the common principle of measuring distance between two probability distributions through the entropy measure. The modest aim of this review paper is essentially this idea of assimilating distinct estimation methods. In the following two sections we discuss Fisher’s (1912, 1922) maximum likelihood estimation (MLE) approach and its relative efficiency to the MM estimation method. The MLE is the forerunner of the currently popular EL approach. We also discuss the minimum χ 2 method of estimation, which is based on the minimization of the Pearson χ 2 statistic. Section 4 proceeds with optimal estimation using an estimating function (EF) approach. In Section 5, we discuss the instrumental variable (IV) and GMM estimation procedure along with their recent variants. Both EF and GMM approaches were devised in order to handle problems of method of moments estimation where the number of moment restrictions is larger than the number of parameters. The last section provides some concluding remarks. While doing the survey, we also try to provide some personal perspectives on researchers who contributed to the amazing progress in statistical and 7
econometrics estimation techniques that we have witnessed in the last 100 years.We do this since in many instances the original motivation and philosophy of various statistical techniques have become clouded over time.And to the best of our knowledge these materials have not found a place in econometric textbooks. 2 Fisher's (1912)maximum likelihood,and the minimum chi-squared methods of estimation In 1912 when R.A.Fisher published his first mathematical paper,he was a third and final year undergraduate in mathematics and mathematical physics in Gonville and Caius College, Cambridge.It is now hard to envision exactly what prompted Fisher to write this paper. Possibly,his tutor the astronomer F.J.M.Stratton (1881-1960),who lectured on the theory of errors,was the instrumental factor.About Stratton's role,Edwards(1997a,p.36)wrote:"In the Easter Term 1911 he had lectured at the observatory on Calculation of Orbits from Observations, and during the next academic year on Combination of Observations in the Michaelmas Term (1911),the first term of Fisher's third and final undergraduate year.It is very likely that Fisher attended Stratton's lectures and subsequently discussed statistical questions with him during mathematics supervision in College,and he wrote the 1912 paper as a result."8 The paper started with a criticism of two known methods of curve fitting,least squares and Pearson's MM.In particular,regarding MM,Fisher (1912,p.156)stated "a choice has been made without theoretical justification in selecting r equations..."Fisher was referring to the equations in (9),though Pearson(1902)defended his choice on the ground that these lower-order moments have smallest relative variance [see Hald (1998,p.708)]. After disposing of these two methods,Fisher stated "we may solve the real problem directly" and set out to discuss his absolute criterion for fitting frequency curves.He took the probability density function (p.d.f)f(y;0)(using our notation)as an ordinate of the theoretical curve of unit area and,hence,interpreted f(y;0)y as the chance of an observation falling within the 8Fisher(1912)ends with"In conclusion I should like to acknowledge the great kindness of Mr.J.F.M.Stratton, to whose criticism and encouragement the present form of this note is due."It may not be out of place to add that in 1912 Stratton also prodded his young pupil to write directly to Student(William S.Gosset,1876-1937), and Fisher sent Gosset a rigorous proof of t-distribution.Gosset was sufficiently impressed to send the proof to Karl Pearson with a covering letter urging him to publish it in Biometrika as a note.Pearson,however,was not impressed and nothing more was heard of Fisher's proof [see Box(1978,pp.71-73)and Lehmann(1999, pp.419-420)].This correspondence between Fisher and Gosset was the beginning of a lifelong mutual respect and friendship until the death of Gosset. 8
econometrics estimation techniques that we have witnessed in the last 100 years. We do this since in many instances the original motivation and philosophy of various statistical techniques have become clouded over time. And to the best of our knowledge these materials have not found a place in econometric textbooks. 2 Fisher’s (1912) maximum likelihood, and the minimum chi-squared methods of estimation In 1912 when R. A. Fisher published his first mathematical paper, he was a third and final year undergraduate in mathematics and mathematical physics in Gonville and Caius College, Cambridge. It is now hard to envision exactly what prompted Fisher to write this paper. Possibly, his tutor the astronomer F. J. M. Stratton (1881-1960), who lectured on the theory of errors, was the instrumental factor. About Stratton’s role, Edwards (1997a, p.36) wrote: “In the Easter Term 1911 he had lectured at the observatory on Calculation of Orbits from Observations, and during the next academic year on Combination of Observations in the Michaelmas Term (1911), the first term of Fisher’s third and final undergraduate year. It is very likely that Fisher attended Stratton’s lectures and subsequently discussed statistical questions with him during mathematics supervision in College, and he wrote the 1912 paper as a result.”8 The paper started with a criticism of two known methods of curve fitting, least squares and Pearson’s MM. In particular, regarding MM, Fisher (1912, p.156) stated “a choice has been made without theoretical justification in selecting r equations . . . ” Fisher was referring to the equations in (9), though Pearson (1902) defended his choice on the ground that these lower-order moments have smallest relative variance [see Hald (1998, p.708)]. After disposing of these two methods, Fisher stated “we may solve the real problem directly” and set out to discuss his absolute criterion for fitting frequency curves. He took the probability density function (p.d.f) f(y; θ) (using our notation) as an ordinate of the theoretical curve of unit area and, hence, interpreted f(y; θ)δy as the chance of an observation falling within the 8Fisher (1912) ends with “In conclusion I should like to acknowledge the great kindness of Mr. J.F.M. Stratton, to whose criticism and encouragement the present form of this note is due.” It may not be out of place to add that in 1912 Stratton also prodded his young pupil to write directly to Student (William S. Gosset, 1876-1937), and Fisher sent Gosset a rigorous proof of t-distribution. Gosset was sufficiently impressed to send the proof to Karl Pearson with a covering letter urging him to publish it in Biometrika as a note. Pearson, however, was not impressed and nothing more was heard of Fisher’s proof [see Box (1978, pp.71-73) and Lehmann (1999, pp.419-420)]. This correspondence between Fisher and Gosset was the beginning of a lifelong mutual respect and friendship until the death of Gosset. 8