Available online at www.sciencedirect.com DIRECT Journal of Mathematical ACADEMIC Psychology PRESS Journal of Mathematical Psychology 47(2003)90-100 http://www.elsevier.com/locate/jmp Tutorial Tutorial on maximum likelihood estimation In Jae Myung* Department of Psychology,Ohio State University,1885 Neil Arenue Mall,Columbus,OH 43210-1222,USA Received 30 November 2001:revised 16 October 2002 Abstract In this paper.I provide a tutorial exposition on maximum likelihood estimation(MLE).The intended audience of this tutorial are researchers who practice mathematical modeling of cognition but are unfamiliar with the estimation method.Unlike least-squares estimation which is primarily a descriptive tool,MLE is a preferred method of parameter estimation in statistics and is an indispensable tool for many statistical modeling techniques,in particular in non-linear modeling with non-normal data.The purpose of this paper is to provide a good conceptual explanation of the method with illustrative examples so the reader can have a grasp of some of the basic principles. C 2003 Elsevier Science (USA).All rights reserved. 1.Introduction (i.e.r2),and root mean squared deviation.LSE,which unlike MLE requires no or minimal distributional In psychological science,we seek to uncover general assumptions,is useful for obtaining a descriptive laws and principles that govern the behavior under measure for the purpose of summarizing observed data, investigation.As these laws and principles are not but it has no basis for testing hypotheses or constructing directly observable,they are formulated in terms of confidence intervals. hypotheses.In mathematical modeling,such hypo- On the other hand.MLE is not as widely recognized theses about the structure and inner working of the among modelers in psychology,but it is a standard behavioral process of interest are stated in terms of approach to parameter estimation and inference in parametric families of probability distributions called statistics.MLE has many optimal properties in estima- models.The goal of modeling is to deduce the form of tion:sufficiency (complete information about the para- the underlying process by testing the viability of such meter of interest contained in its MLE estimator); models. consistency (true parameter value that generated the Once a model is specified with its parameters,and data recovered asymptotically,i.e.for data of suffi- data have been collected,one is in a position to evaluate ciently large samples):efficiency (lowest-possible var- its goodness of fit,that is,how well it fits the observed iance of parameter estimates achieved asymptotically); data.Goodness of fit is assessed by finding parameter and parameterization invariance (same MLE solution values of a model that best fits the data-a procedure obtained independent of the parametrization used).In called parameter estimation. contrast,no such things can be said about LSE.As such, There are two general methods of parameter estima- most statisticians would not view LSE as a general tion.They are least-squares estimation (LSE)and method for parameter estimation,but rather as an maximum likelihood estimation (MLE).The former approach that is primarily used with linear regression has been a popular choice of model fitting in psychology models.Further,many of the inference methods in (e.g.,Rubin,Hinton,Wenzel,1999;Lamberts,2000 statistics are developed based on MLE.For example, but see Usher McClelland,2001)and is tied to many MLE is a prerequisite for the chi-square test,the G- familiar statistical concepts such as linear regression, square test,Bayesian methods,inference with missing sum of squares error,proportion variance accounted for data,modeling of random effects,and many model selection criteria such as the Akaike information *Fax:+614-292-5601 criterion (Akaike,1973)and the Bayesian information E-mail address:myung.1@osu.edu criteria (Schwarz,1978). 0022-2496/03/S-see front matter C 2003 Elsevier Science (USA).All rights reserved. doi:10.1016/S0022-2496(02)00028-7
Journal of Mathematical Psychology 47 (2003) 90–100 Tutorial Tutorial on maximum likelihood estimation In Jae Myung* Department of Psychology, Ohio State University, 1885 Neil Avenue Mall, Columbus, OH 43210-1222, USA Received 30 November 2001; revised 16 October 2002 Abstract In this paper, I provide a tutorial exposition on maximum likelihood estimation (MLE). The intended audience of this tutorial are researchers who practice mathematical modeling of cognition but are unfamiliar with the estimation method. Unlike least-squares estimation which is primarily a descriptive tool, MLE is a preferred method of parameter estimation in statistics and is an indispensable tool for many statistical modeling techniques, in particular in non-linear modeling with non-normal data. The purpose of this paper is to provide a good conceptual explanation of the method with illustrative examples so the reader can have a grasp of some of the basic principles. r 2003 Elsevier Science (USA). All rights reserved. 1. Introduction In psychological science, we seekto uncover general laws and principles that govern the behavior under investigation. As these laws and principles are not directly observable, they are formulated in terms of hypotheses. In mathematical modeling, such hypotheses about the structure and inner working of the behavioral process of interest are stated in terms of parametric families of probability distributions called models. The goal of modeling is to deduce the form of the underlying process by testing the viability of such models. Once a model is specified with its parameters, and data have been collected, one is in a position to evaluate its goodness of fit, that is, how well it fits the observed data. Goodness of fit is assessed by finding parameter values of a model that best fits the data—a procedure called parameter estimation. There are two general methods of parameter estimation. They are least-squares estimation (LSE) and maximum likelihood estimation (MLE). The former has been a popular choice of model fitting in psychology (e.g., Rubin, Hinton, & Wenzel, 1999; Lamberts, 2000 but see Usher & McClelland, 2001) and is tied to many familiar statistical concepts such as linear regression, sum of squares error, proportion variance accounted for (i.e. r2), and root mean squared deviation. LSE, which unlike MLE requires no or minimal distributional assumptions, is useful for obtaining a descriptive measure for the purpose of summarizing observed data, but it has no basis for testing hypotheses or constructing confidence intervals. On the other hand, MLE is not as widely recognized among modelers in psychology, but it is a standard approach to parameter estimation and inference in statistics. MLE has many optimal properties in estimation: sufficiency (complete information about the parameter of interest contained in its MLE estimator); consistency (true parameter value that generated the data recovered asymptotically, i.e. for data of suffi- ciently large samples); efficiency (lowest-possible variance of parameter estimates achieved asymptotically); and parameterization invariance (same MLE solution obtained independent of the parametrization used). In contrast, no such things can be said about LSE. As such, most statisticians would not view LSE as a general method for parameter estimation, but rather as an approach that is primarily used with linear regression models. Further, many of the inference methods in statistics are developed based on MLE. For example, MLE is a prerequisite for the chi-square test, the Gsquare test, Bayesian methods, inference with missing data, modeling of random effects, and many model selection criteria such as the Akaike information criterion (Akaike, 1973) and the Bayesian information criteria (Schwarz, 1978). *Fax: +614-292-5601. E-mail address: myung.1@osu.edu. 0022-2496/03/$ - see front matter r 2003 Elsevier Science (USA). All rights reserved. doi:10.1016/S0022-2496(02)00028-7
I.J.Myung Journal of Mathematical Psychology 47 (2003)90-100 91 In this tutorial paper,I introduce the maximum model's parameter.As the parameter changes in value, likelihood estimation method for mathematical model- different probability distributions are generated.For- ing.The paper is written for researchers who are mally,a model is defined as the family of probability primarily involved in empirical work and publish in distributions indexed by the model's parameters. experimental journals (e.g.Journal of Experimental Let f(yw)denote the probability density function Psychology)but do modeling.The paper is intended to (PDF)that specifies the probability of observing data serve as a stepping stone for the modeler to move vector y given the parameter w.Throughout this paper beyond the current practice of using LSE to more we will use a plain letter for a vector (e.g.y)and a letter informed modeling analyses,thereby expanding his or with a subscript for a vector element (e.g.y).The her repertoire of statistical instruments,especially in parameter w=(w1,...,wk)is a vector defined on a non-linear modeling.The purpose of the paper is to multi-dimensional parameter space.If individual ob- provide a good conceptual understanding of the method servations,yi's,are statistically independent of one with concrete examples.For in-depth,technically more another,then according to the theory of probability,the rigorous treatment of the topic,the reader is directed to PDF for the data y=(v1,...,ym)given the parameter other sources (e.g.,Bickel Doksum,1977,Chap.3; vector w can be expressed as a multiplication of PDFs Casella Berger,2002,Chap.7;DeGroot Schervish, for individual observations, 2002,Chap.6;Spanos,1999,Chap.13). f(y=(vI,y2,...,yn)w)=fi(vw)f(v2w) …f(ym w) (1) 2.Model specification To illustrate the idea of a PDF,consider the simplest case with one observation and one parameter,that is, 2.1.Probability density function m=k=1.Suppose that the data y represents the number of successes in a sequence of 10 Bernoulli trials From a statistical standpoint,the data vector y= (e.g.tossing a coin 10 times)and that the probability of (y1,...,ym)is a random sample from an unknown a success on any one trial,represented by the parameter population.The goal of data analysis is to identify the w,is 0.2.The PDF in this case is given by population that is most likely to have generated the 10-n0.2y0.80- 101 sample.In statistics,each population is identified by a f0yln=10,w=0.2)= corresponding probability distribution.Associated with each probability distribution is a unique value of the 0y=0,1,,10) (2) 04 .3 0.2 Data y 0.4 0.3 0.2 1 Data y Fig.1.Binomial probability distributions of sample size n=10 and probability parameter w=0.2 (top)and w=0.7 (bottom)
In this tutorial paper, I introduce the maximum likelihood estimation method for mathematical modeling. The paper is written for researchers who are primarily involved in empirical workand publish in experimental journals (e.g. Journal of Experimental Psychology) but do modeling. The paper is intended to serve as a stepping stone for the modeler to move beyond the current practice of using LSE to more informed modeling analyses, thereby expanding his or her repertoire of statistical instruments, especially in non-linear modeling. The purpose of the paper is to provide a good conceptual understanding of the method with concrete examples. For in-depth, technically more rigorous treatment of the topic, the reader is directed to other sources (e.g., Bickel & Doksum, 1977, Chap. 3; Casella & Berger, 2002, Chap. 7; DeGroot & Schervish, 2002, Chap. 6; Spanos, 1999, Chap. 13). 2. Model specification 2.1. Probability density function From a statistical standpoint, the data vector y ¼ ðy1;y; ymÞ is a random sample from an unknown population. The goal of data analysis is to identify the population that is most likely to have generated the sample. In statistics, each population is identified by a corresponding probability distribution. Associated with each probability distribution is a unique value of the model’s parameter. As the parameter changes in value, different probability distributions are generated. Formally, a model is defined as the family of probability distributions indexed by the model’s parameters. Let fðyjwÞ denote the probability density function (PDF) that specifies the probability of observing data vector y given the parameter w: Throughout this paper we will use a plain letter for a vector (e.g. y) and a letter with a subscript for a vector element (e.g. yi). The parameter w ¼ ðw1;y; wkÞ is a vector defined on a multi-dimensional parameter space. If individual observations, yi’s, are statistically independent of one another, then according to the theory of probability, the PDF for the data y ¼ ðy1;y; ymÞ given the parameter vector w can be expressed as a multiplication of PDFs for individual observations, fðy ¼ ðy1; y2;y; ynÞ j wÞ ¼ f1ðy1 j wÞ f2ðy2 j wÞ ?fnðym j wÞ: ð1Þ To illustrate the idea of a PDF, consider the simplest case with one observation and one parameter, that is, m ¼ k ¼ 1: Suppose that the data y represents the number of successes in a sequence of 10 Bernoulli trials (e.g. tossing a coin 10 times) and that the probability of a success on any one trial, represented by the parameter w; is 0.2. The PDF in this case is given by fðy j n ¼ 10; w ¼ 0:2Þ ¼ 10! y!ð10 yÞ! ð0:2Þ y ð0:8Þ 10y ðy ¼ 0; 1;y; 10Þ ð2Þ Fig. 1. Binomial probability distributions of sample size n ¼ 10 and probability parameter w ¼ 0:2 (top) and w ¼ 0:7 (bottom). I.J. Myung / Journal of Mathematical Psychology 47 (2003) 90–100 91
92 1J.Myung I Journal of Mathematical Psychology 47(2003)90-100 which is known as the binomial distribution with interest,find the one PDF,among all the probability parameters n=10,w=0.2.Note that the number of densities that the model prescribes,that is most likely to trials(n)is considered as a parameter.The shape of this have produced the data.To solve this inverse problem, PDF is shown in the top panel of Fig.1.If the we define the likelihood function by reversing the roles of parameter value is changed to say w=0.7,a new PDF the data vector y and the parameter vector w in f(yw), is obtained as i.e. 10川 f0yln=10,w=0.7)= 10-70.7y(0.3)10- L(wly)=f(ylw). (5) 0=0,1,,10) (3) Thus L(wly)represents the likelihood of the parameter w given the observed data y,and as such is a function of whose shape is shown in the bottom panel of Fig.1.The w.For the one-parameter binomial example in Eq.(4), following is the general expression of the PDF of the the likelihood function for y =7 and n=10 is given by binomial distribution for arbitrary values of w and n: L(w|n=10,y=7)=fy=7|n=10,w) n! f6a,w)=-n"(1-w =73n7(1-w320≤w≤1). 101 (6) (0≤w≤1;y=0,1,,n) (4) The shape of this likelihood function is shown in Fig.2. which as a function ofy specifies the probability of data There exist an important difference between the PDF y for a given value of n and w.The collection of all such f(yw)and the likelihood function L(wly).As illustrated PDFs generated by varying the parameter across its in Figs.I and 2,the two functions are defined on range (0-1 in this case for w,n1)defines a model. different axes,and therefore are not directly comparable to each other.Specifically,the PDF in Fig.I is a 2.2.Likelihood function function of the data given a particular set of parameter values,defined on the data scale.On the other hand,the Given a set of parameter values,the corresponding likelihood function is a function of the parameter given PDF will show that some data are more probable than a particular set of observed data,defined on the other data.In the previous example,the PDF with w= parameter scale.In short,Fig.I tells us the probability 0.2,y=2 is more likely to occur than y=5(0.302 vs. of a particular data value for a fixed parameter,whereas 0.026).In reality,however,we have already observed the Fig.2 tells us the likelihood ("unnormalized probabil- data.Accordingly,we are faced with an inverse ity")of a particular parameter value for a fixed data set. problem:Given the observed data and a model of Note that the likelihood function in this figure is a curve 0.35 0.3 0.25 0.2 0.15 0 0.05 0 0.1 2 030,4650.日070a Parameter w Fig.2.The likelihood function given observed data y=7 and sample size n=10 for the one-parameter model described in the text
which is known as the binomial distribution with parameters n ¼ 10; w ¼ 0:2: Note that the number of trials ðnÞ is considered as a parameter. The shape of this PDF is shown in the top panel of Fig. 1. If the parameter value is changed to say w ¼ 0:7; a new PDF is obtained as fðy j n ¼ 10; w ¼ 0:7Þ ¼ 10! y!ð10 yÞ! ð0:7Þ y ð0:3Þ 10y ðy ¼ 0; 1;y; 10Þ ð3Þ whose shape is shown in the bottom panel of Fig. 1. The following is the general expression of the PDF of the binomial distribution for arbitrary values of w and n: fðyjn; wÞ ¼ n! y!ðn yÞ! wy ð1 wÞ ny ð0pwp1; y ¼ 0; 1;y; nÞ ð4Þ which as a function of y specifies the probability of data y for a given value of n and w: The collection of all such PDFs generated by varying the parameter across its range (0–1 in this case for w; nX1) defines a model. 2.2. Likelihood function Given a set of parameter values, the corresponding PDF will show that some data are more probable than other data. In the previous example, the PDF with w ¼ 0:2; y ¼ 2 is more likely to occur than y ¼ 5 (0.302 vs. 0.026). In reality, however, we have already observed the data. Accordingly, we are faced with an inverse problem: Given the observed data and a model of interest, find the one PDF, among all the probability densities that the model prescribes, that is most likely to have produced the data. To solve this inverse problem, we define the likelihood function by reversing the roles of the data vector y and the parameter vector w in fðyjwÞ; i.e. LðwjyÞ ¼ fðyjwÞ: ð5Þ Thus LðwjyÞ represents the likelihood of the parameter w given the observed data y; and as such is a function of w: For the one-parameter binomial example in Eq. (4), the likelihood function for y ¼ 7 and n ¼ 10 is given by Lðw j n ¼ 10; y ¼ 7Þ ¼ fðy ¼ 7 j n ¼ 10; wÞ ¼ 10! 7!3! w7 ð1 wÞ 3 ð0pwp1Þ: ð6Þ The shape of this likelihood function is shown in Fig. 2. There exist an important difference between the PDF fðyjwÞ and the likelihood function LðwjyÞ: As illustrated in Figs. 1 and 2, the two functions are defined on different axes, and therefore are not directly comparable to each other. Specifically, the PDF in Fig. 1 is a function of the data given a particular set of parameter values, defined on the data scale. On the other hand, the likelihood function is a function of the parameter given a particular set of observed data, defined on the parameter scale. In short, Fig. 1 tells us the probability of a particular data value for a fixed parameter, whereas Fig. 2 tells us the likelihood (‘‘unnormalized probability’’) of a particular parameter value for a fixed data set. Note that the likelihood function in this figure is a curve Fig. 2. The likelihood function given observed data y ¼ 7 and sample size n ¼ 10 for the one-parameter model described in the text. 92 I.J. Myung / Journal of Mathematical Psychology 47 (2003) 90–100
I.J.Myung Journal of Mathematical Psychology 47 (2003)90-100 93 because there is only one parameter beside n,which is at wi=wiMLE for all i=1,...,k.This is because the assumed to be known.If the model has two parameters, definition of maximum or minimum of a continuous the likelihood function will be a surface sitting above the differentiable function implies that its first derivatives parameter space.In general,for a model with k vanish at such points parameters,the likelihood function L(wly)takes the The likelihood equation represents a necessary con- shape of a k-dim geometrical "surface"sitting above a dition for the existence of an MLE estimate.An k-dim hyperplane spanned by the parameter vector w= additional condition must also be satisfied to ensure (1,,wk) that In L(wy)is a maximum and not a minimum,since the first derivative cannot reveal this.To be a maximum, the shape of the log-likelihood function should be 3.Maximum likelihood estimation convex (it must represent a peak,not a valley)in the neighborhood of wMLE.This can be checked by Once data have been collected and the likelihood calculating the second derivatives of the log-likelihoods function of a model given the data is determined,one is and showing whether they are all negative at wi=wi.MLE in a position to make statistical inferences about the fori=1,...,k population,that is,the probability distribution that underlies the data.Given that different parameter values m L(wb)o. (8) Ow? index different probability distributions(Fig.1),we are interested in finding the parameter value that corre- To illustrate the MLE procedure,let us again consider sponds to the desired probability distribution. the previous one-parameter binomial example given a The principle of maximum likelihood estimation fixed value of n.First,by taking the logarithm of the (MLE),originally developed by R.A.Fisher in the likelihood function L(wln =10,y=7)in Eq.(6),we 1920s,states that the desired probability distribution is obtain the log-likelihood as the one that makes the observed data "most likely," 101 which means that one must seek the value of the nL(wln=10,y=7)=n73+7lnw+3ln(1-w9) parameter vector that maximizes the likelihood function Next,the first derivative of the log-likelihood is L(wly).The resulting parameter vector,which is sought calculated as by searching the multi-dimensional parameter space,is called the MLE estimate,and is denoted by wMLE d1nL(w|n=10,y=7)_7_37-10w (10) (WI.MLE,...,W&.MLE).For example,in Fig.2,the MLE dw Γw1-ww(1-w) estimate is wMLE=0.7 for which the maximized like- By requiring this equation to be zero,the desired MLE lihood value is L(wMLE 0.7In 10,y=7)=0.267. estimate is obtained as wMLE =0.7.To make sure that The probability distribution corresponding to this the solution represents a maximum,not a minimum,the MLE estimate is shown in the bottom panel of Fig.1. second derivative of the log-likelihood is calculated and According to the MLE principle,this is the population evaluated at w WMLE, that is most likely to have generated the observed data of y=7.To summarize,maximum likelihood estima- d2In L(wIn =10,y=7) 7 、3 tion is a method to seek the probability distribution that dw2 -(1-w makes the observed data most likely. =-47.62<0 (11) 3.1.Likelihood equation which is negative,as desired. In practice,however,it is usually not possible to MLE estimates need not exist nor be unique.In this obtain an analytic form solution for the MLE estimate, section,we show how to compute MLE estimates when especially when the model involves many parameters and its PDF is highly non-linear.In such situations,the they exist and are unique.For computational conve- nience,the MLE estimate is obtained by maximizing the MLE estimate must be sought numerically using non- log-likelihood function,In L(wly).This is because the linear optimization algorithms.The basic idea of non- two functions,In L(wly)and L(wly),are monotonically linear optimization is to quickly find optimal parameters related to each other so the same MLE estimate is that maximize the log-likelihood.This is done by obtained by maximizing either one.Assuming that the log-likelihood function,In L(wly),is differentiable,if Consider the Hessian matrix H(w)defined as (w)= WMLE exists,it must satisfy the following partial differential equation known as the likelihood equation: L)(.).Then a more accurate test of the convexity owiowi aln L(wly)=0 condition requires that the determinant of H(w)be negative definite. that is,H(w WMLE)z<0 for any kxl real-numbered vector z,where O z'denotes the transpose of z
because there is only one parameter beside n; which is assumed to be known. If the model has two parameters, the likelihood function will be a surface sitting above the parameter space. In general, for a model with k parameters, the likelihood function LðwjyÞ takes the shape of a k-dim geometrical ‘‘surface’’ sitting above a k-dim hyperplane spanned by the parameter vector w ¼ ðw1;y; wkÞ: 3. Maximum likelihood estimation Once data have been collected and the likelihood function of a model given the data is determined, one is in a position to make statistical inferences about the population, that is, the probability distribution that underlies the data. Given that different parameter values index different probability distributions (Fig. 1), we are interested in finding the parameter value that corresponds to the desired probability distribution. The principle of maximum likelihood estimation (MLE), originally developed by R.A. Fisher in the 1920s, states that the desired probability distribution is the one that makes the observed data ‘‘most likely,’’ which means that one must seekthe value of the parameter vector that maximizes the likelihood function LðwjyÞ: The resulting parameter vector, which is sought by searching the multi-dimensional parameter space, is called the MLE estimate, and is denoted by wMLE ¼ ðw1;MLE;y; wk;MLEÞ: For example, in Fig. 2, the MLE estimate is wMLE ¼ 0:7 for which the maximized likelihood value is LðwMLE ¼ 0:7jn ¼ 10; y ¼ 7Þ ¼ 0:267: The probability distribution corresponding to this MLE estimate is shown in the bottom panel of Fig. 1. According to the MLE principle, this is the population that is most likely to have generated the observed data of y ¼ 7: To summarize, maximum likelihood estimation is a method to seekthe probability distribution that makes the observed data most likely. 3.1. Likelihood equation MLE estimates need not exist nor be unique. In this section, we show how to compute MLE estimates when they exist and are unique. For computational convenience, the MLE estimate is obtained by maximizing the log-likelihood function, ln LðwjyÞ: This is because the two functions, ln LðwjyÞ and LðwjyÞ; are monotonically related to each other so the same MLE estimate is obtained by maximizing either one. Assuming that the log-likelihood function, ln LðwjyÞ; is differentiable, if wMLE exists, it must satisfy the following partial differential equation known as the likelihood equation: @ln LðwjyÞ @wi ¼ 0 ð7Þ at wi ¼ wi;MLE for all i ¼ 1;y; k: This is because the definition of maximum or minimum of a continuous differentiable function implies that its first derivatives vanish at such points. The likelihood equation represents a necessary condition for the existence of an MLE estimate. An additional condition must also be satisfied to ensure that ln LðwjyÞ is a maximum and not a minimum, since the first derivative cannot reveal this. To be a maximum, the shape of the log-likelihood function should be convex (it must represent a peak, not a valley) in the neighborhood of wMLE: This can be checked by calculating the second derivatives of the log-likelihoods and showing whether they are all negative at wi ¼ wi;MLE for i ¼ 1;y; k; 1 @2 ln LðwjyÞ @w2 i o0: ð8Þ To illustrate the MLE procedure, let us again consider the previous one-parameter binomial example given a fixed value of n: First, by taking the logarithm of the likelihood function Lðwjn ¼ 10; y ¼ 7Þ in Eq. (6), we obtain the log-likelihood as ln Lðw j n ¼ 10; y ¼ 7Þ ¼ ln 10! 7!3! þ 7 ln w þ 3 lnð1 wÞð:9Þ Next, the first derivative of the log-likelihood is calculated as d ln Lðw j n ¼ 10; y ¼ 7Þ dw ¼ 7 w 3 1 w ¼ 7 10w wð1 wÞ : ð10Þ By requiring this equation to be zero, the desired MLE estimate is obtained as wMLE ¼ 0:7: To make sure that the solution represents a maximum, not a minimum, the second derivative of the log-likelihood is calculated and evaluated at w ¼ wMLE; d2 ln Lðw j n ¼ 10; y ¼ 7Þ dw2 ¼ 7 w2 3 ð1 wÞ 2 ¼ 47:62o0 ð11Þ which is negative, as desired. In practice, however, it is usually not possible to obtain an analytic form solution for the MLE estimate, especially when the model involves many parameters and its PDF is highly non-linear. In such situations, the MLE estimate must be sought numerically using nonlinear optimization algorithms. The basic idea of nonlinear optimization is to quickly find optimal parameters that maximize the log-likelihood. This is done by 1Consider the Hessian matrix HðwÞ defined as HijðwÞ ¼ @2 ln LðwÞ @wi@wj ði; j ¼ 1;y; kÞ: Then a more accurate test of the convexity condition requires that the determinant of HðwÞ be negative definite, that is, z0 Hðw ¼ wMLEÞzo0 for any kx1 real-numbered vector z; where z0 denotes the transpose of z: I.J. Myung / Journal of Mathematical Psychology 47 (2003) 90–100 93
94 I.J.Myung Journal of Mathematical Psychology 47 (2003)90-100 a1 a2 83 Parameter w Fig.3.A schematic plot of the log-likelihood function for a fictitious one-parameter model.Point B is the global maximum whereas points A and C are two local maxima.The series of arrows depicts an iterative optimization process searching much smaller sub-sets of the multi-dimen- tries to improve upon an initial set of parameters that is sional parameter space rather than exhaustively search- supplied by the user.Initial parameter values are chosen ing the whole parameter space,which becomes either at random or by guessing.Depending upon the intractable as the number of parameters increases.The choice of the initial parameter values,the algorithm "intelligent"search proceeds by trial and error over the could prematurely stop and return a sub-optimal set of course of a series of iterative steps.Specifically,on each parameter values.This is called the local maxima iteration,by taking into account the results from the problem.As an example,in Fig.3 note that although previous iteration,a new set of parameter values is the starting parameter value at point a2 will lead to the obtained by adding small changes to the previous optimal point B called the global maximum,the starting parameters in such a way that the new parameters are parameter value at point al will lead to point A,which is likely to lead to improved performance.Different a sub-optimal solution.Similarly,the starting parameter optimization algorithms differ in how this updating value at a3 will lead to another sub-optimal solution at routine is conducted.The iterative process,as shown by point C. a series of arrows in Fig.3,continues until the Unfortunately,there exists no general solution to the parameters are judged to have converged (i.e.,point B local maximum problem.Instead,a variety of techni- in Fig.3)on the optimal set of parameters on an ques have been developed in an attempt to avoid the appropriately predefined criterion.Examples of the problem,though there is no guarantee of their stopping criterion include the maximum number of effectiveness.For example,one may choose different iterations allowed or the minimum amount of change in starting values over multiple runs of the iteration parameter values between two successive iterations. procedure and then examine the results to see whether the same solution is obtained repeatedly.When that happens,one can conclude with some confidence that a 3.2.Local maxima global maximum has been found.2 It is worth noting that the optimization algorithm does not necessarily guarantee that a set of parameter 2A stochastic optimization algorithm known as simulated annealing values that uniquely maximizes the log-likelihood will be (Kirkpatrick,Gelatt,Vecchi.1983)can overcome the local maxima problem,at least in theory,though the algorithm may not be a feasible found.Finding optimum parameters is essentially a option in practice as it may take an realistically long time to find the heuristic process in which the optimization algorithm solution
searching much smaller sub-sets of the multi-dimensional parameter space rather than exhaustively searching the whole parameter space, which becomes intractable as the number of parameters increases. The ‘‘intelligent’’ search proceeds by trial and error over the course of a series of iterative steps. Specifically, on each iteration, by taking into account the results from the previous iteration, a new set of parameter values is obtained by adding small changes to the previous parameters in such a way that the new parameters are likely to lead to improved performance. Different optimization algorithms differ in how this updating routine is conducted. The iterative process, as shown by a series of arrows in Fig. 3, continues until the parameters are judged to have converged (i.e., point B in Fig. 3) on the optimal set of parameters on an appropriately predefined criterion. Examples of the stopping criterion include the maximum number of iterations allowed or the minimum amount of change in parameter values between two successive iterations. 3.2. Local maxima It is worth noting that the optimization algorithm does not necessarily guarantee that a set of parameter values that uniquely maximizes the log-likelihood will be found. Finding optimum parameters is essentially a heuristic process in which the optimization algorithm tries to improve upon an initial set of parameters that is supplied by the user. Initial parameter values are chosen either at random or by guessing. Depending upon the choice of the initial parameter values, the algorithm could prematurely stop and return a sub-optimal set of parameter values. This is called the local maxima problem. As an example, in Fig. 3 note that although the starting parameter value at point a2 will lead to the optimal point B called the global maximum, the starting parameter value at point a1 will lead to point A, which is a sub-optimal solution. Similarly, the starting parameter value at a3 will lead to another sub-optimal solution at point C. Unfortunately, there exists no general solution to the local maximum problem. Instead, a variety of techniques have been developed in an attempt to avoid the problem, though there is no guarantee of their effectiveness. For example, one may choose different starting values over multiple runs of the iteration procedure and then examine the results to see whether the same solution is obtained repeatedly. When that happens, one can conclude with some confidence that a global maximum has been found.2 Fig. 3. A schematic plot of the log-likelihood function for a fictitious one-parameter model. Point B is the global maximum whereas points A and C are two local maxima. The series of arrows depicts an iterative optimization process. 2A stochastic optimization algorithm known as simulated annealing (Kirkpatrick, Gelatt, & Vecchi, 1983) can overcome the local maxima problem, at least in theory, though the algorithm may not be a feasible option in practice as it may take an realistically long time to find the solution. 94 I.J. Myung / Journal of Mathematical Psychology 47 (2003) 90–100
1.J.Myung I Journal of Mathematical Psychology 47(2003)90-100 95 3.3.Relation to least-squares estimation time,and p(w,t)the model's prediction of the prob- ability of correct recall at time t.The two models are Recall that in MLE we seek the parameter values that defined as are most likely to have produced the data.In LSE,on power model:p(w,t)=wit-"2 (wI,w2>0), the other hand,we seek the parameter values that provide the most accurate description of the data, exponential model p(w,t)=wi exp(-w2t) (13) measured in terms of how closely the model fits the (w1,w2>0) data under the square-loss function.Formally,in LSE, the sum of squares error(SSE)between observations and Suppose that data y=(v1,...,ym)consists of m predictions is minimized: observations in which yi(0syis1)represents an ob- served proportion of correct recall at time ti(i= 77 SSE(w)= CGy-prd4(w)》2, (12) 1,...,m).We are interested in testing the viability of i=l these models.We do this by fitting each to observed data and examining its goodness of fit. where prdi(w)denotes the model's prediction for the ith Application of MLE requires specification of the PDF observation.Note that SSE(w)is a function of the f(yw)of the data under each model.To do this,first we parameter vector w=(wi,...,wk). note that each observed proportion yi is obtained by As in MLE,finding the parameter values that dividing the number of correct responses (xi)by the minimize SSE generally requires use of a non-linear total number of independent trials (n),yi= optimization algorithm.Minimization of LSE is also xi/n (0syis1)We then note that each xi is binomially subject to the local minima problem,especially when the distributed with probability p(w,t)so that the PDFs for model is non-linear with respect to its parameters.The the power model and the exponential model are choice between the two methods of estimation can have obtained as non-trivial consequences.In general,LSE estimates tend l to differ from MLE estimates,especially for data that power: are not normally distributed such as proportion correct f:n w)=(n and response time.An implication is that one might (w1店")(1-wt)”- possibly arrive at different conclusions about the same n! data set depending upon which method of estimation is exponential f(xiln,w)= (14) (n-xi)!xi! employed in analyzing the data.When this occurs,MLE should be preferred to LSE,unless the probability (w1 exp(-w2ti)) density function is unknown or difficult to obtain in an (1-w1exp(-w2t)"- easily computable form,for instance,for the diffusion wherexi=0,1,...,n,i=1,...,m. model of recognition memory (Ratcliff,1978).3 There is There are two points to be made regarding the PDFs a situation,however,in which the two methods in the above equation.First,the probability parameter intersect.This is when observations are independent of of a binomial probability distribution (i.e.w in Eq.(4)) one another and are normally distributed with a is being modeled.Therefore,the PDF for each model in constant variance.In this case,maximization of the Eq.(14)is obtained by simply replacing the probability log-likelihood is equivalent to minimization of SSE,and parameter w in Eq.(4)with the model equation,p(w,t),in therefore,the same parameter values are obtained under Eq.(13).Second,note that yi is related to xi by a fixed either MLE or LSE. scaling constant,1/n.As such,any statistical conclusion regardingx;is applicable directly to yi,except for the scale transformation.In particular,the PDF for yi,f(vin,w), 4.Illustrative example is obtained by simply replacing xi in f(xin,w)with nyi. Now,assuming that xi's are statistically independent In this section,I present an application example of of one another,the desired log-likelihood function for maximum likelihood estimation.To illustrate the the power model is given by method,I chose forgetting data given the recent surge of interest in this topic (e.g.Rubin Wenzel,1996; In L(w =(w1,w2)n,x) Wickens,1998;Wixted Ebbesen,1991). =ln(f(x1ln,w)·f(x2ln,w)…f(xmln,w)】 Among a half-dozen retention functions that have been proposed and tested in the past,I provide an ∑nf(xi,wm example of MLE for the two functions,power and exponential.Let w=(wi.w2)be the parameter vector, ∑6xn(m)+a-切n-wm6") 3For this model,the PDF is expressed as an infinite sum of ranscendental functions +lnml-ln(n-x!-lnx) (15)
3.3. Relation to least-squares estimation Recall that in MLE we seekthe parameter values that are most likely to have produced the data. In LSE, on the other hand, we seekthe parameter values that provide the most accurate description of the data, measured in terms of how closely the model fits the data under the square-loss function. Formally, in LSE, the sum of squares error (SSE) between observations and predictions is minimized: SSEðwÞ ¼ Xm i¼1 ðyi prdiðwÞÞ2 ; ð12Þ where prdiðwÞ denotes the model’s prediction for the ith observation. Note that SSEðwÞ is a function of the parameter vector w ¼ ðw1;y; wkÞ: As in MLE, finding the parameter values that minimize SSE generally requires use of a non-linear optimization algorithm. Minimization of LSE is also subject to the local minima problem, especially when the model is non-linear with respect to its parameters. The choice between the two methods of estimation can have non-trivial consequences. In general, LSE estimates tend to differ from MLE estimates, especially for data that are not normally distributed such as proportion correct and response time. An implication is that one might possibly arrive at different conclusions about the same data set depending upon which method of estimation is employed in analyzing the data. When this occurs, MLE should be preferred to LSE, unless the probability density function is unknown or difficult to obtain in an easily computable form, for instance, for the diffusion model of recognition memory (Ratcliff, 1978).3 There is a situation, however, in which the two methods intersect. This is when observations are independent of one another and are normally distributed with a constant variance. In this case, maximization of the log-likelihood is equivalent to minimization of SSE, and therefore, the same parameter values are obtained under either MLE or LSE. 4. Illustrative example In this section, I present an application example of maximum likelihood estimation. To illustrate the method, I chose forgetting data given the recent surge of interest in this topic (e.g. Rubin & Wenzel, 1996; Wickens, 1998; Wixted & Ebbesen, 1991). Among a half-dozen retention functions that have been proposed and tested in the past, I provide an example of MLE for the two functions, power and exponential. Let w ¼ ðw1;w2Þ be the parameter vector, t time, and pðw; tÞ the model’s prediction of the probability of correct recall at time t: The two models are defined as power model : pðw; tÞ ¼ w1t w2 ðw1; w240Þ; exponential model : pðw; tÞ ¼ w1 expðw2tÞ ðw1; w240Þ: ð13Þ Suppose that data y ¼ ðy1;y; ymÞ consists of m observations in which yið0pyip1Þ represents an observed proportion of correct recall at time ti ði ¼ 1;y; mÞ: We are interested in testing the viability of these models. We do this by fitting each to observed data and examining its goodness of fit. Application of MLE requires specification of the PDF fðyjwÞ of the data under each model. To do this, first we note that each observed proportion yi is obtained by dividing the number of correct responses ðxiÞ by the total number of independent trials ðnÞ; yi ¼ xi=n ð0pyip1Þ We then note that each xi is binomially distributed with probability pðw; tÞ so that the PDFs for the power model and the exponential model are obtained as power : fðxi j n; wÞ ¼ n! ðn xiÞ!xi! ðw1t w2 i Þ xi ð1 w1t w2 i Þ nxi ; exponential : fðxi j n; wÞ ¼ n! ðn xiÞ!xi! ðw1 expðw2tiÞÞxi ð1 w1 expðw2tiÞÞnxi ; ð14Þ where xi ¼ 0; 1;y; n; i ¼ 1;y; m: There are two points to be made regarding the PDFs in the above equation. First, the probability parameter of a binomial probability distribution (i.e. w in Eq. (4)) is being modeled. Therefore, the PDF for each model in Eq. (14) is obtained by simply replacing the probability parameter w in Eq. (4) with the model equation, pðw; tÞ; in Eq. (13). Second, note that yi is related to xi by a fixed scaling constant, 1=n: As such, any statistical conclusion regarding xi is applicable directly to yi; except for the scale transformation. In particular, the PDF for yi; fðyijn; wÞ; is obtained by simply replacing xi in fðxijn; wÞ with nyi: Now, assuming that xi’s are statistically independent of one another, the desired log-likelihood function for the power model is given by ln Lðw ¼ ðw1; w2Þjn; xÞ ¼ lnðfðx1jn; wÞ fðx2 j n; wÞ?fðxm j n; wÞÞ ¼ Xm i¼1 ln fðxijn;wÞ ¼ Xm i¼1 ðxi lnðw1t w2 i Þþðn xiÞ lnð1 w1t w2 i Þ þ ln n! lnðn xiÞ! ln xi!Þ: ð15Þ 3 For this model, the PDF is expressed as an infinite sum of transcendental functions. I.J. Myung / Journal of Mathematical Psychology 47 (2003) 90–100 95
96 I.J.Myung Journal of Mathematical Psychology 47(2003)90-100 Power Expanential Data 0. > 0.5 0.4 0.3 0.2 0.1 2 B10 14 18 18 20 Time in Seconds Fig.4.Modeling forgetting data.Squares represent the data in Murdock (1961).The thick (respectively,thin)curves are best fits by the power (respectively,exponential)models. Table 1 Summary fits of Murdock(1961)data for the power and exponential models under the maximum likelihood estimation(MLE)method and the least- squares estimation (LSE)method. MLE LSE Power Exponential Power Exponential Loglik/SSE() -313.37(0.886) -305.31(0.963) 0.0540(0.894) 0.0169(0.967) Parameter wi 0.953 1.070 1.003 1.092 Parameter w2 0.498 0.131 0.511 0.141 Note:For each model fitted,the first row shows the maximized log-likelihood value for MLE and the minimized sum of squares error value for LSE. Each number in the parenthesis is the proportion of variance accounted for(i.e.r2)in that case.The second and third rows show MLE and LSE parameter estimates for each of w and w2.The above results were obtained using Matlab code described in the appendix. This quantity is to be maximized with respect to the yield the observed data (v1,....)=(0.94,0.77,0.40, two parameters,w and w2.It is worth noting that the 0.26,0.24,0.16),from which the number of correct last three terms of the final expression in the above responses,xi,is obtained as 100yi,i=1,...,6.In equation (i.e.,In n!-In(n-xi)!-Inxi!)do not depend Fig.4,the proportion recall data are shown as squares. upon the parameter vector,thereby do not affecting the The curves in Fig.4 are best fits obtained under MLE. MLE results.Accordingly,these terms can be ignored, Table 1 summarizes the MLE results,including fit and their values are often omitted in the calculation of measures and parameter estimates,and also include the the log-likelihood.Similarly,for the exponential model, LSE results,for comparison.Matlab code used for the its log-likelihood function can be obtained from Eq.(15) calculations is included in the appendix. by substituting wi exp(-w2ti)for wit2. The results in Table 1 indicate that under either In illustrating MLE,I used a data set from Murdock method of estimation,the exponential model fit better (1961).In this experiment subjects were presented with a than the power model.That is,for the former,the log- set of words or letters and were asked to recall the items likelihood was larger and the SSE smaller than for the after six different retention intervals,(t1,...,16)= latter.The same conclusion can be drawn even in terms (1,3,6,9,12,18)in seconds and thus,m=6.The of r2.Also note the appreciable discrepancies in proportion recall at each retention interval was calcu- parameter estimate between MLE and LSE.These lated based on 100 independent trials (i.e.n=100)to differences are not unexpected and are due to the fact
This quantity is to be maximized with respect to the two parameters, w1 and w2: It is worth noting that the last three terms of the final expression in the above equation (i.e., ln n! lnðn xiÞ! ln xi!) do not depend upon the parameter vector, thereby do not affecting the MLE results. Accordingly, these terms can be ignored, and their values are often omitted in the calculation of the log-likelihood. Similarly, for the exponential model, its log-likelihood function can be obtained from Eq. (15) by substituting w1 expðw2tiÞ for w1t w2 i : In illustrating MLE, I used a data set from Murdock (1961). In this experiment subjects were presented with a set of words or letters and were asked to recall the items after six different retention intervals, ðt1;y; t6Þ ¼ ð1; 3; 6; 9; 12; 18Þ in seconds and thus, m ¼ 6: The proportion recall at each retention interval was calculated based on 100 independent trials (i.e. n ¼ 100) to yield the observed data ðy1;y; y6Þ¼ð0:94; 0:77; 0:40; 0:26; 0:24; 0:16Þ; from which the number of correct responses, xi; is obtained as 100yi; i ¼ 1;y; 6: In Fig. 4, the proportion recall data are shown as squares. The curves in Fig. 4 are best fits obtained under MLE. Table 1 summarizes the MLE results, including fit measures and parameter estimates, and also include the LSE results, for comparison. Matlab code used for the calculations is included in the appendix. The results in Table 1 indicate that under either method of estimation, the exponential model fit better than the power model. That is, for the former, the loglikelihood was larger and the SSE smaller than for the latter. The same conclusion can be drawn even in terms of r2: Also note the appreciable discrepancies in parameter estimate between MLE and LSE. These differences are not unexpected and are due to the fact Fig. 4. Modeling forgetting data. Squares represent the data in Murdock(1961). The thick(respectively, thin) curves are best fits by the power (respectively, exponential) models. Table 1 Summary fits of Murdock(1961) data for the power and exponential models under the maximum likelihood estimation (MLE) method and the leastsquares estimation (LSE) method. MLE LSE Power Exponential Power Exponential Loglik/SSE ðr2Þ 313:37 ð0:886Þ 305:31 ð0:963Þ 0.0540 (0.894) 0.0169 (0.967) Parameter w1 0.953 1.070 1.003 1.092 Parameter w2 0.498 0.131 0.511 0.141 Note: For each model fitted, the first row shows the maximized log-likelihood value for MLE and the minimized sum of squares error value for LSE. Each number in the parenthesis is the proportion of variance accounted for (i.e. r2) in that case. The second and third rows show MLE and LSE parameter estimates for each of w1 and w2: The above results were obtained using Matlab code described in the appendix. 96 I.J. Myung / Journal of Mathematical Psychology 47 (2003) 90–100
I.J.Myung Journal of Mathematical Psychology 47 (2003)90-100 97 that the proportion data are binomially distributed,not issues in model selection,the reader is referred elsewhere normally distributed.Further,the constant variance assump- (e.g.Linhart Zucchini,1986;Myung,Forster, tion required for the equivalence between MLE and LSE Browne,2000;Pitt,Myung,Zhang,2002). does not hold for binomial data for which the variance,2= np(1-p),depends upon proportion correct p. 5.Concluding remarks 4.1.MLE interpretation This article provides a tutorial exposition of max- imum likelihood estimation.MLE is of fundamental What does it mean when one model fits the data better importance in the theory of inference and is a basis of than does a competitor model?It is important not to many inferential techniques in statistics.unlike LSE. jump to the conclusion that the former model does a which is primarily a descriptive tool.In this paper,I better job of capturing the underlying process and provide a simple,intuitive explanation of the method so therefore represents a closer approximation to the true that the reader can have a grasp of some of the basic model that generated the data.A good fit is a necessary, principles.I hope the reader will apply the method in his but not a sufficient,condition for such a conclusion.A or her mathematical modeling efforts so a plethora of superior fit (i.e.,higher value of the maximized log- widely available MLE-based analyses (e.g.Batchelder likelihood)merely puts the model in a list of candidate Crowther,1997;Van Zandt,2000)can be performed on models for further consideration.This is because a model data,thereby extracting as much information and can achieve a superior fit to its competitors for reasons insight as possible into the underlying mental process that have nothing to do with the model's fidelity to the under investigation. underlying process.For example,it is well established in statistics that a complex model with many parameters fits data better than a simple model with few parameters, Acknowledgments even if it is the latter that generated the data.The central question is then how one should decide among a set of This work was supported by research Grant R0l competing models.A short answer is that a model should MH57472 from the National Institute of Mental Health. be selected based on its generalizability,which is defined The author thanks Mark Pitt.Richard Schweickert.and as a model's ability to fit current data but also to predict two anonymous reviewers for valuable comments on future data.For a thorough treatment of this and related earlier versions of this paper. Appendix This appendix presents Matlab code that performs MLE and LSE analyses for the example described in the text. Matlab Code for MLE This is the main program that finds MLE estimates.Given a model,it takes sample size (n),time intervals (t)and observed proportion correct %(y)as inputs.It returns the parameter values that maximize the log- likelihood function global n t x;%define global variables opts=optimset(DerivativeCheck',‘off','Display',‘off',‘TolX',1e-6,‘TolFun',le-6, ‘Diagnostics',‘off',‘MaxIter',200,LargeScale',‘off'); option settings for optimization algorithm n =100;%number of independent Bernoulli trials (i.e.,sample size) t=1 3 6 9 12 18:%time intervals as a column vector y [.94 .77.40.26.24.16];%observed proportion correct as a column vector x=ny:%number of correct responses init-w rand(2,1);%starting parameter values low-w zeros(2,1):%parameter lower bounds up_w =100ones(2,1):%parameter upper bounds while 1, [w1,lik1,exit1]fmincon ('power-mle',init-w,[],],][],low-w,up-w,[,opts);
that the proportion data are binomially distributed, not normally distributed. Further, the constant variance assumption required for the equivalence between MLE and LSE does not hold for binomial data for which the variance, s2 ¼ npð1 pÞ; depends upon proportion correct p: 4.1. MLE interpretation What does it mean when one model fits the data better than does a competitor model? It is important not to jump to the conclusion that the former model does a better job of capturing the underlying process and therefore represents a closer approximation to the true model that generated the data. A good fit is a necessary, but not a sufficient, condition for such a conclusion. A superior fit (i.e., higher value of the maximized loglikelihood) merely puts the model in a list of candidate models for further consideration. This is because a model can achieve a superior fit to its competitors for reasons that have nothing to do with the model’s fidelity to the underlying process. For example, it is well established in statistics that a complex model with many parameters fits data better than a simple model with few parameters, even if it is the latter that generated the data. The central question is then how one should decide among a set of competing models. A short answer is that a model should be selected based on its generalizability, which is defined as a model’s ability to fit current data but also to predict future data. For a thorough treatment of this and related issues in model selection, the reader is referred elsewhere (e.g. Linhart & Zucchini, 1986; Myung, Forster, & Browne, 2000; Pitt, Myung, & Zhang, 2002). 5. Concluding remarks This article provides a tutorial exposition of maximum likelihood estimation. MLE is of fundamental importance in the theory of inference and is a basis of many inferential techniques in statistics, unlike LSE, which is primarily a descriptive tool. In this paper, I provide a simple, intuitive explanation of the method so that the reader can have a grasp of some of the basic principles. I hope the reader will apply the method in his or her mathematical modeling efforts so a plethora of widely available MLE-based analyses (e.g. Batchelder & Crowther, 1997; Van Zandt, 2000) can be performed on data, thereby extracting as much information and insight as possible into the underlying mental process under investigation. Acknowledgments This workwas supported by research Grant R01 MH57472 from the National Institute of Mental Health. The author thanks Mark Pitt, Richard Schweickert, and two anonymous reviewers for valuable comments on earlier versions of this paper. Appendix This appendix presents Matlab code that performs MLE and LSE analyses for the example described in the text. Matlab Code for MLE % This is the main program that finds MLE estimates. Given a model, it % takes sample size (n), time intervals (t) and observed proportion correct % (y) as inputs. It returns the parameter values that maximize the log- % likelihood function global n t x; % define global variables opts ¼ optimset (‘DerivativeCheck’,‘off’,’Display’,‘off’,‘TolX’,1e-6,‘TolFun’,1e-6, ‘Diagnostics’,‘off’,‘MaxIter’,200,LargeScale’,‘off’); % option settings for optimization algorithm n ¼ 100 ;% number of independent Bernoulli trials (i.e., sample size) t ¼ ½1 3 6 9 12 18 0 ;% time intervals as a column vector y ¼ ½:94 :77 :40 :26 :24 :16 0 ;% observed proportion correct as a column vector x ¼ nny;% number of correct responses init w ¼ randð2; 1Þ;% starting parameter values low w ¼ zerosð2; 1Þ;% parameter lower bounds up w ¼ 100nonesð2; 1Þ;% parameter upper bounds while 1, ½w1; lik1; exit1 ¼ fmincon (‘power mle’,init w,[],[],[],[],low w,up w,[],opts); I.J. Myung / Journal of Mathematical Psychology 47 (2003) 90–100 97
98 1.J.Myung Journal of Mathematical Psychology 47(2003)90-100 %optimization for power model that minimizes minus log-likelihood (note that minimization of minus log-likelihood is equivalent to maximization of log-likelihood) %w1:MLE parameter estimates lik1:maximized log-likelihood value exit1:optimization has converged if exit1 >0 or not otherwise [w2,lik2,exit2]FMINCON(EXPO_MLE',INIT_W,[,[0,[,LOW_W,UP_W,[OPTS); optimization for exponential model that minimizes minus log-likelihood prd1=w1(1,1)t.(-w1(2,1)):%best fit prediction by power model r2(1,1)=1-sum((prd1-y).2)/sum((y-mean(y)).2);r2 for power model prd2=w2(1,1)"exp(-w2(2,1)"t):%best fit prediction by exponential model r2(2,1)=1-sum((prd2-y).2)/sum((y-mean(y)).2);%r2 for exponential model if sum(r2>0)==2 break; else init_w=rand(2,1); end; end; format long; disp(num2str([w1 w2 r2],5));%display results disp(num2str([lik1 lik2 exit1 exit2],5));%display results end end of the main program function loglik power_mle(w) POWER_MLE The log-likelihood function of the power model global n t x; p=w(1,1)t.(-w(2,1)):%power model prediction given parameter p=p+(p==zeros(6,1)*1e-5-(p==ones(6,1)*1e-5;%ensure0<p<1 1og1ik=(-1)*(x.*1og(p)+(血-x.*1og(1-p): minus log-likelihood for individual observations loglik sum(loglik);%overall minus log-likelihood being minimized function loglik expo_mle(w) EXPO_MLE The log-likelihood function of the exponential model global n t x; p=w(1,1)"exp(-w(2,1)t):%exponential model prediction p=p+(p==zeros(6,1)*1e-5-(p==ones(6,1)*1e-5;%ensure0<p<1 1og1ik=(-1)*(x.*1og(p)+(n-x).*1og(1p) minus log-likelihood for individual observations loglik sum(loglik);%overall minus log-likelihood being minimized Matlab Code for LSE This is the main program that finds LSE estimates.Given a model,it takes sample size (n),time intervals (t)and observed proportion correct %(y)as inputs.It returns the parameter values that minimize the sum of squares error global t;define global variable opts=optimset(‘DerivativeCheck',‘off',‘Display',‘off',‘TolX',1e-6,‘TolFun',le-6,Diagnostic- s',off','MaxIter',200,'LargeScale','off'); option settings for optimization algorithm n=100;%number of independent binomial trials (i.e.,sample size) t =[1 3 6 9 12 18)%time intervals as a column vector
% optimization for power model that minimizes minus log-likelihood (note that minimization of minus log-likelihood is equivalent to maximization of log-likelihood) % w1: MLE parameter estimates % lik1: maximized log-likelihood value % exit1: optimization has converged if exit1 40 or not otherwise ½w2; lik2; exit2 ¼ FMINCONð‘EXPO MLE’; INIT W;½ ; ½ ; ½ ; ½ ; LOW W; UP W; ½ ; OPTSÞ; % optimization for exponential model that minimizes minus log-likelihood prd1 ¼ w1ð1; 1Þ n t: # ð-w1ð2; 1ÞÞ;% best fit prediction by power model r2ð1; 1Þ ¼ 1-sumððprd1-yÞ: # 2Þ=sumððy-meanðyÞÞ: # 2Þ; % r# 2 for power model prd2 ¼ w2ð1; 1Þ n expð-w2ð2; 1Þ n tÞ;% best fit prediction by exponential model r2ð2; 1Þ ¼ 1-sumððprd2-yÞ: # 2Þ=sumððy-meanðyÞÞ: # 2Þ; %r# 2 for exponential model if sumðr240Þ ¼¼ 2 break; else init w ¼ randð2; 1Þ; end; end; format long; disp(num2str([w1 w2 r2],5));% display results disp(num2str([lik1 lik2 exit1 exit2],5));% display results end % end of the main program function loglik ¼ power mleðwÞ % POWER MLE The log-likelihood function of the power model global n t x; p ¼ wð1; 1Þ n t: # ð-wð2; 1ÞÞ;% power model prediction given parameter p ¼ p þ ðp ¼¼ zerosð6; 1ÞÞn 1e-5 ðp ¼¼ onesð6; 1ÞÞn 1e-5; % ensure 0opo1 loglik ¼ ð1Þ n ðx: * logðpÞþðn-xÞ: * logð1-pÞÞ; % minus log-likelihood for individual observations loglik ¼ sumðloglikÞ;% overall minus log-likelihood being minimized function loglik ¼ expo mleðwÞ % EXPO MLE The log-likelihood function of the exponential model global n t x; p ¼ wð1; 1Þ n expð-wð2; 1Þ n tÞ;% exponential model prediction p ¼ p þ ðp ¼¼ zerosð6; 1ÞÞn 1e-5 ðp ¼¼ onesð6; 1ÞÞn 1e-5; % ensure 0opo1 loglik ¼ ð1Þ n ðx: * logðpÞþðn-xÞ: * logð1pÞÞ; % minus log-likelihood for individual observations loglik ¼ sumðloglikÞ;% overall minus log-likelihood being minimized Matlab Code for LSE % This is the main program that finds LSE estimates. Given a model, it % takes sample size (n), time intervals (t) and observed proportion correct % (y) as inputs. It returns the parameter values that minimize the sum of % squares error global t; % define global variable opts ¼ optimset(‘DerivativeCheck’,‘off’,‘Display’,‘off’,‘TolX’,1e-6,‘TolFun’,1e-6, ‘Diagnostics’,‘off’,‘MaxIter’,200,‘LargeScale’,‘off’); % option settings for optimization algorithm n ¼ 100; % number of independent binomial trials (i.e., sample size) t ¼ ½1 3 6 9 12 18 0 ;% time intervals as a column vector 98 I.J. Myung / Journal of Mathematical Psychology 47 (2003) 90–100
1.J.Myung I Journal of Mathematical Psychology 47(2003)90-100 99 y=[.94.77 .40.26.24.16];%observed proportion correct as a column vector init_w=rand(2,1);%starting parameter values low-w zeros(2,1);%parameter lower bounds up-w =100*ones(2,1):%parameter upper bounds [w1,sse1,res1,exit1]lsqnonlin('power_lse',init_w,low-w,up-w,opts,y); optimization for power model %w1:LSE estimates sse1:minimized SSE value %res1:value of the residual at the solution exit1:optimization has converged if exit1 >0 or not otherwise [w2,sse2,res2,exit2]=1sqnonlin('expo_lse',init-w,low-w,up-w,opts,y); optimization for exponential model r2(1,1)=1-sse1/sum((y-mean(y)).2);r.2 for power model r2(2,1)=1-sse2/sum((y-mean(y)).2);r2 for exponential model format long; disp(num2str([w1 w2 r2],5));%display out results disp(num2str([sse1 sse2 exit1 exi2],5));%display out results end end of the main program function dev power_lse(w,y) POWER LSE The deviation between observation and prediction of the power model global t; p=w(1,1)t(-w(2,1));%power model prediction dev=p-y; deviation between prediction and observation,the square of which is being minimized function dev expo_lse(w,y) %EXPO_LSE The deviation between observation and prediction of the exponential model global t; p=w(1,1)exp(-w(2,1)*t):%exponential model prediction dev =p-y; deviation between prediction and observation,the square of which is being minimized References Lamberts,K.(2000).Information-accumulation theory of speeded categorization.Psychological Review,107(2).227-260. Akaike,H.(1973).Information theory and an extension of Linhart,H..Zucchini,W.(1986).Model selection.New York,NY: the maximum likelihood principle.In:Petrox,B.N.,Caski, Wiley. F.Second international symposium on information theory Murdock Jr..B.B.(1961).The retention of individual items.Journal of (pp.267-281).Budapest:Akademiai Kiado. Experimental Psychology,62,618-625. Batchelder,W.H.,Crowther,C.S.(1997).Multinomial processing Myung,I.J.,Forster,M.,Browne,M.W.(2000).Special issue tree models of factorial categorization.Journal of Mathematical on model selection.Journal of Mathematical Psychology,44, Psychology,41.45-55. 1-2. Bickel,P.J..&Doksum,K.A.(1977).Mathematical statistics. Pitt,M.A..Myung.I.J.,&Zhang.S.(2002).Toward a method of Oakland,CA:Holden-day,Inc. selecting among computational models of cognition.Psychological Casella,G.,Berger,R.L.(2002).Statistical inference (2nd ed.). Review,109,472-491. Pacific Grove,CA:Duxberry. Ratcliff,R.(1978).A theory of memory retrieval.Psychological DeGroot,M.H.,Schervish,M.J.(2002).Probability and statistics Review,85.59-108. (3rd ed.).Boston,MA:Addison-Wesley. Rubin,D.C..Hinton,S..&Wenzel,A.(1999).The precise time course Kirkpatrick,S..Gelatt,C.D.,Vecchi,M.P.(1983).Optimization by of retention.Journal of Experimental Psychology:Learning. simulated annealing.Science,220.671-680. Memory,and Cognition,25.1161-1176
References Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In: Petrox, B.N., & Caski, F. Second international symposium on information theory (pp. 267–281). Budapest: Akademiai Kiado. Batchelder, W. H., & Crowther, C. S. (1997). Multinomial processing tree models of factorial categorization. Journal of Mathematical Psychology, 41, 45–55. Bickel, P. J., & Doksum, K. A. (1977). Mathematical statistics. Oakland, CA: Holden-day, Inc. Casella, G., & Berger, R. L. (2002). Statistical inference (2nd ed.). Pacific Grove, CA: Duxberry. DeGroot, M. H., & Schervish, M. J. (2002). Probability and statistics (3rd ed.). Boston, MA: Addison-Wesley. Kirkpatrick, S., Gelatt, C. D., & Vecchi, M. P. (1983). Optimization by simulated annealing. Science, 220, 671–680. Lamberts, K. (2000). Information-accumulation theory of speeded categorization. Psychological Review, 107(2), 227–260. Linhart, H., & Zucchini, W. (1986). Model selection. New York, NY: Wiley. MurdockJr., B. B. (1961). The retention of individual items. Journal of Experimental Psychology, 62, 618–625. Myung, I. J., Forster, M., & Browne, M. W. (2000). Special issue on model selection. Journal of Mathematical Psychology, 44, 1–2. Pitt, M. A., Myung, I. J., & Zhang, S. (2002). Toward a method of selecting among computational models of cognition. Psychological Review, 109, 472–491. Ratcliff, R. (1978). A theory of memory retrieval. Psychological Review, 85, 59–108. Rubin, D. C., Hinton, S., & Wenzel, A. (1999). The precise time course of retention. Journal of Experimental Psychology: Learning, Memory, and Cognition, 25, 1161–1176. y ¼ ½:94 :77 :40 :26 :24 :16 0 ;% observed proportion correct as a column vector init w ¼ randð2; 1Þ;% starting parameter values low w ¼ zerosð2; 1Þ;% parameter lower bounds up w ¼ 100nonesð2; 1Þ;% parameter upper bounds ½w1; sse1; res1; exit1 ¼ lsqnonlinð‘power lse0 ; init w; low w; up w; opts; yÞ; % optimization for power model % w1: LSE estimates % sse1: minimized SSE value % res1: value of the residual at the solution % exit1: optimization has converged if exit1 40 or not otherwise ½w2; sse2; res2; exit2 ¼ lsqnonlinð‘expo lse0 ; init w; low w; up w; opts; yÞ; % optimization for exponential model r2ð1; 1Þ ¼ 1-sse1=sumððy-meanðyÞÞ: # 2Þ; % r# 2 for power model r2ð2; 1Þ ¼ 1-sse2=sumððy-meanðyÞÞ: # 2Þ; % r# 2 for exponential model format long; disp(num2str([w1 w2 r2],5));% display out results disp(num2str([sse1 sse2 exit1 exi2],5));% display out results end % end of the main program function dev ¼ power lseðw; yÞ % POWER LSE The deviation between observation and prediction of the power % model global t; p ¼ wð1; 1Þ n t: # ð-wð2; 1ÞÞ;% power model prediction dev ¼ p y; % deviation between prediction and observation, the square of which is being minimized function dev ¼ expo lseðw; yÞ % EXPO LSE The deviation between observation and prediction of the % exponential model global t; p ¼ wð1; 1Þ n expðwð2; 1Þ n tÞ;% exponential model prediction dev ¼ p y; % deviation between prediction and observation, the square of which is being minimized I.J. Myung / Journal of Mathematical Psychology 47 (2003) 90–100 99