A History of Markov Chain Monte Carlo* -Subjective Recollections from Incomplete Data- CHRISTIAN ROBERTT GEORGE CASELLA Universite Paris Dauphine and CREST,INSEE University of Florida August 14,2008 Abstract In this note we attempt to trace the history and development of Markov chain Monte Carlo (MCMC)from its early inception in the late 1940's through its use today. We see how the earlier stages of the Monte Carlo(MC,not MCMC)research have led to the algorithms currently in use.More importantly,we see how the development of this methodology has not only changed our solutions to problems,but has changed the way we think about problems. 1 Introduction Markov Chain Monte Carlo (MCMC)methods have been around for almost as long as Monte Carlo techniques,even though their impact on Statistics has not been truly felt until the very early 1990s.(The emergence of Markov based techniques in Physics and,in particular, Particle Physics is another story that will remain mostly untold within this survey.See Landau and Binder 2005 for a review.)Also,we will not launch into a description of MCMC techniques,unless they have some historical link,as the remainder of this volume covers *We are grateful for comments and suggestions from Olivier Cappe,David Spiegelhalter,Alan Gelfand, Jun Liu,Sharon McGrayne,Peter Muller,Gareth Roberts,and Adrian Smith Professor of Statistics,CEREMADE,Universite Paris Dauphine,75785 Paris cedex 16,France.Sup- ported by the Agence Nationale de la Recherche(ANR,212,rue de Bercy 75012 Paris)through the 2006-2008 project Adap'MC.Email:xian@ceremade.dauphine.fr. Distinguished Professor,Department of Statistics,University of Florida,Gainesville,FL 32611.Sup- ported by National Science Foundation Grants DMS-04-05543,DMS-0631632 and SES-0631588.Email: casella@stat.ufl.edu. 1
A History of Markov Chain Monte Carlo∗ —Subjective Recollections from Incomplete Data— Christian Robert† Universit´e Paris Dauphine and CREST, INSEE George Casella‡ University of Florida August 14, 2008 Abstract In this note we attempt to trace the history and development of Markov chain Monte Carlo (MCMC) from its early inception in the late 1940’s through its use today. We see how the earlier stages of the Monte Carlo (MC, not MCMC) research have led to the algorithms currently in use. More importantly, we see how the development of this methodology has not only changed our solutions to problems, but has changed the way we think about problems. 1 Introduction Markov Chain Monte Carlo (MCMC) methods have been around for almost as long as Monte Carlo techniques, even though their impact on Statistics has not been truly felt until the very early 1990s. (The emergence of Markov based techniques in Physics and, in particular, Particle Physics is another story that will remain mostly untold within this survey. See Landau and Binder 2005 for a review.) Also, we will not launch into a description of MCMC techniques, unless they have some historical link, as the remainder of this volume covers ∗We are grateful for comments and suggestions from Olivier Capp´e, David Spiegelhalter, Alan Gelfand, Jun Liu, Sharon McGrayne, Peter Muller, Gareth Roberts, and Adrian Smith †Professor of Statistics, CEREMADE, Universit´e Paris Dauphine, 75785 Paris cedex 16, France. Supported by the Agence Nationale de la Recherche (ANR, 212, rue de Bercy 75012 Paris) through the 2006-2008 project Adap’MC. Email: xian@ceremade.dauphine.fr. ‡Distinguished Professor, Department of Statistics, University of Florida, Gainesville, FL 32611. Supported by National Science Foundation Grants DMS-04-05543, DMS-0631632 and SES-0631588. Email: casella@stat.ufl.edu. 1
the technical aspects.A comprehensive entry with further references can also be found in Robert and Casella (2004). We will distinguish between the introduction of Metropolis-Hastings based algorithms from those related with Gibbs sampling,since they each stem from radically different origins, even though their mathematical justification via Markov chain theory is the same.Tracing the development of Monte Carlo methods,we will also briefly mention what we might call the "second-generation MCMC revolution".Starting in the mid-to-late 1990s,this includes the development of particle filters,reversible jump and perfect sampling,and concludes with more current work on population or sequential Monte Carlo and regeneration and the computing of "honest"standard errors.(But is it still history?!) As mentioned above,the realization that Markov chains could be used in a wide variety of situations only came(to mainstream statisticians)with Gelfand and Smith (1990),de- spite earlier publications in the statistical literature like Hastings(1970),Geman and Geman (1984)and Tanner and Wong (1987).Several reasons can be advanced:lack of computing machinery (think of the computers of 1970!),lack of background on Markov chains,lack of trust in the practicality of the method...It thus required visionary researchers like Alan Gelfand and Adrian Smith to spread the good news,backed up with a collection of papers that demonstrated,through a series of applications,that the method was easy to under- stand,easy to implement and practical (Gelfand et al.1990,1992,Smith and Gelfand 1992, Wakefield et al.1994).The rapid emergence of the dedicated BUGS (Bayesian inference Using Gibbs Sampling)software as early as 1991(when a paper on BUGS was presented at the Valencia meeting)was another compelling argument for adopting (at large)MCMC algorithms.1 2 Before the Revolution Monte Carlo methods were born in Los Alamos,New Mexico during World War II,eventually resulting in the Metropolis algorithm in the early 1950s.While Monte Carlo methods were in use by that time,MCMC was brought closer to statistical practicality by the work of Hastings in the 1970s. What can be reasonably seen as the first MCMC algorithm is the Metropolis algorithm, published by Metropolis et al.(1953).It emanates from the same group of scientists who produced the Monte Carlo method,namely the research scientists of Los Alamos,mostly 1Historically speaking,the development of BUGS initiated from Geman and Geman(1984)and Pearl (1987),in tune with the developments in the artificial intelligence community,and it pre-dates Gelfand and Smith (1990). 2
the technical aspects. A comprehensive entry with further references can also be found in Robert and Casella (2004). We will distinguish between the introduction of Metropolis-Hastings based algorithms from those related with Gibbs sampling, since they each stem from radically different origins, even though their mathematical justification via Markov chain theory is the same. Tracing the development of Monte Carlo methods, we will also briefly mention what we might call the “second-generation MCMC revolution”. Starting in the mid-to-late 1990s, this includes the development of particle filters, reversible jump and perfect sampling, and concludes with more current work on population or sequential Monte Carlo and regeneration and the computing of “honest” standard errors. (But is it still history?!) As mentioned above, the realization that Markov chains could be used in a wide variety of situations only came (to mainstream statisticians) with Gelfand and Smith (1990), despite earlier publications in the statistical literature like Hastings (1970), Geman and Geman (1984) and Tanner and Wong (1987). Several reasons can be advanced: lack of computing machinery (think of the computers of 1970!), lack of background on Markov chains, lack of trust in the practicality of the method... It thus required visionary researchers like Alan Gelfand and Adrian Smith to spread the good news, backed up with a collection of papers that demonstrated, through a series of applications, that the method was easy to understand, easy to implement and practical (Gelfand et al. 1990, 1992, Smith and Gelfand 1992, Wakefield et al. 1994). The rapid emergence of the dedicated BUGS (Bayesian inference Using Gibbs Sampling) software as early as 1991 (when a paper on BUGS was presented at the Valencia meeting) was another compelling argument for adopting (at large) MCMC algorithms.1 2 Before the Revolution Monte Carlo methods were born in Los Alamos, New Mexico during World War II, eventually resulting in the Metropolis algorithm in the early 1950s. While Monte Carlo methods were in use by that time, MCMC was brought closer to statistical practicality by the work of Hastings in the 1970s. What can be reasonably seen as the first MCMC algorithm is the Metropolis algorithm, published by Metropolis et al. (1953). It emanates from the same group of scientists who produced the Monte Carlo method, namely the research scientists of Los Alamos, mostly 1Historically speaking, the development of BUGS initiated from Geman and Geman (1984) and Pearl (1987), in tune with the developments in the artificial intelligence community, and it pre-dates Gelfand and Smith (1990). 2
physicists working on mathematical physics and the atomic bomb.2 MCMC algorithms therefore date back to the same time as the development of regular (MC only)Monte Carlo method,which are usually traced to Ulam and von Neumann in the late 1940s.Stanislaw Ulam associates the original idea with an intractable combinatorial computation he attempted in 1946(calculating the probability of winning at the card game "solitaire").This idea was enthusiastically adopted by John von Neumann for implementa- tion with direct applications to neutron diffusion,the name "Monte Carlo"being suggested by Nicholas Metropolis.(See the interesting recounting of this in Eckhardt 1987.) These occurrences very closely coincide with the appearance of the very first computer, the ENIAC,which came to life in February 1946,after three years of construction.The Monte Carlo method was set up by von Neumann,who was using it on thermonuclear and fission problems as early as 1947.At the same time,that is,1947,Ulam and von Neumann invented inversion and accept-reject techniques (also recounted in Eckhardt 1987)to simulate from non-uniform distributions.Without computers,a rudimentary version invented by Fermi in the 1930s did not get any recognition (Metropolis 1987).Note also that,as early as 1949, a symposium on Monte Carlo was supported by Rand,NBS and the Oak Ridge laboratory and that Metropolis and Ulam (1949)published the very first paper about the Monte Carlo method. 2.1 The Metropolis et al.(1953)paper The first MCMC algorithm is associated with a second computer,called MANIAC(!),built3 in Los Alamos in early 1952.Both a physicist and a mathematician,Metropolis,who died in Los Alamos in 1999,came to this place in April 1943.The other members of the team also came to Los Alamos during those years,with Edward Teller being the most controversial character of the group.As early as 1942,he was one of the first scientists to work on the Manhattan Project that led to the production of the A bomb.Almost as early,he became obsessed with the hydrogen(H)bomb,which he eventually managed to design with Stanislaw Ulam using the better computer facilities in the early 1950s.4 Published in June 1953 in the Journal of Chemical Physics,Metropolis et al.(1953) 2The atomic bomb construction did not involve simulation techniques,even though the subsequent de- velopment of the hydrogen bomb did. 3MANIAC stands for Mathematical Analyzer,Numerical Integrator and Computer. 4On a somber note,Edward Teller later testified against Robert Oppenheimer in the McCarthy trials and, much later,was a fervent proponent of the "Star Wars"defense system under the Reagan administration. 3
physicists working on mathematical physics and the atomic bomb.2 MCMC algorithms therefore date back to the same time as the development of regular (MC only) Monte Carlo method, which are usually traced to Ulam and von Neumann in the late 1940s. Stanislaw Ulam associates the original idea with an intractable combinatorial computation he attempted in 1946 (calculating the probability of winning at the card game “solitaire”). This idea was enthusiastically adopted by John von Neumann for implementation with direct applications to neutron diffusion, the name “Monte Carlo“ being suggested by Nicholas Metropolis. (See the interesting recounting of this in Eckhardt 1987.) These occurrences very closely coincide with the appearance of the very first computer, the ENIAC, which came to life in February 1946, after three years of construction. The Monte Carlo method was set up by von Neumann, who was using it on thermonuclear and fission problems as early as 1947. At the same time, that is, 1947, Ulam and von Neumann invented inversion and accept-reject techniques (also recounted in Eckhardt 1987) to simulate from non-uniform distributions. Without computers, a rudimentary version invented by Fermi in the 1930s did not get any recognition (Metropolis 1987). Note also that, as early as 1949, a symposium on Monte Carlo was supported by Rand, NBS and the Oak Ridge laboratory and that Metropolis and Ulam (1949) published the very first paper about the Monte Carlo method. 2.1 The Metropolis et al. (1953) paper The first MCMC algorithm is associated with a second computer, called MANIAC(!), built3 in Los Alamos in early 1952. Both a physicist and a mathematician, Metropolis, who died in Los Alamos in 1999, came to this place in April 1943 . The other members of the team also came to Los Alamos during those years, with Edward Teller being the most controversial character of the group. As early as 1942, he was one of the first scientists to work on the Manhattan Project that led to the production of the A bomb. Almost as early, he became obsessed with the hydrogen (H) bomb, which he eventually managed to design with Stanislaw Ulam using the better computer facilities in the early 1950s. 4 Published in June 1953 in the Journal of Chemical Physics, Metropolis et al. (1953) 2The atomic bomb construction did not involve simulation techniques, even though the subsequent development of the hydrogen bomb did. 3MANIAC stands for Mathematical Analyzer, Numerical Integrator and Computer. 4On a somber note, Edward Teller later testified against Robert Oppenheimer in the McCarthy trials and, much later, was a fervent proponent of the “Star Wars” defense system under the Reagan administration. 3
primary focus is the computation of integrals of the form 了= f F(p,q)exp{-E(p,q)/kT}dpdq Jexp-E(p,q)/kTdpdq with the energy E being defined as NN E(p,q)= ∑∑va, 2 where N is the number of particles,V a potential function and dii the distance between particles i and j.The Boltzmann distribution expt-E(p,q)/kT}is parameterised by the temperature T,k being the Boltzmann constant,with a normalisation factor Z(T)= exp{-E(p,q)/kT}dpdq that is not available in closed form.Since p and g are 2N-dimensional vectors,numerical integration is impossible.Given the large dimension of the problem,even standard Monte Carlo techniques fail to correctly approximate 3,since expf-E(p,q)/kT}is very small for most realizations of the random configurations of the particle system (uniformly in the 2N or 4N square).In order to improve the efficiency of the Monte Carlo method,Metropolis et al. (1953)propose a random walk modification of the N particles.That is,for each particle i (1≤i≤N),values x=x:+aξ1iand=+aE2i are proposed,where both i and E2i are uniformu(-1,1).The energy difference AE between the new configuration and the previous one is then computed and the new configuration is accepted with probability 1∧exp{-△E/kT}, (1) and otherwise the previous configuration is replicated (in the sense that it will count one more time in the final average of the F(pt,p)'s over the r moves of the random walk, 1 <t <T)).Note that Metropolis et al.(1953)move one particle at a time,rather than moving all of them together,which makes the initial algorithm appear as a primitive kind of Gibbs sampling (!)They The authors of Metropolis et al.(1953)demonstrate the validity of the algorithm by first establishing irreducibility (that they call ergodicity)and second proving ergodicity,that is convergence to the stationary distribution.The second part is obtained via a discretiza- tion of the space:They first note that the proposal move is reversible,then establish that expf-E/kT}is invariant.The result is therefore proven in its full generality (modulo the 4
primary focus is the computation of integrals of the form I = R F(p, q) exp{−E(p, q)/kT}dpdq R exp{−E(p, q)/kT}dpdq , with the energy E being defined as E(p, q) = 1 2 X N i=1 X N j=1 j6=i V (dij ), where N is the number of particles, V a potential function and dij the distance between particles i and j. The Boltzmann distribution exp{−E(p, q)/kT} is parameterised by the temperature T, k being the Boltzmann constant, with a normalisation factor Z(T) = Z exp{−E(p, q)/kT}dpdq that is not available in closed form. Since p and q are 2N-dimensional vectors, numerical integration is impossible. Given the large dimension of the problem, even standard Monte Carlo techniques fail to correctly approximate I, since exp{−E(p, q)/kT} is very small for most realizations of the random configurations of the particle system (uniformly in the 2N or 4N square). In order to improve the efficiency of the Monte Carlo method, Metropolis et al. (1953) propose a random walk modification of the N particles. That is, for each particle i (1 ≤ i ≤ N), values x ′ i = xi + αξ1i and y ′ i = yi + αξ2i are proposed, where both ξ1i and ξ2i are uniform U(−1, 1). The energy difference ∆E between the new configuration and the previous one is then computed and the new configuration is accepted with probability 1 ∧ exp{−∆E/kT} , (1) and otherwise the previous configuration is replicated (in the sense that it will count one more time in the final average of the F(pt , pt)’s over the τ moves of the random walk, 1 ≤ t ≤ τ )). Note that Metropolis et al. (1953) move one particle at a time, rather than moving all of them together, which makes the initial algorithm appear as a primitive kind of Gibbs sampling (!). They The authors of Metropolis et al. (1953) demonstrate the validity of the algorithm by first establishing irreducibility (that they call ergodicity) and second proving ergodicity, that is convergence to the stationary distribution. The second part is obtained via a discretization of the space: They first note that the proposal move is reversible, then establish that exp{−E/kT} is invariant. The result is therefore proven in its full generality (modulo the 4
discretization).The remainder of the paper is concerned with the specific problem of the rigid-sphere collision model.The number of iterations of the Metropolis algorithm seems to be limited:16 steps for burn-in and 48 to 64 subsequent iterations (that still required four to five hours on the Los Alamos MANIAC). An interesting variation of(1)is the Simulated Annealing algorithm,developed by Kirk- patrick et al.(1983),who connected optimization with annealing,the cooling of a metal. Their variation is to allow T of(1)to change as the algorithm runs,according to a "cooling schedule",and the Simulated Annealing algorithm can be shown to find the global maximum with probability 1,although the analysis is quite complex due to the fact that,with varying T,the algorithm is no longer a time-homogeneous Markov chain. 2.2 The Hastings (1970)paper The Metropolis algorithm was later generalized by Hastings (1970)and Peskun (1973,1981) as a statistical simulation tool that could overcome the curse of dimensionality met by regular Monte Carlo methods(already emphasized in Metropolis et al.1953).5 In his Biometrika paper,6 Hastings (1970)also defines his methodology on finite and reversible Markov chains,treating the continuous case by using a discretization analogy. The generic probability of acceptance for a move from state i to state j is Sij Qij= 1+迈, Tiqii where sij is a symmetric function.This generic form of probability encompasses the forms of both Metropolis et al.(1953)and Barker (1965).At this stage,Peskun's ordering is not yet discovered and Hastings thus mentions that little is known about the relative merits of those two choices (even though)Metropolis's method may be preferable.He also warns against high rejection rates as indicative of a poor choice of transition matrir,but does not mention the opposite pitfall of low rejection rates,associated with a slow exploration of the target. The examples given in the paper are a Poisson target with a +1 random walk proposal,a normal target with a uniform random walk proposal mixed with its reflection (i.e.centered at -X(t)rather than X(t)),and then a multivariate target where Hastings introduces a Gibbs sampling strategy,updating one component at a time and defining the composed transition as satisfying the stationary condition because each component does leave the 5In fact,Hastings starts by mentioning a decomposition of the target distribution into a product of one-dimensional conditional distributions but this falls short of an early Gibbs sampler! 6Hastings(1970)is one of the ten papers reproduced in the Biometrika 100th anniversary volume by Titterington and Cox(2001). 5
discretization). The remainder of the paper is concerned with the specific problem of the rigid-sphere collision model. The number of iterations of the Metropolis algorithm seems to be limited: 16 steps for burn-in and 48 to 64 subsequent iterations (that still required four to five hours on the Los Alamos MANIAC). An interesting variation of (1) is the Simulated Annealing algorithm, developed by Kirkpatrick et al. (1983), who connected optimization with annealing, the cooling of a metal. Their variation is to allow T of (1) to change as the algorithm runs, according to a “cooling schedule”, and the Simulated Annealing algorithm can be shown to find the global maximum with probability 1, although the analysis is quite complex due to the fact that, with varying T, the algorithm is no longer a time-homogeneous Markov chain. 2.2 The Hastings (1970) paper The Metropolis algorithm was later generalized by Hastings (1970) and Peskun (1973, 1981) as a statistical simulation tool that could overcome the curse of dimensionality met by regular Monte Carlo methods (already emphasized in Metropolis et al. 1953).5 In his Biometrika paper,6 Hastings (1970) also defines his methodology on finite and reversible Markov chains, treating the continuous case by using a discretization analogy. The generic probability of acceptance for a move from state i to state j is αij = sij 1 + πi πj qij qji , where sij is a symmetric function. This generic form of probability encompasses the forms of both Metropolis et al. (1953) and Barker (1965). At this stage, Peskun’s ordering is not yet discovered and Hastings thus mentions that little is known about the relative merits of those two choices (even though) Metropolis’s method may be preferable. He also warns against high rejection rates as indicative of a poor choice of transition matrix, but does not mention the opposite pitfall of low rejection rates, associated with a slow exploration of the target. The examples given in the paper are a Poisson target with a ±1 random walk proposal, a normal target with a uniform random walk proposal mixed with its reflection (i.e. centered at −X(t) rather than X(t)), and then a multivariate target where Hastings introduces a Gibbs sampling strategy, updating one component at a time and defining the composed transition as satisfying the stationary condition because each component does leave the 5 In fact, Hastings starts by mentioning a decomposition of the target distribution into a product of one-dimensional conditional distributions but this falls short of an early Gibbs sampler! 6Hastings (1970) is one of the ten papers reproduced in the Biometrika 100th anniversary volume by Titterington and Cox (2001). 5
target invariant!Hastings (1970)actually refers to Erhman et al.(1960)as a preliminary if specific instance of this sampler.More precisely,this is Metropolis-within-Gibbs except for the name.It is quite amazing that this first introduction of the Gibbs sampler has been completely overlooked,even though the proof of convergence is completely general,based on a composition argument as in Tierney (1994)!The remainder of the paper deals with (a)an importance sampling version of MCMC,(b)general remarks about assessment of the error,(c)an application to random orthogonal matrices(with yet again an example of Gibbs sampling). Three years later,still in Biometrika,Peskun(1973)publishes a comparison of Metropolis' and Barker's forms of acceptance probabilities and shows(again in a discrete setup)that the optimal choice (in terms of the asymptotic variance of any empirical average)is Metropolis'. The proof is a direct consequence of a result by Kemeny and Snell(1960)on the asymptotic variance.Peskun also establishes that this asymptotic variance can improve upon the iid case if and only if the eigenvalues of P-A are all negative,when A is the transition matrix corresponding to the iid simulation and P the transition matrix corresponding to the Metropolis algorithm,but he concludes that the trace of P-A is always positive. 3 Seeds of the Revolution A number of earlier pioneers had brought forward the seeds of Gibbs sampling;in particular. Hammersley and Clifford had produced a constructive argument in 1970 to recover a joint distribution from its conditionals,a result later called the Hammersley-Clifford theorem by Besag (1974,1986).Besides Hastings (1970)and Geman and Geman (1984),already mentioned,other papers that contained the germs of Gibbs sampling are Besag and Clifford (1989),Broniatowski et al.(1984),Qian and Titterington (1990),and Tanner and Wong (1987). 3.1 Besag's Early Work and the Fundamental (Missing)Theorem In the early 1970's,Hammersley,Clifford,and Besag were working on the specification of joint distributions from conditional distributions and on necessary and sufficient conditions for the conditional distributions to be compatible with a joint distribution.What is now known as the Hammersley-Clifford theorem states that a joint distribution for a vector asso- ciated with a dependence graph (edge meaning dependence and absence of edge conditional independence)must be represented as a product of functions over the cligues of the graphs. that is,of functions depending only on the components indexed by the labels in the clique
target invariant! Hastings (1970) actually refers to Erhman et al. (1960) as a preliminary if specific instance of this sampler. More precisely, this is Metropolis-within-Gibbs except for the name. It is quite amazing that this first introduction of the Gibbs sampler has been completely overlooked, even though the proof of convergence is completely general, based on a composition argument as in Tierney (1994)! The remainder of the paper deals with (a) an importance sampling version of MCMC, (b) general remarks about assessment of the error, (c) an application to random orthogonal matrices (with yet again an example of Gibbs sampling). Three years later, still in Biometrika, Peskun (1973) publishes a comparison of Metropolis’ and Barker’s forms of acceptance probabilities and shows (again in a discrete setup) that the optimal choice (in terms of the asymptotic variance of any empirical average) is Metropolis’. The proof is a direct consequence of a result by Kemeny and Snell (1960) on the asymptotic variance. Peskun also establishes that this asymptotic variance can improve upon the iid case if and only if the eigenvalues of P − A are all negative, when A is the transition matrix corresponding to the iid simulation and P the transition matrix corresponding to the Metropolis algorithm, but he concludes that the trace of P − A is always positive. 3 Seeds of the Revolution A number of earlier pioneers had brought forward the seeds of Gibbs sampling; in particular, Hammersley and Clifford had produced a constructive argument in 1970 to recover a joint distribution from its conditionals, a result later called the Hammersley–Clifford theorem by Besag (1974, 1986). Besides Hastings (1970) and Geman and Geman (1984), already mentioned, other papers that contained the germs of Gibbs sampling are Besag and Clifford (1989), Broniatowski et al. (1984), Qian and Titterington (1990), and Tanner and Wong (1987). 3.1 Besag’s Early Work and the Fundamental (Missing) Theorem In the early 1970’s, Hammersley, Clifford, and Besag were working on the specification of joint distributions from conditional distributions and on necessary and sufficient conditions for the conditional distributions to be compatible with a joint distribution. What is now known as the Hammersley-Clifford theorem states that a joint distribution for a vector associated with a dependence graph (edge meaning dependence and absence of edge conditional independence) must be represented as a product of functions over the cliques of the graphs, that is, of functions depending only on the components indexed by the labels in the clique 6
(which is a subset of the nodes of the graphs such that every node is connected by an edge to every other node in the subset).See Cressie (1993)or Lauritzen (1996)for detailed treatments. From an historical point of view,Hammersley (1974)explains why the Hammersley- Clifford theorem was never published as such,but only through Besag (1974).The reason is that Clifford and Hammersley were dissatisfied with the positivity constraint:The joint density could be recovered from the full conditionals only when the support of the joint was made of the product of the supports of the full conditionals(with obvious counter-examples, as in Robert and Casella 2004).While they strived to make the theorem independent of any positivity condition,their graduate student published Moussouris(1974),a counter-example that put a full stop to their endeavors. While Julian Besag can certainly be credited to some extent of the(re-)discovery of the Gibbs sampler (as in Besag 1974),Besag (1975)has the curious and anticlimactic following comment: The simulation procedure is to consider the sites cyclically and,at each stage, to amend or leave unaltered the particular site value in question,according to a probability distribution whose elements depend upon the current value at neigh- boring sites (...)However,the technique is unlikely to be particularly helpful in many other than binary situations and the Markov chain itself has no practical interpretation. So,while stating the basic version of the Gibbs sampler on a graph with discrete variables, Besag dismisses it as unpractical. On the other hand,Hammersley,together with Handscomb,wrote a textbook on Monte Carlo methods,(the first?)(Hammersley and Handscomb 1964).There they cover such topics as "Crude Monte Carlo"(which is(3));importance sampling;control variates;and "Conditional Monte Carlo",which looks surprisingly like a missing-data completion ap- proach.Of course,they do not cover the Hammersley-Clifford theorem but,in contrast to Besag (1974),they state in the Preface We are convinced nevertheless that Monte Carlo methods will one day reach an impressive maturity. Well said! 7
(which is a subset of the nodes of the graphs such that every node is connected by an edge to every other node in the subset). See Cressie (1993) or Lauritzen (1996) for detailed treatments. From an historical point of view, Hammersley (1974) explains why the HammersleyClifford theorem was never published as such, but only through Besag (1974). The reason is that Clifford and Hammersley were dissatisfied with the positivity constraint: The joint density could be recovered from the full conditionals only when the support of the joint was made of the product of the supports of the full conditionals (with obvious counter-examples, as in Robert and Casella 2004). While they strived to make the theorem independent of any positivity condition, their graduate student published Moussouris (1974), a counter-example that put a full stop to their endeavors. While Julian Besag can certainly be credited to some extent of the (re-)discovery of the Gibbs sampler (as in Besag 1974), Besag (1975) has the curious and anticlimactic following comment: The simulation procedure is to consider the sites cyclically and, at each stage, to amend or leave unaltered the particular site value in question, according to a probability distribution whose elements depend upon the current value at neighboring sites (...) However, the technique is unlikely to be particularly helpful in many other than binary situations and the Markov chain itself has no practical interpretation. So, while stating the basic version of the Gibbs sampler on a graph with discrete variables, Besag dismisses it as unpractical. On the other hand, Hammersley, together with Handscomb, wrote a textbook on Monte Carlo methods, (the first?) (Hammersley and Handscomb 1964). There they cover such topics as “Crude Monte Carlo“ (which is (3)); importance sampling; control variates; and “Conditional Monte Carlo”, which looks surprisingly like a missing-data completion approach. Of course, they do not cover the Hammersley-Clifford theorem but, in contrast to Besag (1974), they state in the Preface We are convinced nevertheless that Monte Carlo methods will one day reach an impressive maturity. Well said! 7
3.2 EM and its Simulated Versions as Precursors Besides a possible difficult computation in the E-step,problems with the EM algorithm (Dempster et al.1977)do occur in the case of multimodal likelihoods.The increase of the likelihood function at each step of the algorithm ensures its convergence to the maximum likelihood estimator in the case of unimodal likelihoods but it implies a dependence on initial conditions for multimodal likelihoods.Several proposals can be found in the literature to overcome this problem,one of which we now describe because of its connection with Gibbs sampling. Broniatowski et al.(1984)and Celeux and Diebolt (1985,1992)have tried to overcome the dependence of EM methods on the starting value by replacing the E step with a simulation step,the missing data z being generated conditionally on the observation z and on the current value of the parameter 0m.The maximization in the M step is then done on the (simulated)complete-data log-likelihood,H(,m).The appeal of this approach is that it allows for a more systematic exploration of the likelihood surface by partially avoiding the fatal attraction of the closest mode.Unfortunately,the theoretical convergence results for these methods are limited.Celeux and Diebolt (1990)have,however,solved the convergence problem of SEM by devising a hybrid version called SAEM(for Simulated Annealing EM), where the amount of randomness in the simulations decreases with the iterations,ending up with an EM algorithm.This version actually relates to simulated annealing methods 3.3 Gibbs,and Beyond Although somewhat removed from statistical inference in the classical sense and based on ear- lier techniques used in Statistical Physics,the landmark paper by Geman and Geman (1984) brought Gibbs sampling into the arena of statistical application.This paper is also responsi- ble for the name Gibbs sampling,because it implemented this method for the Bayesian study of Gibbs random fields which,in turn,derive their name from the physicist Josiah Willard Gibbs (1839-1903).This original implementation of the Gibbs sampler was applied to a discrete image processing problem and did not involve completion.But this was one more spark that led to the explosion,as it had a clear influence on Green,Smith,Spiegelhalter and others. 4 The Revolution The gap of more than 30 years between Metropolis et al.(1953)and Gelfand and Smith (1990)can still be partially attributed to the lack of appropriate computing power,as most 8
3.2 EM and its Simulated Versions as Precursors Besides a possible difficult computation in the E-step, problems with the EM algorithm (Dempster et al. 1977) do occur in the case of multimodal likelihoods. The increase of the likelihood function at each step of the algorithm ensures its convergence to the maximum likelihood estimator in the case of unimodal likelihoods but it implies a dependence on initial conditions for multimodal likelihoods. Several proposals can be found in the literature to overcome this problem, one of which we now describe because of its connection with Gibbs sampling. Broniatowski et al. (1984) and Celeux and Diebolt (1985, 1992) have tried to overcome the dependence of EM methods on the starting value by replacing the E step with a simulation step, the missing data z being generated conditionally on the observation x and on the current value of the parameter θm. The maximization in the M step is then done on the (simulated) complete-data log-likelihood, H˜ (x, zm|θ). The appeal of this approach is that it allows for a more systematic exploration of the likelihood surface by partially avoiding the fatal attraction of the closest mode. Unfortunately, the theoretical convergence results for these methods are limited. Celeux and Diebolt (1990) have, however, solved the convergence problem of SEM by devising a hybrid version called SAEM (for Simulated Annealing EM), where the amount of randomness in the simulations decreases with the iterations, ending up with an EM algorithm. This version actually relates to simulated annealing methods. 3.3 Gibbs, and Beyond Although somewhat removed from statistical inference in the classical sense and based on earlier techniques used in Statistical Physics, the landmark paper by Geman and Geman (1984) brought Gibbs sampling into the arena of statistical application. This paper is also responsible for the name Gibbs sampling, because it implemented this method for the Bayesian study of Gibbs random fields which, in turn, derive their name from the physicist Josiah Willard Gibbs (1839–1903). This original implementation of the Gibbs sampler was applied to a discrete image processing problem and did not involve completion. But this was one more spark that led to the explosion, as it had a clear influence on Green, Smith, Spiegelhalter and others. 4 The Revolution The gap of more than 30 years between Metropolis et al. (1953) and Gelfand and Smith (1990) can still be partially attributed to the lack of appropriate computing power, as most 8
of the examples now processed by MCMC algorithms could not have been treated previously, even though the hundreds of dimensions processed in Metropolis et al.(1953)were quite formidable.However,by the mid-1980s the pieces were all in place. After Peskun,MCMC in the statistical world was dormant for about 10 years,and then several papers appeared that highlighted its usefulness in specific settings (see,for example,Geman and Geman 1984,Tanner and Wong 1987,Besag 1989).In particular, Geman and Geman (1984)building on Metropolis et al.(1953),Hastings (1970),and Peskun (1973),influenced Gelfand and Smith (1990)to write a paper that is the genuine starting point for an intensive use of MCMC methods by the statistical community.It sparked new interest in Bayesian methods,statistical computing,algorithms,and stochastic processes through the use of computing algorithms such as the Gibbs sampler and the Me- tropolis-Hastings algorithm.(See Casella and George 1992 for an elementary introduction to the Gibbs sampler'.) Interestingly,the earlier Tanner and Wong (1987)has essentially the same ingredients as Gelfand and Smith (1990),namely the fact that simulating from the conditional dis- tributions is sufficient to simulate (in the limiting sense)from the joint.This paper was considered important enough to be a discussion paper in the Journal of the American Sta- tistical Association,but its impact was somewhat limited,compared with the one of Gelfand and Smith (1990).There are several reasons for this;one being that the method seemed to only apply to missing data problems (hence the name of data augmentation instead of Gibbs sampling),and another is that the authors were more focused on approximating the posterior distribuion.They suggested a(Markov chain)Monte Carlo approximation to the target (x)at each iteration of the sampler,based on m π(x,), zk心元t-1(2x, m k=1 that is,by replicating the simulations from the current approximation(r)of the marginal posterior distribution of the missing data m times.This focus on estimation of the posterior distribution connect the original Data Augmentation algorithm to EM,as pointed out by Dempster in the discussion.Although the discussion by Morris gets very close to the two-stage Gibbs sampler for hierarchical models,he is still concerned about doing m iterations,and worries about how costly that would be.Tanner and Wong mention 7On a humorous note,the original Technical Report of this paper was called Gibbs for Kids,which was changed because a referee did not appreciate the humor.However,our colleague Dan Gianola,an Animal Breeder at Wisconsin,liked the title.In using Gibbs sampling in his work,he gave a presentation in 1993 at the 44th Annual Meeting of the European Association for Animal Production,Arhus,Denmark.The title: Gibbs for Pigs. 9
of the examples now processed by MCMC algorithms could not have been treated previously, even though the hundreds of dimensions processed in Metropolis et al. (1953) were quite formidable. However, by the mid-1980s the pieces were all in place. After Peskun, MCMC in the statistical world was dormant for about 10 years, and then several papers appeared that highlighted its usefulness in specific settings (see, for example, Geman and Geman 1984, Tanner and Wong 1987, Besag 1989). In particular, Geman and Geman (1984) building on Metropolis et al. (1953), Hastings (1970) , and Peskun (1973), influenced Gelfand and Smith (1990) to write a paper that is the genuine starting point for an intensive use of MCMC methods by the statistical community. It sparked new interest in Bayesian methods, statistical computing, algorithms, and stochastic processes through the use of computing algorithms such as the Gibbs sampler and the Metropolis–Hastings algorithm. (See Casella and George 1992 for an elementary introduction to the Gibbs sampler7 .) Interestingly, the earlier Tanner and Wong (1987) has essentially the same ingredients as Gelfand and Smith (1990), namely the fact that simulating from the conditional distributions is sufficient to simulate (in the limiting sense) from the joint. This paper was considered important enough to be a discussion paper in the Journal of the American Statistical Association, but its impact was somewhat limited, compared with the one of Gelfand and Smith (1990). There are several reasons for this; one being that the method seemed to only apply to missing data problems (hence the name of data augmentation instead of Gibbs sampling), and another is that the authors were more focused on approximating the posterior distribuion. They suggested a (Markov chain) Monte Carlo approximation to the target π(θ|x) at each iteration of the sampler, based on 1 m Xm k=1 π(θ|x, zt,k), zt,k ∼ πˆt−1(z|x), that is, by replicating the simulations from the current approximation ˆπt−1(z|x) of the marginal posterior distribution of the missing data m times. This focus on estimation of the posterior distribution connect the original Data Augmentation algorithm to EM, as pointed out by Dempster in the discussion. Although the discussion by Morris gets very close to the two-stage Gibbs sampler for hierarchical models, he is still concerned about doing m iterations, and worries about how costly that would be. Tanner and Wong mention 7On a humorous note, the original Technical Report of this paper was called Gibbs for Kids, which was changed because a referee did not appreciate the humor. However, our colleague Dan Gianola, an Animal Breeder at Wisconsin, liked the title. In using Gibbs sampling in his work, he gave a presentation in 1993 at the 44th Annual Meeting of the European Association for Animal Production, Arhus, Denmark. The title: Gibbs for Pigs. 9
taking mention m =1 at the end of the paper,referring to this as an "extreme case".In a sense,Tanner and Wong (1987)was still too close to Rubin's 1978 multiple imputation to start a(new)revolution.Yet another reason for this may be that the theoretical backup was based on functional analysis rather than Markov chain theory,which implied in particular for the Markov kernel to be uniformly bounded and equicontinuous.This may have rebuffed potential applicants as requiring too much math! The authors of this review were fortunate enough to attend many focused conferences during this time,where we were able to witness the explosion of Gibbs sampling.In the summer of 1986 in Bowling Green,Ohio,Adrian Smith gave a series of ten lectures on hierarchical models.Although there was a lot of computing mentioned,the Gibbs sampler was not fully developed yet.In another lecture by Adrian Smith in June 1989 at a Bayesian workshop in Sherbrooke,Quebec,Adrian exposed for the first time (?the generic features of Gibbs sampler,and we still remembers vividly the shock induced by the sheer breadth of the method on ourselves and on the whole audience! This development of Gibbs sampling,MCMC,and the resulting seminal paper of Gelfand and Smith (1990)was an epiphany in the world of Statistics. Definition:epiphany n.A spiritual event in which the essence of a given object of mani- festation appears to the subject,as in a sudden flash of recognition. The explosion had begun,and just two years later,at an MCMC conference at Ohio State University organized by Alan Gelfand,Prem Goel,and Adrian Smith,there were three full days of talks.The presenters at the conference read like a(north-American)Who's Who of MCMC,and the level,intensity and impact of that conference,and the subsequent research,is immeasurable.The program of the conference is reproduced in Appendix A. Approximately one year later,in May of 1992,there was a meeting of the Royal Statistical Society on "The Gibbs sampler and other Markov chain Monte Carlo methods",where four papers were presented followed by much discussion.The papers appear in the first volume of JRSSB in 1993,together with 49 (!)pages of discussion,again by the Who's Who of MCMC(more global this time),and the excitement is clearly evident in the writings. Looking at these meetings,we can see the paths that Gibbs sampling would lead us down.In the next two sections we will summarize some of the advances from the early to mid 1990s. 4.1 Advances in MCMC Theory Perhaps the most influential MCMC theory paper of the 1990s is Tierney (1994),who care- fully laid out all of the assumptions needed to analyze the Markov chains and then developed 10
taking mention m = 1 at the end of the paper, referring to this as an “extreme case”. In a sense, Tanner and Wong (1987) was still too close to Rubin’s 1978 multiple imputation to start a (new) revolution. Yet another reason for this may be that the theoretical backup was based on functional analysis rather than Markov chain theory, which implied in particular for the Markov kernel to be uniformly bounded and equicontinuous. This may have rebuffed potential applicants as requiring too much math! The authors of this review were fortunate enough to attend many focused conferences during this time, where we were able to witness the explosion of Gibbs sampling. In the summer of 1986 in Bowling Green, Ohio, Adrian Smith gave a series of ten lectures on hierarchical models. Although there was a lot of computing mentioned, the Gibbs sampler was not fully developed yet. In another lecture by Adrian Smith in June 1989 at a Bayesian workshop in Sherbrooke, Qu´ebec, Adrian exposed for the first time (?) the generic features of Gibbs sampler, and we still remembers vividly the shock induced by the sheer breadth of the method on ourselves and on the whole audience! This development of Gibbs sampling, MCMC, and the resulting seminal paper of Gelfand and Smith (1990) was an epiphany in the world of Statistics. Definition: epiphany n. A spiritual event in which the essence of a given object of manifestation appears to the subject, as in a sudden flash of recognition. The explosion had begun, and just two years later, at an MCMC conference at Ohio State University organized by Alan Gelfand, Prem Goel, and Adrian Smith, there were three full days of talks. The presenters at the conference read like a (north-American) Who’s Who of MCMC, and the level, intensity and impact of that conference, and the subsequent research, is immeasurable. The program of the conference is reproduced in Appendix A. Approximately one year later, in May of 1992, there was a meeting of the Royal Statistical Society on “The Gibbs sampler and other Markov chain Monte Carlo methods”, where four papers were presented followed by much discussion. The papers appear in the first volume of JRSSB in 1993, together with 49 (!) pages of discussion, again by the Who’s Who of MCMC (more global this time), and the excitement is clearly evident in the writings. Looking at these meetings, we can see the paths that Gibbs sampling would lead us down. In the next two sections we will summarize some of the advances from the early to mid 1990s. 4.1 Advances in MCMC Theory Perhaps the most influential MCMC theory paper of the 1990s is Tierney (1994), who carefully laid out all of the assumptions needed to analyze the Markov chains and then developed 10