哈密尔顿蒙特卡罗的几何基础 The Geometric Foundations of Hamiltonian Monte Carlo.pdf_大学文库

arXiv:1410.5110v1 [stat.ME] 19 Oct 2014 Submitted to Statistical Science The Geometric Foundations of Hamiltonian Monte Carlo Michael Betancourt, Simon Byrne, Sam Livingstone, and Mark Girolami Abstract. Although Hamiltonian Monte Carlo has proven an empirical success, the lack of a rigorous theoretical understanding of the algorithm has in many ways impeded both principled developments of the method and use of the algorithm in practice. In this paper we develop the formal foundations of the algorithm through the construction of measures on smooth manifolds, and demonstrate how the theory naturally identifies efficient implementations and motivates promising generalizations. Key words and phrases: Markov Chain Monte Carlo, Hamiltonian Monte Carlo, Disintegration, Differential Geometry, Smooth Manifold, Fiber Bundle, Riemannian Geometry, Symplectic Geometry. The frontier of Bayesian inference requires algorithms capable of fitting complex models with hundreds, if not thousands of parameters, intricately bound together with nonlinear and often hierarchical correlations. Hamiltonian Monte Carlo (Duane et al., 1987; Neal, 2011) has proven tremendously successful at extracting inferences from these models, with applications spanning computer science (Sutherland, P´oczos and Schneider, 2013; Tang, Srivastava and Salakhutdinov, 2013), ecology (Schofield et al., 2014; Terada, Inoue and Nishihara, 2013), epidemiology (Cowling et al., 2012), linguistics (Husain, Vasishth and Srinivasan, 2014), pharmacokinetics (Weber et al., 2014), physics (Jasche et al., 2010; Porter and Carr´e, 2014; Sanders, Betancourt and Soderberg, 2014; Wang et al., 2014), and political science (Ghitza and Gelman, 2014), to name a few. Despite such widespread empirical success, however, there remains an air of mystery concerning the efficacy of the algorithm. This lack of understanding not only limits the adoption of Hamiltonian Monte Carlo but may also foster unprincipled and, ultimately, fragile implementations that restrict the Michael Betancourt is a Postdoctoral Research Associate at the University of Warwick, Coventry CV4 7AL, UK (e-mail: betanalpha@gmail.com). Simon Byrne is an EPSRC Postdoctoral Research Fellow at University College London, Gower Street, London, WC1E 6BT. Sam Livingstone is a PhD candidate at University College London, Gower Street, London, WC1E 6BT. Mark Girolami is an ESPRC Established Career Research Fellow at the University of Warwick, Coventry CV4 7AL, UK. 1

2 BETANCOURT ET AL of Fang,San tation in ical Monte Carlo (Lan 20121.1m effort to r len +1 rical inte to Ha niltonia Monte Carlo.Alth gh this lea ads dels th ce rapidly in pgimg 2012)in ndard Hami ehow critical to manc but why? paper we he theoretical fou on of Hamiltonian Monte Carlo wer que ions like We demo ritical to the of the algorith hence i several gene 1 ementa the success miltonian a We begin esPoeartiofg efficient Markov kernels and po sibl tho on mot ates the ols in differen e try, theory of probabilistic olds the penult at tha a sh the ion.and analysis of H tonian Monte this spective directs generalizations of the algorithm familiarity with differential geometry a complete understanding of this work will be a challe and we recommend that readers without a backgroun only scan through Section 2 to develop some intuition for the probabilistic interpretation bundles,Hemannian metrics,and symplectic forms,as wel I as the utility of Hamiltonia For those readers interesting in developing new implementations c Hamiltonian Monte Carlo we recommend a more careful reading of these sections and suggest introductory literature on the mathematics necessary to do so in the introduction of Section 2. 1 CONSTRUCTING EFFICIENT MARKOV KERNELS Bayesian inference is conceptually straightforward:the information about a system is first modeled with the construction of a posterior distribution,and then statistical questions can be answered by computing expectations with respect to that distribution.Many of the limitations of Bayesian inference arise not in the modeling of a posterior distribution but rather in computing the subsequent expectations.Because it provides a generic means of estimating these expectations markoy chain monte carlo has been critical to the success of the Bayesian methodology in practice. In this section we first review the Markov kernels intrinsic to Markov Chain Monte Carlo and then consider the dynamic systems perspective to motivate a strategy for constructing Markov kernels that yield computationally efficient inferences

2 BETANCOURT ET AL. scalability of the algorithm. Consider, for example, the Compressible Generalized Hybrid Monte Carlo scheme of Fang, Sanz-Serna and Skeel (2014) and the particular implementation in Lagrangian Dynamical Monte Carlo (Lan et al., 2012). In an effort to reduce the computational burden of the algorithm, the authors sacrifice the costly volume-preserving numerical integrators typical to Hamiltonian Monte Carlo. Although this leads to improved performance in some low-dimensional models, the performance rapidly diminishes with increasing model dimension (Lan et al., 2012) in sharp contrast to standard Hamiltonian Monte Carlo. Clearly, the volume-preserving numerical integrator is somehow critical to scalable performance; but why? In this paper we develop the theoretical foundation of Hamiltonian Monte Carlo in order to answer questions like these. We demonstrate how a formal understanding naturally identifies the properties critical to the success of the algorithm, hence immediately providing a framework for robust implementations. Moreover, we discuss how the theory motivates several generalizations that may extend the success of Hamiltonian Monte Carlo to an even broader array of applications. We begin by considering the properties of efficient Markov kernels and possible strategies for constructing those kernels. This construction motivates the use of tools in differential geometry, and we continue by curating a coherent theory of probabilistic measures on smooth manifolds. In the penultimate section we show how that theory provides a skeleton for the development, implementation, and formal analysis of Hamiltonian Monte Carlo. Finally, we discuss how this formal perspective directs generalizations of the algorithm. Without a familiarity with differential geometry a complete understanding of this work will be a challenge, and we recommend that readers without a background in the subject only scan through Section 2 to develop some intuition for the probabilistic interpretation of forms, fiber bundles, Riemannian metrics, and symplectic forms, as well as the utility of Hamiltonian flows. For those readers interesting in developing new implementations of Hamiltonian Monte Carlo we recommend a more careful reading of these sections and suggest introductory literature on the mathematics necessary to do so in the introduction of Section 2. 1. CONSTRUCTING EFFICIENT MARKOV KERNELS Bayesian inference is conceptually straightforward: the information about a system is first modeled with the construction of a posterior distribution, and then statistical questions can be answered by computing expectations with respect to that distribution. Many of the limitations of Bayesian inference arise not in the modeling of a posterior distribution but rather in computing the subsequent expectations. Because it provides a generic means of estimating these expectations, Markov Chain Monte Carlo has been critical to the success of the Bayesian methodology in practice. In this section we first review the Markov kernels intrinsic to Markov Chain Monte Carlo and then consider the dynamic systems perspective to motivate a strategy for constructing Markov kernels that yield computationally efficient inferences

4 BETANCOURT ET AL. that are asymptotically consistent for any initial q0 ∈ Q, lim N→∞ ˆfN (q0) P −→ E̟[f] . Here δq is the Dirac measure that concentrates on q, δq(A) ∝ 0, q /∈ A 1, q ∈ A , q ∈ Q, A ∈ B(Q). In practice we are interested not just in Markov chains that explore the target distribution as N → ∞ but in Markov chains that can explore and yield precise Markov Chain Monte Carlo estimators in only a finite number of transitions. From this perspective the efficiency of a Markov chain can be quantified in terms of the autocorrelation, which measures the dependence of any square integrable test function, f ∈ L 2 (Q, ̟), before and after the application of the Markov transition ρ[f] ≡ R f(q1) f(q2) τ (q1, dq2) ̟(dq1) − R f(q2) ̟(dq2) R f(q1) ̟(dq1) R f 2(q) ̟(dq) − R f(q) ̟(dq) 2 . In the best case the Markov kernel reproduces the target measure, τ (q, A) = ̟(A), ∀q ∈ Q, and the autocorrelation vanishes for all test functions, ρ[f] = 0. Alternatively, a Markov kernel restricted to a Dirac measure at the initial point, τ (q, A) = δq(A), moves nowhere and the autocorrelations saturate for any test function, ρ[f] = 1. Note that we are disregarding anti-autocorrelated chains, whose performance is highly sensitive to the particular f under consideration. Given a target measure, any Markov kernel will lie in between these two extremes; the more of the target measure a kernel explores the smaller the autocorrelations, while the more localized the exploration to the initial point the larger the autocorrelations. Unfortunately, common Markov kernels like Gaussian Random walk Metropolis (Robert and Casella, 1999) and the Gibbs sampler (Geman and Geman, 1984; Gelfand and Smith, 1990) degenerate into local exploration, and poor efficiency, when targeting the complex distributions of interest. Even in two-dimensions, for example, nonlinear correlations in the target distribution constrain the n-step transition kernels to small neighborhoods around the initial point (Figure 1). In order for Markov Chain Monte Carlo to perform well on these contemporary problems we need to be able to engineer Markov kernels that maintain exploration, and hence small autocorrelations, when targeting intricate distributions. The construction of such kernels is greatly eased with the use of measure-preserving maps

THE GEOMETRIC FOUNDATIONS OF HAMILTONIAN MONTE CARLO 7 where f is the density of ̟ with respect to the Lebesgue measure on R n . When targeting complex distributions either ǫ or the support of the indicator will be small and the resulting translations barely perturb the initial state. Example 2. The random scan Gibbs sampler is induced by axis-aligned translations, ti,η : qi → P −1 i (η) i ∼ U{1, . . . , n} η ∼ U[0, 1] , where Pi(qi) = Z qi ∞ ̟(d˜qi |q) is the cumulative distribution function of the ith conditional measure. When the target distribution is strongly correlated, the conditional measures concentrate near the initial q and, as above, the translations are stunted. In order to define a Markov kernel that remains efficient in difficult problems we need measure-preserving maps whose domains are not limited to local exploration. Realizations of Langevin diffusions (Øksendal, 2003), for example, yield measure-preserving maps that diffuse across the entire target distribution. Unfortunately that diffusion tends to expand across the target measures only slowly (Figure 2): for any finite diffusion time the resulting Langevin kernels are localized around the initial point (Figure 3). What we need are more coherent maps that avoid such diffusive behavior. One potential candidate for coherent maps are flows. A flow, {φt}, is a family of isomorphisms parameterized by a time, t, φt : Q → Q, ∀t ∈ R, that form a one-dimensional Lie group on composition, φt ◦ φs = φs+t φ −1 t = φ−t φ0 = IdQ, where IdQ is the natural identity map on Q. Because the inverse of a map is given only by negating t, as the time is increased the resulting φt pushes points away from their initial positions and avoids localized exploration (Figure 4). Our final obstacle is in engineering a flow comprised of measure-preserving maps. Flows are particularly natural on the smooth manifolds of differential geometry, and flows that preserve a given target measure can be engineered on one exceptional class of smooth manifolds known as symplectic manifolds. If we can understand these manifolds

THE GEOMETRIC FOUNDATIONS OF HAMILTONIAN MONTE CARLO 9 q t Flow Diffusion Fig 4. Because of the underlying group structure flows cannot double back on themselves like diffusions, forcing a coherent exploration of the target space. probabilistically then we can take advantage of their properties to build Markov kernels with small autocorrelations for even the complex, high-dimensional target distributions of practical interest. 2. MEASURES ON MANIFOLDS In this section we review probability measures on smooth manifolds of increasing sophistication, culminating in the construction of measure-preserving flows. Although we will relate each result to probabilistic theory and introduce intuition where we can, the formal details in the following require a working knowledge of differential geometry up to Lee (2013). We will also use the notation therein throughout the paper. For readers new to the subject but interested in learning more, we recommend the introduction in Baez and Muniain (1994), the applications in Schutz (1980); Jos´e and Saletan (1998), and then finally Lee (2013). The theory of symplectic geometry in which we will be particularly interested is reviewed in Schutz (1980); Jos´e and Saletan (1998); Lee (2013), with Cannas da Silva (2001) providing the most modern and thorough coverage of the subject. Smooth manifolds generalize the Euclidean space of real numbers and the corresponding calculus; in particular, a smooth manifold need only look locally like a Euclidean space (Figures 5). This more general space includes Lie groups, Stiefel manifolds, and other spaces becoming common in contemporary applications (Byrne and Girolami, 2013), not to mention regular Euclidean space as a special case. It does not, however, include any manifold with a discrete topology such as tree spaces. Formally, we assume that our sample space, Q, satisfies the properties of a smooth, connected, and orientable n-dimensional manifold. Specifically we require that Q be a Hausdorff and second-countable topological space that is locally homeomorphic to R n and