Sub d to Stat The Geometric Foundations of Hamiltonian Monte Carlo Michael Betancourt,Simon Byrne,Sam Livingstone,and Mark Girolami Abstract.Although Hamiltonian Monte Carlo has proven an empirical success,the lack of a rigorous theoretical understanding of the algo- rithm has in many ways impeded both principled developments of the method and use of the algorithm in practice.In this paper we develop the formal foundations of the algorithm through the construction of measures on smooth manifolds,and demonstrate how the theory natu- rally identifies efficient implementations and motivates promising gen- 回 eralizations. Key words and phrases:Markoy Chain monte Carlo.Hamiltonian monte Carlo,Disintegration,Differential Geometry,Smooth Manifold,Fiber Bundle,Riemannian Geometry,Symplectic Geometry. Th quires algorithms capable of fitting mod- wit if not thousand ia oft n hierarchica Carlo (Duane er na- proven tremen suc scien and S 2013 alakhutdin V,2013 )ecology(Sch ,0 Terada,Inoue and Nishihara, epidemio (e( ing et a 2012), (Husain, Vasishth and Srinivasan 14),pnarma asche et al.,2010;Porter and Carre Sanders,Betancourt and Soderberg,2014;Wang et al.,2014),and political science(Ghitza and Gelman, 2014,t name a few.Despite such widesprea empirical success,however,there remains an alr or mys oncerning the efficacy of the algorithm This la f understanding not only limits the adoption of Hamiltonian Monte Carlo but may also foster unprincipled and,ultimately,fragile implementations that restrict the Michael Betancourt is a Postdoctoral Research As ociate at the University of Warwick Coventry CV4 7AL,UK (e-mail:betanalphaugmail.com).Simon Byrne is an EPSRC Postdoctoral Research Fellow at University College London,Gower Street,London WCIE 6B Sam Livingstone is a PhD candidate at University Co London,Gower Street,London,WC1E 6BT Mark Girolami is an ESPRC Established Career Research Fellow at the University of Warwick.Coventry CVA TAL.UK. 1
arXiv:1410.5110v1 [stat.ME] 19 Oct 2014 Submitted to Statistical Science The Geometric Foundations of Hamiltonian Monte Carlo Michael Betancourt, Simon Byrne, Sam Livingstone, and Mark Girolami Abstract. Although Hamiltonian Monte Carlo has proven an empirical success, the lack of a rigorous theoretical understanding of the algorithm has in many ways impeded both principled developments of the method and use of the algorithm in practice. In this paper we develop the formal foundations of the algorithm through the construction of measures on smooth manifolds, and demonstrate how the theory naturally identifies efficient implementations and motivates promising generalizations. Key words and phrases: Markov Chain Monte Carlo, Hamiltonian Monte Carlo, Disintegration, Differential Geometry, Smooth Manifold, Fiber Bundle, Riemannian Geometry, Symplectic Geometry. The frontier of Bayesian inference requires algorithms capable of fitting complex models with hundreds, if not thousands of parameters, intricately bound together with nonlinear and often hierarchical correlations. Hamiltonian Monte Carlo (Duane et al., 1987; Neal, 2011) has proven tremendously successful at extracting inferences from these models, with applications spanning computer science (Sutherland, P´oczos and Schneider, 2013; Tang, Srivastava and Salakhutdinov, 2013), ecology (Schofield et al., 2014; Terada, Inoue and Nishihara, 2013), epidemiology (Cowling et al., 2012), linguistics (Husain, Vasishth and Srinivasan, 2014), pharmacokinetics (Weber et al., 2014), physics (Jasche et al., 2010; Porter and Carr´e, 2014; Sanders, Betancourt and Soderberg, 2014; Wang et al., 2014), and political science (Ghitza and Gelman, 2014), to name a few. Despite such widespread empirical success, however, there remains an air of mystery concerning the efficacy of the algorithm. This lack of understanding not only limits the adoption of Hamiltonian Monte Carlo but may also foster unprincipled and, ultimately, fragile implementations that restrict the Michael Betancourt is a Postdoctoral Research Associate at the University of Warwick, Coventry CV4 7AL, UK (e-mail: betanalpha@gmail.com). Simon Byrne is an EPSRC Postdoctoral Research Fellow at University College London, Gower Street, London, WC1E 6BT. Sam Livingstone is a PhD candidate at University College London, Gower Street, London, WC1E 6BT. Mark Girolami is an ESPRC Established Career Research Fellow at the University of Warwick, Coventry CV4 7AL, UK. 1
2 BETANCOURT ET AL of Fang,San tation in ical Monte Carlo (Lan 20121.1m effort to r len +1 rical inte to Ha niltonia Monte Carlo.Alth gh this lea ads dels th ce rapidly in pgimg 2012)in ndard Hami ehow critical to manc but why? paper we he theoretical fou on of Hamiltonian Monte Carlo wer que ions like We demo ritical to the of the algorith hence i several gene 1 ementa the success miltonian a We begin esPoeartiofg efficient Markov kernels and po sibl tho on mot ates the ols in differen e try, theory of probabilistic olds the penult at tha a sh the ion.and analysis of H tonian Monte this spective directs generalizations of the algorithm familiarity with differential geometry a complete understanding of this work will be a challe and we recommend that readers without a backgroun only scan through Section 2 to develop some intuition for the probabilistic interpretation bundles,Hemannian metrics,and symplectic forms,as wel I as the utility of Hamiltonia For those readers interesting in developing new implementations c Hamiltonian Monte Carlo we recommend a more careful reading of these sections and suggest introductory literature on the mathematics necessary to do so in the introduction of Section 2. 1 CONSTRUCTING EFFICIENT MARKOV KERNELS Bayesian inference is conceptually straightforward:the information about a system is first modeled with the construction of a posterior distribution,and then statistical questions can be answered by computing expectations with respect to that distribution.Many of the limitations of Bayesian inference arise not in the modeling of a posterior distribution but rather in computing the subsequent expectations.Because it provides a generic means of estimating these expectations markoy chain monte carlo has been critical to the success of the Bayesian methodology in practice. In this section we first review the Markov kernels intrinsic to Markov Chain Monte Carlo and then consider the dynamic systems perspective to motivate a strategy for constructing Markov kernels that yield computationally efficient inferences
2 BETANCOURT ET AL. scalability of the algorithm. Consider, for example, the Compressible Generalized Hybrid Monte Carlo scheme of Fang, Sanz-Serna and Skeel (2014) and the particular implementation in Lagrangian Dynamical Monte Carlo (Lan et al., 2012). In an effort to reduce the computational burden of the algorithm, the authors sacrifice the costly volume-preserving numerical integrators typical to Hamiltonian Monte Carlo. Although this leads to improved performance in some low-dimensional models, the performance rapidly diminishes with increasing model dimension (Lan et al., 2012) in sharp contrast to standard Hamiltonian Monte Carlo. Clearly, the volume-preserving numerical integrator is somehow critical to scalable performance; but why? In this paper we develop the theoretical foundation of Hamiltonian Monte Carlo in order to answer questions like these. We demonstrate how a formal understanding naturally identifies the properties critical to the success of the algorithm, hence immediately providing a framework for robust implementations. Moreover, we discuss how the theory motivates several generalizations that may extend the success of Hamiltonian Monte Carlo to an even broader array of applications. We begin by considering the properties of efficient Markov kernels and possible strategies for constructing those kernels. This construction motivates the use of tools in differential geometry, and we continue by curating a coherent theory of probabilistic measures on smooth manifolds. In the penultimate section we show how that theory provides a skeleton for the development, implementation, and formal analysis of Hamiltonian Monte Carlo. Finally, we discuss how this formal perspective directs generalizations of the algorithm. Without a familiarity with differential geometry a complete understanding of this work will be a challenge, and we recommend that readers without a background in the subject only scan through Section 2 to develop some intuition for the probabilistic interpretation of forms, fiber bundles, Riemannian metrics, and symplectic forms, as well as the utility of Hamiltonian flows. For those readers interesting in developing new implementations of Hamiltonian Monte Carlo we recommend a more careful reading of these sections and suggest introductory literature on the mathematics necessary to do so in the introduction of Section 2. 1. CONSTRUCTING EFFICIENT MARKOV KERNELS Bayesian inference is conceptually straightforward: the information about a system is first modeled with the construction of a posterior distribution, and then statistical questions can be answered by computing expectations with respect to that distribution. Many of the limitations of Bayesian inference arise not in the modeling of a posterior distribution but rather in computing the subsequent expectations. Because it provides a generic means of estimating these expectations, Markov Chain Monte Carlo has been critical to the success of the Bayesian methodology in practice. In this section we first review the Markov kernels intrinsic to Markov Chain Monte Carlo and then consider the dynamic systems perspective to motivate a strategy for constructing Markov kernels that yield computationally efficient inferences
THE GEOMETRIC FOUNDATIONS OF HAMILTONIAN MONTE CARLO 3 1.1 Markov Kernels Consider a probability space, (Q,B(Q),), with an n-dimensional sample space,Q,the Borel a-algebra over Q,B(Q),and a distin- guished probability measure,.In a Bayesian application,for example,the distinguished measure would be the posterior distribution and our ultimate goal would be the estimation of expectations with respect to the posterior,Ef]. A Markov kernel,T,is a map from an element of the sample space and the a-algebra to a probability, T:Q×B(Q)→0,1, such that the kernel is a measurable function in the first argument, T(,A):Q→Q,VA∈B(Q) and a probability measure in the second argument, r(g,):B(Q)→0,1,g∈Q. By construction the kernel defines a map, T:Q→P(Q), point. ver all initial points in the state space we can construct measures to probability measures. T:P(Q)→P(Q), by 可(A)=oT(A)=r(q,A)o(dq),g∈Q,A∈B(Q). When the transition is aperiodic.irreducible.Harris recurrent.and preserves the target measure,=w,its repeated application generates a Markov chain that will even- tually explore the entirety of Correlated samples,(o. N)from the markov pectation (Roberts et al. can construct estimators, jN(ao)=下∑fa
THE GEOMETRIC FOUNDATIONS OF HAMILTONIAN MONTE CARLO 3 1.1 Markov Kernels Consider a probability space, (Q,B(Q), ̟), with an n-dimensional sample space, Q, the Borel σ-algebra over Q, B(Q), and a distinguished probability measure, ̟. In a Bayesian application, for example, the distinguished measure would be the posterior distribution and our ultimate goal would be the estimation of expectations with respect to the posterior, E̟[f]. A Markov kernel, τ , is a map from an element of the sample space and the σ-algebra to a probability, τ : Q × B(Q) → [0, 1] , such that the kernel is a measurable function in the first argument, τ (·, A) : Q → Q, ∀A ∈ B(Q), and a probability measure in the second argument, τ (q, ·) : B(Q) → [0, 1] , ∀q ∈ Q. By construction the kernel defines a map, τ : Q → P(Q), where P(Q) is the space of probability measures over Q; intuitively, at each point in the sample space the kernel defines a probability measure describing how to sample a new point. By averaging the Markov kernel over all initial points in the state space we can construct a Markov transition from probability measures to probability measures, T : P(Q) → P(Q), by ̟ ′ (A) = ̟T (A) = Z τ (q, A) ̟(dq), ∀q ∈ Q, A ∈ B(Q). When the transition is aperiodic, irreducible, Harris recurrent, and preserves the target measure, ̟T = ̟, its repeated application generates a Markov chain that will eventually explore the entirety of ̟. Correlated samples, (q0, q1, . . . , qN ) from the Markov chain yield Markov Chain Monte Carlo estimators of any expectation (Roberts et al., 2004; Meyn and Tweedie, 2009). Formally, for any integrable function f ∈ L 1 (Q, ̟) we can construct estimators, ˆfN (q0) = 1 N X N n=0 f(qn)
BETANCOURT ET AL. that are asymptotically consistent for any initial o imx(qo)巴E-j. Here is the Dirac measure that concentrates on q, 0,9A 6(A)(Q) inte ested not just in Markov chains that Mo kov cha ins tha n only a finite n叫 ons.From this perspe tive the arkov chain can be quantified in terms of the au ntion,which mea- ny square integrable test function,feL2(,),before and after pi/1=fm)fpr,d92)oaqm)-fo2)adg2ffg)ad4l f2(q)(dq)-(ff(g)(dq)) In the best case the Markov kernel reproduces the target measure, T(q,A)=(A),g∈Q, measure at the initial point, T(q,A)=6g(A), moves nowhere and the autocorrelations saturate for any test function,pf]=1.Note that we are disregarding anti-autocorrelated chains,whose performance is highly sensitive to the particular f under consideration. Given a target measure,any Markov kernel will lie in between these two extremes;the more of the target measure a kernel explores the smaller the autocorrelations,while the more localized the exploration to the initial point the larger the autocorrelations.Unfortu- nately,common Markov kernels like Gaussian Random walk Metropolis(Robert and Casella. 1999)and the Gibbs sampler (Geman and Geman,1984;Gelfand and Smith,1990)degen- erate into local exploration,and poor efficiency,when targeting the complex distributions of interest.Even in two-dimensions,for example,nonlinear correlations in the target dis- tribution constrain the n-step transition kernels to small neighborhoods around the initial point (Figure 1). In order for Markov Chain Monte Carlo to perform well on these contemporary ce small is greatly eased with the use of measure-preserving maps
4 BETANCOURT ET AL. that are asymptotically consistent for any initial q0 ∈ Q, lim N→∞ ˆfN (q0) P −→ E̟[f] . Here δq is the Dirac measure that concentrates on q, δq(A) ∝ 0, q /∈ A 1, q ∈ A , q ∈ Q, A ∈ B(Q). In practice we are interested not just in Markov chains that explore the target distribution as N → ∞ but in Markov chains that can explore and yield precise Markov Chain Monte Carlo estimators in only a finite number of transitions. From this perspective the efficiency of a Markov chain can be quantified in terms of the autocorrelation, which measures the dependence of any square integrable test function, f ∈ L 2 (Q, ̟), before and after the application of the Markov transition ρ[f] ≡ R f(q1) f(q2) τ (q1, dq2) ̟(dq1) − R f(q2) ̟(dq2) R f(q1) ̟(dq1) R f 2(q) ̟(dq) − R f(q) ̟(dq) 2 . In the best case the Markov kernel reproduces the target measure, τ (q, A) = ̟(A), ∀q ∈ Q, and the autocorrelation vanishes for all test functions, ρ[f] = 0. Alternatively, a Markov kernel restricted to a Dirac measure at the initial point, τ (q, A) = δq(A), moves nowhere and the autocorrelations saturate for any test function, ρ[f] = 1. Note that we are disregarding anti-autocorrelated chains, whose performance is highly sensitive to the particular f under consideration. Given a target measure, any Markov kernel will lie in between these two extremes; the more of the target measure a kernel explores the smaller the autocorrelations, while the more localized the exploration to the initial point the larger the autocorrelations. Unfortunately, common Markov kernels like Gaussian Random walk Metropolis (Robert and Casella, 1999) and the Gibbs sampler (Geman and Geman, 1984; Gelfand and Smith, 1990) degenerate into local exploration, and poor efficiency, when targeting the complex distributions of interest. Even in two-dimensions, for example, nonlinear correlations in the target distribution constrain the n-step transition kernels to small neighborhoods around the initial point (Figure 1). In order for Markov Chain Monte Carlo to perform well on these contemporary problems we need to be able to engineer Markov kernels that maintain exploration, and hence small autocorrelations, when targeting intricate distributions. The construction of such kernels is greatly eased with the use of measure-preserving maps
THE GEOMETRIC FOUNDATIONS OF HAMILTONIAN MONTE CARLO T(Metropolis) δa,T(Metropolis) T(Metropolis-Within-Gibbs) I (Metropolis-Within-Gibbs (c) (d) the initial point,een aftermultiple iterntions
THE GEOMETRIC FOUNDATIONS OF HAMILTONIAN MONTE CARLO 5 δq0 T1 (Metropolis) (a) δq0 T10 (Metropolis) (b) δq0 T1 (Metropolis-Within-Gibbs) (c) δq0 T10 (Metropolis-Within-Gibbs) (d) Fig 1. Both (a, b) Random Walk Metropolis and (c, d) the Gibbs sampler are stymied by complex distributions, for example a warped Gaussian distribution (Haario, Saksman and Tamminen, 2001) on the sample space Q = R 2 , here represented with a 95% probability contour. Even when optimally tuned (Roberts et al., 1997), both Random Walk Metropolis and Random Walk Metropolis-within-Gibbs kernels concentrate around the initial point, even after multiple iterations
BETANCOURT ET AL. 1.2 Markov Kernels Induced From Measure-Preserving Maps Directly constructing a markoy kernel that targets w.let alone an efficient markov kernel,can be difficult.Instead of constructing a kernel directly,however,we can construct space into itself, t:Q→Q,te「, that each preserves the target measure, t,=可, where the pushforward measure,,is defined as (t,)(A)=(ot-1)(A),A∈B(Q). (C,9, which induces a Markov kernel by 9 ra,A)=fa14t(g》, where I is the indicator function, u{ggeQ4e@ In other words,the kernel assignsa probability toset,AB),by computing the measure of the preimage of that set,t(A),averaged over all isomorphisms in I.Because each t preserves the target measure,so too will their convolution and,consequently,the Markov transition induced by the kernel. This construction provides a new perspective on the limited performance of existing algorithms. 9→g+e(<fg+9) f(a E~N(0,) n~U0,1
6 BETANCOURT ET AL. 1.2 Markov Kernels Induced From Measure-Preserving Maps Directly constructing a Markov kernel that targets ̟, let alone an efficient Markov kernel, can be difficult. Instead of constructing a kernel directly, however, we can construct one indirectly be defining a family of measure-preserving maps (Petersen, 1989). Formally, let Γ be some space of continuous, bijective maps, or isomorphisms, from the space into itself, t : Q → Q, ∀t ∈ Γ, that each preserves the target measure, t∗̟ = ̟, where the pushforward measure, t∗̟, is defined as (t∗̟)(A) ≡ ̟ ◦ t −1 (A), ∀A ∈ B(Q). If we can define a σ-algebra, G, over this space then the choice of a distinguished measure over G, γ, defines a probability space, (Γ, G, γ), which induces a Markov kernel by (1) τ (q, A) ≡ Z Γ γ(dt)IA(t(q)), where I is the indicator function, IA(q) ∝ 0, q /∈ A 1, q ∈ A , q ∈ Q, A ∈ B(Q). In other words, the kernel assigns a probability to a set, A ∈ B(Q), by computing the measure of the preimage of that set, t −1 (A), averaged over all isomorphisms in Γ. Because each t preserves the target measure, so too will their convolution and, consequently, the Markov transition induced by the kernel. This construction provides a new perspective on the limited performance of existing algorithms. Example 1. We can consider Gaussian Random Walk Metropolis, for example, as being generated by random, independent translations of each point in the sample space, tǫ,η : q 7→ q + ǫ I η < f(q + ǫ) f(q) ǫ ∼ N (0, Σ) η ∼ U[0, 1]
THE GEOMETRIC FOUNDATIONS OF HAMILTONIAN MONTE CARLO 7 where f is the density of with respect to the lebesgue measure on R".When targeting complex distributions either cor the support of the indicator will be small and the resulting translations barely perturb the initial state. EXAMPLE 2.The random scan Gibbs sampler is induced by axis-aligned translations, tn:9→P() i~U1,,n} n~00,1刂. where is the cumulative distribution function of the ith conditional measure.When the target distribution is strongly correlated,the conditional measures concentrate near the initial q and,as above,the translations are stunted. ) are not li e sions (Okser mple,yield measure-preserving maps tha cross the entire target distribut ion.Unfortunately that diffusion tends to expand across the target measures only slowly(Figure 2):for any finite diffu on time the resulting Langevin kernels are localized around the initial point(Figure 3).What we need are more coherent maps that avoid such diffusive behavior One potential candidate for coherent maps are flows.A flow,is a family of isomor- phisms parameterized by a time,t, o:Q→Q,teR, that form a one-dimensional Lie group on composition, =s+t 1=6t where Ido is the natural identity map on Q.Because the inverse of a map is given only by negating t,as the time is increased the resulting pushes points away from their initial positions and avoids localized exploration (Figure 4).Our final obstacle is in engineering a flow comprised of measure-preserving maps. Flows are particularly natural on the smooth manifolds of differential geometry,and flows that preserve a given target measure can be engineered on one exceptional class of smooth manifolds known as symplectic manifolds.If we can understand these manifolds
THE GEOMETRIC FOUNDATIONS OF HAMILTONIAN MONTE CARLO 7 where f is the density of ̟ with respect to the Lebesgue measure on R n . When targeting complex distributions either ǫ or the support of the indicator will be small and the resulting translations barely perturb the initial state. Example 2. The random scan Gibbs sampler is induced by axis-aligned translations, ti,η : qi → P −1 i (η) i ∼ U{1, . . . , n} η ∼ U[0, 1] , where Pi(qi) = Z qi ∞ ̟(d˜qi |q) is the cumulative distribution function of the ith conditional measure. When the target distribution is strongly correlated, the conditional measures concentrate near the initial q and, as above, the translations are stunted. In order to define a Markov kernel that remains efficient in difficult problems we need measure-preserving maps whose domains are not limited to local exploration. Realizations of Langevin diffusions (Øksendal, 2003), for example, yield measure-preserving maps that diffuse across the entire target distribution. Unfortunately that diffusion tends to expand across the target measures only slowly (Figure 2): for any finite diffusion time the resulting Langevin kernels are localized around the initial point (Figure 3). What we need are more coherent maps that avoid such diffusive behavior. One potential candidate for coherent maps are flows. A flow, {φt}, is a family of isomorphisms parameterized by a time, t, φt : Q → Q, ∀t ∈ R, that form a one-dimensional Lie group on composition, φt ◦ φs = φs+t φ −1 t = φ−t φ0 = IdQ, where IdQ is the natural identity map on Q. Because the inverse of a map is given only by negating t, as the time is increased the resulting φt pushes points away from their initial positions and avoids localized exploration (Figure 4). Our final obstacle is in engineering a flow comprised of measure-preserving maps. Flows are particularly natural on the smooth manifolds of differential geometry, and flows that preserve a given target measure can be engineered on one exceptional class of smooth manifolds known as symplectic manifolds. If we can understand these manifolds
BETANCOURT ET AL. c2nomgameaahn极afmahoot I(Langevin wi T(Langevin with 10) (a) (b) (e) al point. en here for a Langevin diffusion targeting the tisted Gaussian distrib tion (Haario,Saksman and Tamminen,2001)
8 BETANCOURT ET AL. Fig 2. Langevin trajectories are, by construction, diffusive, and are just as likely to double back as they are to move forward. Consequently even as the diffusion time grows, here to t = 1000 as the trajectory darkens, realizations of a Langevin diffusion targeting the twisted Gaussian distribution (Haario, Saksman and Tamminen, 2001) only slowly wander away from the initial point. δq0 T (Langevin with t=1) (a) δq0 T (Langevin with t=10) (b) δq0 T (Langevin with t=100) (c) Fig 3. Because of the diffusive nature of the underlying maps, Langevin kernels expand very slowly with increasing diffusion time, t. For any reasonable diffusion time the resulting kernels will concentrate around the initial point, as seen here for a Langevin diffusion targeting the twisted Gaussian distribution (Haario, Saksman and Tamminen, 2001)
THE GEOMETRIC FOUNDATIONS OF HAMILTONIAN MONTE CARLO 9 Flow Diffusio cdbe eck on themce ueo probabilistically then we can take advantage of their properties to build Markov kernels with small autocorrelations for even the complex,high-dimensional target distributions of practical interest. 2.MEASURES ON MANIFOLDS In this section epreserving flows the foll uce int ferent etry up to e(2 e will also ion therein through the pa er.Fo rs nev in lear Bac and Muniain 1994 applications in Schutz(1980);J nd Saletan (1998 and then finally Lee(2013).The theory of symp ctic geometry in which we will be partic d Saletan (1998);Lee (2 13 with aing the most moc I thorou n coverage e of the su Smooth manifolds generalize the Euclidean space of real nd the corresponding calculus;in particular,a smooth manifold only look locally like a E clidea n space (Figures 5).This more general space includes Lie groups,Stiefel manifolds,and other spaces becoming common in contemporary applications (Byrne and Girolami,2013),not to mention regular Euclidean space as a special case.It does not,however,include any manifold with a discrete topology such as tree spaces. Formally,we assume that our sample space,satisfies the properties of a smooth, connected,and orientable n-dimensional manifold.Specifically we require that Q be a Hausdorff and second-countable topological space that is locally homeomorphic to R"and
THE GEOMETRIC FOUNDATIONS OF HAMILTONIAN MONTE CARLO 9 q t Flow Diffusion Fig 4. Because of the underlying group structure flows cannot double back on themselves like diffusions, forcing a coherent exploration of the target space. probabilistically then we can take advantage of their properties to build Markov kernels with small autocorrelations for even the complex, high-dimensional target distributions of practical interest. 2. MEASURES ON MANIFOLDS In this section we review probability measures on smooth manifolds of increasing sophistication, culminating in the construction of measure-preserving flows. Although we will relate each result to probabilistic theory and introduce intuition where we can, the formal details in the following require a working knowledge of differential geometry up to Lee (2013). We will also use the notation therein throughout the paper. For readers new to the subject but interested in learning more, we recommend the introduction in Baez and Muniain (1994), the applications in Schutz (1980); Jos´e and Saletan (1998), and then finally Lee (2013). The theory of symplectic geometry in which we will be particularly interested is reviewed in Schutz (1980); Jos´e and Saletan (1998); Lee (2013), with Cannas da Silva (2001) providing the most modern and thorough coverage of the subject. Smooth manifolds generalize the Euclidean space of real numbers and the corresponding calculus; in particular, a smooth manifold need only look locally like a Euclidean space (Figures 5). This more general space includes Lie groups, Stiefel manifolds, and other spaces becoming common in contemporary applications (Byrne and Girolami, 2013), not to mention regular Euclidean space as a special case. It does not, however, include any manifold with a discrete topology such as tree spaces. Formally, we assume that our sample space, Q, satisfies the properties of a smooth, connected, and orientable n-dimensional manifold. Specifically we require that Q be a Hausdorff and second-countable topological space that is locally homeomorphic to R n and
10 BETANCOURT ET AL u cQ (4)cR Q=SxR ucQ (lha)C R2 (a) (6) (c) FIG 5.(a)The cylinder,Q=Sx R.is a nontrivial erample of a manifold.Although not globally equivalent wherever the two neighborhoods intersect (the intersections here shoun in gray). equipped with a differential structure. fua:vahael' consisting of open neighborhoods in Q, ua cQ, and homeomorphic charts, a:la→yaCR”, that are smooth functions whenever their domains overlap(Figure 5), (R"),Va,B nug#0. Coordinates subordinate to a chart, q:4a→R where i is the ith Euclidean projection on the image of,provide local parameterizations of the manifold convenient for explicit calculations
10 BETANCOURT ET AL. Q = S 1 × R (a) U1 ⊂ Q U2 ⊂ Q (b) ψ1(U1) ⊂ R 2 ψ2(U2) ⊂ R 2 (c) Fig 5. (a) The cylinder, Q = S 1 ×R, is a nontrivial example of a manifold. Although not globally equivalent to a Euclidean space, (b) the cylinder can be covered in two neighborhoods (c) that are themselves isomorphic to an open neighborhood in R 2 . The manifold becomes smooth when ψ1 ◦ψ −1 2 : R 2 → R 2 is a smooth function wherever the two neighborhoods intersect (the intersections here shown in gray). equipped with a differential structure, {Uα, ψα}α∈I , consisting of open neighborhoods in Q, Uα ⊂ Q, and homeomorphic charts, ψα : Uα → Vα ⊂ R n , that are smooth functions whenever their domains overlap (Figure 5), ψβ ◦ ψ −1 α ∈ C ∞(R n ), ∀α, β | Uα ∩ Uβ 6= ∅. Coordinates subordinate to a chart, q i : Uα → R q → πi ◦ ψα, where πi is the ith Euclidean projection on the image of ψα, provide local parameterizations of the manifold convenient for explicit calculations