DYNAMIC INFERENCE IN PROBABILISTIC GRAPHICAL MODELS WEIMING FENG.KUN HE.XIAOMING SUN.AND YITONG YIN ABSTRACT.Probabilistic graphical models,such as Markov random fields(MRFs),are useful for describ- ing high-dimensional distributions in terms of local dependence structures.The probabilistic inference is a fundamental problem related to graphical models,and sampling is a main approach for the problem. In this paper,we study the probabilistic inference problems when the graphical model itself is changing dynamically with time.Such dynamic inference problems arise naturally in today's application,e.g.mul- tivariate time-series data and practical learning procedures. We give a dynamic algorithm for sampling-based probabilistic inferences in MRFs,where each dy- namic update can change the underlying graph and all parameters of the MRF simultaneously,as long as the total amount of changes is bounded.Suppose that the MRF has n variables and bounded maximum degree,and N(n)independent samples are sufficient for the inference for a polynomial function N(). Our algorithm dynamically maintains an answer to the inference problem using O(nN(n))space cost, and O(N(n)+n)incremental time cost upon each update to the MRF,as long as the Dobrushin-Shlosman condition is satisfied by the MRFs.This well-known condition has long been used for guaranteeing the efficiency of Markov chain Monte Carlo(MCMC)sampling in the traditional static setting.Compared to the static case,which requires (nN(n))time cost for redrawing all N(n)samples whenever the MRF changes,our dynamic algorithm gives a significant Q(minin,N(n)))-factor speedup.Our approach relies on a novel dynamic sampling technique,which transforms traditional local Markov chains(a.k.a.single- site dynamics)to dynamic sampling algorithms,in a scenario where all parameters of the graphical model are subject to simultaneous update at each time. (Weiming Feng,Yitong Yin)STATE KEY LABORATORY FOR NOVEL SOFTWARE TECHNOLOGY,NANJING UNIVERSITY.E-mail: fengwm@smail.nju.edu.cn,yinyt@nju.edu.cn (Kun He)SHENZHEN UNIVERSITY:SHENZHEN INSTITUTE OF COMPUTING SCIENCES.E-mail:hekun.threebodyofoxmail. com (Xiaoming Sun)CAS KEY LAB OF NETWORK DATA SCIENCE AND TECHNOLOGY,INSTITUTE OF COMPUTING TECHNOLOGY, CHINESE ACADEMY OF SCIENCES.E-mail:sunxiaoming@ict.ac.cn This research is supported by the National Key R&D Program of China 2018YFB1003202 and the National Science Foun- dation of China under Grant Nos.61722207 and 61672275
DYNAMIC INFERENCE IN PROBABILISTIC GRAPHICAL MODELS WEIMING FENG, KUN HE, XIAOMING SUN, AND YITONG YIN Abstract. Probabilistic graphical models, such as Markov random fields (MRFs), are useful for describing high-dimensional distributions in terms of local dependence structures. The probabilistic inference is a fundamental problem related to graphical models, and sampling is a main approach for the problem. In this paper, we study the probabilistic inference problems when the graphical model itself is changing dynamically with time. Such dynamic inference problems arise naturally in today’s application, e.g. multivariate time-series data and practical learning procedures. We give a dynamic algorithm for sampling-based probabilistic inferences in MRFs, where each dynamic update can change the underlying graph and all parameters of the MRF simultaneously, as long as the total amount of changes is bounded. Suppose that the MRF has n variables and bounded maximum degree, and N(n) independent samples are sufficient for the inference for a polynomial function N(·). Our algorithm dynamically maintains an answer to the inference problem using Oe(nN(n)) space cost, and Oe(N(n)+n) incremental time cost upon each update to the MRF, as long as the Dobrushin-Shlosman condition is satisfied by the MRFs. This well-known condition has long been used for guaranteeing the efficiency of Markov chain Monte Carlo (MCMC) sampling in the traditional static setting. Compared to the static case, which requires Ω(nN(n)) time cost for redrawing all N(n) samples whenever the MRF changes, our dynamic algorithm gives a significant Ωe(min{n, N(n)})-factor speedup. Our approach relies on a novel dynamic sampling technique, which transforms traditional local Markov chains (a.k.a. singlesite dynamics) to dynamic sampling algorithms, in a scenario where all parameters of the graphical model are subject to simultaneous update at each time. (Weiming Feng, Yitong Yin) State Key Laboratory for Novel Software Technology, Nanjing University. E-mail: fengwm@smail.nju.edu.cn,yinyt@nju.edu.cn (Kun He) Shenzhen University; Shenzhen Institute of Computing Sciences. E-mail: hekun.threebody@foxmail. com (Xiaoming Sun) CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences. E-mail: sunxiaoming@ict.ac.cn This research is supported by the National Key R&D Program of China 2018YFB1003202 and the National Science Foundation of China under Grant Nos. 61722207 and 61672275
CONTENTS 1 Introduction············· 1.1 Our results........ 1 1.2 Related work 44444.4.4。 2 1.3 Organization of the paper..··. 2 Dynamic inference problem 3 2.1 Markov random fields... 3 2.2 Probabilistic inference and sampling 3 2.3 Dynamic inference problem ........................ x 3 Main results 。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。 5 4 Preliminaries 4.4 6 5 Outlines of algorithm 8 6 Dynamic Gibbs sampling......... 9 6.1 Coupling for dynamic instances.......... 10 6.2 Data structure for Gibbs sampling 17 6.3 Single--sample dynamic Gibbs sampling algorithm.·.··....... 6.4 Multi-sample dynamic Gibbs sampling algorithm 21 7 Proofs for dynamic Gibbs sampling..····..············ 24 7.1 Analysis of the couplings...·.·················· 24 7.2 Implementation of the algorithms............................. 29 7.3 Dynamic Gibbs sampling for specific models·······.··.··········· 30 8 Proofs for dynamic inference 37 8.1 Proof of the main theorem..··.······.·············· 37 8.2 Dynamic inference on specific models 37 9 Conclusion············· 38 References.....·.········· 38
Contents 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Our results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Organization of the paper. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Dynamic inference problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1 Markov random fields. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Probabilistic inference and sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.3 Dynamic inference problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 4 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 5 Outlines of algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 6 Dynamic Gibbs sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 6.1 Coupling for dynamic instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 6.2 Data structure for Gibbs sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 6.3 Single-sample dynamic Gibbs sampling algorithm . . . . . . . . . . . . . . . . . . . . 18 6.4 Multi-sample dynamic Gibbs sampling algorithm . . . . . . . . . . . . . . . . . . . . 21 7 Proofs for dynamic Gibbs sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 7.1 Analysis of the couplings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 7.2 Implementation of the algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 7.3 Dynamic Gibbs sampling for specific models . . . . . . . . . . . . . . . . . . . . . . 30 8 Proofs for dynamic inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 8.1 Proof of the main theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 8.2 Dynamic inference on specific models . . . . . . . . . . . . . . . . . . . . . . . . . . 37 9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.INTRODUCTION The probabilistic graphical models provide a rich language for describing high-dimensional dis- tributions in terms of the dependence structures between random variables.The Markov random filed (MRF)is a basic graphical model that encodes pairwise interactions of complex systems.Given a graph G=(V,E),each vertex v E V is associated with a function:O-R,called the vertex potential,on a finite domain O=g of g spin states,and each edge eeE is associated with a symmetric function e:O2R,called the edge potential,which describes a pairwise interaction.Together,these induce a probability distribution u over all configurations o ev: o)ex(H)》=ep(∑p(o)+ pe(ou,ou)川 uev e=fu,U)EE This distribution u is known as the Gibbs distribution and H(o)is the Hamiltonian.It arises naturally from various physical models,statistics or learning problems,and combinatorial problems in computer science [MM09,KFB09]. The probabilistic inference is one of the most fundamental computational problems in graphical model.Some basic inference problems ask to calculate the marginal distribution,conditional distri- bution,or maximum-a-posteriori probabilities of one or several random variables [WJ08].Sampling is perhaps the most widely used approach for probabilistic inference.Given a graphical model,inde- pendent samples are drawn from the Gibbs distribution and certain statistics are computed using the samples to give estimates for the inferred quantity.For most typical inference problems,such statistics are easy to compute once the samples are given,for instance,for estimating the marginal distribution on a variable subset S,the statistics is the frequency of each configuration in Os among the samples, thus the cost for inference is dominated by the cost for generating random samples [JVV86,SVV09]. The classic probabilistic inference assumes a static setting,where the input graphical model is fixed. In today's application,dynamically changing graphical models naturally arise in many scenarios.In various practical algorithms for learning graphical models,e.g.the contrastive divergence algorithm for leaning the restricted Boltzmann machine [Hin12]and the iterative proportional fitting algorithm for maximum likelihood estimation of graphical models [WJ08],the optimal model I*is obtained by updating the parameters of the graphical model iteratively(usually by gradient descent),which gen- erates a sequence of graphical models I,12,...,IM,with the goal that IM is a good approximation of I'.Also in the study of the multivariate time-series data,the dynamic Gaussian graphical mod- els [CW*07],multiregression dynamic model [OS93],dynamic graphical model [OS92],and dynamic chain graph models [AQ+17],are all dynamically changing graphical models and have been used in a variety of applications.Meanwhile,with the advent of Big Data,scalable machine learning systems need to deal with continuously evolving graphical models(see e.g.[RKD+19]and [SWA09]). The theoretical studies of probabilistic inference in dynamically changing graphical models are lack- ing.In the aforementioned scenarios in practice,it is common that a sequence of graphical models is presented with time,where any two consecutive graphical models can differ from each other in all potentials but by a small total amount.Recomputing the inference problem from scratch at every time when the graphical model is changed,can give correct solution,but is very wasteful.A fundamental question is whether probabilsitic inference can be solved dynamically and efficiently. In this paper,we study the problem of probabilistic inference in an MRF when the MRF itself is changing dynamically with time.At each time,the whole graphical model,including all vertices and edges as well as their potentials,are subject to changes.Such non-local updates are very general and cover all applications mentioned above.The problem of dynamic inference then asks to maintain a correct answer to the inference in a dynamically changing MRF with low incremental cost proportional to the amount of changes made to the graphical model at each time. 1.1.Our results.We give a dynamic algorithm for sampling-based probabilistic inferences.Given an MRF instance with n vertices,suppose that N(n)independent samples are sufficient to give an approximate solution to the inference problem,where N:N+-N+is a polynomial function.We give dynamic algorithms for general inference problems on dynamically changing MRF
1. Introduction The probabilistic graphical models provide a rich language for describing high-dimensional distributions in terms of the dependence structures between random variables. The Markov random filed (MRF) is a basic graphical model that encodes pairwise interactions of complex systems. Given a graph G = (V, E), each vertex v ∈ V is associated with a function φv : Q → R, called the vertex potential, on a finite domain Q = [q] of q spin states, and each edge e ∈ E is associated with a symmetric function φe : Q 2 → R, called the edge potential, which describes a pairwise interaction. Together, these induce a probability distribution µ over all configurations σ ∈ Q V : µ(σ) ∝ exp(H(σ)) = exp Õ v ∈V φv (σv ) + Õ e={u,v }∈E φe (σu, σv ) . This distribution µ is known as the Gibbs distribution and H(σ) is the Hamiltonian. It arises naturally from various physical models, statistics or learning problems, and combinatorial problems in computer science [MM09, KFB09]. The probabilistic inference is one of the most fundamental computational problems in graphical model. Some basic inference problems ask to calculate the marginal distribution, conditional distribution, or maximum-a-posteriori probabilities of one or several random variables [WJ08]. Sampling is perhaps the most widely used approach for probabilistic inference. Given a graphical model, independent samples are drawn from the Gibbs distribution and certain statistics are computed using the samples to give estimates for the inferred quantity. For most typical inference problems, such statistics are easy to compute once the samples are given, for instance, for estimating the marginal distribution on a variable subset S, the statistics is the frequency of each configuration in Q S among the samples, thus the cost for inference is dominated by the cost for generating random samples [JVV86, ŠVV09]. The classic probabilistic inference assumes a static setting, where the input graphical model is fixed. In today’s application, dynamically changing graphical models naturally arise in many scenarios. In various practical algorithms for learning graphical models, e.g. the contrastive divergence algorithm for leaning the restricted Boltzmann machine [Hin12] and the iterative proportional fitting algorithm for maximum likelihood estimation of graphical models [WJ08], the optimal model I ∗ is obtained by updating the parameters of the graphical model iteratively (usually by gradient descent), which generates a sequence of graphical models I1, I2, · · · , IM , with the goal that IM is a good approximation of I ∗ . Also in the study of the multivariate time-series data, the dynamic Gaussian graphical models [CW+ 07], multiregression dynamic model [QS93], dynamic graphical model [QS92], and dynamic chain graph models [AQ+ 17], are all dynamically changing graphical models and have been used in a variety of applications. Meanwhile, with the advent of Big Data, scalable machine learning systems need to deal with continuously evolving graphical models (see e.g. [RKD+ 19] and [SWA09]). The theoretical studies of probabilistic inference in dynamically changing graphical models are lacking. In the aforementioned scenarios in practice, it is common that a sequence of graphical models is presented with time, where any two consecutive graphical models can differ from each other in all potentials but by a small total amount. Recomputing the inference problem from scratch at every time when the graphical model is changed, can give correct solution, but is very wasteful. A fundamental question is whether probabilsitic inference can be solved dynamically and efficiently. In this paper, we study the problem of probabilistic inference in an MRF when the MRF itself is changing dynamically with time. At each time, the whole graphical model, including all vertices and edges as well as their potentials, are subject to changes. Such non-local updates are very general and cover all applications mentioned above. The problem of dynamic inference then asks to maintain a correct answer to the inference in a dynamically changing MRF with low incremental cost proportional to the amount of changes made to the graphical model at each time. 1.1. Our results. We give a dynamic algorithm for sampling-based probabilistic inferences. Given an MRF instance with n vertices, suppose that N(n) independent samples are sufficient to give an approximate solution to the inference problem, where N : N + → N + is a polynomial function. We give dynamic algorithms for general inference problems on dynamically changing MRF. 1
Suppose that the current MRF has n vertices and bounded maximum degree,and each update to the MRF may change the underlying graph or all vertex/edge potentials,as long as the total amount of changes is bounded.Our algorithm maintains an approximate solution to the inference with O(nN(n)) space cost,and with O(N(n)+n)incremental time cost upon each update,assuming that the MRFs sat- isfy the Dobrushin-Shlosman condition [DS85a,DS85b,DS87].The condition has been widely used to imply the efficiency of Markov chain Monte Carlo(MCMC)sampling(e.g.see [Hay06,DGJ08]).Com- pared to the static algorithm,which requires (nN(n))time for redrawing all N(n)samples each time, our dynamic algorithm significantly improves the time cost with an (min{n,N(n)})-factor speedup. On specific models,the Dobrushin-Shlosman condition has been established in the literature,which directly gives us following efficient dynamic inference algorithms,with O(nN(n))space cost and O(N(n)+n)time cost per update,on graphs with n vertices and maximum degree A=O(1): for Ising model with temperaturesatisfyinge1-which is close to the uniqueness thresholde=1-,beyond which the static versions of sampling or marginal inference problem for anti-ferromagnetic Ising model is intractable [GSV16,GSV15]; for hardcore model with fugacity A<,which matches the best bound known for sam- pling algorithm with near-linear running time on general graphs with bounded maximum de- gree [Vig99,LV99,DG00]; for proper g-coloring with g 2A,which matches the best bound known for sampling algo- rithm with near-linear running time on general graphs with bounded maximum degree [Jer95]. Our dynamic inference algorithm is based on a dynamic sampling algorithm,which efficiently main- tains N(n)independent samples for the current MRF while the MRF is subject to changes.More specif- ically,we give a dynamic version of the Gibbs sampling algorithm,a local Markov chain for sampling from the Gibbs distribution that has been studied extensively.Our techniques are based on:(1)cou- plings for dynamic instances of graphical models;and(2)dynamic data structures for representing single-site Markov chains so that the couplings can be realized algorithmically in sub-linear time.Both these techniques are of independent interests,and can be naturally extended to more general settings with multi-body interactions. Our results show that on dynamically changing graphical models,sampling-based probabilistic in- ferences can be solved significantly faster than rerunning the static algorithm at each time.This has practical significance in speeding up the iterative procedures for learning graphical models. 1.2.Related work.The problem of dynamic sampling from graphical models was introduced very recently in [FVY19].There,a dynamic sampling algorithm was given for sampling from graphical models with soft constraints,and can only deal with local updates that change a single vertex or edge at each time.And the regimes for such dynamic sampling algorithm to be efficient with small incre- mental cost,are much more restrictive than the conditions for the rapid mixing of Markov chains. Our algorithm greatly improves the regimes for efficient dynamic sampling for the Ising and hardcore models in [FVY19],and for the first time,can handle non-local updates that change all vertex/edge potentials simultaneously.Besides,the dynamic/online sampling from log-concave distributions was also studied in [NR17,LMV19]. Another related topic is the dynamic graph problems,which ask to maintain a solution (e.g.span- ners [FG19,NSWN17,WN17]or shortest paths [BC16,HKN16,HKN14])while the input graph is dy- namically changing.More recently,important progresses have been made on dynamically maintaining structures that are related to graph random walks,such as spectral sparsifier [DGGP19,ADK16]or effective resistances [DGGP18,GHP18].Instead of one particular solution,dynamic inference prob- lems ask to maintain a statistics of an exponential-sized probability space described by a dynamically changing graphical model. 1.3.Organization of the paper.In Section 2,we formally introduce the dynamic inference problem In Section 3,we formally state the main results.Preliminaries are given in Section 4.In Section 5,we outline our dynamic inference algorithm.In Section 6,we present the algorithms for dynamic Gibbs sampling.The analyses of these dynamic sampling algorithms are given in Section 7.The proof of the main theorem on dynamic inference is given in Section 8.The conclusion is given in Section 9. 2
Suppose that the current MRF has n vertices and bounded maximum degree, and each update to the MRF may change the underlying graph or all vertex/edge potentials, as long as the total amount of changes is bounded. Our algorithm maintains an approximate solution to the inference with Oe(nN(n)) space cost, and with Oe(N(n)+n) incremental time cost upon each update, assuming that the MRFs satisfy the Dobrushin-Shlosman condition [DS85a, DS85b, DS87]. The condition has been widely used to imply the efficiency of Markov chain Monte Carlo (MCMC) sampling (e.g. see [Hay06, DGJ08]). Compared to the static algorithm, which requires Ω(nN(n)) time for redrawing all N(n) samples each time, our dynamic algorithm significantly improves the time cost with an Ωe(min{n, N(n)})-factor speedup. On specific models, the Dobrushin-Shlosman condition has been established in the literature, which directly gives us following efficient dynamic inference algorithms, with Oe(nN(n)) space cost and Oe(N(n) + n) time cost per update, on graphs with n vertices and maximum degree ∆ = O(1): • for Ising model with temperature β satisfying e −2|β | > 1− 2 ∆+1 , which is close to the uniqueness threshold e −2|βc | = 1 − 2 ∆ , beyond which the static versions of sampling or marginal inference problem for anti-ferromagnetic Ising model is intractable [GŠV16, GŠV15]; • for hardcore model with fugacity λ 2∆, which matches the best bound known for sampling algorithm with near-linear running time on general graphs with bounded maximum degree [Jer95]. Our dynamic inference algorithm is based on a dynamic sampling algorithm, which efficiently maintains N(n) independent samples for the current MRF while the MRF is subject to changes. More specifically, we give a dynamic version of the Gibbs sampling algorithm, a local Markov chain for sampling from the Gibbs distribution that has been studied extensively. Our techniques are based on: (1) couplings for dynamic instances of graphical models; and (2) dynamic data structures for representing single-site Markov chains so that the couplings can be realized algorithmically in sub-linear time. Both these techniques are of independent interests, and can be naturally extended to more general settings with multi-body interactions. Our results show that on dynamically changing graphical models, sampling-based probabilistic inferences can be solved significantly faster than rerunning the static algorithm at each time. This has practical significance in speeding up the iterative procedures for learning graphical models. 1.2. Related work. The problem of dynamic sampling from graphical models was introduced very recently in [FVY19]. There, a dynamic sampling algorithm was given for sampling from graphical models with soft constraints, and can only deal with local updates that change a single vertex or edge at each time. And the regimes for such dynamic sampling algorithm to be efficient with small incremental cost, are much more restrictive than the conditions for the rapid mixing of Markov chains. Our algorithm greatly improves the regimes for efficient dynamic sampling for the Ising and hardcore models in [FVY19], and for the first time, can handle non-local updates that change all vertex/edge potentials simultaneously. Besides, the dynamic/online sampling from log-concave distributions was also studied in [NR17, LMV19]. Another related topic is the dynamic graph problems, which ask to maintain a solution (e.g. spanners [FG19, NSWN17, WN17] or shortest paths [BC16, HKN16, HKN14]) while the input graph is dynamically changing. More recently, important progresses have been made on dynamically maintaining structures that are related to graph random walks, such as spectral sparsifier [DGGP19, ADK+ 16] or effective resistances [DGGP18, GHP18]. Instead of one particular solution, dynamic inference problems ask to maintain a statistics of an exponential-sized probability space described by a dynamically changing graphical model. 1.3. Organization of the paper. In Section 2, we formally introduce the dynamic inference problem. In Section 3, we formally state the main results. Preliminaries are given in Section 4. In Section 5, we outline our dynamic inference algorithm. In Section 6, we present the algorithms for dynamic Gibbs sampling. The analyses of these dynamic sampling algorithms are given in Section 7. The proof of the main theorem on dynamic inference is given in Section 8. The conclusion is given in Section 9. 2
2.DYNAMIC INFERENCE PROBLEM 2.1.Markov random fields.An instance of Markov random field (MRF)is specified by a tuple I (V,E,)where G =(V,E)is an undirected simple graph;O is a domain of g =|Ol spin states,for some finite q>1;and =(a)aevuE associates each vEV a vertex potential o:O-R and each eE E an edge potential e:Q2-R,where de is symmetric. A configuration aov maps each vertex vV to a spin state in Q.so that each vertex can be interpreted as a variable.And the Hamiltonian of a configuration oEQ is defined as: H(o)±∑(o,)+∑pe(oo vEV e=(u,v}EE This defines the Gibbs distribution r,which is a probability distribution over such that 1 oeQ,r(o)=exp(H()》 where the normalizing factor Zov exp(H())is called the partition function. The Gibbs measure u(o)can be 0 as the functions e can take the value -oo.A configuration o is called feasible if u(o)>0.To trivialize the problem of constructing a feasible configuration,we further assume the following natural condition for the MRF instances considered in this paper:1 (1) VvEV,Vo eOG(): ∑exp(c)+∑pu(oucl >0 ceO where IG(v)uV I fu,v}EE}denotes the neighborhood of v in graph G=(V,E). Some well studied typical MRFs indclude: .Ising model:The domain of each spin is O={-1,+1).Each edge eE is associated with a temperature Be R;and each vertex v E V is associated with a local field hoER.For each configuration o∈{-l,+1}',μr(o)cexp(∑fu,UYEE BeGuGu+∑vev huOu): Hardcore model:The domain isO=(0,1).Each configurationo OV indicates an independent set inG=(V,E),and ur(o)o alloll,where>0 is the fugacity parameter. proper g-coloring:uniform distribution over all proper q-colorings of G =(V,E). 2.2.Probabilistic inference and sampling.In graphical models,the task of probabilistic inference is to derive the probabilities regarding one or more random variables of the model.Abstractly,this is described by a function RK that maps each graphical model IE to a target K-dimensional probability vector,where i is the class of graphical models containing the random variables we are interested in and the K-dimensional vector describes the probabilities we want to derive.Given 0() and an MRF instance Ie the inference problem asks to estimate the probability vector e(). Here are some fundamental inference problems [WJ08]for MRF instances.Let I =(V,E,O,)be a MRF instance and A.B C V two disjoint sets where A B C V. Marginal inference:estimate the marginal distribution uA.r()of the variables in A,where VoA∈Q,HAI(oA)±∑HI(GA.T). t∈QA Posterior inference:given any TBB,estimate the posterior distribution A.r(rB)for the variables in A,where oA∈Q1,Ar(oA|tB)兰AUB,IoA,B) ∥B,I(tB) IThis condition guarantees that the marginal probabilities are always well-defined,and the problem of constructing a feasible configuration where r()>0,is trivial.The condition holds for all MRFs with soft constraints,or with hard constraints where there is a permissive spin,e.g.the hardcore model For MRFs with truly repulsive hard constraints such as proper g-coloring,the condition may translate to the condition g A +1 where A is the maximum degree of graph G, which is necessary for the irreducibility of local Markov chains for q-colorings. 3
2. Dynamic inference problem 2.1. Markov random fields. An instance of Markov random field (MRF) is specified by a tuple I = (V, E,Q, Φ), where G = (V, E) is an undirected simple graph; Q is a domain of q = |Q| spin states, for some finite q > 1; and Φ = (φa)a ∈V ∪E associates each v ∈ V a vertex potential φv : Q → R and each e ∈ E an edge potential φe : Q 2 → R, where φe is symmetric. A configuration σ ∈ Q V maps each vertex v ∈ V to a spin state in Q, so that each vertex can be interpreted as a variable. And the Hamiltonian of a configuration σ ∈ Q V is defined as: H(σ) ≜ Õ v ∈V φv (σv ) + Õ e={u,v }∈E φe (σu, σv ). This defines the Gibbs distribution µI, which is a probability distribution over Q V such that ∀σ ∈ Q V , µI(σ) = 1 Z exp(H(σ)), where the normalizing factor Z ≜ Í σ ∈QV exp(H(σ)) is called the partition function. The Gibbs measure µ(σ) can be 0 as the functions φv, φe can take the value −∞. A configuration σ is called feasible if µ(σ) > 0. To trivialize the problem of constructing a feasible configuration, we further assume the following natural condition for the MRF instances considered in this paper:1 ∀v ∈ V, ∀σ ∈ Q ΓG (v) : Õ c ∈Q exp φv (c) + Õ u ∈Γv φuv (σu,c) ! (1) > 0. where ΓG (v) ≜ {u ∈ V | {u,v} ∈ E} denotes the neighborhood of v in graph G = (V, E). Some well studied typical MRFs indclude: • Ising model: The domain of each spin is Q = {−1, +1}. Each edge e ∈ E is associated with a temperature βe ∈ R; and each vertex v ∈ V is associated with a local field hv ∈ R. For each configuration σ ∈ {−1, +1} V , µI(σ) ∝ exp Í {u,v }∈E βeσuσv + Í v ∈V hvσv . • Hardcore model: The domain isQ = {0, 1}. Each configuration σ ∈ Q V indicates an independent set in G = (V, E), and µI(σ) ∝ λ kσ k , where λ > 0 is the fugacity parameter. • proper q-coloring: uniform distribution over all proper q-colorings of G = (V, E). 2.2. Probabilistic inference and sampling. In graphical models, the task of probabilistic inference is to derive the probabilities regarding one or more random variables of the model. Abstractly, this is described by a function θ : M → R K that maps each graphical model I ∈ M to a target K-dimensional probability vector, where M is the class of graphical models containing the random variables we are interested in and the K-dimensional vector describes the probabilities we want to derive. Given θ(·) and an MRF instance I ∈ M, the inference problem asks to estimate the probability vector θ(I). Here are some fundamental inference problems [WJ08] for MRF instances. Let I = (V, E,Q, Φ) be a MRF instance and A, B ⊆ V two disjoint sets where A ] B ⊆ V . • Marginal inference: estimate the marginal distribution µA,I(·) of the variables in A, where ∀σA ∈ Q A , µA,I(σA) ≜ Õ τ ∈QV \A µI(σA, τ ). • Posterior inference: given any τB ∈ Q B , estimate the posterior distribution µA,I(· | τB) for the variables in A, where ∀σA ∈ Q A , µA,I(σA | τB) ≜ µA∪B,I(σA, τB) µB,I(τB) . 1This condition guarantees that the marginal probabilities are always well-defined, and the problem of constructing a feasible configuration σ, where µI(σ) > 0, is trivial. The condition holds for all MRFs with soft constraints, or with hard constraints where there is a permissive spin, e.g. the hardcore model. For MRFs with truly repulsive hard constraints such as proper q-coloring, the condition may translate to the condition q ≥ ∆ + 1 where ∆ is the maximum degree of graph G, which is necessary for the irreducibility of local Markov chains for q-colorings. 3
Maximum-a-posteriori (MAP)inference:find the maximum-a-posteriori (MAP)probabilities P()for the configurations over A,where VoA∈Q,PAr(GA)兰maxμAUB,I(aA,TB. tB∈QB All these fundamental inference problems can be described abstractly by a function 0:RK.For instances,for marginal inference,contains all MRF instances where A is a subset of the vertices, K=1,and ()=(A.(and for posterior or MAP inference,contains all MRF instances where AB is a subset of the vertices,K=Ql and 0(T)=(HA.r(A TB)(for posterior inference)or()=(PA(A(for MAP inference). One canonical approach for probabilistic inference is by sampling:sufficiently many independent samples are drawn(approximately)from the Gibbs distribution of the MRF instance and an estimate of the target probabilities is calculated from these samples.Given a probabilistic inference problem 0(),we use ()to denote an estimating function that approximates (T)using independent samples drawn approximately from ur.For the aforementioned problems of marginal,posterior and MAP inferences,such estimating function Se()simply counts the frequency of the samples that satisfy certain properties. The sampling cost of an estimator is captured in two aspects:the number of samples it uses and the accuracy of each individual sample it requires. Definition 2.1((N,e)-estimator for 0).Let 0:RK be a probabilistic inference problem and o()an estimating function for ()that for each instance I=(V,E,,)e,maps samples in v to an estimate of0(I).LetN:Nt→N+and e:Nt→(0,1).For any instance I=(W,E,Q,Φ)∈ where n =IVI,the random variable e(x(1),...,X(N(m))is said to be an (N,e)-estimator for ( if x(1),...,X(N(n))Ov are N(n)independent samples drawn approximately from ur such that drv(X,μr)≤e(n)for all1≤j≤N(n. 2.3.Dynamic inference problem.We consider the inference problem where the input graphical model is changed dynamically:at each step,the current MRF instance I =(V,E,O,)is updated to a new instance I'=(V,E',O,')We consider general update operations for MRFs that can change both the underlying graph and all edge/vertex potentials simultaneously,where the update request is made by a non-adaptive adversary independently of the randomness used by the inference algorithm. Such updates are general enough and cover many applications,e.g.analyses of time series network data [CW+07,QS93,QS92,AQ+17],and learning algorithms for graphical models [Hin12,WJ08]. The difference between the original and the updated instances is measured as follows. Definition 2.2(difference between MRF instances).The difference between two MRF instances I=(V,E,Q,Φ)andT'=(W',E',Q,Φ'),whereΦ=(pa)aEVUE andΦ'=(pa)aeVUE',is defined as (2) dI,I)± ∑lp。-9l+∑m-l,+V®1+E@E vEVnV' eEEnE' where AB=(A\B)U(B\A)stands for the symmetric difference between two sets A and B, pu-%l,兰∑ceQ 1ou(c)-φ%(c外and e-l1兰∑c,ceQ e(c,c)-pc,c Given a probability vector specified by the function 0:RK,the dynamic inference problem asks to maintain an estimator 0(T)of 0(T)for the current MRF instance I =(V,E,O,)i,with a data structure,such that when I is updated to T'=(V',E',')the algorithm updates (T) to an estimator 0(')of the new vector 0(),or equivalently,outputs the difference between the estimators (I)and (I'). It is desirable to have the dynamic inference algorithm which maintains an(N,e)-estimator for 0(T) for the current instance I.However,the dynamic algorithm cannot be efficient if N(n)and e(n)change drastically with n(so that significantly more samples or substantially more accurate samples may be needed when a new vertex is added),or if recalculating the estimating function S()itself is expensive. We introduce a notion of dynamical efficiency for the estimators that are suitable for dynamic inference
• Maximum-a-posteriori (MAP) inference: find the maximum-a-posteriori (MAP) probabilities P ∗ A,I (·) for the configurations over A, where ∀σA ∈ Q A , P ∗ A,I (σA) ≜ max τB ∈QB µA∪B,I(σA, τB). All these fundamental inference problems can be described abstractly by a function θ : M → R K . For instances, for marginal inference, M contains all MRF instances where A is a subset of the vertices, K = |Q| |A| , and θ(I) = (µA,I(σA))σA ∈QA ; and for posterior or MAP inference, M contains all MRF instances where A ] B is a subset of the vertices, K = |Q| |A| and θ(I) = (µA,I(σA | τB))σA ∈QA (for posterior inference) or θ(I) = (P ∗ A,I (σA))σA ∈QA (for MAP inference). One canonical approach for probabilistic inference is by sampling: sufficiently many independent samples are drawn (approximately) from the Gibbs distribution of the MRF instance and an estimate of the target probabilities is calculated from these samples. Given a probabilistic inference problem θ(·), we use Eθ (·) to denote an estimating function that approximates θ(I) using independent samples drawn approximately from µI. For the aforementioned problems of marginal, posterior and MAP inferences, such estimating function Eθ (·) simply counts the frequency of the samples that satisfy certain properties. The sampling cost of an estimator is captured in two aspects: the number of samples it uses and the accuracy of each individual sample it requires. Definition 2.1 ((N, ϵ)-estimator for θ). Let θ : M → R K be a probabilistic inference problem and Eθ (·) an estimating function for θ(·) that for each instance I = (V, E,Q, Φ) ∈ M, maps samples in Q V to an estimate of θ(I). Let N : N + → N + and ϵ : N + → (0, 1). For any instance I = (V, E,Q, Φ) ∈ M where n = |V |, the random variable Eθ (X (1) , . . . ,X (N (n))) is said to be an (N, ϵ)-estimator for θ(I) if X (1) , . . . ,X (N (n)) ∈ Q V are N(n) independent samples drawn approximately from µI such that dTV X (j) , µI ≤ ϵ(n) for all 1 ≤ j ≤ N(n). 2.3. Dynamic inference problem. We consider the inference problem where the input graphical model is changed dynamically: at each step, the current MRF instance I = (V, E,Q, Φ) is updated to a new instance I 0 = (V 0 , E 0 ,Q, Φ 0 ). We consider general update operations for MRFs that can change both the underlying graph and all edge/vertex potentials simultaneously, where the update request is made by a non-adaptive adversary independently of the randomness used by the inference algorithm. Such updates are general enough and cover many applications, e.g. analyses of time series network data [CW+ 07, QS93, QS92, AQ+ 17], and learning algorithms for graphical models [Hin12, WJ08]. The difference between the original and the updated instances is measured as follows. Definition 2.2 (difference between MRF instances). The difference between two MRF instances I = (V, E,Q, Φ) and I 0 = (V 0 , E 0 ,Q, Φ 0 ), where Φ = (φa)a ∈V ∪E and Φ 0 = (φ 0 a )a ∈V 0∪E 0, is defined as (2) d(I, I 0 ) ≜ Õ v ∈V ∩V 0 φv − φ 0 v 1 + Õ e ∈E∩E 0 φe − φ 0 e 1 + |V ⊕ V 0 | + |E ⊕ E 0 |, where A ⊕ B = (A \ B) ∪ (B \ A) stands for the symmetric difference between two sets A and B, φv − φ 0 v 1 ≜ Í c ∈Q φv (c) − φ 0 v (c) , and φe − φ 0 e 1 ≜ Í c,c 0∈Q φe (c,c 0 ) − φ 0 e (c,c 0 ) . Given a probability vector specified by the function θ : M → R K , the dynamic inference problem asks to maintain an estimator ˆθ(I) of θ(I) for the current MRF instance I = (V, E,Q, Φ) ∈ M, with a data structure, such that when I is updated to I 0 = (V 0 , E 0 ,Q, Φ 0 ) ∈ M, the algorithm updates ˆθ(I) to an estimator ˆθ(I0 ) of the new vector θ(I0 ), or equivalently, outputs the difference between the estimators ˆθ(I) and ˆθ(I0 ). It is desirable to have the dynamic inference algorithm which maintains an (N, ϵ)-estimator for θ(I) for the current instance I. However, the dynamic algorithm cannot be efficient if N(n) and ϵ(n) change drastically with n (so that significantly more samples or substantially more accurate samples may be needed when a new vertex is added), or if recalculating the estimating function Eθ (·) itself is expensive. We introduce a notion of dynamical efficiency for the estimators that are suitable for dynamic inference. 4
Definition 2.3 (dynamical efficiency).Let NN+N+and e:N+(0,1).Let S()be an estimating function for some K-dimensional probability vector of MRF instances.An tuple(N.e.S)is said to be dynamically efficient if it satisfies: .(bounded difference)there exist constants C1,C2>0 such that for any nE N, IN0a+1)-N(mI sC:N@and le(n+1D)-cnl≤·em n n .(small incremental cost)there is a deterministic algorithm that maintains (X(1,...,X(m)) using (mn+K).polylog(mn)bits where x(),...,(m)OV and n =IVI,such that when X()....x(m)Qv are updated to Y(),....Y(m)Qv,where n'=IV'l,the algorithm updates (X....m)to (Y....Y(m)within time cost D.polylog(mm'nn)+(m+ m'),where D is the size of the difference between two sample sequences defined as: (3) D兰 ∑∑1[xo)≠ro ismax{m,m'}e元v where an unassigned x((v)or Y((v)is not equal to any assigned spin. The dynamic efficiency basically asks N(),e(),and E()to have some sort of "Lipschitz"properties. To satisfy the bounded difference condition,N(n)and 1/e(n)are necessarily polynomially bounded, and can be any constant,polylogarithmic,or polynomial functions,or multiplications of such func- tions.The condition with small incremental cost also holds very commonly.In particular,it is satisfied by the estimating functions for all the aforementioned problems for the marginal,posterior and MAP inferences as long as the sets of variables have sizes Al,Bl E O(logn).We remark that the O(log n) upper bound is somehow necessary for the efficiency of inference,because otherwise the dimension of (T)itself(which is at least glAl)becomes super-polynomial in n. 3.MAIN RESULTS Let I =(V,E,be an MRF instance,where G=(V,E).Let IG(v)denote the neighborhood of v in G.For any vertex vV and any configuration(),we use()=Hv.(a)to denote the marginal distribution on v conditional on o: Vee:()=o.r(cl)exp (()+vec)(u)) ∑a∈Qexp(pu(a)+∑uerc()puu(cu,a) Due to the assumption in(1),the marginal distribution is always well-defined.The following condition is the Dobrushin-Shlosman condition [DS85a,DS85b,DS87,Hay06,DGJ08]. Condition 3.1(Dobrushin-Shlosman condition).Let I =(V,E,)be an MRF instance with Gibbs distribution u=ur.Let Ar RYx be the influence matrix which is defined as 50 Ar(u,)兰 maxa,r)eBu,drv(8,μ),{u}∈E, 0 {u,U}主E, where the maximum is taken over the set Bu.of all (o,r)O()xOc()that differ only at u, and drv()c(c)(c)is the total variation distance between g and An MRF instance I is said to satisfy the Dobrushin-Shlosman condition if there is a constant 8>0 such that m∑Aru)s1-& vEV Our main theorem assumes the following setup:Let 0:RK be a probabilistic inference problem that maps each MRF instance in to a K-dimensional probability vector,and let Se be its estimating function.LetN:N→N+and e:Nt→(O,l).We use I=(V,E,Q,Φ)∈gr,where n =IVI,to denote the current instance and I'=(V',E',O,'),where n'=V'l,to denote the updated instance
Definition 2.3 (dynamical efficiency). Let N : N + → N + and ϵ : N + → (0, 1). Let E(·) be an estimating function for some K-dimensional probability vector of MRF instances. An tuple (N, ϵ, E) is said to be dynamically efficient if it satisfies: • (bounded difference) there exist constants C1,C2 > 0 such that for any n ∈ N + , |N(n + 1) − N(n)| ≤ C1 · N(n) n and |ϵ(n + 1) − ϵ(n)| ≤ C2 · ϵ(n) n ; • (small incremental cost) there is a deterministic algorithm that maintains E(X (1) , . . . ,X (m) ) using (mn + K) · polylog(mn) bits where X (1) , . . . ,X (m) ∈ Q V and n = |V |, such that when X (1) , . . . ,X (m) ∈ Q V are updated to Y (1) , . . . ,Y (m0 ) ∈ Q V 0 , where n 0 = |V 0 |, the algorithm updates E(X (1) , . . . ,X (m) ) to E(Y (1) , . . . ,Y (m0 ) ) within time cost D · polylog(mm0nn0 ) +O(m + m0 ), where D is the size of the difference between two sample sequences defined as: D ≜ Õ i ≤max{m,m0 } Õ v ∈V ∪V 0 1 h X (i) (v) , Y (i) (v) i (3) , where an unassigned X (i) (v) or Y (i) (v) is not equal to any assigned spin. The dynamic efficiency basically asks N(·), ϵ(·), and E(·) to have some sort of “Lipschitz” properties. To satisfy the bounded difference condition, N(n) and 1/ϵ(n) are necessarily polynomially bounded, and can be any constant, polylogarithmic, or polynomial functions, or multiplications of such functions. The condition with small incremental cost also holds very commonly. In particular, it is satisfied by the estimating functions for all the aforementioned problems for the marginal, posterior and MAP inferences as long as the sets of variables have sizes |A| , |B| ∈ O(logn). We remark that the O(logn) upper bound is somehow necessary for the efficiency of inference, because otherwise the dimension of θ(I) itself (which is at least q |A| ) becomes super-polynomial in n. 3. Main results Let I = (V, E,Q, Φ) be an MRF instance, where G = (V, E). Let ΓG (v) denote the neighborhood of v in G. For any vertex v ∈ V and any configuration σ ∈ Q ΓG (v) , we use µ σ v,I (·) = µv,I(· | σ) to denote the marginal distribution on v conditional on σ: ∀c ∈ Q : µ σ v,I (c) = µv,I(c | σ) ≜ exp φv (c) + Í u ∈ΓG (v) φuv (σu,c) Í a ∈Q exp φv (a) + Í u ∈ΓG (v) φuv (σu, a) . Due to the assumption in (1), the marginal distribution is always well-defined. The following condition is the Dobrushin-Shlosman condition [DS85a, DS85b, DS87, Hay06, DGJ08]. Condition 3.1 (Dobrushin-Shlosman condition). Let I = (V, E,Q, Φ) be an MRF instance with Gibbs distribution µ = µI. Let AI ∈ R V ×V ≥0 be the influence matrix which is defined as AI(u,v) ≜ ( max(σ ,τ )∈Bu,v dTV µ σ v , µ τ v , {u,v} ∈ E, 0 {u,v} 0 such that max u ∈V Õ v ∈V AI(u,v) ≤ 1 − δ . Our main theorem assumes the following setup: Let θ : M → R K be a probabilistic inference problem that maps each MRF instance in M to a K-dimensional probability vector, and let Eθ be its estimating function. Let N : N + → N + and ϵ : N + → (0, 1). We use I = (V, E,Q, Φ) ∈ M, where n = |V |, to denote the current instance and I 0 = (V 0 , E 0 ,Q, Φ 0 ) ∈ M, where n 0 = |V 0 |, to denote the updated instance. 5
Theorem 3.2(dynamic inference algorithm).Assume that (N,e,Se)is dynamically efficient,both I and I'satisfy the Dobrushin-Shlosman condition,and d(I,I')0 is an arbitrary constant. model regime space cost time cost for each update Ising e2川≥1- O(nN(n)+K) O(△2LN(n)+△n hardcore 入≤福 O(nN(n)+K) O(△3LN(n)+△n q-coloring 9≥(2+8)△ O(nN(n)+K) O(△2LN(n)+△n) TABLE 1.Dynamic inference for specific models. The results for Ising model and q-coloring are corollaries of Theorem 3.2.The regime for hardcore model is better than the Dobrushin-Shlosman condition(which is),because we use the cou- pling introduced by Vigoda [Vig99]to analyze the algorithm. 4.PRELIMINARIES Total variation distance and coupling.Let u and v be two distributions over 9.The total variation distance between u and v is defined as dvu,)e}∑)-xl. 2 xED A coupling ofμand v is a joint distribution(X,Y)∈2×2 such that marginal distribution of X isμ and the marginal distribution of Y is v.The following coupling lemma is well-known. 2In multivariate time-series data analysis,the MRF instances of two sequential times are similar.In the iterative algo- rithms for learning graphical models,the difference between two sequential MRF instances generated by gradient descent are bounded to prevent oscillations.Specifically,the difference is very small when the iterative algorithm is close to con- verge [Hin12,WJ08]
Theorem 3.2 (dynamic inference algorithm). Assume that (N, ϵ, Eθ ) is dynamically efficient, both I and I 0 satisfy the Dobrushin-Shlosman condition, and d(I, I 0 ) ≤ L = o(n). There is an algorithm that maintains an (N, ϵ)-estimator ˆθ(I) of the probability vector θ(I) for the current MRF instance I, using Oe(nN(n) + K) bits, such that when I is updated to I 0 , the algorithm updates ˆθ(I) to an (N, ϵ)-estimator ˆθ(I0 ) of θ(I0 ) for the new instance I, within expected time cost Oe ∆ 2 LN(n) + ∆n , where Oe(·) hides a polylog(n) factor, ∆ = max{∆G, ∆G0}, where ∆G and ∆G0 denote the maximum degree of G = (V, E) and G 0 = (V 0 , E 0 ) respectively. Typically, the difference between two MRF instances I, I 0 is small2 , and the underlying graphs are sparse [DSOR16] , that is, L, ∆ ≤ polylog(n). In such cases, our algorithm updates the estimator within time costOe(N(n)+n), which significantly outperforms static sampling-based inference algorithms that require time cost Ω(n 0N(n 0 )) = Ω(nN(n)) for redrawing all N(n 0 ) independent samples. Dynamic sampling. The core of our dynamic inference algorithm is a dynamic algorithm for sampling: Assuming the Dobrushin-Shlosman condition, the algorithm can maintain a sequence of N(n) independent samples X (1) , . . . ,X (N (n)) ∈ Q V that are ϵ(n)-close to µI in total variation distance, and when I is updated to I 0 with difference d(I, I 0 ) ≤ L = o(n), the algorithm can update the maintained samples to N(n 0 ) independent samples Y (1) , . . . ,Y (N (n 0 )) ∈ Q V 0 that are ϵ(n 0 )-close to µI0 in total variation distance, using a time cost Oe ∆ 2LN(n) + ∆n in expectation. This dynamic sampling algorithm is formally described in Theorem 6.1 and is of independent interests [FVY19]. Applications on specific models. On specific models, we have the following results, where δ > 0 is an arbitrary constant. model regime space cost time cost for each update Ising e −2|β | ≥ 1 − 2−δ ∆+1 Oe(nN(n) + K) Oe ∆ 2LN(n) + ∆n hardcore λ ≤ 2−δ ∆−2 Oe(nN(n) + K) Oe ∆ 3LN(n) + ∆n q-coloring q ≥ (2 + δ)∆ Oe(nN(n) + K) Oe ∆ 2LN(n) + ∆n Table 1. Dynamic inference for specific models. The results for Ising model and q-coloring are corollaries of Theorem 3.2. The regime for hardcore model is better than the Dobrushin-Shlosman condition (which is λ ≤ 1−δ ∆−1 ), because we use the coupling introduced by Vigoda [Vig99] to analyze the algorithm. 4. Preliminaries Total variation distance and coupling. Let µ and ν be two distributions over Ω. The total variation distance between µ and ν is defined as dTV (µ, ν) ≜ 1 2 Õ x ∈Ω |µ(x) − ν(x)| . A coupling of µ and ν is a joint distribution (X,Y) ∈ Ω × Ω such that marginal distribution of X is µ and the marginal distribution of Y is ν. The following coupling lemma is well-known. 2 In multivariate time-series data analysis, the MRF instances of two sequential times are similar. In the iterative algorithms for learning graphical models, the difference between two sequential MRF instances generated by gradient descent are bounded to prevent oscillations. Specifically, the difference is very small when the iterative algorithm is close to converge [Hin12, WJ08]. 6
Proposition 4.1(coupling lemma).For any coupling (X,Y)ofu and v,it holds that Pr[X≠Y]≥drv(4,v). Furthermore,there is an optimal coupling that achieves equality. Local neighborhood.Let G=(V,E)be a graph.For any vertex v EV,let IG(v)fu EV I fu,v}E E}denote the neighborhood of v,andr(v)=IG(v)Ufu}the inclusive neighborhood of v.We simply write I=r(v)=IG(v)and r=r+(v)=r(v)for short when G is clear in the context.We use △=△G兰maxvev Tl to denote the maximum degree of graph G. A notion of local neighborhood for MRF is frequently used.Let I =(V,E,be an MRF instance.For vV,we denote by[]the restriction of I on the inclusive neighborhood ofv,i.e.Iu=(Tt,Ev,Q,Φu),where Ev={u,v}∈E}andΦu=(pa)aeTtUEv Gibbs sampling.The Gibbs sampling (a.k.a.heat-bath,Glauber dynamics),is a classic Markov chain for sampling from Gibbs distributions.Let I=(V,E,be an MRF instance and u =ur its Gibbs distribution.The chain of Gibbs sampling(Algorithm 1)is on the space OV,and has the stationary distribution ur [LP17,Chapter 3]. Algorithm 1:Gibbs sampling Initialization:an initial state Xo E (not necessarily feasible) 1 for t =1,2,...,T do 2 pick Ur EV uniformly at random; 3 draw a random value c O from the marginal distribution Hu (X-1T)); Xr(or)←-c and X(w←-Xi-1(u)for all u∈V\{uri Marginal distributions.Here ((T))=Ho.r(.T))denotes the marginal distribution at vV conditioning on (T)O,which is computed as: (4) pu(c)Πuer Puv(ou,c) VeQ:H()Tet ( Due to the assumption(1),this marginal distribution is always well defined,and its computation uses only the information of Coupling for mixing time.Consider a chain(X)on space with stationary distribution r for MRF instance I.The mixing rate is defined as:for e>0, rmix(Ie)max min (tdrv) where dry(Xt,ur)denotes the total variation distance between ur and the distribution of X.. A coupling of a Markov chain is a joint process(Xt,Y:)tzo such that(X)tzo and(Y:)tzo marginally follow the same transition rule as the original chain.Consider the following type of couplings. Definition 4.2(one-step optimal coupling for Gibbs sampling).A coupling(Xt,Y)tzo of Gibbs sampling on an MRF instance I =(V,E,Q,)is a one-step optimal coupling if it is constructed as follows:For t=1,2,..., (1)pick the same random vr E V,and let (Xi(u),Yi(u))(X:-1(u),Y-1(u))for all uv; (②)sample(X,(().Y(er》from an optimal coupling D8oi.,C,)of the marginal distributions Ho(Io)and uv,(IT)where o =X1-1(Tv)and T =Y-(Tv). The coupling Dop()is an optimal coupling of)and)that attains the maximum Prx=y]for all couplings (y)ofa)andy).The coupling Dop)is determined by the local information I and o,rE Odeg(). With such a coupling,we can establish the following relation between the Dobrushin-Shlosman condition and the rapid mixing of the Gibbs sampling [DS85a,DS85b,DS87,BD97,Hay06,DGJ08]. 7
Proposition 4.1 (coupling lemma). For any coupling (X,Y) of µ and ν, it holds that Pr[X , Y] ≥ dTV (µ, ν) . Furthermore, there is an optimal coupling that achieves equality. Local neighborhood. Let G = (V, E) be a graph. For any vertex v ∈ V , let ΓG (v) ≜ {u ∈ V | {u,v} ∈ E} denote the neighborhood of v, and Γ + G (v) ≜ ΓG (v)∪ {v} the inclusive neighborhood of v. We simply write Γv = Γ(v) = ΓG (v) and Γ + v = Γ + (v) = Γ + G (v) for short when G is clear in the context. We use ∆ = ∆G ≜ maxv ∈V |Γv | to denote the maximum degree of graph G. A notion of local neighborhood for MRF is frequently used. Let I = (V, E,Q, Φ) be an MRF instance. For v ∈ V , we denote by Iv ≜ I[Γ + v ] the restriction of I on the inclusive neighborhood Γ + v of v, i.e. Iv = (Γ + v , Ev,Q, Φv ), where Ev = {{u,v} ∈ E} and Φv = (φa)a ∈Γ + v ∪Ev . Gibbs sampling. The Gibbs sampling (a.k.a. heat-bath, Glauber dynamics), is a classic Markov chain for sampling from Gibbs distributions. Let I = (V, E,Q, Φ) be an MRF instance and µ = µI its Gibbs distribution. The chain of Gibbs sampling (Algorithm 1) is on the space Ω ≜ Q V , and has the stationary distribution µI [LP17, Chapter 3]. Algorithm 1: Gibbs sampling Initialization: an initial state X0 ∈ Ω (not necessarily feasible); 1 for t = 1, 2, . . . ,T do 2 pick vt ∈ V uniformly at random; 3 draw a random value c ∈ Q from the marginal distribution µvt (· | Xt−1(Γvt )); 4 Xt (vt ) ← c and Xt (u) ← Xt−1(u) for all u ∈ V \ {vt }; Marginal distributions. Here µv (· | σ(Γv )) = µv,I(· | σ(Γv )) denotes the marginal distribution at v ∈ V conditioning on σ(Γv ) ∈ Q Γv , which is computed as: ∀c ∈ Q : µv (c | σ(Γv )) = φv (c) Î u ∈Γv φuv (σu,c) Í c 0∈Q φv (c 0 ) Î u ∈Γv φuv (σu,c 0 ) (4) . Due to the assumption (1), this marginal distribution is always well defined, and its computation uses only the information of Iv . Coupling for mixing time. Consider a chain (Xt ) ∞ t=0 on space Ω with stationary distribution µI for MRF instance I. The mixing rate is defined as: for ϵ > 0, τmix(I, ϵ) ≜ max X0 min {t | dTV (Xt , µI)} , where dTV (Xt , µI) denotes the total variation distance between µI and the distribution of Xt . A coupling of a Markov chain is a joint process (Xt ,Yt )t ≥0 such that (Xt )t ≥0 and (Yt )t ≥0 marginally follow the same transition rule as the original chain. Consider the following type of couplings. Definition 4.2 (one-step optimal coupling for Gibbs sampling). A coupling (Xt ,Yt )t ≥0 of Gibbs sampling on an MRF instance I = (V, E,Q, Φ) is a one-step optimal coupling if it is constructed as follows: For t = 1, 2, . . ., (1) pick the same random vt ∈ V , and let (Xt (u),Yt (u)) ← (Xt−1(u),Yt−1(u)) for all u , vt ; (2) sample (Xt (vt ),Yt (vt )) from an optimal coupling D σ ,τ opt,Ivt (·, ·) of the marginal distributions µvt (· | σ) and µvt (· | τ ) where σ = Xt−1(Γvt ) and τ = Yt−1(Γvt ). The coupling D σ ,τ opt,Ivt (·, ·) is an optimal coupling of µvt (· | σ) and µvt (· | τ ) that attains the maximum Pr[x = y] for all couplings (x,y) of x ∼ µvt (· | σ) and y ∼ µvt (· | τ ). The coupling D σ ,τ opt,Ivt (·, ·) is determined by the local information Iv and σ, τ ∈ Q deg(v) . With such a coupling, we can establish the following relation between the Dobrushin-Shlosman condition and the rapid mixing of the Gibbs sampling [DS85a, DS85b, DS87, BD97, Hay06, DGJ08]. 7
Proposition 4.3([Hay06]).Let I =(V,E,Q,be an MRF instance with n VI,and=OV the state space.LetH(o,t)兰l{o∈V|ou≠denote the Hamming distance between o∈2andt∈2.IfI satisfies the Dobrushin-Shlosman condition(Condition 3.1)with constant 8>0,then the one-step optimal coupling(X:,Y:)tzo for Gibbs sampling(Definition 4.2)satisfies Ya,t∈Q:B[HX.Y)lX-1=oAY-1=t]≤1- H(o,), n then the mixing rate of Gibbs sampling on I is bounded as rmix(I,e)log 5.OUTLINES OF ALGORITHM Let 0:RK be a probabilistic inference problem that maps each MRF instance in i to a K-dimensional probability vector,and let Se be its estimating function.Le I=(V,E,O,be the current instance,where n=V.Our dynamic inference algorithm maintains a sequence of N(n) independent samples.(N(which are (n)-close to the Gibbs distribution in total variation distance and an(N,e)-estimator (of (T)such that T)=8g(Xm,X②,.,Xm川 Upon an update request that modifies I to a new instance I'=(V,E,,')where n'=V', our algorithm does the followings: Update the sample sequence.Update x(1),...,X(N(m))to a new sequence of N(n')independent samples Y(,...,Y(N(that aree(n)-close to ur,in total variation distance,and output the difference between two sample sequences. Update the estimator.Given the difference between the two sample sequences,update (T)to 0()=Se(Y(1),...,Y(N(n))by accessing the oracle in Definition 2.3. Obviously,the updated estimator 0()is an(N,e)-estimator for 0(). Our main technical contribution is to give an algorithm that dynamically maintains a sequence of N(n)independent samples for ur,while I itself is dynamically changing.The dynamic sampling problem was recently introduced in [FVY19].The dynamical sampling algorithm given there only handles update of a single vertex or edge and works only for graphical models with soft constraints. In contrast,our dynamic sampling algorithm maintains a sequence of N(n)independent samples for ur within total variation distance e(n),while the entire specification of the graphical model I is subject to dynamic update (to a new I'with difference d(,I')<L o(n)).Specifically,the algorithm updates the sample sequence within expected time O(A2N(n)L log'n An).Note that the extra O(An)cost is necessary for just editing the current MRF instance I to I'because a single update may change all the vertex and edge potentials simultaneously.This incremental time cost dominates the time cost of the dynamic inference algorithm,and is efficient for maintaining N(n)independent samples,especially when N(n)is sufficiently large,e.g.N(n)=Q(n/L),in which case the average incremental cost for updating each sample is O(A2Llog3 n+An/N(n))=O(A2Llog3 n). We illustrate the main idea by explaining how to maintain one sample.The idea is to represent the trace of the Markov chain for generating the sample by a dynamic data structure,and when the MRF instance is changed,this trace is modified to that of the new chain for generating the sample for the updated instance.This is achieved by both a set of efficient dynamic data structures and the coupling between the two Markov chains. Specifically,let(X)obe the Gibbs sampler chain for distribution ur.When the chain is rapidly mixing,starting from an arbitrary initial configuration Xo e Ov,after suitably many steps X=Xr is an accurate enough sample for ur.At each step,X-1 and X:may differ only at a vertex vr which is picked from V uniformly and independently at random.The evolution of the chain is fully captured by the initial state Xo and the sequence of pairs (v,Xr(vr)),from t=1 to t =T,which is called the execution log of the chain in the paper. Now suppose that the current instance I is updated to I'.We construct such a coupling between the original chain(Xand the new chain (Ysuch that (Y)is a faithful Gibbs sampling chain for the updated instance I'given that (X)is a faithful chain for I,and the difference between 8
Proposition 4.3 ([Hay06]). Let I = (V, E,Q, Φ) be an MRF instance with n = |V |, and Ω = Q V the state space. Let H(σ, τ ) ≜ |{v ∈ V | σv , τv }| denote the Hamming distance between σ ∈ Ω and τ ∈ Ω. If I satisfies the Dobrushin-Shlosman condition (Condition 3.1) with constant δ > 0, then the one-step optimal coupling (Xt ,Yt )t ≥0 for Gibbs sampling (Definition 4.2) satisfies ∀σ, τ ∈ Ω : E [H(Xt ,Yt ) | Xt−1 = σ ∧ Yt−1 = τ ] ≤ 1 − δ n · H(σ, τ ), then the mixing rate of Gibbs sampling on I is bounded as τmix(I, ϵ) ≤ n δ log n ϵ . 5. Outlines of algorithm Let θ : M → R K be a probabilistic inference problem that maps each MRF instance in M to a K-dimensional probability vector, and let Eθ be its estimating function. Le I = (V, E,Q, Φ) ∈ M be the current instance, where n = |V |. Our dynamic inference algorithm maintains a sequence of N(n) independent samples X (1) , . . . ,X (N (n)) ∈ Q V which are ϵ(n)-close to the Gibbs distribution µI in total variation distance and an (N, ϵ)-estimator ˆθ(I) of θ(I) such that ˆθ(I) = Eθ (X (1) ,X (2) , . . . ,X (N (n))). Upon an update request that modifies I to a new instance I 0 = (V 0 , E 0 ,Q, Φ 0 ) ∈ M, where n 0 = |V 0 |, our algorithm does the followings: • Update the sample sequence. Update X (1) , . . . ,X (N (n)) to a new sequence of N(n 0 ) independent samplesY (1) , . . . ,Y (N (n 0 )) ∈ Q V 0 that are ϵ(n 0 )-close to µI0 in total variation distance, and output the difference between two sample sequences. • Update the estimator. Given the difference between the two sample sequences, update ˆθ(I) to ˆθ(I0 ) = Eθ (Y (1) , . . . ,Y (N (n 0 ))) by accessing the oracle in Definition 2.3. Obviously, the updated estimator ˆθ(I0 ) is an (N, ϵ)-estimator for θ(I0 ). Our main technical contribution is to give an algorithm that dynamically maintains a sequence of N(n) independent samples for µI, while I itself is dynamically changing. The dynamic sampling problem was recently introduced in [FVY19]. The dynamical sampling algorithm given there only handles update of a single vertex or edge and works only for graphical models with soft constraints. In contrast, our dynamic sampling algorithm maintains a sequence of N(n) independent samples for µI within total variation distance ϵ(n), while the entire specification of the graphical model I is subject to dynamic update (to a new I 0 with difference d(I, I 0 ) ≤ L = o(n)). Specifically, the algorithm updates the sample sequence within expected time O(∆ 2N(n)L log3 n + ∆n). Note that the extra O(∆n) cost is necessary for just editing the current MRF instance I to I 0 because a single update may change all the vertex and edge potentials simultaneously. This incremental time cost dominates the time cost of the dynamic inference algorithm, and is efficient for maintaining N(n) independent samples, especially when N(n) is sufficiently large, e.g. N(n) = Ω(n/L), in which case the average incremental cost for updating each sample is O(∆ 2L log3 n + ∆n/N(n)) = O(∆ 2L log3 n). We illustrate the main idea by explaining how to maintain one sample. The idea is to represent the trace of the Markov chain for generating the sample by a dynamic data structure, and when the MRF instance is changed, this trace is modified to that of the new chain for generating the sample for the updated instance. This is achieved by both a set of efficient dynamic data structures and the coupling between the two Markov chains. Specifically, let (Xt ) T t=0 be the Gibbs sampler chain for distribution µI. When the chain is rapidly mixing, starting from an arbitrary initial configuration X0 ∈ Q V , after suitably many steps X = XT is an accurate enough sample for µI. At each step, Xt−1 and Xt may differ only at a vertex vt which is picked from V uniformly and independently at random. The evolution of the chain is fully captured by the initial state X0 and the sequence of pairs hvt ,Xt (vt )i, from t = 1 to t = T , which is called the execution log of the chain in the paper. Now suppose that the current instance I is updated to I 0 . We construct such a coupling between the original chain (Xt ) T t=0 and the new chain (Yt ) T t=0 , such that (Yt ) T t=0 is a faithful Gibbs sampling chain for the updated instance I 0 given that (Xt ) T t=0 is a faithful chain for I, and the difference between 8