The Annals of Statistics 2009,Vol.37,No.2.905-938 D0:10.1214/07-A0S587 Institute of Mathematical Statistics,2009 THE FORMAL DEFINITION OF REFERENCE PRIORS BY JAMES O.BERGER.1 JOSE M.BERNARDO2 AND DONGCHU SUN3 Duke University,Universitat de Valencia and University of Missouri-Columbia Reference analysis produces objective Bayesian inference,in the sense that inferential statements depend only on the assumed model and the available data,and the prior distribution used to make an inference is least informative in a certain information-theoretic sense. Reference priors have been rigorously defined in specific contexts and heuristically defined in general,but a rigorous general definition has been lacking.We produce a rigorous general definition here and then show how an explicit expression for the reference prior can be ob- tained under very weak regularity conditions.The explicit expression can be used to derive new reference priors both analytically and nu- merically. 1.Introduction and notation. IA9S10'060:A!XIe 1.1.Background and goals.There is a considerable body of conceptual and theoretical literature devoted to identifying appropriate procedures for the formulation of objective priors;for relevant pointers see Section 5.6 in Bernardo and Smith [13],Datta and Mukerjee [20],Bernardo [11],Berger [3],Ghosh,Delampady and Samanta 23]and references therein.Refer- ence analysis,introduced by Bernardo [10]and further developed by Berger and Bernardo [4,5,6,7],and Sun and Berger [42],has been one of the most utilized approaches to developing objective priors;see the references in Bernardo [11]. Reference analysis uses information-theoretical concepts to make precise the idea of an objective prior which should be maximally dominated by the Received March 2007:revised December 2007. Supported by NSF Grant DMS-01-03265. 2Supported by Grant MTM2006-07801. 3Supported by NSF Grants SES-0351523 and SES-0720229. AMS 2000 subject classifications.Primary 62F15;secondary 62A01,62B10. Key words and phrases.Amount of information,Bayesian asymptotics,consensus pri- ors,Fisher information,Jeffreys priors,noninformative priors,objective priors,reference priors. This is an electronic reprint of the original article published by the Institute of Mathematical Statistics in The Annals of Statistics, 2009,Vol.37,No.2,905-938.This reprint differs from the original in pagination and typographic detail
arXiv:0904.0156v1 [math.ST] 1 Apr 2009 The Annals of Statistics 2009, Vol. 37, No. 2, 905–938 DOI: 10.1214/07-AOS587 c Institute of Mathematical Statistics, 2009 THE FORMAL DEFINITION OF REFERENCE PRIORS By James O. Berger,1 Jos´e M. Bernardo2 and Dongchu Sun3 Duke University, Universitat de Val`encia and University of Missouri-Columbia Reference analysis produces objective Bayesian inference, in the sense that inferential statements depend only on the assumed model and the available data, and the prior distribution used to make an inference is least informative in a certain information-theoretic sense. Reference priors have been rigorously defined in specific contexts and heuristically defined in general, but a rigorous general definition has been lacking. We produce a rigorous general definition here and then show how an explicit expression for the reference prior can be obtained under very weak regularity conditions. The explicit expression can be used to derive new reference priors both analytically and numerically. 1. Introduction and notation. 1.1. Background and goals. There is a considerable body of conceptual and theoretical literature devoted to identifying appropriate procedures for the formulation of objective priors; for relevant pointers see Section 5.6 in Bernardo and Smith [13], Datta and Mukerjee [20], Bernardo [11], Berger [3], Ghosh, Delampady and Samanta [23] and references therein. Reference analysis, introduced by Bernardo [10] and further developed by Berger and Bernardo [4, 5, 6, 7], and Sun and Berger [42], has been one of the most utilized approaches to developing objective priors; see the references in Bernardo [11]. Reference analysis uses information-theoretical concepts to make precise the idea of an objective prior which should be maximally dominated by the Received March 2007; revised December 2007. 1Supported by NSF Grant DMS-01-03265. 2 Supported by Grant MTM2006-07801. 3Supported by NSF Grants SES-0351523 and SES-0720229. AMS 2000 subject classifications. Primary 62F15; secondary 62A01, 62B10. Key words and phrases. Amount of information, Bayesian asymptotics, consensus priors, Fisher information, Jeffreys priors, noninformative priors, objective priors, reference priors. This is an electronic reprint of the original article published by the Institute of Mathematical Statistics in The Annals of Statistics, 2009, Vol. 37, No. 2, 905–938. This reprint differs from the original in pagination and typographic detail. 1
2 J.O.BERGER.J.M.BERNARDO AND D.SUN data,in the sense of maximizing the missing information (to be precisely defined later)about the parameter.The original formulation of reference priors in the paper by Bernardo [10]was largely informal.In continuous one parameter problems,heuristic arguments were given to justify an explicit expression in terms of the expectation under sampling of the logarithm of the asymptotic posterior density,which reduced to Jeffreys prior (Jeffreys [31,32])under asymptotic posterior normality.In multiparameter problems it was argued that one should not maximize the joint missing information but proceed sequentially,thus avoiding known problems such as marginal- ization paradoxes.Berger and Bernardo [7]gave more precise definitions of this sequential reference process,but restricted consideration to continuous multiparameter problems under asymptotic posterior normality.Clarke and Barron [17]established regularity conditions under which joint maximization of the missing information leads to Jeffreys multivariate priors.Ghosal and Samanta [27]and Ghosal [26]provided explicit results for reference priors in some types of nonregular models. This paper has three goals. GOAL 1.Make precise the definition of the reference prior.This has two different aspects. Applying Bayes theorem to improper priors is not obviously justifiable. Formalizing when this is legitimate is desirable,and is considered in Sec- tion 2. Previous attempts at a general definition of reference priors have had heuristic features,especially in situations in which the reference prior is improper.Replacing the heuristics with a formal definition is desirable, and is done in Section 3. GOAL 2.Present a simple constructive formula for a reference prior. Indeed,for a model described by density p(x6),where x is the complete data vector and 0 is a continuous unknown parameter,the formula for the reference prior,()will be shown to be π(0)=lim fk(0) k-→oof(00) f(0)= expp()logi"x where 0o is an interior point of the parameter space ,x()={x1,.. stands for k conditionally independent replications of x,and *(x()) is the posterior distribution corresponding to some fixed,largely arbitrary prior *()
2 J. O. BERGER, J. M. BERNARDO AND D. SUN data, in the sense of maximizing the missing information (to be precisely defined later) about the parameter. The original formulation of reference priors in the paper by Bernardo [10] was largely informal. In continuous one parameter problems, heuristic arguments were given to justify an explicit expression in terms of the expectation under sampling of the logarithm of the asymptotic posterior density, which reduced to Jeffreys prior (Jeffreys [31, 32]) under asymptotic posterior normality. In multiparameter problems it was argued that one should not maximize the joint missing information but proceed sequentially, thus avoiding known problems such as marginalization paradoxes. Berger and Bernardo [7] gave more precise definitions of this sequential reference process, but restricted consideration to continuous multiparameter problems under asymptotic posterior normality. Clarke and Barron [17] established regularity conditions under which joint maximization of the missing information leads to Jeffreys multivariate priors. Ghosal and Samanta [27] and Ghosal [26] provided explicit results for reference priors in some types of nonregular models. This paper has three goals. Goal 1. Make precise the definition of the reference prior. This has two different aspects. • Applying Bayes theorem to improper priors is not obviously justifiable. Formalizing when this is legitimate is desirable, and is considered in Section 2. • Previous attempts at a general definition of reference priors have had heuristic features, especially in situations in which the reference prior is improper. Replacing the heuristics with a formal definition is desirable, and is done in Section 3. Goal 2. Present a simple constructive formula for a reference prior. Indeed, for a model described by density p(x | θ), where x is the complete data vector and θ is a continuous unknown parameter, the formula for the reference prior, π(θ), will be shown to be π(θ) = lim k→∞ fk(θ) fk(θ0) , fk(θ) = expZ p(x (k) | θ)log[π ∗ (θ | x (k) )] dx (k) , where θ0 is an interior point of the parameter space Θ, x (k) = {x1,...,xk} stands for k conditionally independent replications of x, and π ∗ (θ | x (k) ) is the posterior distribution corresponding to some fixed, largely arbitrary prior π ∗ (θ)
DEFINITION OF REFERENCE PRIORS 3 The interesting thing about this expression is that it holds (under mild conditions)for any type of continuous parameter model,regardless of the asymptotic nature of the posterior.This formula is established in Section 4.1,and various illustrations of its use are given. A second use of the expression is that it allows straightforward compu- tation of the reference prior numerically.This is illustrated in Section 4.2 for a difficult nonregular problem and for a problem for which analytical determination of the reference prior seems very difficult. GOAL 3.To make precise the most common practical rationale for use of improper objective priors,which proceeds as follows: In reality,we are always dealing with bounded parameters so that the real parameter space should,say,be some compact set o. It is often only known that the bounds are quite large,in which case it is difficult to accurately ascertain which o to use. This difficulty can be surmounted if we can pass to the unbounded space and show that the analysis on this space would yield essentially the same answer as the analysis on any very large compact Oo Establishing that the analysis on is a good approximation from the refer- ence theory viewpoint requires establishing two facts: 1.The reference prior distribution on e,when restricted to eo,is the ref- erence prior on 0o. 2.The reference posterior distribution on e is an appropriate limit of the reference posterior distributions on an increasing sequence of compact sets f:converging to e. Indicating how these two facts can be verified is the third goal of the paper. 1.2.Notation.Attention here is limited mostly to one parameter prob- lems with a continuous parameter,but the ideas are extendable to the mul- tiparameter case through the sequential scheme of Berger and Bernardo [7]. It is assumed that probability distributions may be described through probability density functions,either in respect to Lebesgue measure or count- ing measure.No distinction is made between a random quantity and the particular values that it may take.Bold italic roman fonts are used for observable random vectors (typically data)and italic greek fonts for un- observable random quantities (typically parameters);lower case is used for variables and upper case calligraphic for their domain sets.Moreover,the standard mathematical convention of referring to functions,say fx and gx of xE,respectively by f(x)and g(x),will be used throughout.Thus,the conditional probability density of data x E&given 6 will be represented
DEFINITION OF REFERENCE PRIORS 3 The interesting thing about this expression is that it holds (under mild conditions) for any type of continuous parameter model, regardless of the asymptotic nature of the posterior. This formula is established in Section 4.1, and various illustrations of its use are given. A second use of the expression is that it allows straightforward computation of the reference prior numerically. This is illustrated in Section 4.2 for a difficult nonregular problem and for a problem for which analytical determination of the reference prior seems very difficult. Goal 3. To make precise the most common practical rationale for use of improper objective priors, which proceeds as follows: • In reality, we are always dealing with bounded parameters so that the real parameter space should, say, be some compact set Θ0. • It is often only known that the bounds are quite large, in which case it is difficult to accurately ascertain which Θ0 to use. • This difficulty can be surmounted if we can pass to the unbounded space Θ and show that the analysis on this space would yield essentially the same answer as the analysis on any very large compact Θ0. Establishing that the analysis on Θ is a good approximation from the reference theory viewpoint requires establishing two facts: 1. The reference prior distribution on Θ, when restricted to Θ0, is the reference prior on Θ0. 2. The reference posterior distribution on Θ is an appropriate limit of the reference posterior distributions on an increasing sequence of compact sets {Θi}∞ i=1 converging to Θ. Indicating how these two facts can be verified is the third goal of the paper. 1.2. Notation. Attention here is limited mostly to one parameter problems with a continuous parameter, but the ideas are extendable to the multiparameter case through the sequential scheme of Berger and Bernardo [7]. It is assumed that probability distributions may be described through probability density functions, either in respect to Lebesgue measure or counting measure. No distinction is made between a random quantity and the particular values that it may take. Bold italic roman fonts are used for observable random vectors (typically data) and italic greek fonts for unobservable random quantities (typically parameters); lower case is used for variables and upper case calligraphic for their domain sets. Moreover, the standard mathematical convention of referring to functions, say fx and gx of x ∈ X , respectively by f(x) and g(x), will be used throughout. Thus, the conditional probability density of data x ∈ X given θ will be represented
4 J.O.BERGER.J.M.BERNARDO AND D.SUN by p(x),with p(x)>0 and fap(x)dx =1,and the reference pos- terior distribution of 0ee given x will be represented by r(x),with π(0|x)≥0 and Jeπ(e|x)d0=l.This admittedly imprecise notation will greatly simplify the exposition.If the random vectors are discrete,these functions naturally become probability mass functions,and integrals over their values become sums.Density functions of specific distributions are de- noted by appropriate names.Thus,if z is an observable random quantity with a normal distribution of mean u and variance o2,its probability den- sity function will be denoted N(xu,o2);if the posterior distribution of A is Gamma with mean a/b and variance a/b2,its probability density func- tion will be denoted Ga(a,b).The indicator function on a set C will be denoted by 1c. Reference prior theory is based on the use of logarithmic divergence,often called the Kullback-Leibler divergence. DEFINITION 1.The logarithmic divergence of a probability density p() of the random vector 6ee from its true probability density p(),denoted by kp p,is p0d0, p()log0) provided the integral (or the sum)is finite. The properties of pp}have been extensively studied;pioneering works include Gibbs 22],Shannon 38],Good 24,25],Kullback and Leibler [35], Chernoff [15],Jaynes [29,30],Kullback [34]and Csiszar [18,19] DEFINITION 2 (Logarithmic convergence).A sequence of probability density functions [pi converges logarithmically to a probability density p if,and only if,limi(p pi)=0. 2.Improper and permissible priors. 2.1.Justifying posteriors from improper priors.Consider a model M= {p(x|e),x∈X,e∈Θ}and a strictly positive prior function m(e).(Were strict attention to strictly positive functions because any believably objective prior would need to have strictly positive density,and this restriction elim- inates many technical details.)When r()is improper,so that Je r()do diverges,Bayes theorem no longer applies,and the use of the formal poste- rior density (2.1) π(0|x)= p(x|0)π(0) J∫ep(x|0)π(0)d0
4 J. O. BERGER, J. M. BERNARDO AND D. SUN by p(x | θ), with p(x | θ) ≥ 0 and R X p(x | θ) dx = 1, and the reference posterior distribution of θ ∈ Θ given x will be represented by π(θ | x), with π(θ | x) ≥ 0 and R Θ π(θ | x) dθ = 1. This admittedly imprecise notation will greatly simplify the exposition. If the random vectors are discrete, these functions naturally become probability mass functions, and integrals over their values become sums. Density functions of specific distributions are denoted by appropriate names. Thus, if x is an observable random quantity with a normal distribution of mean µ and variance σ 2 , its probability density function will be denoted N(x | µ,σ2 ); if the posterior distribution of λ is Gamma with mean a/b and variance a/b2 , its probability density function will be denoted Ga(λ | a,b). The indicator function on a set C will be denoted by 1C. Reference prior theory is based on the use of logarithmic divergence, often called the Kullback–Leibler divergence. Definition 1. The logarithmic divergence of a probability density ˜p(θ) of the random vector θ ∈ Θ from its true probability density p(θ), denoted by κ{p˜ | p}, is κ{p˜ | p} = Z Θ p(θ) log p(θ) p˜(θ) dθ, provided the integral (or the sum) is finite. The properties of κ{p˜ | p} have been extensively studied; pioneering works include Gibbs [22], Shannon [38], Good [24, 25], Kullback and Leibler [35], Chernoff [15], Jaynes [29, 30], Kullback [34] and Csiszar [18, 19]. Definition 2 (Logarithmic convergence). A sequence of probability density functions {pi}∞ i=1 converges logarithmically to a probability density p if, and only if, limi→∞ κ(p | pi) = 0. 2. Improper and permissible priors. 2.1. Justifying posteriors from improper priors. Consider a model M = {p(x | θ),x ∈ X ,θ ∈ Θ} and a strictly positive prior function π(θ). (We restrict attention to strictly positive functions because any believably objective prior would need to have strictly positive density, and this restriction eliminates many technical details.) When π(θ) is improper, so that R Θ π(θ) dθ diverges, Bayes theorem no longer applies, and the use of the formal posterior density π(θ | x) = p(x | θ)π(θ) R Θ p(x | θ)π(θ) dθ (2.1)
DEFINITION OF REFERENCE PRIORS 5 must be justified,even when fep(x )()de<oo so that t(x)is a proper density. The most convincing justifications revolve around showing that r(x) is a suitable limit of posteriors obtained from proper priors.A variety of versions of such arguments exist;cf.Stone [40,41]and Heath and Sudderth 28].Here,we consider approximations based on restricting the prior to an increasing sequence of compact sets and using logarithmic convergence to define the limiting process.The main motivation is,as mentioned in the introduction,that objective priors are often viewed as being priors that will yield a good approximation to the analysis on the "true but difficult to specify"large bounded parameter space. DEFINITION 3(Approximating compact sequence).Consider a paramet- ric model M={p(x|f),x∈,0∈O}and a strictly positive continuous functionπ(0),0∈Θ,such that,.for all x∈X,fep(x|0)π(0)d0<o.An approximating compact sequence of parameter spaces is an increasing se- quence of compact subsets ofΘ,{ei}≌1,converging to e.The correspond- ing sequence of posteriors with support on ei,defined as [i(x)1,with Ti(6|x)xp(x|8)π(0),Ti()=cπ(e)le,andc=Jaπ(e)d,is called the approximating sequence of posteriors to the formal posterior r(x). Notice that the renormalized restrictions mi()of (0)to the O;are proper [because the e;are compact and r(0)is continuous].The following theorem shows that the posteriors resulting from these proper priors do converge,in the sense of logarithmic convergence,to the posterior r(x). THEOREM 1.Consider model M={p(x|0),x∈X,0∈Θ}and a strictly positive continuous function n(0),such that fep(x0)(0)de<oo,for all xE X.For any approrimating compact sequence of parameter spaces,the corresponding approrimating sequence of posteriors converges logarithmi- cally to the formal posterior (x)p(x0)n(0) PROOF.To prove that k{(x)|i(x)}converges to zero,define the predictive densities pi(x)=Je,p(x)ni(0)do and p(x)=Jep(x)(0)do (which has been assumed to be finite).Using for the posteriors the expres- sions provided by Bayes theorem yields π(0|x)log le: 01风d0=e p(x)π:(0 _ao π(0|x) π(0|x)1og P(x)π(0 =。(01x)log p(x) d0 i(x)ci =log p(x) pi(x)ci 二log Jep(x0)(0)do Je.p(x0)(0)do
DEFINITION OF REFERENCE PRIORS 5 must be justified, even when R Θ p(x | θ)π(θ) dθ < ∞ so that π(θ | x) is a proper density. The most convincing justifications revolve around showing that π(θ | x) is a suitable limit of posteriors obtained from proper priors. A variety of versions of such arguments exist; cf. Stone [40, 41] and Heath and Sudderth [28]. Here, we consider approximations based on restricting the prior to an increasing sequence of compact sets and using logarithmic convergence to define the limiting process. The main motivation is, as mentioned in the introduction, that objective priors are often viewed as being priors that will yield a good approximation to the analysis on the “true but difficult to specify” large bounded parameter space. Definition 3 (Approximating compact sequence). Consider a parametric model M = {p(x | θ),x ∈ X ,θ ∈ Θ} and a strictly positive continuous function π(θ), θ ∈ Θ, such that, for all x ∈ X , R Θ p(x | θ)π(θ) dθ < ∞. An approximating compact sequence of parameter spaces is an increasing sequence of compact subsets of Θ, {Θi}∞ i=1, converging to Θ. The corresponding sequence of posteriors with support on Θi , defined as {πi(θ | x)}∞ i=1, with πi(θ | x) ∝ p(x | θ)πi(θ), πi(θ) = c −1 i π(θ)1Θi and ci = R Θi π(θ) dθ, is called the approximating sequence of posteriors to the formal posterior π(θ | x). Notice that the renormalized restrictions πi(θ) of π(θ) to the Θi are proper [because the Θi are compact and π(θ) is continuous]. The following theorem shows that the posteriors resulting from these proper priors do converge, in the sense of logarithmic convergence, to the posterior π(θ | x). Theorem 1. Consider model M = {p(x | θ),x ∈ X ,θ ∈ Θ} and a strictly positive continuous function π(θ), such that R Θ p(x | θ)π(θ) dθ < ∞, for all x ∈ X . For any approximating compact sequence of parameter spaces, the corresponding approximating sequence of posteriors converges logarithmically to the formal posterior π(θ | x) ∝ p(x | θ)π(θ). Proof. To prove that κ{π(· | x) | πi(· | x)} converges to zero, define the predictive densities pi(x) = R Θi p(x | θ)πi(θ) dθ and p(x) = R Θ p(x | θ)π(θ) dθ (which has been assumed to be finite). Using for the posteriors the expressions provided by Bayes theorem yields Z Θi πi(θ | x) log πi(θ | x) π(θ | x) dθ = Z Θi πi(θ | x) log p(x)πi(θ) pi(x)π(θ) dθ = Z Θi πi(θ | x) log p(x) pi(x)ci dθ = log p(x) pi(x)ci = log R Θ p(x | θ)π(θ) dθ R Θi p(x | θ)π(θ) dθ
6 J.O.BERGER.J.M.BERNARDO AND D.SUN But the last expression converges to zero if,and only if, lim p(x|0)π(0)d0=p(x|0)π(0)d0 and this follows from the monotone convergence theorem. It is well known that logarithmic convergence implies convergence in LI which implies uniform convergence of probabilities,so Theorem 1 could,at first sight,be invoked to justify the formal use of virtually any improper prior in Bayes theorem.As illustrated below,however,logarithmic convergence of the approximating posteriors is not necessarily good enough. EXAMPLE 1 (Fraser,Monette and Ng [21]).Consider the model,with both discrete data and parameter space, M={p(x|)=1/3,x∈{0/2,20,20+1,0∈{1,2,}, where [u]denotes the integer part of u,and [1/2]is separately defined as 1. Fraser,Monnete and Ng [21]show that the naive improper prior ()=1 produces a posterior (r)o p(x)which is strongly inconsistent,leading to credible sets for 6 given by [2r,2+1 which have posterior probability 2/3 but frequentist coverage of only 1/3 for all 6 values.Yet,choosing the natural approximating sequence of compact sets e:={1,...,i},it follows from Theorem 1 that the corresponding sequence of posteriors converges logarithmically to (x). The difficulty shown by Example 1 lies in the fact that logarithmic con- vergence is only pointwise convergence for given x,which does not guarantee that the approximating posteriors are accurate in any global sense over x. For that we turn to a stronger notion of convergence. DEFINITION 4 (Expected logarithmic convergence of posteriors).Con- sider a parametric model M={p(x0),xE,0e},a strictly positive continuous function r(0),6ee and an approximating compact sequence {i}of parameter spaces.The corresponding sequence of posteriors {mi( x)is said to be expected logarithmically convergent to the formal pos- terior (x)if (2.2) IimK{π(|x)|π(|x)}p(x)dk=0, i- where pi(x)=Je.p(x0)ni(0)do. This notion was first discussed (in the context of reference priors)in Berger and Bernardo [7],and achieves one of our original goals:A prior
6 J. O. BERGER, J. M. BERNARDO AND D. SUN But the last expression converges to zero if, and only if, lim i→∞ Z Θi p(x | θ)π(θ) dθ = Z Θ p(x | θ)π(θ) dθ, and this follows from the monotone convergence theorem. It is well known that logarithmic convergence implies convergence in L1 which implies uniform convergence of probabilities, so Theorem 1 could, at first sight, be invoked to justify the formal use of virtually any improper prior in Bayes theorem. As illustrated below, however, logarithmic convergence of the approximating posteriors is not necessarily good enough. Example 1 (Fraser, Monette and Ng [21]). Consider the model, with both discrete data and parameter space, M = {p(x | θ) = 1/3, x ∈ {[θ/2], 2θ, 2θ + 1}, θ ∈ {1, 2,...}}, where [u] denotes the integer part of u, and [1/2] is separately defined as 1. Fraser, Monnete and Ng [21] show that the naive improper prior π(θ) = 1 produces a posterior π(θ | x) ∝ p(x | θ) which is strongly inconsistent, leading to credible sets for θ given by {2x, 2x + 1} which have posterior probability 2/3 but frequentist coverage of only 1/3 for all θ values. Yet, choosing the natural approximating sequence of compact sets Θi = {1,...,i}, it follows from Theorem 1 that the corresponding sequence of posteriors converges logarithmically to π(θ | x). The difficulty shown by Example 1 lies in the fact that logarithmic convergence is only pointwise convergence for given x, which does not guarantee that the approximating posteriors are accurate in any global sense over x. For that we turn to a stronger notion of convergence. Definition 4 (Expected logarithmic convergence of posteriors). Consider a parametric model M = {p(x | θ),x ∈ X ,θ ∈ Θ}, a strictly positive continuous function π(θ), θ ∈ Θ and an approximating compact sequence {Θi} of parameter spaces. The corresponding sequence of posteriors {πi(θ | x)}∞ i=1 is said to be expected logarithmically convergent to the formal posterior π(θ | x) if lim i→∞ Z X (2.2) κ{π(· | x) | πi(· | x)}pi(x) dx = 0, where pi(x) = R Θi p(x | θ)πi(θ) dθ. This notion was first discussed (in the context of reference priors) in Berger and Bernardo [7], and achieves one of our original goals: A prior
DEFINITION OF REFERENCE PRIORS > distribution satisfying this condition will yield a posterior that,on average over x,is a good approximation to the proper posterior that would result from restriction to a large compact subset of the parameter space. To some Bayesians,it might seem odd to worry about averaging the log- arithmic discrepancy over the sample space but,as will be seen,reference priors are designed to be "noninformative"for a specified model,the notion being that repeated use of the prior with that model will be successful in practice. ExAMPLE 2 (Fraser,Monette and Ng [21]continued).In Example 1,the discrepancies (x)mi(.x)}between x)and the posteriors de- rived from the sequence of proper priors {i()1 converged to zero.How- ever,Berger and Bernardo [7]shows that Jkf(x)i(x)}pi(x)dx- log 3 as i-oo,so that the expected logarithmic discrepancy does not go to zero.Thus,the sequence of proper priors ()=1/i,01.. does not provide a good global approximation to the formal prior m(0)=1, providing one explanation of the paradox found by Fraser,Monette and Ng 21. Interestingly,for the improper prior (0)=1/0,the approximating com- pact sequence considered above can be shown to yield posterior distributions that expected logarithmically converge to (0-p(x0),so that this is a good candidate objective prior for the problem.It is also shown in Berger and Bernardo [7 that this prior has posterior confidence intervals with the correct frequentist coverage. Two potential generalizations are of interest.Definition 4 requires con- vergence only with respect to one approximating compact sequence of pa- rameter spaces.It is natural to wonder what happens for other such approx- imating sequences.We suspect,but have been unable to prove in general, that convergence with respect to one sequence will guarantee convergence with respect to any sequence.If true,this makes expected logarithmic con- vergence an even more compelling property. Related to this is the possibility of allowing not just an approximating series of priors based on truncation to compact parameter spaces,but in- stead allowing any approximating sequence of priors.Among the difficulties in dealing with this is the need for a better notion of divergence that is symmetric in its arguments.One possibility is the symmetrized form of the logarithmic divergence in Bernardo and Rueda[12],but the analysis is con- siderably more difficult. 2.2.Permissible priors.Based on the previous considerations,we re- strict consideration of possibly objective priors to those that satisfy the expected logarithmic convergence condition,and formally define them as follows.(Recall that x represents the entire data vector.)
DEFINITION OF REFERENCE PRIORS 7 distribution satisfying this condition will yield a posterior that, on average over x, is a good approximation to the proper posterior that would result from restriction to a large compact subset of the parameter space. To some Bayesians, it might seem odd to worry about averaging the logarithmic discrepancy over the sample space but, as will be seen, reference priors are designed to be “noninformative” for a specified model, the notion being that repeated use of the prior with that model will be successful in practice. Example 2 (Fraser, Monette and Ng [21] continued). In Example 1, the discrepancies κ{π(· | x) | πi(· | x)} between π(θ | x) and the posteriors derived from the sequence of proper priors {πi(θ)}∞ i=1 converged to zero. However, Berger and Bernardo [7] shows that R X κ{π(· | x) | πi(· | x)}pi(x) dx → log 3 as i → ∞, so that the expected logarithmic discrepancy does not go to zero. Thus, the sequence of proper priors {πi(θ) = 1/i,θ ∈ {1,...,i}}∞ i=1 does not provide a good global approximation to the formal prior π(θ) = 1, providing one explanation of the paradox found by Fraser, Monette and Ng [21]. Interestingly, for the improper prior π(θ) = 1/θ, the approximating compact sequence considered above can be shown to yield posterior distributions that expected logarithmically converge to π(θ | x) ∝ θ −1p(x | θ), so that this is a good candidate objective prior for the problem. It is also shown in Berger and Bernardo [7] that this prior has posterior confidence intervals with the correct frequentist coverage. Two potential generalizations are of interest. Definition 4 requires convergence only with respect to one approximating compact sequence of parameter spaces. It is natural to wonder what happens for other such approximating sequences. We suspect, but have been unable to prove in general, that convergence with respect to one sequence will guarantee convergence with respect to any sequence. If true, this makes expected logarithmic convergence an even more compelling property. Related to this is the possibility of allowing not just an approximating series of priors based on truncation to compact parameter spaces, but instead allowing any approximating sequence of priors. Among the difficulties in dealing with this is the need for a better notion of divergence that is symmetric in its arguments. One possibility is the symmetrized form of the logarithmic divergence in Bernardo and Rueda [12], but the analysis is considerably more difficult. 2.2. Permissible priors. Based on the previous considerations, we restrict consideration of possibly objective priors to those that satisfy the expected logarithmic convergence condition, and formally define them as follows. (Recall that x represents the entire data vector.)
J.O.BERGER.J.M.BERNARDO AND D.SUN DEFINITION 5.A strictly positive continuous function ()is a permis- sible prior for model M={p(x|0),x∈X,0∈Θ}if 1.for all xEX,x)is proper,that is,fe p(x)()de0, (2.3) lim tef(t)=0, t→0 then n(0)=1 is a permissible prior for the location model M. ExAMPLE 3(A nonpermissible constant prior in a location model).Con- sider the location model M=p(x0)=f(x-0),6ER,x>0+e},where f(t)=t-(logt)-2,t>e.It is shown in Appendix B that,if ()=1,then (I mo()}po()dz=oo for any compact set 0o=[a,b]with b-a >1;thus,n(0)=1 is not a permissible prior for M.Note that this model does not satisfy (2.3). This is an interesting example because we are still dealing with a location density,so that r()=1 is still the invariant (Haar)prior and,as such,satis- fies numerous nice properties such as being exact frequentist matching (i.e., a Bayesian 100(1-a)%credible set will also be a frequentist 100(1-a)% confidence set;cf.equation(6.22)in Berger [2]).This is in stark contrast to the situation with the Fraser,Monette and Ng example.However,the basic fact remains that posteriors from uniform priors on large compact sets do not seem here to be well approximated (in terms of logarithmic divergence) by a uniform prior on the full parameter space.The suggestion is that this is a situation in which assessment of the "true"bounded parameter space is potentially needed. Of course,a prior might be permissible for a larger sample size,even if it is not permissible for the minimal sample size.For instance,we suspect that m(0)=1 is permissible for any location model having two or more independent observations. The condition in the definition of permissibility that the posterior must be proper is not vacuous,as the following example shows
8 J. O. BERGER, J. M. BERNARDO AND D. SUN Definition 5. A strictly positive continuous function π(θ) is a permissible prior for model M = {p(x | θ), x ∈ X , θ ∈ Θ} if: 1. for all x ∈ X , π(θ | x) is proper, that is, R Θ p(x | θ)π(θ) dθ 0, lim |t|→0 |t| 1+ε (2.3) f(t) = 0, then π(θ) = 1 is a permissible prior for the location model M. Example 3 (A nonpermissible constant prior in a location model). Consider the location model M ≡ {p(x | θ) = f(x − θ),θ ∈ R,x > θ + e}, where f(t) = t −1 (log t) −2 , t > e. It is shown in Appendix B that, if π(θ) = 1, then R Θ0 κ{π(θ | x) | π0(θ | x)}p0(x) dx = ∞ for any compact set Θ0 = [a,b] with b − a ≥ 1; thus, π(θ) = 1 is not a permissible prior for M. Note that this model does not satisfy (2.3). This is an interesting example because we are still dealing with a location density, so that π(θ) = 1 is still the invariant (Haar) prior and, as such, satis- fies numerous nice properties such as being exact frequentist matching (i.e., a Bayesian 100(1 − α)% credible set will also be a frequentist 100(1 − α)% confidence set; cf. equation (6.22) in Berger [2]). This is in stark contrast to the situation with the Fraser, Monette and Ng example. However, the basic fact remains that posteriors from uniform priors on large compact sets do not seem here to be well approximated (in terms of logarithmic divergence) by a uniform prior on the full parameter space. The suggestion is that this is a situation in which assessment of the “true” bounded parameter space is potentially needed. Of course, a prior might be permissible for a larger sample size, even if it is not permissible for the minimal sample size. For instance, we suspect that π(θ) = 1 is permissible for any location model having two or more independent observations. The condition in the definition of permissibility that the posterior must be proper is not vacuous, as the following example shows
DEFINITION OF REFERENCE PRIORS 9 ExAMPLE 4 (Mixture model).Let x={r1,...,n be a random sample from the mixture p(xi)=N(x0,1)+N(x 0,1),and consider the uni- form prior function (0)=1.Since the likelihood function is bounded below by 2-II=1N(;,1)>0,the integrated likelihood ()(do= ()d will diverge.Hence,the corresponding formal posterior is im- proper,and therefore the uniform prior is not a permissible prior function for this model.It can be shown that Jeffreys prior for this mixture model has the shape of an inverted bell,with a minimum value 1/2 at u=0;hence, it is also bounded from below and is,therefore,not a permissible prior for this model either. Example 4 is noteworthy because it is very rare for the Jeffreys prior to yield an improper posterior in univariate problems.It is also of interest because there is no natural objective prior available for the problem.(There are data-dependent objective priors:see Wasserman [43].) Theorem 2 can easily be modified to apply to models that can be trans- formed into a location model. COROLLARY1.Consider M≡{p(x|0),0∈Θ,x∈X}.If there are mono- tone functions y=y(x)and o=(0)such that p(yo)=f(y-o)is a lo- cation model and there exists such that limf(t)=0,then n(0)=o(0)is a permissible prior function for M. The most frequent transformation is the log transformation,which con- verts a scale model into a location model.Indeed,this transformation yields the following direct analogue of Theorem 2. COROLLARY 2.Consider M={p(x0)=0-f(x/0),0>0,ER), a scale model where f(s),s>0,is a density function.If,for some e>0, (2.4) lim ltee'f(e)=0, |→∞ then n(0)=0-1 is a permissible prior function for the scale model M. EXAMPLE 5 (Exponential data).If x is an observation from an expo- nential density,(2.4)becomes t+et exp(-et)→0,aslt一o,which is true.From Corollary 2,(0)=0-1 is a permissible prior;indeed,i(0)= (2i)-l0-l,e-t≤0≤e is expected logarithmically convergent toπ(0): EXAMPLE 6(Uniform data).Let z be one observation from the uniform distribution M={Un(x 0,0)=0-1,x[0,0],0>0}.This is a scale den- sity,and equation(2.4)becomes+ee1o<e'<1y→0,asl→oo,which is indeed true.Thus,m(0)=0-1 is a permissible prior function for M
DEFINITION OF REFERENCE PRIORS 9 Example 4 (Mixture model). Let x = {x1,...,xn} be a random sample from the mixture p(xi | θ) = 1 2 N(x | θ, 1) + 1 2 N(x | 0, 1), and consider the uniform prior function π(θ) = 1. Since the likelihood function is bounded below by 2−n Qn j=1 N(xj | 0, 1) > 0, the integrated likelihood R ∞ −∞ p(x | θ)π(θ) dθ = R ∞ −∞ p(x | θ) dθ will diverge. Hence, the corresponding formal posterior is improper, and therefore the uniform prior is not a permissible prior function for this model. It can be shown that Jeffreys prior for this mixture model has the shape of an inverted bell, with a minimum value 1/2 at µ = 0; hence, it is also bounded from below and is, therefore, not a permissible prior for this model either. Example 4 is noteworthy because it is very rare for the Jeffreys prior to yield an improper posterior in univariate problems. It is also of interest because there is no natural objective prior available for the problem. (There are data-dependent objective priors: see Wasserman [43].) Theorem 2 can easily be modified to apply to models that can be transformed into a location model. Corollary 1. Consider M ≡ {p(x | θ),θ ∈ Θ,x ∈ X }. If there are monotone functions y = y(x) and φ = φ(θ) such that p(y | φ) = f(y − φ) is a location model and there exists ε > 0 such that lim|t|→0 |t| 1+εf(t) = 0, then π(θ) = |φ ′ (θ)| is a permissible prior function for M. The most frequent transformation is the log transformation, which converts a scale model into a location model. Indeed, this transformation yields the following direct analogue of Theorem 2. Corollary 2. Consider M = {p(x | θ) = θ −1f(|x|/θ),θ > 0,x ∈ R}, a scale model where f(s), s > 0, is a density function. If, for some ε > 0, lim |t|→∞ |t| 1+ε e t f(e t (2.4) ) = 0, then π(θ) = θ −1 is a permissible prior function for the scale model M. Example 5 (Exponential data). If x is an observation from an exponential density, (2.4) becomes |t| 1+ε e t exp(−e t ) → 0, as |t| → ∞, which is true. From Corollary 2, π(θ) = θ −1 is a permissible prior; indeed, πi(θ) = (2i) −1 θ −1 , e −i ≤ θ ≤ e i is expected logarithmically convergent to π(θ). Example 6 (Uniform data). Let x be one observation from the uniform distribution M ={Un(x | 0,θ) = θ −1 , x ∈ [0,θ], θ > 0}. This is a scale density, and equation (2.4) becomes |t| 1+ε e t1{0<et<1} → 0, as |t| → ∞, which is indeed true. Thus, π(θ) = θ −1 is a permissible prior function for M
10 J.O.BERGER.J.M.BERNARDO AND D.SUN The examples showing permissibility were for a single observation.Pleas- antly,it is enough to establish permissibility for a single observation or,more generally,for the sample size necessary for posterior propriety of r(x) because of the following theorem,which shows that expected logarithmic discrepancy is monotonically nonincreasing in sample size. THEOREM 3 (Monotone expected logarithmic discrepancy).Let M= {p(x1,x2|0)=p(x1|0)p(x2lx1,0),x1∈1,x2∈X2,8∈Θ}be a paramet- ric model.Consider a continuous improper prior n(0)satisfying m(x1)= fep(x110)T(0)do<oo and m(x1,x2)=Je p(x1,x20)(0)de<oo.For any compact setΘoC日,let o(0)=r(0)1e(0)/J6π(0)d.Them, (2.5) )111.)mo() ≤/k{x)1x)mat)i where for0∈Θo: To(01x1,X2)= p(x1,x2|0)π(0) mo(x1;x2) mo(x1,x2)=px1,x20)r(0)d0, Jeo T0(0|x1)= p(x1|8)π(0) mo(x1) mo(x1)= p(x11θ)π(0)d0. Jeo PROOF.The proof of this theorem is given in Appendix C. As an aside,the above result suggests that,as the sample size grows,the convergence of the posterior to normality given in Clarke [16 is monotone. 3.Reference priors. 3.1.Definition of reference priors.Key to the definition of reference pri- ors is Shannon expected information (Shannon [38]and Lindley [36]). DEFINITION 6(Expected information).The information to be expected from one observation from model M≡{p(x|0),x∈X,0∈Θ},when the prior for 0 is g(),is M)()os q(0)
10 J. O. BERGER, J. M. BERNARDO AND D. SUN The examples showing permissibility were for a single observation. Pleasantly, it is enough to establish permissibility for a single observation or, more generally, for the sample size necessary for posterior propriety of π(θ | x) because of the following theorem, which shows that expected logarithmic discrepancy is monotonically nonincreasing in sample size. Theorem 3 (Monotone expected logarithmic discrepancy). Let M = {p(x1,x2 | θ) = p(x1 | θ)p(x2 | x1,θ),x1 ∈ X1,x2 ∈ X2,θ ∈ Θ} be a parametric model. Consider a continuous improper prior π(θ) satisfying m(x1) = R Θ p(x1 | θ)π(θ) dθ < ∞ and m(x1,x2) = R Θ p(x1,x2 | θ)π(θ) dθ < ∞. For any compact set Θ0 ⊂ Θ, let π0(θ) = π(θ)1Θ0 (θ)/ R Θ0 π(θ) dθ. Then, Z Z X1×X2 κ{π(· | x1,x2) | π0(· | x1,x2)}m0(x1,x2) dx1 dx2 (2.5) ≤ Z X1 κ{π(· | x1) | π0(· | x1)}m0(x1) dx1, where for θ ∈ Θ0, π0(θ | x1,x2) = p(x1,x2 | θ)π(θ) m0(x1,x2) , m0(x1,x2) = Z Θ0 p(x1,x2 | θ)π(θ) dθ, π0(θ | x1) = p(x1 | θ)π(θ) m0(x1) , m0(x1) = Z Θ0 p(x1 | θ)π(θ) dθ. Proof. The proof of this theorem is given in Appendix C. As an aside, the above result suggests that, as the sample size grows, the convergence of the posterior to normality given in Clarke [16] is monotone. 3. Reference priors. 3.1. Definition of reference priors. Key to the definition of reference priors is Shannon expected information (Shannon [38] and Lindley [36]). Definition 6 (Expected information). The information to be expected from one observation from model M ≡ {p(x | θ),x ∈ X ,θ ∈ Θ}, when the prior for θ is q(θ), is I{q | M} = Z Z X×Θ p(x | θ)q(θ) log p(θ | x) q(θ) dxdθ