Notes on Probability Remarks: For academic use only Probability theory These notes will take for granted some familiarity with abstract(Lebesgue) integra tion theory. For further reading, see Chung(2001) 1 Probability spaces and random variables A probability space(2, F, P) where S is a non-empty set,F is a a-algebra of subsets of Q and P: F-0, 1 is a( positive) measure space such that An event is a set F∈F
Notes on Probability Remarks: For academic use only
A random variable X: Q-R is an F-measurable real-valued function on Q A random variable is said to have a property almost surely (P) if it has the property on a set F with P(F)=l. We often abbreviate and write as The expected value E[X] of a random variable X is defined via E[X]=/X(w)dP()=/X(w)P(dw) 1.1 Distribution measures and distribution functions The distribution measure of a random variable is a measure on the borel o algebra of subsets of R that tells you what the probability is that XUEBCR That is, for any borel set B C R H(B)=P(X(B)=P({u∈9:X(u)∈B}) Remark. We will sometimes write X B) when we mean X-(B) The distribution function F: R-0, 1 of a random variable X is defined via Fx(x)=(-∞,x]) That is, Fx(a)is the probability that X(w)sa If Fx is absolutely continuous, then it has a density f R-R+ such that F(x)=/f()d In particular, if Fx is differentiable everywhere, then fx(a)=Fx()
A random variable X : Ω → R is an F-measurable real-valued function on Ω. A random variable is said to have a property almost surely (P) if it has the property on a set F with P(F) = 1. We often abbreviate and write a.s. The expected value E[X] of a random variable X is defined via E[X] = Z Ω X(ω) dP(ω) = Z Ω X(ω) P(dω). 1.1 Distribution measures and distribution functions The distribution measure of a random variable is a measure on the Borel σ- algebra of subsets of R that tells you what the probability is that X(ω) ∈ B ⊂ R. That is, for any Borel set B ⊂ R, µ(B) = P ¡ X −1 (B) ¢ = P({ω ∈ Ω : X(ω) ∈ B}). Remark. We will sometimes write {X ∈ B} when we mean X−1 (B). The distribution function F : R → [0, 1] of a random variable X is defined via FX(x) = µ ((−∞, x]). That is, FX(x) is the probability that X(ω) ≤ x. If FX is absolutely continuous, then it has a density f : R → R+ such that F(x) = Zx −∞ f(y) dy. In particular, if FX is differentiable everywhere, then fX(x) = F 0 X(x). 2
1.2 Information and g-algebras When considering o-algebras g C f one may interpret g as the amount of avail- able information. Intuitively, if our information is given by g, we can distinguish between the events in g in the sense that for any event G E g we know with perfect certainty whether or not it has occurred. Given this, it makes sense to say that if g C H, then H contains no less information than g. Also, it is tempting to say that g=singletons) corresponds to full information since it should enable us to tell exactly what w has been drawn. But this turns out to be an awkward way of defining full information in general although admittedly it makes perfect sense when Q is a finite set. Instead, we will define full information as g= f, since then our information enables us to forecast perfectly the realized value of every random variable. Finally, we will say that g=Q, 0)(the trivial a-algebra)corresponds to o information Alternatively, we might tell the following story. Suppose our a-algebrag is generated by a finite partition IP. (i) Someone(Tyche, the norns, the dean, or whoever it is)chooses an outcome w E Q without telling us which (ii)While we don' t know which w E Q has been chosen, we are, however, told(by an oracle, Hugin Munin, or the gazette or whatever)in which component Pk E P w lies. In practice, this could be arranged by allowing us to observe a stochastic variable defined via )=∑kIn()
1.2 Information and σ-algebras When considering σ-algebras G ⊂ F one may interpret G as the amount of available information. Intuitively, if our information is given by G, we can distinguish between the events in G in the sense that for any event G ∈ G we know with perfect certainty whether or not it has occurred. Given this, it makes sense to say that if G ⊂ H, then H contains no less information than G. Also, it is tempting to say that G = σ {singletons} corresponds to full information since it should enable us to tell exactly what ω has been drawn. But this turns out to be an awkward way of defining full information in general although admittedly it makes perfect sense when Ω is a finite set. Instead, we will define full information as G = F, since then our information enables us to forecast perfectly the realized value of every random variable. Finally, we will say that G = {Ω, ∅} (the trivial σ-algebra) corresponds to no information. Alternatively, we might tell the following story. Suppose our σ-algebra G is generated by a finite partition P. (i) Someone (Tyche, the norns, the dean, or whoever it is) chooses an outcome ω ∈ Ω without telling us which. (ii) While we don’t know which ω ∈ Ω has been chosen, we are, however, told (by an oracle, Hugin & Munin, or the Gazette or whatever) in which component Pk ∈ P ω lies. In practice, this could be arranged by allowing us to observe a stochastic variable defined via X (ω) = Xn k=1 kIPk (ω) . (1) 3
o Hesh this out a little bit more, you may want to think that getting more in- formationin this context would correspond to having a 'finer'partition, where a partition Q finer than I arises from chopping up the components of P. It follows of course, that o(P)Co(Q), which was our original(and more general) definition of‘ more information In any case, notice that the axioms that characterize a o-algebra accord well with our intuitions about information. Obviously, we should know whether Q2, since it always occurs by definition. Also, if we know whether A, we should know whether not-A too. Similarly if we know whether A and whether B, we should know whether AUB. Countable unions are perhaps a little trickier to motivate intuitively; they are there essentially for technical reasons. In particular, they allow us to prove various limit theorems which are part of the point of the Lebesgue theory In economic modelling, it is plausible to allow decisions to depend only upon the available information. Mathematically, this means that if the agent's information is given by g, then her decision must be a g-measurable random variable. The interpretation of this is that the information in g suffices to give us perfect knowledge of the decision. Thus when it is time for the agent to act, she knows precisely what At this stage it is worth thinking about what it means for a stochastic variable X to be g-measurable. Intuitively, it means that the information in g suffices in order to know the value X(w). To make this more concrete, suppose that g is generated by a partition P. Then for X to be g-measurable, X has to be constant on each element Pk E P. It follows that knowing which element Pk has occurred is enough
To flesh this out a little bit more, you may want to think that getting ‘more information’ in this context would correspond to having a ‘finer’ partition, where a partition Q finer than P arises from chopping up the components of P. It follows, of course, that σ (P) ⊂ σ (Q), which was our original (and more general) definition of ‘more information’. In any case, notice that the axioms that characterize a σ−algebra accord well with our intuitions about information. Obviously, we should know whether Ω, since it always occurs by definition. Also, if we know whether A, we should know whether not-A too. Similarly, if we know whether A and whether B, we should know whether A∪B. Countable unions are perhaps a little trickier to motivate intuitively; they are there essentially for technical reasons. In particular, they allow us to prove various limit theorems which are part of the point of the Lebesgue theory. In economic modelling, it is plausible to allow decisions to depend only upon the available information. Mathematically, this means that if the agent’s information is given by G, then her decision must be a G-measurable random variable. The interpretation of this is that the information in G suffices to give us perfect knowledge of the decision. Thus when it is time for the agent to act, she knows precisely what to do. At this stage it is worth thinking about what it means for a stochastic variable X to be G-measurable. Intuitively, it means that the information in G suffices in order to know the value X (ω). To make this more concrete, suppose that G is generated by a partition P. Then for X to be G-measurable, X has to be constant on each element Pk ∈ P. It follows that knowing which element Pk has occurred is enough 4
to be able to tell what the value of X(w) must be As a further illustration of the fact that o-algebras do a good job in modelling information, we have the following result Definition. Let IXa, aeI be a family of random variables. Then the a-algebra generated by (Xa,CEIl, denoted by o(Xa, aE I is the smallest a-algebra g such that all the random variables in Xa, aE I are g-measurable Remark. Such a a-algebra exists. (Recall the proof: consider the intersection of all o-algebras on Q such that (Xa, cE I are measurable. Proposition. Let X=X1, X2, . Xn be a finite set of random variables. Let Z be a random variable. Then Z is o[X]-measurable iff there exists a Borel measurable function f:Rn→ R such that, for all w∈9, Z(u)=f(X1(u),X2(u),…,Xn(u) Proof. The case when oX is generated by a finite partition (i.e. when the mapping T:0-Rn defined via T(w)=(X1, X2,. Xn) is F-simple) is not too hard and is left as an exercise. For the rest, see Williams(1991).D 2 The conditional expectation Intuitively, the conditional expectation is the best predictor of the realization of a random variable given the available information. by "best" we will mean the one that minimizes the mean square error
to be able to tell what the value of X (ω) must be. As a further illustration of the fact that σ-algebras do a good job in modelling information, we have the following result. Definition. Let {Xα, α ∈ I} be a family of random variables. Then the σ-algebra generated by {Xα, α ∈ I}, denoted by σ {Xα, α ∈ I} is the smallest σ-algebra G such that all the random variables in {Xα, α ∈ I} are G-measurable. Remark. Such a σ-algebra exists. (Recall the proof: consider the intersection of all σ-algebras on Ω such that {Xα, α ∈ I} are measurable.) Proposition. Let X = {X1,X2, ..., Xn} be a finite set of random variables. Let Z be a random variable. Then Z is σ {X}-measurable iff there exists a Borel measurable function f : R n → R such that, for all ω ∈ Ω, Z (ω) = f (X1 (ω), X2 (ω), ..., Xn (ω)). (2) Proof. The case when σ {X} is generated by a finite partition (i.e. when the mapping T : Ω → R n defined via T (ω) = (X1,X2, ..., Xn) is F-simple) is not too hard and is left as an exercise. For the rest, see Williams (1991). 2 The conditional expectation Intuitively, the conditional expectation is the best predictor of the realization of a random variable given the available information. By “best” we will mean the one that minimizes the mean square error. 5
A formal definition, that works for square integrable random variables, is given by the following Definition Let g c F be a a-algebra and let X E L2(Q2, F, P). Then the condi- tional expectation Y=EX9 is the projection of X onto L2(Q2, g, P) Remark. By the Hilbert space projection theorem, the conditional expectation Y= minE[(X-Z) (3) Remark. The conditional expectation is itself a random variable. Its value is uncertain because it depends on precisely which events G E g actually occur. In other words, it is a(contingent )forecasting rule whose output(the forecast)depend on the content of the information revealed. For example, suppose our information set is such that we know whether the president has been shot. Then our actions may depend on whether he is or is not shot The projection-based definition is intuitively the most appealing one, but unfortu nately it only applies to square integrable stochastic variables. One way to extend the definition to merely integrable stochastic variables is to note that L is dense CI and define E [XIg] as the limit of the sequence E[,) where Xn L2 and Xn-X(in C ) Another way is the following Proposition. Let g CF be a o-algebra and let X E C(Q, F, P). Then there is an a s.(P)unique integrable random variable Z such that g-measurable and
A formal definition, that works for square integrable random variables, is given by the following. Definition Let G ⊂ F be a σ-algebra and let X ∈ L2 (Ω, F, P). Then the conditional expectation Y = E [X|G] is the projection of X onto L 2 (Ω, G, P). Remark. By the Hilbert space projection theorem, the conditional expectation solves Y = min Z∈L2(Ω,G,P) E £ (X − Z) 2 ¤ . (3) Remark. The conditional expectation is itself a random variable. Its value is uncertain because it depends on precisely which events G ∈ G actually occur. In other words, it is a (contingent) forecasting rule whose output (the forecast) depends on the content of the information revealed. For example, suppose our information set is such that we know whether the president has been shot. Then our actions may depend on whether he is or is not shot. The projection-based definition is intuitively the most appealing one, but unfortunately it only applies to square integrable stochastic variables. One way to extend the definition to merely integrable stochastic variables is to note that L 2 is dense in L 1 and define E [X|G] as the limit of the sequence {E [Xn|G]} where Xn ∈ L2 and Xn → X (in L 1 ). Another way is the following. Proposition. Let G ⊂ F be a σ-algebra and let X ∈ L1 (Ω, F, P). Then there is an a.s. (P) unique integrable random variable Z such that 1. Z is G-measurable and 6
2.∫XdP=∫ Zdp for each G∈g Using this result, we define E[XIg=Z Proof. The Radon-Nikodym theorem Remark. Since the conditional expectation is only as.(P) unique, most of the equations below strictly speaking need a qualifying a.s. (P)' appended to them to be true. But since this is a bit tedious, we adopt instead the convention that the statement X= Y means P(wEQ: X(w)=Y(w)))=1. If two random variables W and Z both qualify as the conditional expectation E[XIg], then we will sometimes call them versions of E[Xg This Cl-based definition can be intuitively motivated independently of the projection- based definition in the following way. On events such that we know whether they have occurred, our best guess of X should track X perfectly In any case, it had better be true that our two definitions of the conditional expec tation coincide when they both apply, i. e. on C nc2=C2. They do. You may want to try to prove this for yourself Having defined the conditional expectation with respect to a o-algebra, we now define the conditional expectation with respect to a family of stochastic variables Definition. Let Y E L(Q, F, P) and let (Xa, aE I be a family of random vari- ables. Then the conditional expectation E Y Xa, CEIl is defined as E Ylo Xa,aEIl Since E X is a o(X)-measurable random variable, there is a borel function f such that E[YIX=f(X). Sometimes we use the notation f()=EYIX=a
2. R G XdP = R G ZdP for each G ∈ G. Using this result, we define E [X|G] = Z. Proof. The Radon-Nikodym theorem. Remark. Since the conditional expectation is only a.s. (P) unique, most of the equations below strictly speaking need a qualifying ‘a.s. (P)’ appended to them to be true. But since this is a bit tedious, we adopt instead the convention that the statement X = Y means P ({ω ∈ Ω : X (ω) = Y (ω)}) = 1. If two random variables W and Z both qualify as the conditional expectation E [X|G] , then we will sometimes call them versions of E [X|G]. This L 1 -based definition can be intuitively motivated independently of the projectionbased definition in the following way. On events such that we know whether they have occurred, our best guess of X should track X perfectly. In any case, it had better be true that our two definitions of the conditional expectation coincide when they both apply, i.e. on L 1 ∩ L2 = L 2 . They do. You may want to try to prove this for yourself. Having defined the conditional expectation with respect to a σ–algebra, we now define the conditional expectation with respect to a family of stochastic variables. Definition. Let Y ∈ L 1 (Ω, F, P) and let {Xα, α ∈ I} be a family of random variables. Then the conditional expectation E [Y | {Xα, α ∈ I}] is defined as E [Y |σ {Xα, α ∈ I}] Since E [Y |X] is a σ (X) −measurable random variable, there is a Borel function f such that E [Y |X] = f (X). Sometimes we use the notation f (x) = E [Y |X = x] 7
where the expression on the right hand side is defined by the left hand side Definition. Let Y E L(Q, F, P) and let X be a stochastic variable. Then the function EYIX=z is defined as any Borel function f: R+R with the propert that f(X)is a version of E Y X. Note that E Y X=r is not always uniquely defined, but that this does not matter in practice Having defined the conditional expectation, we now note some of its properties. Let the given probability space be( Q, F, P) Proposition. Let g=1Q,0. Then E[XIG=EX Proof. Exercise.■ Proposition. Let X and Y be integrable random variables, let g be a o algebra and let a, B be scalars. Then E[ax+BYg=aEXI9+BEYIg Proof. Exercise.■ Proposition [ Law of iterated expectations]. Let X EC(Q2, F, P)and letgCHCF beσ- - algebras.Then EEIXH IG=E[XIg (5) Proof. We check that the left hand side satisfies the conditions for being the con- ditional expectation of X with respect to g. Clearly it is g-measurable. Now let G Eg and we have, since g C H and consequently G E H E[E[XIMIC]dP=/ E(X1H]d
where the expression on the right hand side is defined by the left hand side. Definition. Let Y ∈ L 1 (Ω, F, P) and let X be a stochastic variable. Then the function E [Y |X = x] is defined as any Borel function f : R → R with the property that f (X) is a version of E [Y |X] . Note that E [Y |X = x] is not always uniquely defined, but that this does not matter in practice. Having defined the conditional expectation, we now note some of its properties. Let the given probability space be (Ω, F,P). Proposition. Let G = {Ω, ∅}. Then E [X|G] = E [X]. Proof. Exercise. Proposition. Let X and Y be integrable random variables, let G ⊂ F be a σ- algebra and let α, β be scalars. Then E [αX + βY |G] = αE [X|G] + βE [Y |G] (4) Proof. Exercise. Proposition [Law of iterated expectations]. Let X ∈ L1 (Ω, F,P) and let G ⊂ H ⊂ F be σ-algebras. Then E [E [X|H] |G] = E [X|G] (5) Proof. We check that the left hand side satisfies the conditions for being the conditional expectation of X with respect to G. Clearly it is G-measurable. Now let G ∈ G and we have, since G ⊂ H and consequently G ∈ H, Z G E [E [X|H] |G] dP = Z G E [X|H] dP = Z G XdP. (6) 8
Corollary. Let X EC(Q2, F, P)and let g C Fbe a o-algebra. Then EE[XIg= E[XI Proposition. Let X and y be random variables such that XY is integrable. Let g C be a o-algebra and suppose X is g-measurable. Then 1. E[X9= X and 2.E[XY9]=XE[Y|9] Proof.(1)is trivial. To prove(2), note first that, the right hand side is g- measurable(why? ) To show that the right hand side integrates to the right thing, suppose X= IG where G∈g.LetF∈g.Then XEYg dP IGEYIgdP EYIg] dP {(G∩F)∈g!} IGy dP XYdP To show the more general case, show it for simple functions and then use the Mono tone Convergence Theorem
¤ Corollary. Let X ∈ L1 (Ω, F,P) and let G ⊂ F be a σ-algebra. Then E [E [X|G]] = E [X]. Proposition. Let X and Y be random variables such that XY is integrable. Let G ⊂ F be a σ-algebra and suppose X is G-measurable. Then 1. E [X|G] = X and 2. E [XY |G] = XE [Y |G]. Proof. (1) is trivial. To prove (2), note first that, the right hand side is Gmeasurable (why?). To show that the right hand side integrates to the right thing, suppose X = IG where G ∈ G. Let F ∈ G. Then Z F XE [Y |G] dP = Z F IGE [Y |G] dP = Z G∩F E [Y |G] dP = = {(G ∩ F) ∈ G!} = Z G∩F Y dP = Z F IGY dP = = Z F XY dP (7) To show the more general case, show it for simple functions and then use the Monotone Convergence Theorem. 9
We end the discussion of the conditional expectation by defining the conditional probability of an event. We then note with satisfaction that our formal definition substantiates our claim above that if our information is given by g, then we know for all events G E g whether or not they have occurred Definition. Let(Q, F, P) be a probability space and let g C be a g-algebra Let A E F. Then the conditional P(Ag)probability of A given g is defined via P(Ag)=ELIAIG It follows from this definition(why? that if G E g then P(Ag)=l when G occurs and P(Ag)=0 when it does not 2.1 Stochastic processes Let( @, F, P)be a probability space. A stochastic process in discrete time is a mapping X:Z+×!→ R such that, for each fixed t∈z+, the mapping w→X(t,u) is a random variable. For each fixed w E Q, the mapping t-X(t, w) is called a trajectory The definition of a stochastic process in continuous time is the same, except that Z+ is replaced by R+
We end the discussion of the conditional expectation by defining the conditional probability of an event. We then note with satisfaction that our formal definition substantiates our claim above that if our information is given by G, then we know, for all events G ∈ G whether or not they have occurred. Definition. Let (Ω, F, P) be a probability space and let G ⊂ F be a σ−algebra. Let A ∈ F. Then the conditional P (A|G) probability of A given G is defined via P (A|G) = E [IA|G] (8) It follows from this definition (why?) that if G ∈ G then P (A|G) = 1 when G occurs and P (A|G) = 0 when it does not. 2.1 Stochastic processes Let (Ω, F, P) be a probability space. A stochastic process in discrete time is a mapping X : Z+ ×Ω → R such that, for each fixed t ∈ Z+, the mapping ω → X(t, ω) is a random variable. For each fixed ω ∈ Ω, the mapping t → X(t, ω) is called a trajectory. The definition of a stochastic process in continuous time is the same, except that Z+ is replaced by R+. 10