1 Chapter 7 Statistical Functionals and the Delta Method 1.Estimators as Functionals of Fn or Pn 2.Continuity of Functionals of F or P 3.Metrics for Distribution Functions F and Probability Distributions P 4.Differentiability of Functionals of F or P:Gateaux,Hadamard,and Frechet Derivatives 5.Higher Order Derivatives
1 Chapter 7 Statistical Functionals and the Delta Method 1. Estimators as Functionals of Fn or Pn 2. Continuity of Functionals of F or P 3. Metrics for Distribution Functions F and Probability Distributions P 4. Differentiability of Functionals of F or P: Gateaux, Hadamard, and Frechet Derivatives 5. Higher Order Derivatives
Chapter 7 Statistical Functionals and the Delta Method 1 Estimates as Functionals of Fn or Pn Often the quantity we want to estimate can be viewed as a functional T(F)or T(P)of the underlying distribution function F or P generating the data.Then a simple nonparametric estimator is simply T(Fn)or T(Pn)where Fn and Pn denote the empirical distribution function and empirical measure of the data. Notation.Suppose that X1,...,Xn are i.i.d.P on (,A).We let n ∑dx,三the empirical measure of the sample, Pn三 i=1 where the measure with mass one at x (so 6(A)=1A(x)for AA.When =R,especially when k =1,we will write En(x)= a=P(-小F=P-,斗 Here is a list of examples. Example 1.1 The mean T(F)=fxdF(x).T(Fn)=fxdFn(x) Example 1.2 The r-th moment:for r an integer,T(F)=fx"dF(x),and T(Fn)=fr'dFn(). Example 1.3 The variance: TD=VarW=/e-∫FGYF(=∫fe-F(z)F(. T)=vam.x=/e-R.(aP证)=∫∫e-P亚.aE Example 1.4 The median:T(F)=F-1(1/2).T(Fn)=F1(1/2). Example 1.5 The a-trimmed mean:T(F)=(1-2a)-1fF-1(u)du for 0<a <1/2. T(Fn)=(1-2a)-1faFn(u)du. 3
Chapter 7 Statistical Functionals and the Delta Method 1 Estimates as Functionals of Fn or Pn Often the quantity we want to estimate can be viewed as a functional T(F) or T(P) of the underlying distribution function F or P generating the data. Then a simple nonparametric estimator is simply T(Fn) or T(Pn) where Fn and Pn denote the empirical distribution function and empirical measure of the data. Notation. Suppose that X1, . . . , Xn are i.i.d. P on (X , A). We let Pn ≡ 1 n !n i=1 δXi ≡ the empirical measure of the sample, where δx ≡ the measure with mass one at x (so δx(A)=1A(x) for A ∈ A. When X = Rk, especially when k = 1, we will write Fn(x) = 1 n !n i=1 1(−∞,x](Xi) = Pn(−∞, x], F(x) = P(−∞, x]. Here is a list of examples. Example 1.1 The mean T(F) = " xdF(x). T(Fn) = " xdFn(x). Example 1.2 The r-th moment: for r an integer, T(F) = " xrdF(x), and T(Fn) = " xrdFn(x). Example 1.3 The variance: T(F) = V arF (X) = # (x − # xdF(x))2dF(x) = 1 2 # # (x − y) 2dF(x)dF(y), T(Fn) = V arFn(X) = # (x − # xdFn(x))2dFn(x) = 1 2 # # (x − y) 2dFn(x)dFn(y). Example 1.4 The median: T(F) = F −1(1/2). T(Fn) = F−1 n (1/2). Example 1.5 The α−trimmed mean: T(F) = (1 − 2α)−1 " 1−α α F −1(u)du for 0 < α < 1/2. T(Fn) = (1 − 2α)−1 " 1−α α F−1 n (u)du. 3
4 CHAPTER 7.STATISTICAL FUNCTIONALS AND THE DELTA METHOD Example 1.6 The Hodges-Lehmann functional:T(F)=(1/2){FF)1(1/2)where denotes convolution.Then T(Fn)=(1/2){Fn *Fn(1/2)=median{(Xi+Xi)/2}. Example 1.7 The Mann-Whitney functional.For X,Y independent with distribution functions F and G respectively,T(F,G)=fFdG PEG(X <Y).Then T(Fm,Gn)=fFmdGn (based on two independent samples X1,...,Xm i.i.d.F with empirical df Fm and Yi,...,Yn i.i.d.G with empirical df Gn. Example 1.8 Multivariate mean:for P on (R,B):T(P)=fxdP(x)(with values in R), T(Pn)=∫xdPn(o)=n-1∑=1X Example 1.9 Multivariate cross second moments:for P on (R,B): T(P)=zxTdP(x)=22dP(x); T) 「xn'l回)=2n)=n-∑X,Xg Note that T(P)and T(Pn)take values in Rxk. Example 1.10 Multivariate covariance matrix:for P on(R,B) =∫e-兆-P.a2.o)=m-空-不x- Example 1.11 k-means clustering functional:T(P)=(T1(P),...,T(P)where the Ti(P)'s min- imize -ti2dP(x) where Ci=fx E Rm ti minimizes I-t over {t1,...,t}} Then T(Pn)=(Ti(Pn),...,Tk(Pn))where the Ti(Pn)'s minimize -tl-tdn(). Example 1.12 The simplicial depth function:for P on R and z E R,set T(P)=T(P)(x)= Prp(x E S(X1,...,Xk+1))where X1,...,Xk+1 are i.i.d.P and S(1,...,k+1)is the simplex in R determined by x1,...,k+1;e.g.for k =2,the simplex determined by 1,x2,x3 is just a triangle.Then T(Pn)=Prp E S(X1,...,X+1)).Note that in this example T(P)is a function from Rk to R
4 CHAPTER 7. STATISTICAL FUNCTIONALS AND THE DELTA METHOD Example 1.6 The Hodges-Lehmann functional: T(F) = (1/2){F # F}−1(1/2) where # denotes convolution. Then T(Fn) = (1/2){Fn # Fn}−1(1/2) = median{(Xi + Xj )/2}. Example 1.7 The Mann-Whitney functional. For X, Y independent with distribution functions F and G respectively, T(F, G) = " F dG = PF,G(X ≤ Y ). Then T(Fm, Gn) = " FmdGn (based on two independent samples X1, . . . , Xm i.i.d. F with empirical df Fm and Y1, . . . , Yn i.i.d. G with empirical df Gn. Example 1.8 Multivariate mean: for P on (Rk,Bk): T(P) = " xdP(x) (with values in Rk), T(Pn) = " xdPn(x) = n−1 $n i=1 Xi. Example 1.9 Multivariate cross second moments: for P on (Rk,Bk): T(P) = # xxT dP(x) = # x⊗2dP(x); T(Pn) = # xxT dPn(x) = # x⊗2dPn(x) = n−1!n i=1 XiXT i . Note that T(P) and T(Pn) take values in Rk×k. Example 1.10 Multivariate covariance matrix: for P on (Rk,Bk): T(P) = # (x − # ydP(y))(x − # ydP(y))T dP(x) = 1 2 # # (x − y)(x − y) T dP(x)dP(y), T(Pn) = # (x − # ydPn(y))(x − # ydPn(y))T dPn(x) = 1 2 # # (x − y)(x − y) T dPn(x)dPn(y) = n−1!n i=1 (Xi − Xn)(Xi − Xn) T . Example 1.11 k−means clustering functional: T(P)=(T1(P), . . . , Tk(P) where the Ti(P)’s minimize # |x − t1| 2 ∧ · · · ∧ |x − tk| 2dP(x) = ! k i=1 # Ci |x − ti| 2dP(x) where Ci = {x ∈ Rm : ti minimizes |x − t| 2 over {t1, . . . , tk}}. Then T(Pn)=(T1(Pn), . . . , Tk(Pn)) where the Ti(Pn)’s minimize # |x − t1| 2 ∧ · · · ∧ |x − tk| 2dPn(x). Example 1.12 The simplicial depth function: for P on Rk and x ∈ Rk, set T(P) ≡ T(P)(x) = P rP (x ∈ S(X1, . . . , Xk+1)) where X1, . . . , Xk+1 are i.i.d. P and S(x1, . . . , xk+1) is the simplex in Rk determined by x1, . . . , xk+1; e.g. for k = 2, the simplex determined by x1, x2, x3 is just a triangle. Then T(Pn) = P rPn (x ∈ S(X1, . . . , Xk+1)). Note that in this example T(P) is a function from Rk to R
1. ESTIMATES AS FUNCTIONALS OF FN OR PN Example 1.13 (Z-functional derived from likelihood).A maximum likelihood estimator:for P on(,A),suppose that P={Po:0cR}is a regular parametric model with vector scores function le(;0).Then for general P,not necessarily in the model P,consider T defined by (1) io(z;T(P))dP(z)=0. Then /i(c:Te》证.=0 defines T(Pn).For estimation of location in one dimension with I(x;0)=(x-0)and =-f/f, these become v(-T(F))dF(x)=0 and (x-T(Fn))dFn(x)=0. We expect that often the value T(P)ee satisfying (1)also satisfies T(P)=argmineeeK(P,Po). Here is a heuristic argument showing why this should be true:Note that for many cases we have On=argmaxon-In(0)=argmaxoPn(log0) -p argmaxeP(log0)=argmaxe logpe(x)dP(x). Now P(log pe)=P(logp)+Plog P(logp)-Plog P(logp)-K(P;Pe). Thus argmaxe log pe(x)dP(x)=argminoK(P,Pe)≡f(P). If we can interchange differentiation and integration it follows that VoK(P,Po)=p(z)io(z:0)du(x)=io(z:0)dP(z), so the relation (1)is obtained by setting this gradient vector equal to 0. Example 1.14 A bootstrap functional:let T(F)be a functional with estimator T(Fn),and con- sider estimating the distribution function of vn(T(Fn)-T(F)), Hn(F;)=Pr(Vn(T(En)-T(F))<). A natural estimator is Hn(Fn,)
1. ESTIMATES AS FUNCTIONALS OF FN OR PN 5 Example 1.13 (Z-functional derived from likelihood). A maximum likelihood estimator: for P on (X , A), suppose that P = {Pθ : θ ∈ Θ ⊂ Rk} is a regular parametric model with vector scores function ˙ lθ(·; θ). Then for general P, not necessarily in the model P, consider T defined by # ˙ (1) lθ(x; T(P))dP(x) = 0. Then # ˙ lθ(x; T(Pn))dPn(x)=0 defines T(Pn). For estimation of location in one dimension with ˙ l(x; θ) = ψ(x − θ) and ψ ≡ −f% /f, these become # ψ(x − T(F))dF(x) = 0 and # ψ(x − T(Fn))dFn(x)=0. We expect that often the value T(P) ∈ Θ satisfying (1) also satisfies T(P) = argminθ∈ΘK(P, Pθ). Here is a heuristic argument showing why this should be true: Note that for many cases we have ˆθn = argmaxθn−1ln(θ) = argmaxθPn(log θ) →p argmaxθP(log θ) = argmaxθ # log pθ(x)dP(x). Now P(log pθ) = P(log p) + P log %pθ p & = P(log p) − P log % p pθ & = P(log p) − K(P, Pθ). Thus argmaxθ # log pθ(x)dP(x) = argminθK(P, Pθ) ≡ θ(P). If we can interchange differentiation and integration it follows that ∇θK(P, Pθ) = # p(x)˙ lθ(x; θ)dµ(x) = # ˙ lθ(x; θ)dP(x), so the relation (1) is obtained by setting this gradient vector equal to 0. Example 1.14 A bootstrap functional: let T(F) be a functional with estimator T(Fn), and consider estimating the distribution function of √n(T(Fn) − T(F)), Hn(F; ·) = PF ( √n(T(Fn) − T(F)) ≤ ·). A natural estimator is Hn(Fn, ·)
6 CHAPTER 7.STATISTICAL FUNCTIONALS AND THE DELTA METHOD 2 Continuity of Functionals of F or P One of the basic properties of a functional T is continuity (or lack thereof).The sense in which we will want our functionals T to be continuous is in the sense of weak convergence. Definition 2.1 A.T:F-R is weakly continuous at Fo if Fn Fo implies T(Fn)T(Fo). T:F-R is weakly lower-semicontinuous at Fo if Fn=Fo implies lim infnT(Fn)>T(Fo). B.T:P→R is weakly continuous at Po∈P if Pn→Po implies T(Pn)一T(Po) Example 2.1 T(F)=fxdF(x)is discontinuous at every Fo:if Fn=(1-n-1)Fo+n-16a,then Fn→Fo since,for boundedψ bdEn=(1-n-l)pd+n-1b(an)→bdo, But T(Fn)=(1-n)T(Fo)+nanoo if we choose an so that n-anoo. Example 2.2 T(F)=(1-2a)-1f-F-1(u)du with 0T(F); this follows from Skorokhod and Fatou. Here is the basic fact about empirical measures that makes weak continuity of a functional T useful: Theorem 2.1 (Varadarajan).If X1,...,Xn are i.i.d.P on a separable metric space (S,d),then Pr(Pn→P)=1. Proof. For each fixed bounded and continuous function we have
6 CHAPTER 7. STATISTICAL FUNCTIONALS AND THE DELTA METHOD 2 Continuity of Functionals of F or P One of the basic properties of a functional T is continuity (or lack thereof). The sense in which we will want our functionals T to be continuous is in the sense of weak convergence. Definition 2.1 A. T : F → R is weakly continuous at F0 if Fn ⇒ F0 implies T(Fn) → T(F0). T : F → R is weakly lower-semicontinuous at F0 if Fn ⇒ F0 implies lim infn→∞ T(Fn) ≥ T(F0). B. T : P → R is weakly continuous at P0 ∈ P if Pn ⇒ P0 implies T(Pn) → T(P0). Example 2.1 T(F) = " xdF(x) is discontinuous at every F0: if Fn = (1 − n−1)F0 + n−1δan , then Fn ⇒ F0 since, for bounded ψ # ψdFn = (1 − n−1) # ψdF0 + n−1ψ(an) → # ψdF0. But T(Fn) = (1 − n−1)T(F0) + n−1an → ∞ if we choose an so that n−1an → ∞. Example 2.2 T(F) = (1 − 2α)−1 " 1−α α F −1(u)du with 0 < α< 1/2 is continuous at every F0: Fn ⇒ F0 implies that F −1 n (t) → F −1 0 (t) a.e. Lebesgue. Hence T(Fn) = (1 − 2α) −1 # 1−α α F −1 n (u)du → (1 − 2α) −1 # 1−α α F −1 0 (u)du = T(F0) by the dominated convergence theorem. Example 2.3 T(F) = F −1(1/2) is continuous at every F0 such that F −1 0 is continuous at 1/2. Example 2.4 (A lower-semicontinuous functional T). Let T(F) = V arF (X) = # (x − EF X) 2dF(x) = 1 2 EF (X − X% ) 2 where X, X% ∼ F are independent; recall example 1.3. If Fn →d F, then lim infn→∞ T(Fn) ≥ T(F); this follows from Skorokhod and Fatou. Here is the basic fact about empirical measures that makes weak continuity of a functional T useful: Theorem 2.1 (Varadarajan). If X1, . . . , Xn are i.i.d. P on a separable metric space (S, d), then P r(Pn ⇒ P) = 1. Proof. For each fixed bounded and continuous function ψ we have Pnψ ≡ # ψdPn = 1 n !n i=1 ψ(Xi) →a.s. Pψ ≡ # ψdP
2 CONTINUITY OF FUNCTIONALS OF F OR P 7 by the ordinary strong law of large numbers.The proof is completed by noting that the collection of bounded continuous functions on a separable metric space (S,d)is itself separable.See Dudley (1989),sections 11.2 and 11.4. Combining Varadarajan's theorem with weak continuity of T yields the following simple result. Proposition 2.1 Suppose that: (i)(,A)=(S,BBoret)where (S,d)is a separable metric space and BBoret denotes its usual Borel sigma-field. (i)T:P→R is weakly continuous at Po∈P. (iii)X1,...,Xn are i.i.d.Po. Then Tn≡T(Pn)→a.s.T(Po): Proof. By Varadarajan's theorem 2.1,Pn Po a.s.Fix wE A with Pr(A)=1 so that P%→Po.Then by weak continuity of T,Tn(P%)→T(Po).口 A difficulty in using this theorem is typically in trying to verify weak-continuity of T.Weak continuity is a rather strong hypothesis,and many interesting functions fail to have this type of continuity.The following approach is often useful. Definition 2.2 Let FC Li(P)be a collection of integrable functions.Say that PnP with respect to‖‖F if llPn-Plr=supfer|Pn(f)-P(f川一0.Furthermore,.we say that T:P→R is continuous with respect to‖·.F if Pn-PllF→0 implies that T(Pn)→T(P). Definition 2.3 IfFCL(P)is a collection of integrable functions with Pn-P0,we then say that F is a Glivenko-Cantelli class for P and write FE GC(P). Theorem 2.2 Suppose that: ()F∈GC(P);i.e.Pn-Pl一as.0. (ii)T is continuous with respect to‖l·lF Then T(Pn)→a.s.T(P)
2. CONTINUITY OF FUNCTIONALS OF F OR P 7 by the ordinary strong law of large numbers. The proof is completed by noting that the collection of bounded continuous functions on a separable metric space (S, d) is itself separable. See Dudley (1989), sections 11.2 and 11.4. ✷ Combining Varadarajan’s theorem with weak continuity of T yields the following simple result. Proposition 2.1 Suppose that: (i) (X , A)=(S,BBorel) where (S, d) is a separable metric space and BBorel denotes its usual Borel sigma - field. (ii) T : P → R is weakly continuous at P0 ∈ P. (iii) X1, . . . , Xn are i.i.d. P0. Then Tn ≡ T(Pn) →a.s. T(P0). Proof. By Varadarajan’s theorem 2.1, Pn ⇒ P0 a.s. Fix ω ∈ A with P r(A) = 1 so that Pω n ⇒ P0. Then by weak continuity of T, Tn(Pω n) → T(P0). ✷ A difficulty in using this theorem is typically in trying to verify weak-continuity of T. Weak continuity is a rather strong hypothesis, and many interesting functions fail to have this type of continuity. The following approach is often useful. Definition 2.2 Let F ⊂ L1(P) be a collection of integrable functions. Say that Pn → P with respect to . ·. F if .Pn − P.F ≡ supf∈F |Pn(f) − P(f)| → 0. Furthermore, we say that T : P → R is continuous with respect to . · .F if .Pn − P.F → 0 implies that T(Pn) → T(P). Definition 2.3 If F ⊂ L1(P) is a collection of integrable functions with .Pn − P.∗ F → 0, we then say that F is a Glivenko-Cantelli class for P and write F ∈ GC(P). Theorem 2.2 Suppose that: (i) F ∈ GC(P); i.e. .Pn − P.∗ F →a.s. 0. (ii) T is continuous with respect to . · .F . Then T(Pn) →a.s. T(P)
8 CHAPTER 7.STATISTICAL FUNCTIONALS AND THE DELTA METHOD 3 Metrics Probability Distributions F and P We have already encountered the total variation and Hellinger metrics in the course of studying Scheffe's lemma,Bayes estimators,and tests of hypotheses.As we will see,as useful as these metrics are,they are too strong:the empirical measure Pn fails to converge to the true P in either the total variation or Hellinger distance in general.In fact this fails to hold in general for the Prohorov and dual bounded Lipschitz metrics which we introduce below,as has been shown by Dudley (1969), Kersting (1978),and Bretagnolle and Huber-Carol (1977);also see the remarks in Huber (1981), page 39.Nonetheless,it will be helpful to have in mind some some useful metrics for probability measures P and df's F,and their properties. Definition 3.1 The Kolmogorov or supremum metric between two distribution functions F and G is dK(F,G)≡IF-Glo≡sup F(x)-G(x) IERk Definition 3.2 The Levy metric between two distribution functions F and G is d(F,G)≡inf{e>0:G(x-e)-e≤F(x)≤G(x+e)+e for all x∈R}. Definition 3.3 The Prohorov metric between two probability measures P,Q on a metric space (S,d)is dpr(P,Q)=infe>0:P(B)<Q(B)+e for all Borel sets B) where Be≡{x:infyeB d(x,y)≤e}. To define the next metric for P,Q on a metric space(S,d)for any real-valued function f on S,set lfl=supf()-f(y)/d(,y),and denote the usual supremum norm bylfloo=sup If()I. Finally,set lIfllBL fllt +llflloo Definition 3.4 The dual-bounded Lipschitz metric dBL.is defined by dB-(P,Q)≡sup{l fdp-fdQl:lIflBL≤1 Definition 3.5 The total variation metric dry is defined by dv(C,Q)三supP(A)-Q(Al:A∈A}=互/p-9ld where p≡dP/dμ,q=dQ/dμfor some measureμdominating both P and Q(e.g.μ=P+Q). Definition 3.6 The Hellinger metric H is defined by H2(P.Q)=V-VQ)2du=1-VDGdu=1-p(P.Q) whereμis any measure dominating both P and Q.The quantity P(P,Q)≡∫√pgdlμis called the affinity between P and Q. The following basic theorem establishes relationships between these metrics:
8 CHAPTER 7. STATISTICAL FUNCTIONALS AND THE DELTA METHOD 3 Metrics Probability Distributions F and P We have already encountered the total variation and Hellinger metrics in the course of studying Scheff´e’s lemma, Bayes estimators, and tests of hypotheses. As we will see, as useful as these metrics are, they are too strong: the empirical measure Pn fails to converge to the true P in either the total variation or Hellinger distance in general. In fact this fails to hold in general for the Prohorov and dual bounded Lipschitz metrics which we introduce below, as has been shown by Dudley (1969), Kersting (1978), and Bretagnolle and Huber-Carol (1977); also see the remarks in Huber (1981), page 39. Nonetheless, it will be helpful to have in mind some some useful metrics for probability measures P and df’s F, and their properties. Definition 3.1 The Kolmogorov or supremum metric between two distribution functions F and G is dK(F, G) ≡ .F − G.∞ ≡ sup x∈Rk |F(x) − G(x)|. Definition 3.2 The L´evy metric between two distribution functions F and G is dL(F, G) ≡ inf{' > 0 : G(x − ') − ' ≤ F(x) ≤ G(x + ') + ' for all x ∈ R}. Definition 3.3 The Prohorov metric between two probability measures P, Q on a metric space (S, d) is dpr(P, Q) = inf{' > 0 : P(B) ≤ Q(B$ ) + ' for all Borel sets B} where B$ ≡ {x : infy∈B d(x, y) ≤ '}. To define the next metric for P, Q on a metric space (S, d) for any real-valued function f on S, set .f.L ≡ supx)=y |f(x)− f(y)|/d(x, y), and denote the usual supremum norm by .f.∞ ≡ supx |f(x)|. Finally, set .f.BL ≡ .f.L + .f.∞. Definition 3.4 The dual - bounded Lipschitz metric dBL∗ is defined by dBL∗ (P, Q) ≡ sup{| # f dP − # f dQ| : .f.BL ≤ 1}. Definition 3.5 The total variation metric dT V is defined by dT V (P, Q) ≡ sup{|P(A) − Q(A)| : A ∈ A} = 1 2 # |p − q|dµ where p ≡ dP/dµ, q = dQ/dµ for some measure µ dominating both P and Q (e.g. µ = P + Q). Definition 3.6 The Hellinger metric H is defined by H2(P, Q) = 1 2 # { √p − √q}2dµ = 1 − # √pqdµ ≡ 1 − ρ(P, Q) where µ is any measure dominating both P and Q. The quantity ρ(P, Q) ≡ " √pqdµ is called the affinity between P and Q. The following basic theorem establishes relationships between these metrics:
3. METRICS PROBABILITY DISTRIBUTIONS F AND P 9 Theorem 3.1 A.dpr(P,Q)2 dBL-(P,Q)d c[X∈B]U[d(X,Y)>, so that P(B)≤Q(B)+e. For the proof of (a)implies (b)see Strassen (1965),Dudley (1968),or Schay (1974).A nice treatment of Strassen's theorem is given by Dudley (1989)
3. METRICS PROBABILITY DISTRIBUTIONS F AND P 9 Theorem 3.1 A. dP r(P, Q)2 ≤ dBL∗ (P, Q) ≤ 2dP r(P, Q). B. H2(P, Q) ≤ dT V (P, Q) ≤ H(P, Q){2 − H2(P, Q)}1/2. C. dP r(P, Q) ≤ dT V (P, Q). D. For distributions P, Q on the real line, dL ≤ dK ≤ dT V . Proof. We proved B in chapter 2. For A, see Dudley (1989) section 11.3, problem 5, and section 11.6, corollary 11.6.5. Also see Huber (1981), corollary 2.4.3, page 33. Another useful reference is Whitt (1974). ✷ Theorem 3.2 (Strassen). The following are equivalent: (a) dP r(P, Q) ≤ '. (b) There exist X ∼ P, Y ∼ Q defined on a common probability space (Ω, F, P r) such that P r(d(X, Y ) ≤ ') ≥ 1 − '. Proof. (b) implies (a) is easy: for any Borel set B, [X ∈ B] = [X ∈ B, d(X, Y ) ≤ '] ∪ [X ∈ B, d(X, Y ) > '] ⊂ [X ∈ B$ ] ∪ [d(X, Y ) > '], so that P(B) ≤ Q(B$ ) + '. For the proof of (a) implies (b) see Strassen (1965), Dudley (1968), or Schay (1974). A nice treatment of Strassen’s theorem is given by Dudley (1989). ✷
10 CHAPTER 7.STATISTICAL FUNCTIONALS AND THE DELTA METHOD 4 Differentiability of Functionals T of F or P To be able to prove more than consistency,we will need stronger properties of the functional T, namely differentiability. Definition 4.1 T is Gateaux differentiable at F if there exists a linear functional T(F;)such that for F=(1-t)F+tG, T(F)-T(F) lim (x)d(G(x)-F(x)) t→0 t =T(F;G-F)= Vr(x)dG(x) where vr(z)=-fodF(r)has mean zero under F.Or,T:P-R is Gateaux-differentiable at P if there exists T(P;)bounded and linear such that for P=(1-t)P+tQ T(P)-T(P) i(P:Q-P)= (x)d(Q(x)-P(x)) t vp(x)dQ(x). Definition 4.2 T has the influence function or influence curve IC(x;T,F)at F if,with F= (1-t)F+tδx, 0 T(Ft)-T(F)=IC(:T,F)=vF(). Example 4.1 Probability of a set:suppose that T(F)=F(A)for a fixed measurable set A.Then T(F)-T(F) 2-A回)-∫1 x(u)dF(wydG(z)= VF(x)dG(x) t where vr(x)=1A(x)-F(A). Example 4.2 The mean:T(F)=fxdF(x).Then TE)-TE-∫e-TPaG)= VF(x)dG(x) t where vr(x)=x-T(F).Note that the influence function vr(x)for the probability functional is bounded,but that the influence function vr(x)for the mean functional is unbounded. Example 4.3 The variance:T(F)=Varr(X)=f(x-u(F))2dF(x).Now 品ro=品∫e-MPaa =/(-u(F)2dG-F()+2(e-(F(-1)a(F;G-F)dF) (x-μ(F)2d(G-F) ={(x-u(F)2-o2}dG(x). Hence IC(z;T,F)=vF(z)=(-u(F))2-
10 CHAPTER 7. STATISTICAL FUNCTIONALS AND THE DELTA METHOD 4 Differentiability of Functionals T of F or P To be able to prove more than consistency, we will need stronger properties of the functional T, namely differentiability. Definition 4.1 T is Gateaux differentiable at F if there exists a linear functional T˙(F; ·) such that for Ft = (1 − t)F + tG, lim t→0 T(Ft) − T(F) t = T˙(F; G − F) = # ψ(x)d(G(x) − F(x)) = # ψF (x)dG(x) where ψF (x) = ψ − " ψdF(x) has mean zero under F. Or, T : P → R is Gateaux - differentiable at P if there exists T˙(P; ·) bounded and linear such that for Pt ≡ (1 − t)P + tQ lim t→0 T(Pt) − T(P) t = T˙(P; Q − P) = # ψ(x)d(Q(x) − P(x)) = # ψP (x)dQ(x). Definition 4.2 T has the influence function or influence curve IC(x; T, F) at F if, with Ft ≡ (1 − t)F + tδx, lim t→0 T(Ft) − T(F) t = IC(x; T, F) = ψF (x). Example 4.1 Probability of a set: suppose that T(F) = F(A) for a fixed measurable set A. Then T(Ft) − T(F) t = # {1A(x) − # 1A(y)dF(y)}dG(x) = # ψF (x)dG(x) where ψF (x)=1A(x) − F(A). Example 4.2 The mean: T(F) = " xdF(x). Then T(Ft) − T(F) t = # {x − T(F)}dG(x) = # ψF (x)dG(x) where ψF (x) = x − T(F). Note that the influence function ψF (x) for the probability functional is bounded, but that the influence function ψF (x) for the mean functional is unbounded. Example 4.3 The variance: T(F) = V arF (X) = " (x − µ(F))2dF(x). Now d dtT(Ft)|t=0 = d dt # (x − µ(Ft))2dFt(x) = # (x − µ(F))2d(G − F)(x)+2 # (x − µ(F))(−1) ˙µ(F; G − F)dF(x) = # (x − µ(F))2d(G − F) = # {(x − µ(F))2 − σ2 F }dG(x). Hence IC(x; T, F) = ψF (x)=(x − µ(F))2 − σ2 F