A Probabilistic Proof of the Lindeberg-feller Central Limit theorem Larry goldstein 1 INTRODUCTION The Central Limit Theorem, one of the most striking and useful results in probability and statistics, explains why the normal distribution appears in areas as diverse as gambling, measurement error, sampling, and statistical mechanics. In essence, the Central Limit Theorem states that the normal dis- tribution applies whenever one is approximating probabilities for a quantit which is a sum of many independent contributions all of which are roughly the same size. It is the Lindeberg- Feller Theorem which makes this statement precise in providing the sufficient, and in some sense necessary, Lindeberg condition whose satisfaction accounts for the ubiquitous appearance of the bell shaped normal Generally the Lindeberg condition is handled using Fourier methods and is somewhat hard to interpret from the classical point of view. Here we pro- vide a simpler, equivalent, and more easily interpretable probabilistic formu- lation of the Lindeberg condition and demonstrate its sufficiency and partia necessity in the Central Limit Theorem using more elementary means The seeds of the Central Limit Theorem or clt lie in the work of Abra- ham de moivre who. in 1733. not being able to secure himself an academic appointment, supported himself consulting on problems of probability, and gambling. He approximated the limiting probabilities of the Binomial distri bution, the one which governs the behavior of the number Sn of success in an experiment which consists of n independent trials, each one having the same probability p E(0, 1)of success Each individual trial of the experiment can be modelled by X, a(Bernoulli) random variable which records one for each success. and zero for each failure P(X=1)=p and P(X=0)=1-p
A Probabilistic Proof of the Lindeberg-Feller Central Limit Theorem Larry Goldstein 1 INTRODUCTION. The Central Limit Theorem, one of the most striking and useful results in probability and statistics, explains why the normal distribution appears in areas as diverse as gambling, measurement error, sampling, and statistical mechanics. In essence, the Central Limit Theorem states that the normal distribution applies whenever one is approximating probabilities for a quantity which is a sum of many independent contributions all of which are roughly the same size. It is the Lindeberg-Feller Theorem which makes this statement precise in providing the sufficient, and in some sense necessary, Lindeberg condition whose satisfaction accounts for the ubiquitous appearance of the bell shaped normal. Generally the Lindeberg condition is handled using Fourier methods and is somewhat hard to interpret from the classical point of view. Here we provide a simpler, equivalent, and more easily interpretable probabilistic formulation of the Lindeberg condition and demonstrate its sufficiency and partial necessity in the Central Limit Theorem using more elementary means. The seeds of the Central Limit Theorem, or CLT, lie in the work of Abraham de Moivre, who, in 1733, not being able to secure himself an academic appointment, supported himself consulting on problems of probability, and gambling. He approximated the limiting probabilities of the Binomial distribution, the one which governs the behavior of the number Sn of success in an experiment which consists of n independent trials, each one having the same probability p ∈ (0, 1) of success. Each individual trial of the experiment can be modelled by X, a (Bernoulli) random variable which records one for each success, and zero for each failure, P(X = 1) = p and P(X = 0) = 1 − p, 1
and has mean EX= p and variance Var(X)= p(1-p) e reco successes and failures in n independent trials is then given by an independent sequence X1, X2,.. Xn of these Bernuolli variables, and the total number of success S, by their sum Sn=X1+…+Xn. Exactly, Sn has the binomial distribution which specifies that P(Sn=k) (R)p(1-p)n-k for k=0, 1,...,n. For even moderate values of n managing the binomial coefficients(n) becomes unwieldy, to say nothing of computing the sum which yields the cumulative probability P(sn≤m)=∑(m)-(x-ny=k that there will be m or fewer successes providing table approx- imation to such probabilities that can be quite accurate even for moderate Standardizing the binomial Sn by subt dividing by its standard deviation to obtain the mean zero, variance random variable Wn=(Sn-np)/np(1-p), the CLT yields that Vr lim p(Wn≤x)=P(Z≤x) n→ where Z is N(0, 1), a standard, mean zero variance one, normal random variable, that is. the one with distribution function p(u)du where (u)=exp(-5u) We may therefore, for instance, approximate the cumbersome cumulative binomial probability P(Sn < m) by the simpler d((m-np)/np(1-p) It was only for the special case of the binomial that the normal approxi- mation was first considered. Only many years later with the work of Laplace around 1820 did it begin to be systematically realized that the same normal limit is obtained when the underlying Bernoulli variables are replaced by any variables with a finite variance. The result was the classical Central Limit which states that(2) holds whenever 2
and has mean EX = p and variance Var(X) = p(1 − p). The record of successes and failures in n independent trials is then given by an independent sequence X1, X2, . . . , Xn of these Bernuolli variables, and the total number of success Sn by their sum Sn = X1 + · · · + Xn. (1) Exactly, Sn has the binomial distribution which specifies that P(Sn = k) = n k p k (1 − p) n−k for k = 0, 1, . . . , n. For even moderate values of n managing the binomial coefficients n k becomes unwieldy, to say nothing of computing the sum which yields the cumulative probability P(Sn ≤ m) = X k≤m n k p k (1 − p) n−k that there will be m or fewer successes. The great utility of the CLT is in providing an easily computable approximation to such probabilities that can be quite accurate even for moderate values of n. Standardizing the binomial Sn by subtracting its mean and dividing by its standard deviation to obtain the mean zero, variance one random variable Wn = (Sn − np)/ p np(1 − p), the CLT yields that ∀x limn→∞ P(Wn ≤ x) = P(Z ≤ x) (2) where Z is N (0, 1), a standard, mean zero variance one, normal random variable, that is, the one with distribution function Φ(x) = Z x −∞ ϕ(u)du where ϕ(u) = 1 √ 2π exp(− 1 2 u 2 ). (3) We may therefore, for instance, approximate the cumbersome cumulative binomial probability P(Sn ≤ m) by the simpler Φ((m − np)/ p np(1 − p)). It was only for the special case of the binomial that the normal approximation was first considered. Only many years later with the work of Laplace around 1820 did it begin to be systematically realized that the same normal limit is obtained when the underlying Bernoulli variables are replaced by any variables with a finite variance. The result was the classical Central Limit, which states that (2) holds whenever Wn = (Sn − nµ)/ √ nσ2 2
is the standardization of a sum Sn, as in(1), of independent and identically distributed random variables each with mean u and variance o From this generalization it now becomes somewhat clearer why various distributions observed in nature, which may not be at all related to the binomial, such the errors of measurement averages, or the heights of individuals in a sample take on the bell shaped form: each observation is the result of summing many mall independent factors A further extension of the classical CLt could yet come. In situations where the summand distributions do not have identical distributions, can the normal curve still govern? For an example, consider the symmetric group the set of all permutations T on the set (1, 2, .. n. We can represent 丌∈Sr, for example, by two line notation 234567 4376512 from which one can read that T(1)=4 and T(4)=6. This permutation can also be represented in the cycle notation 丌=(1,4,6)(2,3,7)(5) with the meaning that T maps 1 to 4, 4 to 6, 6 to 1, and so forth. From the cycle representation we see that T has two cycles of length 3 and one of length 1, for a total of three cycles. In general, let Kn(T) denote the total number of cycles in a permutation T E Sn. If T is chosen uniformly from all the n! permutations in Sn, does the Central Limit Theorem imply that Kn(r) is approximately normally distributed for large n? To answer this question we will employ the Feller coupling [3, which constructs a random permutation T uniformly from Sn with the help of n independent Bernoulli variables X1,..., Xn with distributions Px=0)=1- and P(Xi=1)=,i=1,…,n.(4) Begin the first cycle at stage 1 with the element 1. At stage i, i=1 if Xn-i+i= 1 close the current cycle and begin a new one starting with the smallest number not yet in any cycle, and otherwise choose an element uniformly from those yet unused and place it to the right of the last element in the current cycle. In this way at stage i we complete a cycle with probability
is the standardization of a sum Sn, as in (1), of independent and identically distributed random variables each with mean µ and variance σ 2 . From this generalization it now becomes somewhat clearer why various distributions observed in nature, which may not be at all related to the binomial, such as the errors of measurement averages, or the heights of individuals in a sample, take on the bell shaped form: each observation is the result of summing many small independent factors. A further extension of the classical CLT could yet come. In situations where the summand distributions do not have identical distributions, can the normal curve still govern? For an example, consider the symmetric group Sn, the set of all permutations π on the set {1, 2, . . . , n}. We can represent π ∈ S7, for example, by two line notation π = 1 2 3 4 5 6 7 4 3 7 6 5 1 2 from which one can read that π(1) = 4 and π(4) = 6. This permutation can also be represented in the cycle notation π = (1, 4, 6)(2, 3, 7)(5) with the meaning that π maps 1 to 4, 4 to 6, 6 to 1, and so forth. From the cycle representation we see that π has two cycles of length 3 and one of length 1, for a total of three cycles. In general, let Kn(π) denote the total number of cycles in a permutation π ∈ Sn. If π is chosen uniformly from all the n! permutations in Sn, does the Central Limit Theorem imply that Kn(π) is approximately normally distributed for large n? To answer this question we will employ the Feller coupling [3], which constructs a random permutation π uniformly from Sn with the help of n independent Bernoulli variables X1, . . . , Xn with distributions P(Xi = 0) = 1 − 1 i and P(Xi = 1) = 1 i , i = 1, . . . , n. (4) Begin the first cycle at stage 1 with the element 1. At stage i, i = 1, . . . , n, if Xn−i+1 = 1 close the current cycle and begin a new one starting with the smallest number not yet in any cycle, and otherwise choose an element uniformly from those yet unused and place it to the right of the last element in the current cycle. In this way at stage i we complete a cycle with probability 3
1/(n-i+1), upon mapping the last element of the current cycle to the one which begins it As the total number Kn(T)of cycles of T is exactly the number of times an element closes the loop upon completing its own cycle Kn(丌)=X1+…X a sum of independent, but not identically distributed random variables Hence, despite the similarity of(5) to(1), the hypotheses of the classical central limit theorem do not hold. Nevertheless, in 1922 Lindeberg 7 pro- vided a general condition which can be applied in this case to show that Kn(T) is asymptotically normal To explore Lindeberg's condition, first consider the proper standardiza- tion of Kn(T) in our example. As any Bernoulli random variable with success probability p has mean p and variance p(1-p), the Bernoulli variable Xi in (4) has mean i-I and variance i-(1-i-l)for i=1,., n. Thus, 1 hn=∑;and ∑ mean and variance of Kn(T), respectively; the mean hn is known as harmonic number. In particular, standardizing Kn(r) to have mean zero and variance 1 results in Kn(丌)-hn which, absorbing the scaling inside the sum, can be written as Wn=∑ Xin where Xi In general, it is both more convenient and more encompassing to deal not with a sequence of variables but rather with a triangular array as in which satisfies the following condition Condition 1.1 For every n= 1, 2,.. the random variables making up the collection Xn=(Xi n: Isisn are independent with mean zero and finite variances a n= Var(Xi n), standardized so that Xi,n has variance Var(Wn)
1/(n − i + 1), upon mapping the last element of the current cycle to the one which begins it. As the total number Kn(π) of cycles of π is exactly the number of times an element closes the loop upon completing its own cycle, Kn(π) = X1 + · · · Xn, (5) a sum of independent, but not identically distributed random variables. Hence, despite the similarity of (5) to (1), the hypotheses of the classical central limit theorem do not hold. Nevertheless, in 1922 Lindeberg [7] provided a general condition which can be applied in this case to show that Kn(π) is asymptotically normal. To explore Lindeberg’s condition, first consider the proper standardization of Kn(π) in our example. As any Bernoulli random variable with success probability p has mean p and variance p(1 − p), the Bernoulli variable Xi in (4) has mean i −1 and variance i −1 (1 − i −1 ) for i = 1, . . . , n. Thus, hn = Xn i=1 1 i and σ 2 n = Xn i=1 1 i − 1 i 2 (6) are the mean and variance of Kn(π), respectively; the mean hn is known as the n th harmonic number. In particular, standardizing Kn(π) to have mean zero and variance 1 results in Wn = Kn(π) − hn σn , which, absorbing the scaling inside the sum, can be written as Wn = Xn i=1 Xi,n where Xi,n = Xi − i −1 σn . (7) In general, it is both more convenient and more encompassing to deal not with a sequence of variables but rather with a triangular array as in (7) which satisfies the following condition. Condition 1.1 For every n = 1, 2, . . ., the random variables making up the collection Xn = {Xi,n : 1 ≤ i ≤ n} are independent with mean zero and finite variances σ 2 i,n = Var(Xi,n), standardized so that Wn = Xn i=1 Xi,n has variance Var(Wn) = Xn i=1 σ 2 i,n = 1. 4
Of course, even under Condition 1.1, some further assumptions must be sat- isfied by the summand variables for the normal convergence(2) For instance, if the first variable accounts for some non-vanishing fraction of the total variability, it will strongly influence the limiting distribution, possi bly resulting in non-normal convergence. The Lindeberg-Feller central limit theorem, see], says that normal convergence(2)holds upon ruling out such situations by imposing the Lindeberg Condition VE>0 lim Ln, c=0 where In. =>EXi I(Xi, nI>e))(8) where for an event A, the indicator'random variable 1(A)takes on the value 1 if A occurs and the value 0 otherwise. Once known to be sufficient the Lindeberg condition was proved to be partially necessary by Feller and Levy, independently; see 8 for history. The appearance of the Lindeberg Condition is justified by explanations such as the one given by Feller [4], who roughly says that it requires the individual variances be due mainly to masses in an interval whose length is small in comparison to the overall variance. We present a probabilistic condition which is seemingly simpler, yet equivalent Our probabilistic approach to the Clt is through the so called zero bias transformation introduced in [6. For every distribution with mean zero, and finite non-zero variance a2 on a random variable x the zero bias transfor mation returns the unique X-zero biased distribution'on X* which satisfies o2Ef(X*)=ELXJ(X) for all absolutely continuous functions f for which these expectations exist The existence of a strong connection between the zero bias transformation and the normal distribution is made clear by the characterization of Stein 91, which implies that X' and X have the same distribution if and only if X has the w(0, a2)distribution, that is, that the normal distribution is the zero bias transformations unique fixed point the ne way to see the "if direction of Steins characterization, that is,why zero bias transformation fixes the normal, is to note that the density function o2(a)=o-p(o-r) of a N(, o)variable, with p(a) given by (3), satisfies the differential equation with a form"conjugate'to( 9) 5
Of course, even under Condition 1.1, some further assumptions must be satisfied by the summand variables for the normal convergence (2) to take place. For instance, if the first variable accounts for some non-vanishing fraction of the total variability, it will strongly influence the limiting distribution, possibly resulting in non-normal convergence. The Lindeberg-Feller central limit theorem, see [4], says that normal convergence (2) holds upon ruling out such situations by imposing the Lindeberg Condition ∀ > 0 limn→∞ Ln, = 0 where Ln, = Xn i=1 E{X 2 i,n1(|Xi,n| ≥ )} (8) where for an event A, the ‘indicator’ random variable 1(A) takes on the value 1 if A occurs, and the value 0 otherwise. Once known to be sufficient, the Lindeberg condition was proved to be partially necessary by Feller and L´evy, independently; see [8] for history. The appearance of the Lindeberg Condition is justified by explanations such as the one given by Feller [4], who roughly says that it requires the individual variances be due mainly to masses in an interval whose length is small in comparison to the overall variance. We present a probabilistic condition which is seemingly simpler, yet equivalent. Our probabilistic approach to the CLT is through the so called zero bias transformation introduced in [6]. For every distribution with mean zero, and finite non-zero variance σ 2 on a random variable X, the zero bias transformation returns the unique ‘X-zero biased distribution’ on X∗ which satisfies σ 2Ef0 (X ∗ ) = E[Xf(X)] (9) for all absolutely continuous functions f for which these expectations exist. The existence of a strong connection between the zero bias transformation and the normal distribution is made clear by the characterization of Stein [9], which implies that X∗ and X have the same distribution if and only if X has the N (0, σ2 ) distribution, that is, that the normal distribution is the zero bias transformation’s unique fixed point. One way to see the ‘if’ direction of Stein’s characterization, that is, why the zero bias transformation fixes the normal, is to note that the density function ϕσ2 (x) = σ −1ϕ(σ −1x) of a N (0, σ2 ) variable, with ϕ(x) given by (3), satisfies the differential equation with a form ‘conjugate’ to (9), σ 2ϕ 0 σ2 (x) = −xϕσ2 (x), 5
and now(9), with X*=X, follows for a large class of functions f by inte- gration by parts We can gain some additional intuition regarding the zero bias transfor mation by observing its action on non-normal distributions, which, in some sense, moves them closer to normality. Let b be a bernoulli random variable with success probability P E(0, 1), and let ua, b denote the uniform distri bution on the finite interval [a, b]. Centering B to form the mean zero discrete random variable X=B-p having variance o=p(1-P), substitution into he right hand side of (9) yields EIXf(XJ= E(B-P)f(B-p p(1-p)f(1-p)-(1-p)pf(-p) 2[f(1-p)-f(-p f(u) f(U), for U having uniform density over [-p, 1-Pl. Hence, with=d indicating the equality of two random variables in distribution (B-p=dU where U has distribution Z-p, 1-pI This example highlights the general fact that the distribution of X* is always absolutely continuous, regardless of the nature of the distribution of X It is the uniqueness of the fixed point of the zero bias transformation, that is, the fact that X* has the same distribution as X only when X is normal that provides the probabilistic reason behind the Clt. This only if'direction of Steins characterization suggests that a distribution which gets mapped to one nearby is close to being a fixed point of the zero bias transformation, and therefore must be close to the transformation's only fixed point, the normal Hence the normal approximation should apply whenever the distribution of a random variable is close to that of its zero bias transformation Moreover, the zero bias transformation has a special property that imme- diately shows why the distribution of a sum Wn of comparably sized inde- pendent random variables is close to that of Wr: a sum of independent terms can be zero biased by replacing a single summand chosen proportionally to its variance and replacing it with one of comparable size. Thus, by differ ing only in a single summand, the variables Wn and W are close, making
and now (9), with X∗ = X, follows for a large class of functions f by integration by parts. We can gain some additional intuition regarding the zero bias transformation by observing its action on non-normal distributions, which, in some sense, moves them closer to normality. Let B be a Bernoulli random variable with success probability p ∈ (0, 1), and let U[a, b] denote the uniform distribution on the finite interval [a, b]. Centering B to form the mean zero discrete random variable X = B − p having variance σ 2 = p(1 − p), substitution into the right hand side of (9) yields E[Xf(X)] = E[(B − p)f(B − p)] = p(1 − p)f(1 − p) − (1 − p)pf(−p) = σ 2 [f(1 − p) − f(−p)] = σ 2 Z 1−p −p f 0 (u)du = σ 2Ef0 (U), for U having uniform density over [−p, 1 − p]. Hence, with =d indicating the equality of two random variables in distribution, (B − p) ∗ =d U where U has distribution U[−p, 1 − p]. (10) This example highlights the general fact that the distribution of X∗ is always absolutely continuous, regardless of the nature of the distribution of X. It is the uniqueness of the fixed point of the zero bias transformation, that is, the fact that X∗ has the same distribution as X only when X is normal, that provides the probabilistic reason behind the CLT. This ‘only if’ direction of Stein’s characterization suggests that a distribution which gets mapped to one nearby is close to being a fixed point of the zero bias transformation, and therefore must be close to the transformation’s only fixed point, the normal. Hence the normal approximation should apply whenever the distribution of a random variable is close to that of its zero bias transformation. Moreover, the zero bias transformation has a special property that immediately shows why the distribution of a sum Wn of comparably sized independent random variables is close to that of W∗ n : a sum of independent terms can be zero biased by replacing a single summand chosen proportionally to its variance and replacing it with one of comparable size. Thus, by differing only in a single summand, the variables Wn and W∗ n are close, making 6
Wn an approximate fixed point of the zero bias transformation, and there- fore approximately normal. This explanation, when given precisely, becomes a probabilistic proof of the Lindeberg-Feller central limit theorem under a condition equivalent to( 8)which we call the small zero bias condition We first consider more precisely this special property of the zero bias transformation on independent sums. Given Xn satisfying Condition 1. 1, let X*=in: I0. Theorem 1.1 For a collection of random variables Xn, n=1, 2, .. satisfy ing Condition 1. 1, the small zero bias condition and the Lindeberg condition (8) are equivalent. 7
Wn an approximate fixed point of the zero bias transformation, and therefore approximately normal. This explanation, when given precisely, becomes a probabilistic proof of the Lindeberg-Feller central limit theorem under a condition equivalent to (8) which we call the ‘small zero bias condition’. We first consider more precisely this special property of the zero bias transformation on independent sums. Given Xn satisfying Condition 1.1, let X∗ n = {X∗ i,n : 1 ≤ i ≤ n} be a collection of random variables so that X∗ i,n has the Xi,n zero bias distribution and is independent of Xn. Further, let In be a random index, independent of Xn and X∗ n , with distribution P(In = i) = σ 2 i,n, (11) and write the variable selected by In, that is, the mixture, using indicator functions as XIn,n = Xn i=1 1(In = i)Xi,n and X ∗ In,n = Xn i=1 1(In = i)X ∗ i,n. (12) Then W∗ n = Wn − XIn,n + X ∗ In,n (13) has the Wn zero bias distribution. For the simple proof of this fact, see [6]. From (13) we see that the CLT should hold when the random variables XIn,n and X∗ In,n are both small asymptotically, since then the distribution of Wn is close to that of W∗ n , making Wn an approximate fixed point of the zero bias transformation. The following theorem shows that properly formalizing the notion of smallness results in a condition equivalent to Lindeberg’s. Recall that we say a sequence of random variables Yn converges in probability to Y , and write Yn →p Y , if limn→∞ P(|Yn − Y | ≥ ) = 0 for all > 0. Theorem 1.1 For a collection of random variables Xn, n = 1, 2, . . . satisfying Condition 1.1, the small zero bias condition X ∗ In,n →p 0 (14) and the Lindeberg condition (8) are equivalent. 7
Our probabilistic proof of the Lindeberg-Feller Theorem develops by first showing that the small zero bias condition implies and hence, that W"-Wn=Xn-Xn→p0 Theorem 1.2 confirms that this convergence in probability to zero, mandating that Wn have its own zero bias distribution in the limit, is sufficient to guarantee normal convergence Theorem 1. 2 IfXn, n=1, 2,.. satisfies Condition 1. 1 and the small zero bias condition(14),then Vr lim P(Wn≤x)=P(Z≤x) We return now to the number Kn() of cycles of a random permutation in Sn, with mean hn and variance o? given by(6). Since 2i_ 1/i2i Xin, as X1=1 for all a≠0, which follows directly from( 9), by(10) we have Xin=d Ui/on, where U; has distribution u-l/i, 1-1/il, i=2,.,n In particular, Uil I with probability one for all i= 1, 2, .. and therefore xn|≤1/on→0 Dy(15). Hence the small zero bias condition is satisfied, and Theorem 1.2 may be invoked to show that the number of cycles of a random permutation is asymptotically normal
Our probabilistic proof of the Lindeberg-Feller Theorem develops by first showing that the small zero bias condition implies XIn,n →p 0, and hence, that W∗ n − Wn = X ∗ In,n − XIn,n →p 0. Theorem 1.2 confirms that this convergence in probability to zero, mandating that Wn have its own zero bias distribution in the limit, is sufficient to guarantee normal convergence. Theorem 1.2 If Xn, n = 1, 2, . . . satisfies Condition 1.1 and the small zero bias condition (14), then ∀x limn→∞ P(Wn ≤ x) = P(Z ≤ x). We return now to the number Kn(π) of cycles of a random permutation in Sn, with mean hn and variance σ 2 n given by (6). Since Pn i=1 1/i2 < ∞, by upper and lower bounding the n th harmonic number hn by integrals of 1/x, we have limn→∞ hn log n = 1 and therefore limn→∞ σ 2 n log n = 1. (15) In view of (7) and (4) we note that in this case Wn = Pn i=2 Xi,n, as X1 = 1 identically makes X1,n = 0 for all n. Now by the linearity relation (aX) ∗ =d aX∗ for all a 6= 0, which follows directly from (9), by (10) we have X ∗ i,n =d Ui/σn, where Ui has distribution U[−1/i, 1 − 1/i], i = 2, . . . , n. In particular, |Ui | ≤ 1 with probability one for all i = 1, 2, . . ., and therefore |X ∗ In,n| ≤ 1/σn → 0 (16) by (15). Hence the small zero bias condition is satisfied, and Theorem 1.2 may be invoked to show that the number of cycles of a random permutation is asymptotically normal. 8
More generally, the small zero bias condition will hold for an array x, with elements Xi, n=Xil on whenever the independent mean zero summand variables X1, X2,... satisfy Xil0 one has C/on 0 mP(xnl2e)=imP(X√ne)=0, that is Xin,n→r0 It is easy to see, and well known, that the Lindeberg condition is not necessary for(2). In particular, consider the case where for all n the first
More generally, the small zero bias condition will hold for an array Xn with elements Xi,n = Xi/σn whenever the independent mean zero summand variables X1, X2, . . . satisfy |Xi | ≤ C with probability one for some constant C, and the variance σ 2 n of their sum Sn tends to infinity. In particular, from (9) one can verify that |Xi | ≤ C with probability one implies |X∗ i | ≤ C with probability one, and hence (16) holds with C replacing 1. In such a case, the Lindeberg condition (8) is also not difficult to verify: for any > 0 one has C/σn 0 limn→∞ P(|X ∗ In,n| ≥ ) = limn→∞ P(|X ∗ | ≥ √ n) = 0, that is X ∗ In,n →p 0. It is easy to see, and well known, that the Lindeberg condition is not necessary for (2). In particular, consider the case where for all n the first 9
summand Xin of Wn has the mean zero normal distribution oZ with vari- ance 0-E(0, 1), and the Lindeberg condition is satisfied for the remaining variables, that is, that the limit is zero when taking the sum in( 8 )over all it 1. Since the sum of independent normal variables is again normal, I will converge in distribution to Z, but( 8) does not hold, since for all e>0 Ln=E{X2n1(X1|≥c)}=a2E{z21(o|2≥e}>0. efini 17 to use for excluding such cases, we have the following partial converse to Theorem 1.2 Theorem 1. 3 IfN, n=1, 2,.. satisfies Condition 1.1 and lim then the small zero bias condition is necessary for Wn-dZ We prove Theorem 1.3 in Section 5 by showing that Wn -d Z implies that W d Z, and that(18)implies XIn, n P 0. But then also Wn+Xin= Z and Wn+X z imply that X 0 These implications provide the probabilistic reason that the small zero bias condition, or Lindeberg condition, is necessary for normal convergence under Section 2 draws a parallel between the zero bias transformation and the one better known for size biasing, and there we consider its connection to the differential equation method of Stein using test functions. In Section 3 we prove the equivalence of the classical Lindeberg condition and the small zero bias condition and then, in Sections 4 and 5, its sufficiency and partial necessity for normal convergence Some pains have been taken to keep the treatment as elementary as pos- sible, in particular by avoiding the use of characteristic functions. Though some technical argument is needed, only real functions are involved and the development remains at a level as basic as the material permits. To help keep the presentation self contained two general type results appear in Section 6 10
summand X1,n of Wn has the mean zero normal distribution σZ with variance σ 2 ∈ (0, 1), and the Lindeberg condition is satisfied for the remaining variables, that is, that the limit is zero when taking the sum in (8) over all i 6= 1. Since the sum of independent normal variables is again normal, Wn will converge in distribution to Z, but (8) does not hold, since for all > 0 limn→∞ Ln, = E{X 2 1,n1(|X1,n| ≥ )} = σ 2E{Z 21(σ|Z| ≥ )} > 0. Defining mn = max 1≤i≤n σ 2 i,n (17) to use for excluding such cases, we have the following partial converse to Theorem 1.2. Theorem 1.3 If Xn, n = 1, 2, . . . satisfies Condition 1.1 and limn→∞ mn = 0, (18) then the small zero bias condition is necessary for Wn →d Z. We prove Theorem 1.3 in Section 5 by showing that Wn →d Z implies that W∗ n →d Z, and that (18) implies XIn,n →p 0. But then also Wn + X∗ In,n = W∗ n + XIn,n →d Z, and now Wn →d Z and Wn + X ∗ In,n →d Z imply that X ∗ In,n →p 0. These implications provide the probabilistic reason that the small zero bias condition, or Lindeberg condition, is necessary for normal convergence under (18). Section 2 draws a parallel between the zero bias transformation and the one better known for size biasing, and there we consider its connection to the differential equation method of Stein using test functions. In Section 3 we prove the equivalence of the classical Lindeberg condition and the small zero bias condition and then, in Sections 4 and 5, its sufficiency and partial necessity for normal convergence. Some pains have been taken to keep the treatment as elementary as possible, in particular by avoiding the use of characteristic functions. Though some technical argument is needed, only real functions are involved and the development remains at a level as basic as the material permits. To help keep the presentation self contained two general type results appear in Section 6. 10