Statistical Seience 1996.VoL.11.N0.3.189-228 Bootstrap Confidence Intervals Thomas J.DiCiccio and Bradley Efron Abstract.This article surveys bootstrap methods for producing good approximate confidence intervals.The goal is to improve by an order of magnitude upon the accuracy of the standard intervals6+z(),in a way that allows routine application even to very complicated problems. Both theory and examples are used to show how this is done.The first seven sections provide a heuristic overview of four bootstrap confidence interval procedures:BCa,bootstrap-t,ABC and calibration.Sections 8 and 9 describe the theory behind these methods,and their close connec- tion with the likelihood-based confidence interval theory developed by Barndorff-Nielsen,Cox and Reid and others. Key words and phrases:Bootstrap-t,BCa and ABC methods,calibra- tion,second-order accuracy 1.INTRODUCTION ate,2(0.95)=1.645 and so on.Often,and always in Confidence this paper,6 and a are obtained by maximum like- intervals have become familiar lihood theory. friends in the applied statistician's collection of data-analytic tools.They combine point estima- The standard intervals,as implemented by maxi- tion and hypothesis testing into a single inferen- mum likelihood theory,are a remarkably useful tool tial statement of great intuitive appeal.Recent The method is completely automatic:the statisti- advances in statistical methodology allow the con- cian inputs the data,the class of possible probabil- struction of highly accurate approximate confidence ity models and the parameter of interest;a com- intervals,even for very complicated probability puter algorithm outputs the intervals(1.1),with no models and elaborate data structures.This article further intervention required.This is in notable con- trast to the construction of an exact interval,which discusses bootstrap methods for constructing such intervals in a routine,automatic way. requires clever thought on a problem-by-problem Two distinct approaches have guided confidence basis when it is possible at all. interval construction since the 1930's.A small cata- The trouble with standard intervals is that they logue of exact intervals has been built up for special are based on an asymptotic approximation that can situations.like the ratio of normal means or a sin- be quite inaccurate in practice.The example below gle binomial parameter.However,most confidence illustrates what every applied statistician knows, that (1.1)can considerably differ from exact inter- intervals are approximate,with by far the favorite vals in those cases where exact intervals exist.Over approximation being the standard interval the years statisticians have developed tricks for im- (1.1) 0±za)6. proving(1.1),involving bias-corrections and param- eter transformations.The bootstrap confidence Here 6 is a point estimate of the parameter of in- intervals that we will discuss here can be thought terest 0,is an estimate of 0's standard deviation, of as automatic algorithms for carrying out these and z()is the 100ath percentile of a normal devi- improvements without human intervention.Of course they apply as well to situations so compli- Thomas J.DiCiccio is Assocciate Professor,De- cated that they lie beyond the power of traditional partment of Social Statistics,358 Ives Hall,Cor- analysis. nell University,Ithaca,New York 14853-3901 We begin with a simple example,where we can (email:tidg@cornell.edu).Bradley Efron is Pro- compute the bootstrap methods with an exact inter- fessor,Department of Statistics and Department val.Figure 1 shows the cd4 data:20 HIV-positive of Health Research and Policy,Stanford Uni- subjects received an experimental antiviral drug; versity,Stanford,California 94305-4065 (e-mail: cd4 counts in hundreds were recorded for each sub- brad@playfair.stanford.edu). ject at baseline and after one year of treatment,giv- 189
Statistical Science 1996, Vol. 11, No. 3, 189–228 Bootstrap Confidence Intervals Thomas J. DiCiccio and Bradley Efron Abstract. This article surveys bootstrap methods for producing good approximate confidence intervals. The goal is to improve by an order of magnitude upon the accuracy of the standard intervals θˆ ± z ασˆ , in a way that allows routine application even to very complicated problems. Both theory and examples are used to show how this is done. The first seven sections provide a heuristic overview of four bootstrap confidence interval procedures: BCa , bootstrap-t, ABC and calibration. Sections 8 and 9 describe the theory behind these methods, and their close connection with the likelihood-based confidence interval theory developed by Barndorff-Nielsen, Cox and Reid and others. Key words and phrases: Bootstrap-t, BCa and ABC methods, calibration, second-order accuracy 1. INTRODUCTION Confidence intervals have become familiar friends in the applied statistician’s collection of data-analytic tools. They combine point estimation and hypothesis testing into a single inferential statement of great intuitive appeal. Recent advances in statistical methodology allow the construction of highly accurate approximate confidence intervals, even for very complicated probability models and elaborate data structures. This article discusses bootstrap methods for constructing such intervals in a routine, automatic way. Two distinct approaches have guided confidence interval construction since the 1930’s. A small catalogue of exact intervals has been built up for special situations, like the ratio of normal means or a single binomial parameter. However, most confidence intervals are approximate, with by far the favorite approximation being the standard interval 1:1 θˆ ± z ασˆ : Here θˆ is a point estimate of the parameter of interest θ, σˆ is an estimate of θˆ’s standard deviation, and z α is the 100αth percentile of a normal deviThomas J. DiCiccio is Assocciate Professor, Department of Social Statistics, 358 Ives Hall, Cornell University, Ithaca, New York 14853-3901 (email: tjd9@cornell.edu). Bradley Efron is Professor, Department of Statistics and Department of Health Research and Policy, Stanford University, Stanford, California 94305-4065 (e-mail: brad@playfair.stanford.edu). ate, z 0:95 = 1:645 and so on. Often, and always in this paper, θˆ and σˆ are obtained by maximum likelihood theory. The standard intervals, as implemented by maximum likelihood theory, are a remarkably useful tool. The method is completely automatic: the statistician inputs the data, the class of possible probability models and the parameter of interest; a computer algorithm outputs the intervals (1.1), with no further intervention required. This is in notable contrast to the construction of an exact interval, which requires clever thought on a problem-by-problem basis when it is possible at all. The trouble with standard intervals is that they are based on an asymptotic approximation that can be quite inaccurate in practice. The example below illustrates what every applied statistician knows, that (1.1) can considerably differ from exact intervals in those cases where exact intervals exist. Over the years statisticians have developed tricks for improving (1.1), involving bias-corrections and parameter transformations. The bootstrap confidence intervals that we will discuss here can be thought of as automatic algorithms for carrying out these improvements without human intervention. Of course they apply as well to situations so complicated that they lie beyond the power of traditional analysis. We begin with a simple example, where we can compute the bootstrap methods with an exact interval. Figure 1 shows the cd4 data: 20 HIV-positive subjects received an experimental antiviral drug; cd4 counts in hundreds were recorded for each subject at baseline and after one year of treatment, giv- 189
190 T.J.DICICCIO AND B.EFRON TABLE 1 The cd4 data,as plotted in Figure 1 6 Subject Baseline One year Subject Baseline One year 1 2.12 2.47 11 4.15 4.74 2 4.35 4.61 12 3.56 3.29 3 3.39 5.26 13 3.39 5.55 4 2.51 3.02 14 1.88 2.82 4.04 6.36 15 2.56 4.23 6 5.10 5.93 16 2.96 3.23 3.77 3.93 17 2.49 2.56 8 3.35 4.09 18 3.03 4.31 9 4.10 4.88 19 2.66 4.37 10 3.35 3.81 20 3.00 2.40 computer-based adjustments to the standard in- B(Baseiine)】 terval endpoints that are guaranteed to improve the coverage accuracy by an order of magnitude,at FIG.1.The cd4 data;cd4 counts in hundreds for 20 subjects, least asymptotically. at baseline and after one year of treatment with an experimental The exact interval endpoints [0.47,0.86]are de- anti-viral drug;numerical values appear in Table 1. fined by the fact that they"cover"the observed value =0.723 with the appropriate probabilities, ing data,say,xi=(Bi,Ai)for i=1,2,...,20.The (1.4) Probo=0.4{6>0.723}=0.05 data is listed in Table 1.The two measurements are highly correlated,having sample correlation coeffi- and cient0=0.723. What if we wish to construct a confidence inter- (1.5) Prob9=0.86{9>0.723}=0.95 val for the true correlation 0?We can find an exact Table 2 shows that the corresponding probabilities interval for 6 if we are willing to assume bivariate for the standard endpoints [0.55,0.90]are 0.12 and normality for the (B:,A:)pairs, 0.99.The standard interval is far too liberal at its lower endpoint and far too cautious at its upper end- (1.2) ii.dN2(入,)fori=1,2,..,20 point.This kind of error is particularly pernicious if the confidence interval is used to test a parameter where A and I are the unknown expectation vec- value of interest like 6=0. tor and covariance matrix.The exact central 90% Table 2 describes the various confidence intervals interval is in terms of their length and right-left asymmetry around the point estimate 0, (1.3)(0xAcT[0.05],0 EXACTI0.95])=(0.47,0.86) length=0.95]-0.05]. This notation emphasizes that a two-sided interval (1.6) 0.95]-8 is intended to give correct coverage at both end- shape points,two 0.05 noncoverage probabilities in this 0-0.05] case,not just an overall 0.10 noncoverage probabil- The standard intervals always have shape equal to ity. 1.00.It is in this way that they err most seriously. The left panel of Table 2 shows the exact and For example,the exact normal-theory interval for standard intervals for the correlation coefficient of Corr has shape equal to 0.52,extending twice as far the cd4 data,assuming the normal model (1.2).Also to the left of=0.723 as to the right.The stan- shown are approximate confidence intervals based dard interval is much too optimistic about ruling on three different (but closely related)bootstrap out values of 6 below 6,and much too pessimistic methods:ABC,BCa and bootstrap-t.The ABC and about ruling out values above 6.This kind of error BC methods match the exact interval to two dec- is automatically identified and corrected by all the imal places,and all of the bootstrap intervals are bootstrap confidence interval methods. more accurate than the standard.The examples There is no compelling reason to assume bivariate and theory that follow are intended to show that normality for the data in Figure 1.A nonparamet- this is no accident.The bootstrap methods make ric version of(1.2)assumes that the pairs (Bi,A:)
190 T. J. DICICCIO AND B. EFRON Fig. 1. The cd4 data; cd4 counts in hundreds for 20 subjects, at baseline and after one year of treatment with an experimental anti-viral drug; numerical values appear in Table 1. ing data, say, xi = Bi ; Ai for i = 1; 2;: : :; 20. The data is listed in Table 1. The two measurements are highly correlated, having sample correlation coeffi- cient θˆ = 0:723. What if we wish to construct a confidence interval for the true correlation θ? We can find an exact interval for θ if we are willing to assume bivariate normality for the Bi ; Ai pairs, 1:2 Bi Ai ∼i:i:d: N2 λ; 0 for i = 1; 2;: : :; 20; where λ and 0 are the unknown expectation vector and covariance matrix. The exact central 90% interval is 1:3 θˆ EXACT0:05; θˆ EXACT0:95 = 0:47; 0:86: This notation emphasizes that a two-sided interval is intended to give correct coverage at both endpoints, two 0.05 noncoverage probabilities in this case, not just an overall 0.10 noncoverage probability. The left panel of Table 2 shows the exact and standard intervals for the correlation coefficient of the cd4 data, assuming the normal model (1.2). Also shown are approximate confidence intervals based on three different (but closely related) bootstrap methods: ABC, BCa and bootstrap-t. The ABC and BCa methods match the exact interval to two decimal places, and all of the bootstrap intervals are more accurate than the standard. The examples and theory that follow are intended to show that this is no accident. The bootstrap methods make Table 1 The cd4 data, as plotted in Figure 1 Subject Baseline One year Subject Baseline One year 1 2.12 2.47 11 4.15 4.74 2 4.35 4.61 12 3.56 3.29 3 3.39 5.26 13 3.39 5.55 4 2.51 3.02 14 1.88 2.82 5 4.04 6.36 15 2.56 4.23 6 5.10 5.93 16 2.96 3.23 7 3.77 3.93 17 2.49 2.56 8 3.35 4.09 18 3.03 4.31 9 4.10 4.88 19 2.66 4.37 10 3.35 3.81 20 3.00 2.40 computer-based adjustments to the standard interval endpoints that are guaranteed to improve the coverage accuracy by an order of magnitude, at least asymptotically. The exact interval endpoints [0.47, 0.86] are de- fined by the fact that they “cover” the observed value θˆ = 0:723 with the appropriate probabilities, 1:4 Probθ=0:47θˆ > 0:723 = 0:05 and 1:5 Probθ=0:86θˆ > 0:723 = 0:95: Table 2 shows that the corresponding probabilities for the standard endpoints [0.55, 0.90] are 0.12 and 0.99. The standard interval is far too liberal at its lower endpoint and far too cautious at its upper endpoint. This kind of error is particularly pernicious if the confidence interval is used to test a parameter value of interest like θ = 0. Table 2 describes the various confidence intervals in terms of their length and right–left asymmetry around the point estimate θˆ, 1:6 length = θˆ0:95 − θˆ0:05; shape = θˆ0:95 − θˆ θˆ − θˆ0:05: The standard intervals always have shape equal to 1.00. It is in this way that they err most seriously. For example, the exact normal-theory interval for Corr has shape equal to 0.52, extending twice as far to the left of θˆ = 0:723 as to the right. The standard interval is much too optimistic about ruling out values of θ below θˆ, and much too pessimistic about ruling out values above θˆ. This kind of error is automatically identified and corrected by all the bootstrap confidence interval methods. There is no compelling reason to assume bivariate normality for the data in Figure 1. A nonparametric version of (1.2) assumes that the pairs Bi ; Ai
BOOTSTRAP CONFIDENCE INTERVALS 191 TABLE 2 Exact and approximate confidence intervals for the correlation coefficient,ed4 data;=0.723:the bootstrap methods ABC,BCa bootstrap-t and calibrated ABC are explained in Sections 2-7;the ABC and BCa intervals are close to exact in the normal theory situation (left panel);the standard interval errs badly at both endpoints,as can be seen from the coverage probabilities in the bottom rows Normal theory Nonparametric Exact ABC BCa Bootstrap-t Standard ABC BCa Bootstrap-t Calibrated Standard 0.05 0.47 0.47 0.47 0.45 0.55 0.56 0.55 0.51 0.56 0.59 0.95 0.86 0.86 0.86 0.87 0.90 0.83 0.85 0.86 0.83 0.85 Length 0.39 0.39 0.39 0.42 0.35 0.27 0.30 0.35 0.27 0.26 Shape 0.52 0.52 0.54 0.52 1.00 0.67 0.70 0.63 0.67 1.00 Cov 05 0.05 0.05 0.05 0.04 0.12 Cov 95 0.95 0.95 0.95 0.97 0.99 are a random sample ("i.i.d.")from some unknown One of the achievements of the theory discussed bivariate distribution F. in Section 8 is to provide a reasonable theoretical gold standard for approximate confidence inter- (1.7) iid.F, i=1,2,.,n, vals.Comparison with this gold standard shows that the bootstrap intervals are not only asymptot- n =20,without assuming that F belongs to any ically more accurate than the standard intervals, particular parametric family.Bootstrap-based confi- they are also more correct."Accuracy"refers to the dence intervals such as abC are available for non- coverage errors:a one-sided bootstrap interval of parametric situations,as discussed in Section 6.In intended coverage a actually covers 6 with proba- theory they enjoy the same second-order accuracy as bility a+0(1/n),where n is the sample size.This in parametric problems.However,in some nonpara- is second-order accuracy,compared to the slower metric confidence interval problems that have been first-order accuracy of the standard intervals,with examined carefully,the small-sample advantages of coverage probabilites a+0(1//n).However con- the bootstrap methods have been less striking than fidence intervals are supposed to be inferentially in parametric situations.Methods that give third- correct as well as accurate.Correctness is a harder order accuracy,like the bootstrap calibration of an property to pin down,but it is easy to give exam- ABC interval,seem to be more worthwhile in the ples of incorrectness:if x1,x2,...,xn is a random nonparametric framework(see Section 6). sample from a normal distribution N(0,1),then In most problems and for most parameters there (min(x:),max(xi))is an exactly accurate two-sided will not exist exact confidence intervals.This great confidence interval for 6 of coverage probability gray area has been the province of the standard in- 1-1/2"-1,but it is incorrect.The theory of Section tervals for at least 70 years.Bootstrap confidence in- 8 shows that all of our better confidence intervals tervals provide a better approximation to exactness are second-order correct as well as second-order in most situations.Table 3 refers to the parameter accurate.We can see this improvement over the 6 defined as the maximum eigenvalue of the covari- standard intervals on the left side of Table 2.The ance matrix of(B,A)in the cd4 experiment, theory says that this improvement exists also in (1.8) 6=maximum eigenvalue {cov(B,A)}. those cases like Table 3 where we cannot see it directly. The maximum likelihood estimate(MLE)of 0,as- suming either model (1.2)or (1.7),is 6=1.68.The bootstrap intervals extend further to the right than 2.THE BCa INTERVALS to the left of 6 in this case,more than 2.5 times as The next six sections give a heuristic overview far under the normal model.Even though we have of bootstrap confidence intervals.More examples no exact endpoint to serve as a"gold standard"here, are presented,showing how bootstrap intervals the theory that follows strongly suggests the supe- can be routinely constructed even in very compli- riority of the bootstrap intervals.Bootstrapping in- cated and messy situations.Section 8 derives the volves much more computation than the standard second-order properties of the bootstrap intervals in intervals,on the order of 1,000 times more,but the terms of asymptotic expansions.Comparisons with algorithms are completely automatic,requiring no likelihood-based methods are made in Section 9 more thought for the maximum eigenvalue than the The bootstrap can be thought of as a convenient correlation coefficient,or for any other parameter. way of executing the likelihood calculations in para-
BOOTSTRAP CONFIDENCE INTERVALS 191 Table 2 Exact and approximate confidence intervals for the correlation coefficient, cd4 data; θˆ = 0:723: the bootstrap methods ABC, BCa , bootstrap-t and calibrated ABC are explained in Sections 2–7; the ABC and BCa intervals are close to exact in the normal theory situation (left panel); the standard interval errs badly at both endpoints, as can be seen from the coverage probabilities in the bottom rows Normal theory Nonparametric Exact ABC BCa Bootstrap-t Standard ABC BCa Bootstrap-t Calibrated Standard 0.05 0.47 0.47 0.47 0.45 0.55 0.56 0.55 0.51 0.56 0.59 0.95 0.86 0.86 0.86 0.87 0.90 0.83 0.85 0.86 0.83 0.85 Length 0.39 0.39 0.39 0.42 0.35 0.27 0.30 0.35 0.27 0.26 Shape 0.52 0.52 0.54 0.52 1.00 0.67 0.70 0.63 0.67 1.00 Cov 05 0.05 0.05 0.05 0.04 0.12 Cov 95 0.95 0.95 0.95 0.97 0.99 are a random sample (“i.i.d.”) from some unknown bivariate distribution F, 1:7 Bi Ai ∼i:i:d: F; i = 1; 2;: : :; n; n = 20, without assuming that F belongs to any particular parametric family. Bootstrap-based confi- dence intervals such as ABC are available for nonparametric situations, as discussed in Section 6. In theory they enjoy the same second-order accuracy as in parametric problems. However, in some nonparametric confidence interval problems that have been examined carefully, the small-sample advantages of the bootstrap methods have been less striking than in parametric situations. Methods that give thirdorder accuracy, like the bootstrap calibration of an ABC interval, seem to be more worthwhile in the nonparametric framework (see Section 6). In most problems and for most parameters there will not exist exact confidence intervals. This great gray area has been the province of the standard intervals for at least 70 years. Bootstrap confidence intervals provide a better approximation to exactness in most situations. Table 3 refers to the parameter θ defined as the maximum eigenvalue of the covariance matrix of B; A in the cd4 experiment, 1:8 θ = maximum eigenvalue covB; A: The maximum likelihood estimate (MLE) of θ, assuming either model (1.2) or (1.7), is θˆ = 1:68. The bootstrap intervals extend further to the right than to the left of θˆ in this case, more than 2.5 times as far under the normal model. Even though we have no exact endpoint to serve as a “gold standard” here, the theory that follows strongly suggests the superiority of the bootstrap intervals. Bootstrapping involves much more computation than the standard intervals, on the order of 1,000 times more, but the algorithms are completely automatic, requiring no more thought for the maximum eigenvalue than the correlation coefficient, or for any other parameter. One of the achievements of the theory discussed in Section 8 is to provide a reasonable theoretical gold standard for approximate confidence intervals. Comparison with this gold standard shows that the bootstrap intervals are not only asymptotically more accurate than the standard intervals, they are also more correct. “Accuracy” refers to the coverage errors: a one-sided bootstrap interval of intended coverage α actually covers θ with probability α + O1/n, where n is the sample size. This is second-order accuracy, compared to the slower first-order accuracy of the standard intervals, with coverage probabilites α + O1/ √ n. However con- fidence intervals are supposed to be inferentially correct as well as accurate. Correctness is a harder property to pin down, but it is easy to give examples of incorrectness: if x1 ; x2 ;: : :; xn is a random sample from a normal distribution Nθ; 1, then (minxi , maxxi ) is an exactly accurate two-sided confidence interval for θ of coverage probability 1 − 1/2 n−1 , but it is incorrect. The theory of Section 8 shows that all of our better confidence intervals are second-order correct as well as second-order accurate. We can see this improvement over the standard intervals on the left side of Table 2. The theory says that this improvement exists also in those cases like Table 3 where we cannot see it directly. 2. THE BCa INTERVALS The next six sections give a heuristic overview of bootstrap confidence intervals. More examples are presented, showing how bootstrap intervals can be routinely constructed even in very complicated and messy situations. Section 8 derives the second-order properties of the bootstrap intervals in terms of asymptotic expansions. Comparisons with likelihood-based methods are made in Section 9. The bootstrap can be thought of as a convenient way of executing the likelihood calculations in para-
192 T.J.DICICCIO AND B.EFRON TABLE 3 Approximate 90%central confidence intervals for the maximum eigenvalue parameter (1.7).cd4 data:the bootstrap intervals extend much further to the right of the MLE=1.68 than to the left Normal theory Nonparametric ABC BCa Standard ABC BCa Calibated Standard 0.05 1.11 1.10 0.80 1.15 1.14 1.16 1.01 0.95 3.25 3.18 2.55 2.56 2.55 3.08 2.35 Length 2.13 2.08 1.74 1.42 1.41 1.92 1.34 Shape 2.80 2.62 1.00 1.70 1.64 2.73 1.00 metric exponential family situations and even in The 2,000 bootstrap replications had standard nonparametric problems. deviation 0.52.This is the bootstrap estimate of The bootstrap was introduced as a nonparametric standard error for 6,generally a more dependable device for estimating standard errors and biases. standard error estimate than the usual parametric Confidence intervals are inherently more delicate delta-method value (see Efron,1981).The mean of inference tools.A considerable amount of effort has the 2,000 values was 1.61,compared to =1.68, gone into upgrading bootstrap methods to the level indicating a small downward bias in the Maxeig of precision required for confidence intervals. statistic.In this case it is easy to see that the down- The BC method is an automatic algorithm for ward bias comes from dividing by n instead of n-1 producing highly accurate confidence limits from a in obtaining the MLE I of the covariance matrix. bootstrap distribution.Its effectiveness was demon- Two thousand bootstrap replications is 10 times strated in Table 2.References include Efron(1987), too many for estimating a standard error,but not too Hall (1988),DiCiccio (1984),DiCiccio and Romano many for the more delicate task of setting confidence (1995)and Efron and Tibshirani(1993).A program intervals.These bootstrap sample size calculations written in the language S is available [see the note appear in Efron(1987,Section 9). in the second paragraph following (4.14)]. The BC procedure is a method of setting approx- The goal of bootstrap confidence interval theory imate confidence intervals for 6 from the percentiles is to calculate dependable confidence limits for a pa- of the bootstrap histogram.Suppose e is a param- rameter of interest 6 from the bootstrap distribution eter of interest;0(x)is an estimate of based on of 6.Figure 2 shows two such bootstrap distribu- the observed data x;and*=(x*)is a bootstrap tions relating to the maximum eigenvalue param- replication of 6 obtained by resampling x*from an eter 0 for the cd4 data,(1.8).The nonparametric estimate of the distribution governing x.Let G(c) bootstrap distribution (on the right)will be dis- be the cumulative distribution function(c.d.f.)of B cussed in Section 6. bootstrap replications *(b), The left panel is the histogram of 2,000 normal- theory bootstrap replications of 6.Each replication (2.2) G(c)=#{(b)<c}/B was obtained by drawing a bootstrap data set anal- ogous to(1.2)】 In our case B =2,000.The upper endpoint Oc.[a]of a one-sided level-a BCa interval,0 2.1 iid.N2(,),i=1,2,..,20, (-OBc [a])is defined in terms of G and two numerical parameters discussed below:the bias. and then computing the maximum likelihood correction zo and the acceleration a(BCa stands for estimate (MLE)of based on the boostrap data.In "bias-corrected and accelerated").By definition the other words6*was the maximum eigenvalue of the BCa endpoint is empirical covariance matrix of the 20 pairs(B:,A). The mean vector A and covariance matrix I'in(2.1) (2.3) 20+2a) were the usual maximum likelihood estimates for c.al=G-1(2o+1-a(0+z@可 入and「,based on the original data in Figure 1. Relation (2.1)is a parametric bootstrap sample, Here d is the standard normal c.d.f,with z()= obtained by sampling from a parametric MLE for (a)as before.The central 0.90 BC interval the unknown distribution F.Section 6 discusses is given by (BC.[0.05],0BC.[0.95]).Formula (2.3) nonparametric bootstrap samples and confidence looks strange,but it is well motivated by the trans- intervals. formation and asymptotic arguments that follow
192 T. J. DICICCIO AND B. EFRON Table 3 Approximate 90% central confidence intervals for the maximum eigenvalue parameter 1:7, cd4 data; the bootstrap intervals extend much further to the right of the MLE θˆ = 1:68 than to the left Normal theory Nonparametric ABC BCa Standard ABC BCa Calibated Standard 0.05 1.11 1.10 0.80 1.15 1.14 1.16 1.01 0.95 3.25 3.18 2.55 2.56 2.55 3.08 2.35 Length 2.13 2.08 1.74 1.42 1.41 1.92 1.34 Shape 2.80 2.62 1.00 1.70 1.64 2.73 1.00 metric exponential family situations and even in nonparametric problems. The bootstrap was introduced as a nonparametric device for estimating standard errors and biases. Confidence intervals are inherently more delicate inference tools. A considerable amount of effort has gone into upgrading bootstrap methods to the level of precision required for confidence intervals. The BCa method is an automatic algorithm for producing highly accurate confidence limits from a bootstrap distribution. Its effectiveness was demonstrated in Table 2. References include Efron (1987), Hall (1988), DiCiccio (l984), DiCiccio and Romano (1995) and Efron and Tibshirani (1993). A program written in the language S is available [see the note in the second paragraph following (4.14)]. The goal of bootstrap confidence interval theory is to calculate dependable confidence limits for a parameter of interest θ from the bootstrap distribution of θˆ. Figure 2 shows two such bootstrap distributions relating to the maximum eigenvalue parameter θ for the cd4 data, (1.8). The nonparametric bootstrap distribution (on the right) will be discussed in Section 6. The left panel is the histogram of 2,000 normaltheory bootstrap replications of θˆ. Each replication was obtained by drawing a bootstrap data set analogous to (1.2), 2:1 B∗ i A∗ i ∼i:i:d: N2 λˆ; 0ˆ ; i = 1; 2;: : :; 20; and then computing θˆ ∗ , the maximum likelihood estimate (MLE) of θ based on the boostrap data. In other words θˆ ∗ was the maximum eigenvalue of the empirical covariance matrix of the 20 pairs B∗ i ; A∗ i . The mean vector λˆ and covariance matrix 0ˆ in (2.1) were the usual maximum likelihood estimates for λ and 0, based on the original data in Figure 1. Relation (2.1) is a parametric bootstrap sample, obtained by sampling from a parametric MLE for the unknown distribution F. Section 6 discusses nonparametric bootstrap samples and confidence intervals. The 2,000 bootstrap replications θˆ ∗ had standard deviation 0.52. This is the bootstrap estimate of standard error for θˆ, generally a more dependable standard error estimate than the usual parametric delta-method value (see Efron, 1981). The mean of the 2,000 values was 1.61, compared to θˆ = 1:68, indicating a small downward bias in the Maxeig statistic. In this case it is easy to see that the downward bias comes from dividing by n instead of n − 1 in obtaining the MLE 0ˆ of the covariance matrix. Two thousand bootstrap replications is 10 times too many for estimating a standard error, but not too many for the more delicate task of setting confidence intervals. These bootstrap sample size calculations appear in Efron (1987, Section 9). The BCa procedure is a method of setting approximate confidence intervals for θ from the percentiles of the bootstrap histogram. Suppose θ is a parameter of interest; θˆx is an estimate of θ based on the observed data x; and θˆ ∗ = θˆx ∗ is a bootstrap replication of θˆ obtained by resampling x ∗ from an estimate of the distribution governing x. Let Gˆ c be the cumulative distribution function (c.d.f.) of B bootstrap replications θˆ ∗ b, 2:2 Gˆ c = #θˆ ∗ b < c/B: In our case B = 2,000. The upper endpoint θˆBCa α of a one-sided level-α BCa interval, θ ∈ −∞; θˆBCa α is defined in terms of Gˆ and two numerical parameters discussed below: the biascorrection z0 and the acceleration a (BCa stands for “bias-corrected and accelerated”). By definition the BCa endpoint is 2:3 θˆBCa α = Gˆ −18 z0 + z0 + z α 1 − az0 + z α : Here 8 is the standard normal c.d.f, with z α = 8−1 α as before. The central 0.90 BCa interval is given by θˆBCa 0:05; θˆBCa 0:95. Formula (2.3) looks strange, but it is well motivated by the transformation and asymptotic arguments that follow.
BOOTSTRAP CONFIDENCE INTERVALS 193 0.5 1.0 1.520 2.5 3.0 *+ FIG.2.Bootstrap distributions for the maximum eigenvalue of the covariance matrix,cd4 data:(left)2,000 parametric bootstrap replications assuming a bivariate normal distribution;(right)2,000 nonparametric bootstrap replications,discussed in Section 6.The solid lines indicate the limits of the BCa 0.90 central confidence intervals,compared to the standard intervals (dashed lines). If a and zo are zero,then OBc [a]=G-1(a),the thatξ=专+W for all values ofg,with W always 100ath percentile of the bootstrap replications.In having the same distribution.This is a translation this case the 0.90 BCa interval is the interval be- problem so we know how to set confidence limits tween the 5th and 95th percentiles of the bootstrap a]for replications.If in addition G is perfectly normal, then Ogc,[a]=0+z(),the standard interval end- (2.6) a=专-W1-, point.In general,(2.3)makes three distinct correc- where wd-a)is the 100(1-a)th percentile of W tions to the standard intervals,improving their The BCa interval(2.3)is exactly equivalent to the coverage accuracy from first to second order. translation interval(2.6),and in this sense it is cor- The c.d.f.G is markedly long-tailed to the rect as well as accurate. right,on the normal-theory side of Figure 2. The bias-correction constant zo is easy to inter- Also a and zo are both estimated to be positive, pret in (2.5)since (a,)=(0.105,0.226),further shifting 0nc [a]to the right of OsTAN[a]=+()The 0.90 BC (2.7) Prob{中<中}=(zo): interval for 6 is Then Prob<}=(zo)because of monotonicity. (2.4)(G-1(0.157),G-1(0.995)=(1.10,3.18) The BC algorithm,in its simplest form,estimates zo by compared to the standard interval(0.80,2.55). The following argument motivates the BCa def- 0=-1 #{*(b)<创 (2.8) inition (2.3),as well as the parameters a and zo. B Suppose that there exists a monotone increasing -1 of the proportion of the bootstrap replications transformation中=m(f)such that中=m(0is less than 6.Of the 2,000 normal-theory bootstrap normally distributed for every choice of 0,but pos- replications shown in the left panel of Fig- sibly with a bias and a nonconstant variance, ure 2,1179 were less than 6=1.68.This gave (2.5)中~N(中-z00b,o6),06=1+a中 2o=-(0.593)=0.226,a positive bias correction since is biased downward relative to 6.An often Then(2.3)gives exactly accurate and correct confi- more accurate method of estimating zo is described dence limits for 0 having observed 6. in Section 4. The argument in Section 3 of Efron(1987)shows The acceleration a in(2.5)measures how quickly that in situation (2.5)there is another monotone the standard error is changing on the normalized transformation,.say专=M(a)and专=M(a),such scale.The value a =0.105 in (2.4),obtained from
BOOTSTRAP CONFIDENCE INTERVALS 193 Fig. 2. Bootstrap distributions for the maximum eigenvalue of the covariance matrix, cd4 data: (left) 2,000 parametric bootstrap replications assuming a bivariate normal distribution; (right) 2,000 nonparametric bootstrap replications, discussed in Section 6. The solid lines indicate the limits of the BCa 0:90 central confidence intervals, compared to the standard intervals (dashed lines). If a and z0 are zero, then θˆBCa α = Gˆ −1 α, the 100αth percentile of the bootstrap replications. In this case the 0.90 BCa interval is the interval between the 5th and 95th percentiles of the bootstrap replications. If in addition Gˆ is perfectly normal, then θˆBCa α = θˆ + z ασˆ , the standard interval endpoint. In general, (2.3) makes three distinct corrections to the standard intervals, improving their coverage accuracy from first to second order. The c.d.f. Gˆ is markedly long-tailed to the right, on the normal-theory side of Figure 2. Also a and z0 are both estimated to be positive, aˆ; zˆ0 = 0:105; 0:226, further shifting θˆBCa α to the right of θˆ STANα = θˆ + z ασˆ . The 0.90 BCa interval for θ is 2:4 Gˆ −1 0:157; Gˆ −1 0:995 = 1:10; 3:18; compared to the standard interval (0.80, 2.55). The following argument motivates the BCa definition (2.3), as well as the parameters a and z0 . Suppose that there exists a monotone increasing transformation φ = mθ such that φˆ = mθˆ is normally distributed for every choice of θ, but possibly with a bias and a nonconstant variance, 2:5 φˆ ∼ Nφ − z0σφ; σ 2 φ ; σφ = 1 + aφ: Then (2.3) gives exactly accurate and correct confi- dence limits for θ having observed θˆ. The argument in Section 3 of Efron (1987) shows that in situation (2.5) there is another monotone transformation, say ξ = Mθ and ξˆ = Mθˆ, such that ξˆ = ξ + W for all values of ξ, with W always having the same distribution. This is a translation problem so we know how to set confidence limits ξˆα for ξ, 2:6 ξˆα = ξ − W1−α ; where W1−α is the 1001 − αth percentile of W. The BCa interval (2.3) is exactly equivalent to the translation interval (2.6), and in this sense it is correct as well as accurate. The bias-correction constant z0 is easy to interpret in (2.5) since 2:7 Probφˆ < φ = 8z0 : Then Probθˆ < θ = 8z0 because of monotonicity. The BCa algorithm, in its simplest form, estimates z0 by 2:8 zˆ0 = 8 −1 #θˆ ∗ b < θˆ B ; 8−1 of the proportion of the bootstrap replications less than θˆ. Of the 2,000 normal-theory bootstrap replications θˆ ∗ shown in the left panel of Figure 2, 1179 were less than θˆ = 1:68. This gave zˆ0 = 8−1 0:593 = 0:226, a positive bias correction since θˆ ∗ is biased downward relative to θˆ. An often more accurate method of estimating z0 is described in Section 4. The acceleration a in (2.5) measures how quickly the standard error is changing on the normalized scale. The value aˆ = 0:105 in (2.4), obtained from
194 T.J.DICICCIO AND B.EFRON formula(4.9)of Section 4,is moderately large.Sup- normal-theory standard intervals for the correlation pose we think we have moved 1.645 standard errors coefficient are much more accurate if constructed on to the right of中,to the scale =tanh()and then transformed back 6=6+1.645061 to give an interval for 6 itself.Transformation in- variance means that the BC intervals cannot be Actually though,with a=0.105, fooled by a bad choice of scale.To put it another way, the statistician does not have to search for a trans- 0%=(1+1.645a)o6=1.17306, formation like tanh in applying the BCa method according to(2.5).For calculating a confidence level, In summary,BCa produces confidence intervals is really only 1.645/1.173 =1.40 standard er- for 6 from the bootstrap distribution of requir- rors to the right of o,considerably less than 1.645. ing on the order of 2,000 bootstrap replications Formula(2.3)automatically corrects for an acceler- These intervals are transformation invariant ating standard error.The next section gives a ge- and exactly correct under the normal transforma- ometrical interpretation of a,and also of the BC tion model(2.5);in general they are second-order formula(2.3). accurate and correct. The peculiar-looking formula(2.3)for the BCa endpoints is designed to give exactly the right an- 3.THE ACCELERATION a swer in situation(2.5),and to give it automatically The acceleration parameter a appearing in the in terms of the bootstrap distribution of *Notice, BCa formula(3.2)looks mysterious.Its definition for instance,that the normalizing transformation in(2.5)involves an idealized transformation to nor- d=m()is not required in (2.3).By comparison, mality which will not be known in practice.Fortu- the standard interval works perfectly only under the nately a enjoys a simple relationship with Fisher's more restrictive assumption that score function which makes it easy to estimate.This (2.9) 8~N(0,σ2), section describes the relationship in the context of one-parameter families.In doing so it also allows with o2 constant.In practice we do not expect ei- us better motivation for the peculiar-looking BCa ther (2.9)or (2.5)to hold exactly,but the broader formula(2.3). assumptions (2.5)are likely to be a better approxi- Suppose then that we have a one-parameter fam- mation to the truth.They produce intervals that are ily of c.d.f.'s Ge()on the real line,with 6 being an an order of magnitude more accurate,as shown in estimate of 0.In the relationships below we assume Section 8. that6 behaves asymptotically like a maximum like- Formula (2.5)generalizes (2.9)in three ways,by lihood estimator,with respect to a notional sample allowing bias,nonconstant standard error and a size n,as made explicit in(5.3)of Efron(1987).As normalizing transformation.These three extensions a particular example,we will consider the case are necessary and sufficient to give second-order accuracy, (3.1) Gamman n=10, n (2.10) Prob{00. where n is the sample size in an i.i.d.sampling situ- Having observed 6,we wonder with what confi- ation.This result is stated more carefully in Section dence we can reject a trial value 00 of the parameter 8,which also shows the second-order correctness of 6.In the gamma example(3.1)we might have the BC intervals.Hall (1988)was the first to es- tablish (2.10). (3.2) 0=1and0o=1.5. The BC intervals are transformation invariant. The easy answer from the bootstrap point of view is If we change the parameter of interest from 0 to given in terms of the bootstrap c.d.f.G(c)=Ga(c). some monotone function of 6,=m(),likewise We can define the bootstrap confidence value to be changing0to中=m(8)and to中*=m(),then the a-level BC endpoints change in the same way, (3.3) a=G(0o)=Ga(0o): (2.11) BC [a]m(0BC,[a]) However,this will usually not agree with the more familiar hypothesis-testing confidence level for a The standard intervals are not transformation in one-parameter problem,say variant,and this accounts for some of their practi- cal difficulties.It is well known,for instance,that (3.4) a=1-Ge()
194 T. J. DICICCIO AND B. EFRON formula (4.9) of Section 4, is moderately large. Suppose we think we have moved 1.645 standard errors to the right of φˆ, to φe = φˆ + 1:645σφˆ : Actually though, with a = 0:105, σeφ = 1 + 1:645aσφˆ = 1:173σφˆ; according to (2.5). For calculating a confidence level, φe is really only 1:645/1:173 = 1:40 standard errors to the right of φˆ, considerably less than 1:645. Formula (2.3) automatically corrects for an accelerating standard error. The next section gives a geometrical interpretation of a, and also of the BCa formula (2.3). The peculiar-looking formula (2.3) for the BCa endpoints is designed to give exactly the right answer in situation (2.5), and to give it automatically in terms of the bootstrap distribution of θˆ ∗ . Notice, for instance, that the normalizing transformation φˆ = mθˆ is not required in (2.3). By comparison, the standard interval works perfectly only under the more restrictive assumption that 2:9 θˆ ∼ Nθ; σ 2 ; with σ 2 constant. In practice we do not expect either (2.9) or (2.5) to hold exactly, but the broader assumptions (2.5) are likely to be a better approximation to the truth. They produce intervals that are an order of magnitude more accurate, as shown in Section 8. Formula (2.5) generalizes (2.9) in three ways, by allowing bias, nonconstant standard error and a normalizing transformation. These three extensions are necessary and sufficient to give second-order accuracy, 2:10 Probθ 0. Having observed θˆ, we wonder with what confi- dence we can reject a trial value θ0 of the parameter θˆ. In the gamma example (3.1) we might have 3:2 θˆ = 1 and θ0 = 1:5: The easy answer from the bootstrap point of view is given in terms of the bootstrap c.d.f. Gˆ c = Gθˆc. We can define the bootstrap confidence value to be 3:3 α˜ = Gˆ θ0 = Gθˆθ0 : However, this will usually not agree with the more familiar hypothesis-testing confidence level for a one-parameter problem, say 3:4 αˆ = 1 − Gθ0 θˆ;
BOOTSTRAP CONFIDENCE INTERVALS 195 the probability under 0o of getting a less extreme We can list three important properties of the(2,2) observation than 0.(For convenience these defini- curve (3.12)near=20: tions assume <00.)In the case of(3.1)-(3.2)we have a=0.930 while a=0.863. (3.13) (2,)=(2o-0)at乏=o: The BCa formula(2.3)amounts to a rule for con- verting bootstrap confidence values into hypothe- (3.14) dE=1at2=0, sis-testing confidence levels a.This becomes crucial as soon as we try to use the bootstrap on problems and more complicated than one-parameter families.De- d22 (3.15) fine d2=-2d at. (3.5) z=Φ-(a)ande=Φ-1(a) The last of these relationships is of special interest here.It says that the curvature of the (z,2)curve at For a given value of 0o and a above,let a=a and 20 is directly proportional to the acceleration a. eBC [a]0o in (2.3).If (2.3)works perfectly,then In any given one-parameter problem we can find we have the actual(2,2)curve,at least in theory.This is ob- tained by keeping 6 fixed and varying the trial point (3.6) 20十2 Φ-1G(00)=2=20+1-a(20+) 0o in (3.3)-(3.5).Figure 3 shows the (2,2)curve for the gamma problem,with 6 any fixed value,say or =1.In this case the BC approximation formula (3.7) 2-20 (3.12)matches the actual(,2)curve to three deci- 龙=1+a(2-z0) -20 mal places over most of the range of the graph.At 0=1,00 1.5 for example,2 equals 1.092 both The fact that the BC intervals are second-order actually and from(3.15). accurate implies that the conversion formula(3.7) The fact that the BCa formula(2.3)is second- itself must be quite accurate. order accurate implies that the conversion formula To use (3.7),or(2.3),we first must estimate the (3.12)errs only by O(1/n).This means that rela- two parameters zo and a.The bias-correction zo is tionships (3.13)-(3.15)must have the same order of estimated by accuracy,even in quite general problems.In partic- ular,the curvature of the actual(,)plot,if it were (3.8) 0=Φ-1G(0)=-1G(0 possible to compute it,would nearly equal-2a,with a given by the skewness definition(3.10). as in(2.8).The acceleration a is estimated in terms None of this is special to one-parameter families of the skewness of the score function except for the skewness definition(3.10),which does (3.9) o=品1g{g.(o, not allow for nuisance parameters.The next section where ge(0)is the density Ge(0)/00.Section 10 of Efron (1987)shows that one-sixth the skewness of (0)evaluated at 0=6, (3.10) a =SKEW0-oflo(0)}/6, is an excellent estimate of a. Both zo and a are of order O(1/n),with the estimates 2o and a erring by O(1/n).For the gamma problem(3.1)it is easy to calculate that (3.11) 20=0.106anda=0.105. If 6 is the MLE in a one-parameter family (but not in general),then 2o and d are nearly the same,as 2 0 2 is the case here. The usable form of(3.7)is FIG.3.Plot of versus in the gamma problem (3.1);the BC approximation (3.12)or(2.3),matches the actual curve to three (3.12) 2-20 2= decimal places.The central curvature of the(2,2)plot is propor 1+(2-20) -0 tional to the acceleration a
BOOTSTRAP CONFIDENCE INTERVALS 195 the probability under θ0 of getting a less extreme observation than θˆ. (For convenience these definitions assume θˆ < θ0 .) In the case of (3.1)–(3.2) we have α˜ = 0:930 while αˆ = 0:863. The BCa formula (2.3) amounts to a rule for converting bootstrap confidence values α˜ into hypothesis-testing confidence levels αˆ. This becomes crucial as soon as we try to use the bootstrap on problems more complicated than one-parameter families. De- fine 3:5 z˜ = 8 −1 α˜ and zˆ = 8 −1 αˆ: For a given value of θ0 and αˆ above, let α = αˆ and θˆBCa α = θ0 in (2.3). If (2.3) works perfectly, then we have 3:6 8 −1Gˆ θ0 = z˜ = z0 + z0 + zˆ 1 − az0 + zˆ ; or 3:7 zˆ = z˜ − z0 1 + azˆ − z0 − z0 : The fact that the BCa intervals are second-order accurate implies that the conversion formula (3.7) itself must be quite accurate. To use (3.7), or (2.3), we first must estimate the two parameters z0 and a. The bias-correction z0 is estimated by 3:8 zˆ0 = 8 −1Gˆ θˆ = 8 −1Gθˆθˆ as in (2.8). The acceleration a is estimated in terms of the skewness of the score function 3:9 `˙ θ θˆ = ∂ ∂θ loggθ θˆ; where gθ θˆ is the density ∂Gθ θˆ/∂θˆ. Section 10 of Efron (1987) shows that one-sixth the skewness of `˙ θ θˆ evaluated at θ = θˆ, 3:10 aˆ = SKEWθ=θˆ`˙ θ θˆ/6; is an excellent estimate of a. Both z0 and a are of order O1/ √ n, with the estimates zˆ0 and aˆ erring by O1/n. For the gamma problem (3.1) it is easy to calculate that 3:11 zˆ0 = 0:106 and aˆ = 0:105: If θˆ is the MLE in a one-parameter family (but not in general), then zˆ0 and aˆ are nearly the same, as is the case here. The usable form of (3.7) is 3:12 zˆ = z˜ − zˆ0 1 + aˆz˜ − z0 − zˆ0 : We can list three important properties of the z˜; zˆ curve (3.12) near z˜ = zˆ0 : z˜; zˆ = zˆ0 − zˆ0 at z˜ = zˆ0 (3.13) y dzˆ dz˜ = 1 at z˜ = zˆ0 (3.14) ; and d 2zˆ dz˜ 2 = −2aˆ at z˜ = zˆ0 (3.15) : The last of these relationships is of special interest here. It says that the curvature of the z˜; zˆ curve at zˆ0 is directly proportional to the acceleration aˆ. In any given one-parameter problem we can find the actual z˜; zˆ curve, at least in theory. This is obtained by keeping θˆ fixed and varying the trial point θ0 in (3.3)–(3.5). Figure 3 shows the z˜; zˆ curve for the gamma problem, with θˆ any fixed value, say θˆ = 1. In this case the BCa approximation formula (3.12) matches the actual z˜; zˆ curve to three decimal places over most of the range of the graph. At θˆ = 1; θ0 = 1:5 for example, zˆ equals 1.092 both actually and from (3.15). The fact that the BCa formula (2.3) is secondorder accurate implies that the conversion formula (3.12) errs only by O1/n. This means that relationships (3.13)–(3.15) must have the same order of accuracy, even in quite general problems. In particular, the curvature of the actual z˜; zˆ plot, if it were possible to compute it, would nearly equal −2aˆ, with aˆ given by the skewness definition (3.10). None of this is special to one-parameter families except for the skewness definition (3.10), which does not allow for nuisance parameters. The next section Fig. 3. Plot of zˆ versus z˜ in the gamma problem 3:1; the BCa approximation 3:12 or 2:3, matches the actual curve to three decimal places. The central curvature of the z˜; zˆ plot is proportional to the acceleration aˆ.
196 T.J.DICICCIO AND B.EFRON shows how to extend the skewness definition of a to the cumulant generating function,is a normalizing multiparameter situations.This gives an estimate factor that makes gu(x)integrate to 1. that is easy to evaluate,especially in exponential The vectors u and n are in one-to-one correspon- families,and that behaves well in practice.In fact dence so that either can be used to index functions a is usually easier to estimate than zo,despite the of interest.In(4.1),for example,we used u to index latter's simpler definition. the densities g,but n to index The ABC algo- rithm involves the mapping from n to u,say 4.THE ABC METHOD (4.2) u=mu(n), We now leave one-parameter families and return which,fortunately,has a simple form in all of the to the more complicated situations that bootstrap common exponential families.Section 3 of DiCic- methods are intended to deal with.In many such cio and Efron(1992)gives function(4.2)for several situations it is possible to approximate the BCa families,as well as specifying the other inputs nec- interval endpoints analytically,entirely dispens- essary for using the ABC algorithm. ing with Monte Carlo simulations.This reduces The MLE of u in (3.1)is y,so that the MLE the computational burden by an enormous fac- of a real-valued parameter of interest 6=t(u)is tor,and also makes it easier to understand how BCa improves upon the standard intervals.The (4.3) 0=t()=t(y). ABC method("ABC"standing for approximate boot- As an example consider the bivariate normal model strap confidence intervals)is an analytic version (1.2).Here x=(B1,A1),(B2,A2),,(B20,A2o) of BCa applying to smoothly defined parameters andy=∑21(B,A,B号,B:A,Ay/20.The bivari- in exponential families.It also applies to smoothly ate normal is a five-parameter exponential family defined nonparametric problems,as shown in Sec- with tion 6.DiCiccio and Efron (1992)introduced the ABC method,which is also discussed in Efron and (4.4)u=(入1,入2,A足+「11,入1入2+「12,λ2+「22)/ Tibshirani (1993). Thus the correlation coefficient is the function t(u) The BC endpoints(2.3)depend on the bootstrap given by c.d.f.G and estimates of the two parameters a and zo.The ABC method requires one further estimate, (4.5) L4-u1八2 of the nonlinearity parameter ca,but it does not in- [(3-(4屿-]2 volve G. 6=t()is seen to be the usual sample correlation The standard interval (1.1)depends only on the coefficient. two quantities ()The ABC intervals depend We denote the p x p covariance matrix of y by on the five quantities (0,G,a,2o,c).Each of the (u)=cov.{y,and letΣ=(i),the MLE ofΣ. three extra numbers(a,2o,c)corrects a deficiency The delta-method estimate of standard error for 6= of the standard method,making the ABC intervals t()depends on Let i denote the gradient vector second-order accurate as well as second-order cor- of =t(u)at u=, rect. The ABC system applies within multiparame- (4.6) ou: ter exponential families,which are briefly reviewed below.This framework includes most familiar Then parametric situations:normal,binomial,Poisson, (4.7) 6=(t'2t)/2 gamma,multinomial,ANOVA,logistic regression, contingency tables,log-linear models,multivariate is the parametric delta-method estimate of standard normal problems,Markov chains and also nonpara- error,and it is also the usual Fisher information metric situations as discussed in Section 6. standard error estimate. The density function for a p-parameter exponen- The o values for the standard intervals in Tables tial family can be written as 2 and 3 were found by numerical differentiation, using (4.1) gu(x)=exp[n'y-(n)] (4.8) t(+se;)-t(i-se;) dμi 28 where x is the observed data and y=Y(x)is a p- dimensional vector of sufficient statistics;n is the for a small value of s,with ei the ith coordinate p-dimensional natural parameter vector;u is vector.The covariance matrix is simple to calcu- the expectation parameter u=E{y};and (n), late in most of the familiar examples,as shown in
196 T. J. DICICCIO AND B. EFRON shows how to extend the skewness definition of aˆ to multiparameter situations. This gives an estimate that is easy to evaluate, especially in exponential families, and that behaves well in practice. In fact a is usually easier to estimate than z0 , despite the latter’s simpler definition. 4. THE ABC METHOD We now leave one-parameter families and return to the more complicated situations that bootstrap methods are intended to deal with. In many such situations it is possible to approximate the BCa interval endpoints analytically, entirely dispensing with Monte Carlo simulations. This reduces the computational burden by an enormous factor, and also makes it easier to understand how BCa improves upon the standard intervals. The ABC method (“ABC” standing for approximate bootstrap confidence intervals) is an analytic version of BCa applying to smoothly defined parameters in exponential families. It also applies to smoothly defined nonparametric problems, as shown in Section 6. DiCiccio and Efron (1992) introduced the ABC method, which is also discussed in Efron and Tibshirani (1993). The BCa endpoints (2.3) depend on the bootstrap c.d.f. Gˆ and estimates of the two parameters a and z0 . The ABC method requires one further estimate, of the nonlinearity parameter cq , but it does not involve Gˆ. The standard interval (1.1) depends only on the two quantities θˆ; σˆ . The ABC intervals depend on the five quantities θˆ; σˆ; aˆ; zˆ0 ; cˆq . Each of the three extra numbers aˆ; zˆ0 ; cˆq corrects a deficiency of the standard method, making the ABC intervals second-order accurate as well as second-order correct. The ABC system applies within multiparameter exponential families, which are briefly reviewed below. This framework includes most familiar parametric situations: normal, binomial, Poisson, gamma, multinomial, ANOVA, logistic regression, contingency tables, log-linear models, multivariate normal problems, Markov chains and also nonparametric situations as discussed in Section 6. The density function for a p-parameter exponential family can be written as 4:1 gµx = expη 0y − ψη where x is the observed data and y = Yx is a pdimensional vector of sufficient statistics; η is the p-dimensional natural parameter vector; µ is the expectation parameter µ = Eµy; and ψη, the cumulant generating function, is a normalizing factor that makes gµx integrate to 1. The vectors µ and η are in one-to-one correspondence so that either can be used to index functions of interest. In (4.1), for example, we used µ to index the densities g, but η to index ψ. The ABC algorithm involves the mapping from η to µ, say 4:2 µ = muη; which, fortunately, has a simple form in all of the common exponential families. Section 3 of DiCiccio and Efron (1992) gives function (4.2) for several families, as well as specifying the other inputs necessary for using the ABC algorithm. The MLE of µ in (3.1) is µˆ = y, so that the MLE of a real-valued parameter of interest θ = tµ is 4:3 θˆ = tµˆ = ty: As an example consider the bivariate normal model (1.2). Here x = B1 ; A1 , B2 ; A2 ;: : :; B20; A20 and y = P20 i=1 Bi , Ai , B2 i , BiAi , A2 i 0 /20. The bivariate normal is a five-parameter exponential family with 4:4 µ = λ1 ; λ2 ; λ 2 1 + 011; λ1λ2 + 012; λ 2 2 + 022 0 : Thus the correlation coefficient is the function tµ given by 4:5 θ = µ4 − µ1µ2 µ3 − µ 2 1 µ5 − µ 2 2 1/2 y θˆ = tµˆ is seen to be the usual sample correlation coefficient. We denote the p × p covariance matrix of y by 6µ = covµy, and let 6ˆ = 6µˆ , the MLE of 6. The delta-method estimate of standard error for θˆ = tµˆ depends on 6ˆ . Let t˙ denote the gradient vector of θ = tµ at µ = µˆ, 4:6 t˙ = : : :; ∂t ∂µi ;: : : 0 µ=µˆ : Then 4:7 σˆ = t˙ 06ˆ t˙ 1/2 is the parametric delta-method estimate of standard error, and it is also the usual Fisher information standard error estimate. The σˆ values for the standard intervals in Tables 2 and 3 were found by numerical differentiation, using 4:8 ∂t ∂µi µˆ := tµˆ + εei − tµˆ − εei 2ε for a small value of ε, with ei the ith coordinate vector. The covariance matrix 6ˆ is simple to calculate in most of the familiar examples, as shown in
BOOTSTRAP CONFIDENCE INTERVALS 197 DiCiccio and Efron (1992,Section 3)giving from Notice that ca is still required here,to estimate 2o (4.7).This assumes that t(u)is differentiable.In in(4.12). fact we need t(u)to be twice differentiable in order Formula(4.14)is the one used in Tables 2 and 3. to carry out the ABC computations. It has the advantage of being transformation invari- The ABC algorithm begins by computing a from ant,(2.11),and is sometimes more accurate than (4.7)-(4.8).Then the parameters (a,zo,ca)are esti- (4.13).However,(4.13)is local,all of the recompu- mated by computing p+2 numerical second deriva- tations of t(u)involved in (4.8)-(4.13)taking place tives.The first of these is infinitesimally near=y.In this sense ABCq is (4.9) a=alfmui+st0la like the standard method.Nonlocality occasionally /663 causes computational difficulties with boundary vi- when is the MLE of the natural parameter vec- olations.In fact (4.13)is a simple quadratic approx- tor n.This turns out to be the same as the skew- imation to (4.14),so ABC and ABCq usually agree ness definition of a,(3.10),in the one-parameter reasonably well. The main point of this article is that highly ac- family obtained from Stein's least favorable family construction [see Efron,1987,(6.7)].Formula (4.9) curate approximate confidence intervals can now be uses exponential family relationships to compute calculated on a routine basis.The ABC intervals are implemented by a short computer algorithm.[The the skewness from a second derivative. ABC intervals in Tables 2 and 3 were produced by The second ABC numerical derivative is the parametric and nonparametric ABC algorithms (4.10) 26 “abcpar'”and“abcnon."These and the BCa program are available in the language S:send electronic mail co measures how nonlinear the parameter of inter- to statlib@lib.stat.cmu.edu with the one-line mes- est 6 is,as a function of u. sage:send bootstrap.funs from S.]There are five in- The final p second derivatives are required for puts to the algorithm:立,,方and the functions t() the bias-correction parameter zo.The parametric and mu().The outputs include @sTAN[a],ABc[a] delta-method estimate of bias for =t()can be and ABc[].Computational effort for the ABC in- expressed as tervals is two or three times that required for the standard intervals. (4.11) The ABC intervals can be useful even in very simple situations.Suppose that the data consists where di is the ith eigenvalue and yi is the ith of a single observation x from a Poisson distribu- eigenvector of Then tion with unknown expectation 0.In this case 6= (4.12)0=Φ-1(2.(a)(cg-b/a)兰a+c,-b/a t(x)=x and =v6.Carrying through definitions (4.9)-4.14)gives d=0=1/(60/2),c,=0,and so This involves terms other than b becuase zo relates to median bias.For the kind of smooth exponential family problems considered here,(4.12)is usually cl网=i+a-,w=+zo more accurate than the direct estimate(2.8). The simplest form of the ABC intervals,called For x 7,the interval (ABc[0.05],0ABc[0.95]) ABCquadratic or ABCq,gives the a-level end- equals (3.54,12.67).This compares with the exact point directly as a function of the five numbers interval (3.57,12.58)for 0,splitting the atom of (0,,a,2o,g): probability at x=7,and with the standard interval (2.65,11.35). a→w=0+z@间 Here is a more realistic example of the ABC al- gorithm,used in a logistic regression context. (4.13) →=(1-a0P →5=入+c,A2 Table 4 shows the data from an experiment con- cerning mammalian cell growth.The goal of this →9ABca[a]=0+5 experiment was to quantify the effects of two fac- The original ABC endpoint,denoted BABcla],re- tors on the success of a culture.Factor"r"measures quires one more recomputation of the function t(): the ratio of two key constituents of the culture plate,while factor "d"measures how many days a→0=0+z(@→入= (1-aw)2 were allowed for culture maturation.A total of (4.14) 1,843 independent cultures were prepared,investi- gating 25 different(ri,d)combinations.The table lists si;and nii for each combination,the num-
BOOTSTRAP CONFIDENCE INTERVALS 197 DiCiccio and Efron (1992, Section 3) giving σˆ from (4.7). This assumes that tµ is differentiable. In fact we need tµ to be twice differentiable in order to carry out the ABC computations. The ABC algorithm begins by computing σˆ from (4.7)–(4.8). Then the parameters a; z0 ; cq are estimated by computing p+2 numerical second derivatives. The first of these is 4:9 aˆ = ∂ 2 ∂ε2 t˙ 0muηˆ + εt˙ε=0 6σˆ 3 ; when ηˆ is the MLE of the natural parameter vector η. This turns out to be the same as the skewness definition of aˆ, (3.10), in the one-parameter family obtained from Stein’s least favorable family construction [see Efron, 1987, (6.7)]. Formula (4.9) uses exponential family relationships to compute the skewness from a second derivative. The second ABC numerical derivative is 4:10 cˆq = ∂ 2 ∂ε2 t µˆ + ε6ˆ t˙ σˆ ε=0 2σˆ y cˆq measures how nonlinear the parameter of interest θ is, as a function of µ. The final p second derivatives are required for the bias-correction parameter z0 . The parametric delta-method estimate of bias for θˆ = tµˆ can be expressed as 4:11 bˆ = 1 2 X p i=1 ∂ 2 ∂ε2 tµˆ + εd 1/2 i γi ε=0 ; where di is the ith eigenvalue and γi is the ith eigenvector of 6ˆ . Then 4:12 zˆ0 = 8 −1 2·8aˆ·8cˆq−bˆ/σˆ := aˆ+cˆq−bˆ/σˆ : This involves terms other than bˆ becuase z0 relates to median bias. For the kind of smooth exponential family problems considered here, (4.12) is usually more accurate than the direct estimate (2.8). The simplest form of the ABC intervals, called ABCquadratic or ABCq, gives the α-level endpoint directly as a function of the five numbers θˆ; σˆ; aˆ; zˆ0 ; cˆq : 4:13 α → w ≡ zˆ0 + z α → λ ≡ w 1 − aˆw 2 → ξ ≡ λ + cˆqλ 2 → θˆ ABCqα = θˆ + σˆ ξ: The original ABC endpoint, denoted θˆ ABCα, requires one more recomputation of the function t·: 4:14 α → w = zˆ0 + z α → λ = w 1 − aˆw 2 → θˆ ABCα = t µˆ + λ6ˆ t˙ σˆ : Notice that cˆq is still required here, to estimate zˆ0 in (4.12). Formula (4.14) is the one used in Tables 2 and 3. It has the advantage of being transformation invariant, (2.11), and is sometimes more accurate than (4.13). However, (4.13) is local, all of the recomputations of tµ involved in (4.8)–(4.13) taking place infinitesimally near µˆ = y. In this sense ABCq is like the standard method. Nonlocality occasionally causes computational difficulties with boundary violations. In fact (4.13) is a simple quadratic approximation to (4.14), so ABC and ABCq usually agree reasonably well. The main point of this article is that highly accurate approximate confidence intervals can now be calculated on a routine basis. The ABC intervals are implemented by a short computer algorithm. [The ABC intervals in Tables 2 and 3 were produced by the parametric and nonparametric ABC algorithms “abcpar” and “abcnon.” These and the BCa program are available in the language S: send electronic mail to statlib@lib.stat.cmu.edu with the one-line message: send bootstrap.funs from S.] There are five inputs to the algorithm: µˆ, 6ˆ , ηˆ and the functions t· and mu·. The outputs include θˆ STANα, θˆ ABCα and θˆ ABCqα. Computational effort for the ABC intervals is two or three times that required for the standard intervals. The ABC intervals can be useful even in very simple situations. Suppose that the data consists of a single observation x from a Poisson distribution with unknown expectation θ. In this case θˆ = tx = x and σˆ = p θˆ. Carrying through definitions (4.9)–(4.14) gives aˆ = zˆ0 = 1/6θˆ1/2 ; cˆq = 0, and so θˆ ABCα = θˆ + w 1 − aˆw 2 p θˆ; w = zˆ0 + z α : For x = 7, the interval θˆ ABC0:05; θˆ ABC0:95 equals 3:54; 12:67. This compares with the exact interval (3.57, 12.58) for θ, splitting the atom of probability at x = 7, and with the standard interval 2:65; 11:35. Here is a more realistic example of the ABC algorithm, used in a logistic regression context. Table 4 shows the data from an experiment concerning mammalian cell growth. The goal of this experiment was to quantify the effects of two factors on the success of a culture. Factor “r” measures the ratio of two key constituents of the culture plate, while factor “d” measures how many days were allowed for culture maturation. A total of 1,843 independent cultures were prepared, investigating 25 different ri ;dj combinations. The table lists sij and nij for each combination, the num-
198 T.J.DICICCIO AND B.EFRON TABLE 4 Cell data:1,843 cell cultures were prepared,varying two factors,r(the ratio of two key constituents)and d (the number of days of culturing).Data shown are sij and nij.the number of successful cultures and the number of cultures attempted,at the ith level of r and the jth level of d di d2 ds da ds Total ri 5/31 3/28 20/45 24/47 29/35 81/186 T2 15/77 36/78 43/71 56/71 66/74 216/371 Ts 48/126 68/116 145/171 98/119 114/129 473/661 TA 29/92 35/52 57/85 38/50 7277 231/356 T5 11/53 20/52 20/48 40/55 52/61 143/269 Total 108/379 162/326 285/420 256/342 333/376 1144/1843 ber of successful cultures,compared to the number not too bad in this case,although better perfor- attempted. mance might have been expected with n=1,843 We suppose that the number of successful cul- data points.In fact it is very difficult to guess a pri- tures is a binomial variate, ori what constitutes a large enough sample size for sij~ii.d.binomial(nij,mi), adequate standard-interval performance. (4.15) The ABC formulas (4.13)-(4.14)were derived as i,j=1,2,3,4,5, second-order approximations to the BCa endpoints by DiCiccio and Efron (1992).They showed that with an additive logistic regression model for the these formulas give second-order accuracy as in unknown probabilities Tij, (2.10),and also second-order correctness.Section 8 reviews some of these results.There are many other expressions for ABC-like interval endpoints T (4.16) that enjoy equivalent second-order properties in theory,although they may be less dependable in ∑a-∑B=0. practice.A particularly simple formula is 1 For the example here we take the parameter of in- (4.20)0ABc[a]=0sTAN[a]+o+(2a+g)z(). terest to be This shows that the ABC endpoints are not just a (4.17) 6=15 T51 translation of @srAN[a]. In repeated sampling situations the estimated the success probability for the lowest r and highest d divided by the success probability for the highest constants (a,20,ca)are of stochastic order 1/n in the sample size,the same as a.They multiply a r and lowest d.This typifies the kind of problem traditionally handled by the standard method. in (4.20),resulting in corrections of order /vn to A logistic regression program calculated maxi- @sTAN[a].If there were only 1/4 as much cell data, mum likelihood estimates @:Bi,from which we n=461,but with the same proportion of successes in every cell of Table 4,then (a,2,would be obtained twice as large.This would double the relative dif- (4.18) i=1+exp[-(位+a+月1l =4.16. ference (0ABc[a]-0sTAN[a])/G according to (4.20), 1+exp[-(i+a1+B5)] rendering 0sTAN[a]quite inaccurate. The output of the logistic regression program pro- Both a and 2o are transformation invariant,re- vided and n for the ABC algorithm.Section 3 taining the same numerical value under monotone of DiCiccio and Efron(1992)gives the exact speci- parameter transformations =m().The nonlin- fication for an ABC analysis of a logistic regression earity constant co is not invariant,and it can be problem.Applied here,the algorithm gave standard reduced by transformations that make o more lin- and ABC 0.90 central intervals for 0, ear as a function of u.Changing parameters from 0=T15/πs1to中=log()changes(a,o,cg)from (6sTAw[0.05],sTAx[0.95])=(3.06,5.26), (-0.006,-0.025,0.105)to(-0.006,-0.025,0.025) 4.19) (ABcl0.05],ABC[0.95])=(3.20,5.43). for the cell data.The standard intervals are nearly correct on the d scale.The ABC and BC methods The ABC limits are shifted moderately upwards automate this kind of data-analytic trick. relative to the standard limits,enough to make the We can visualize the relationship between the shape (1.6)equal 1.32.The standard intervals are BC and ABC intervals in terms of Figure 3.The
198 T. J. DICICCIO AND B. EFRON Table 4 Cell data: 1,843 cell cultures were prepared, varying two factors, r (the ratio of two key constituents) and d (the number of days of culturing). Data shown are sij and nij ; the number of successful cultures and the number of cultures attempted, at the ith level of r and the jth level of d d1 d2 d3 d4 d5 Total r1 5/31 3/28 20/45 24/47 29/35 81/186 r2 15/77 36/78 43/71 56/71 66/74 216/371 r3 48/126 68/116 145/171 98/119 114/129 473/661 r4 29/92 35/52 57/85 38/50 72/77 231/356 r5 11/53 20/52 20/48 40/55 52/61 143/269 Total 108/379 162/326 285/420 256/342 333/376 1144/1843 ber of successful cultures, compared to the number attempted. We suppose that the number of successful cultures is a binomial variate, (4.15) sij ∼i:i:d: binomialnij ;πij ; i; j = 1; 2; 3; 4; 5; with an additive logistic regression model for the unknown probabilities πij , 4:16 log πij 1 − πij = µ + αi + βj ; X 5 1 αi = X 5 1 βj = 0: For the example here we take the parameter of interest to be 4:17 θ = π15 π51 ; the success probability for the lowest r and highest d divided by the success probability for the highest r and lowest d. This typifies the kind of problem traditionally handled by the standard method. A logistic regression program calculated maximum likelihood estimates µˆ; αˆ i ;βˆ j , from which we obtained 4:18 θˆ = 1 + exp−µˆ + αˆ5 + βˆ 1 1 + exp−µˆ + αˆ1 + βˆ 5 = 4:16: The output of the logistic regression program provided µˆ, 6ˆ and ηˆ for the ABC algorithm. Section 3 of DiCiccio and Efron (1992) gives the exact speci- fication for an ABC analysis of a logistic regression problem. Applied here, the algorithm gave standard and ABC 0.90 central intervals for θ, 4:19 θˆ STAN0:05; θˆ STAN0:95 = 3:06; 5:26; θˆ ABC0:05; θˆ ABC0:95 = 3:20; 5:43: The ABC limits are shifted moderately upwards relative to the standard limits, enough to make the shape (1.6) equal 1.32. The standard intervals are not too bad in this case, although better performance might have been expected with n = 1; 843 data points. In fact it is very difficult to guess a priori what constitutes a large enough sample size for adequate standard-interval performance. The ABC formulas (4.13)–(4.14) were derived as second-order approximations to the BCa endpoints by DiCiccio and Efron (1992). They showed that these formulas give second-order accuracy as in (2.10), and also second-order correctness. Section 8 reviews some of these results. There are many other expressions for ABC-like interval endpoints that enjoy equivalent second-order properties in theory, although they may be less dependable in practice. A particularly simple formula is 4:20 θˆ ABCα := θˆ STANα + σˆ zˆ0 + 2aˆ + cˆq z α 2 : This shows that the ABC endpoints are not just a translation of θˆ STANα. In repeated sampling situations the estimated constants aˆ; zˆ0 ; cˆq are of stochastic order 1/ √ n in the sample size, the same as σˆ . They multiply σˆ in (4.20), resulting in corrections of order σˆ / √ n to θˆ STANα. If there were only 1/4 as much cell data, n = 461, but with the same proportion of successes in every cell of Table 4, then aˆ; zˆ0 ; cˆq would be twice as large. This would double the relative difference θˆ ABCα − θˆ STANα/σˆ according to (4.20), rendering θˆ STANα quite inaccurate. Both aˆ and zˆ0 are transformation invariant, retaining the same numerical value under monotone parameter transformations φ = mθ. The nonlinearity constant cˆq is not invariant, and it can be reduced by transformations that make φ more linear as a function of µ. Changing parameters from θ = π15/π51 to φ = logθ changes aˆ; zˆ0 ; cˆq from −0:006; −0:025; 0:105 to −0:006; −0:025; 0:025 for the cell data. The standard intervals are nearly correct on the φ scale. The ABC and BCa methods automate this kind of data-analytic trick. We can visualize the relationship between the BCa and ABC intervals in terms of Figure 3. The