《实用非参数统计》课程教学资源（阅读材料）Conditonal versus unconditional exact tests for comparing two binomials.pdf_大学文库

Conditional versus Unconditional Exact Tests for Comparing Two Binomials Cyrus R. Mehta and Pralay Senchaudhuri 4 September 2003 Introduction There are two fundamentally different exact tests for comparing the equality of two binomial probabilities – Fisher’s exact test (Fisher, 1925), and Barnard’s exact test (Barnard, 1945). Fisher’s exact test (Fisher, 1925) is the more popular of the two. In fact, Fisher was bitterly critical of Barnard’s proposal for esoteric reasons that we will not go into here. Notwithstanding Fisher’s reservations, both tests are available in StatXact. Consider the following example of a vaccine efficacy study (Chan, 1998). In a randomized clinical trial of 30 subjects, 15 were innoculated with a recombinant DNA influenza vaccine and the 15 were innoculated with a placebo. Twelve of the 15 subjects in the placebo group (80%) eventually became infected with influenza whereas for the vaccine group, only 7 of the 15 subjects (47%) became infected. The data are tabulated as a 2 × 2 table (see Table 1). Suppose πe is the probability of infection for the vaccine (or experimental) group and Table 1: Results of Vaccine Efficacy Study Infection Treatment Combined Status Vaccine Placebo Response Yes 7 (47%) 12 (80%) 19 No 8 (53%) 3 (20%) 11 Totals 15 15 30 πc is the probability of infection for the placebo (or control) group. What is the exact one-sided p-value for testing the null hypothesis H0: πe = πc . The answer depends on the type of exact test adopted. Fisher’s exact test produces an exact p-value of 0.0641. On the other hand Barnard’s exact test produces an exact p-value of 0.0341. The smaller, statistically significant, exact p-value produced by Barnard’s method is no accident. For 2 × 2 tables, Barnard’s test is more powerful than Fisher’s, as Barnard noted in his 1945 paper, much to Fisher’s chagrin. The purpose of this short article is to explain why Barnard’s method is more powerful in this setting and to discuss the limitations of the method. 1

What's the Difference Between the Two Methods? In order to appreciate the difference between Fisher's and Barnard's exact tests,let us consider a generic 2 x 2 contingency table obtained by innoculating 15 subjects with vaccine and 15 subjects with placebo.Let us Table 2:A Generic 2 x 2 Table Generated by the Vaccine Efficacy Trial Infection Treatment Combined Status Vaccine Placebo Response Yes Te Te Tc+Te No 15-xe15-xc 30-Zc-Ze Totals 15 15 30 denote this generic 2 x 2 table (Table 2)by X and let us denote the table that was actually observed (see Table 1)by Xo.Suppose that the common probability of infection under the null hypothesis is Te=Te=T. Then the probability of observing any generic table X is a product of two binomials: P(X|π)= 1515 πre+c(1-π)30-xe-xe. (1) Te The exact p-value is then the sum of such probabilities for all tables,X,that could have been observed which are at least as extreme as the observed table,Xo,under the null hypothesis.Specifically,suppose T(X)is a "discrepancy measure"or test statistic that measures how discrepant any table X is relative to the type of table one would expect under the null hypothesis.Then,for any given the exact p-value of the observed table Xo is p(x)= ∑P(xI): (2) T(X)≥T(X) Now there exist many different test statistics for determining if a table X is more or less extreme than the observed table Xo under the null hypothesis.Since our purpose is to explain the difference between the two types of exact tests,we will not discuss the merits of these different test statistics.For this example we will use the popular Wald statistic T(X)= 分c一元e (3) √(1-元)（品+） where fe xe/15,fe=c/15 and=(ze+xe)/15.However similar conclusions would be reached with other test statistics as well. Although the Wald test statistic T(X)is well defined,equation(2)is not of much practical use for computing an exact p-value because it depends on knowing the value of the common probability of an infection under the null hypothesis.One could of course use the data to estimate and then substitute this estimate into equation(2).But then the p-value would no longer be exact.The main difference between the Fisher and Barnard tests is the manner in which they eliminate this nuisance parameter from(2)without sacrificing the exactness. 2

What’s the Difference Between the Two Methods? In order to appreciate the difference between Fisher’s and Barnard’s exact tests, let us consider a generic 2 × 2 contingency table obtained by innoculating 15 subjects with vaccine and 15 subjects with placebo. Let us Table 2: A Generic 2 × 2 Table Generated by the Vaccine Efficacy Trial Infection Treatment Combined Status Vaccine Placebo Response Yes xe xc xc + xe No 15 − xe 15 − xc 30 − xc − xe Totals 15 15 30 denote this generic 2 × 2 table (Table 2) by X and let us denote the table that was actually observed (see Table 1) by X0. Suppose that the common probability of infection under the null hypothesis is πe = πc = π. Then the probability of observing any generic table X is a product of two binomials: P(X|π) = 15 xc 15 xe πxc+xe (1 − π) 30−xc−xe . (1) The exact p-value is then the sum of such probabilities for all tables, X, that could have been observed which are at least as extreme as the observed table, X0, under the null hypothesis. Specifically, suppose T(X) is a “discrepancy measure” or test statistic that measures how discrepant any table X is relative to the type of table one would expect under the null hypothesis. Then, for any given π, the exact p-value of the observed table X0 is p(π) = T (X)≥T(X0) P(X|π) . (2) Now there exist many different test statistics for determining if a table X is more or less extreme than the observed table X0 under the null hypothesis. Since our purpose is to explain the difference between the two types of exact tests, we will not discuss the merits of these different test statistics. For this example we will use the popular Wald statistic T(X) = πˆc − πˆe π¯(1 − π¯)( 1 15 + 1 15 ) , (3) where πˆe = xe/15 , πˆc = xc/15 and π¯ = (xc + xe)/15. However similar conclusions would be reached with other test statistics as well. Although the Wald test statistic T(X) is well defined, equation (2) is not of much practical use for computing an exact p-value because it depends on knowing the value of π, the common probability of an infection under the null hypothesis. One could of course use the data to estimate π and then substitute this estimate into equation (2). But then the p-value would no longer be exact. The main difference between the Fisher and Barnard tests is the manner in which they eliminate this nuisance parameter from (2) without sacrificing the exactness. 2

Figure 3:Distribution of T(X)for Barnard's Exact Test(*=.3365) 0.14 0.12 0.0 0.04 P=0.0341 0.02 4.00 3.0元 30g 4.00 T T=1.894 the sample space for Fisher's exact test much more discrete than it is for Barnard's exact test.Consequently,the number of distinct p-values that one could obtain with Fisher's exact test is less than the corresponding number of distinct p-values that one could obtain with Barnard's exact test.This in turn implies that if we want to restrict the type-1 error to some upper limit,say 5%,Fisher's procedure will usually be more conservative than Barnard's,resulting in a loss of power.The power loss diminishes as the sample sizes get larger since the discreteness of the Fisher statistic is not as pronounced. When comparing Fisher's and Barnard's exact tests,the loss of power due to the greater discreteness of the Fisher statistic is somewhat offset by the requirement that Barnard's exact test must maximize over all possible p-values,by choice of the nuisance parameter,For 2 x 2 tables the loss of power due to the discreteness dominates over the loss of power due to the maximization,resulting in greater power for Barnard's exact test. But as the number of rows and columns of the observed table increase,the maximizing factor will tend to dominate,and Fisher's exact test will achieve greater power than Barnard's.For details of this investigation see Mehta and Hilton(1993).For additional discussion of this topic see Appendix G of the StatXact-5 User Manual. Reference Barnard GA(1945).A new test for 2 x 2 tables.Nature 156:177. Chan I(1998).Exact tests of equivalence and efficacy with a non-zero lower bound for comparative studies. Statistics in Medicine 17,1403-1413. Fisher RA(1925).Statistical Methods for Research Workers.Oliver and Boyd,Edinburgh. Mehta CR,Hilton JF(1993).Exact power of conditional and unconditional tests:going beyond the 2x2 contingency table.American Statistician,47:91-98. 5 Cytel Software Corporation 675 Massachusetts Avenue Cambridge,MA 02139 www.cytel.com

Figure 3: Distribution of T(X) for Barnard’s Exact Test (π∗ = .3365) the sample space for Fisher’s exact test much more discrete than it is for Barnard’s exact test. Consequently, the number of distinct p-values that one could obtain with Fisher’s exact test is less than the corresponding number of distinct p-values that one could obtain with Barnard’s exact test. This in turn implies that if we want to restrict the type-1 error to some upper limit, say 5%, Fisher’s procedure will usually be more conservative than Barnard’s, resulting in a loss of power. The power loss diminishes as the sample sizes get larger since the discreteness of the Fisher statistic is not as pronounced. When comparing Fisher’s and Barnard’s exact tests, the loss of power due to the greater discreteness of the Fisher statistic is somewhat offset by the requirement that Barnard’s exact test must maximize over all possible p-values, by choice of the nuisance parameter, π. For 2 × 2 tables the loss of power due to the discreteness dominates over the loss of power due to the maximization, resulting in greater power for Barnard’s exact test. But as the number of rows and columns of the observed table increase, the maximizing factor will tend to dominate, and Fisher’s exact test will achieve greater power than Barnard’s. For details of this investigation see Mehta and Hilton (1993). For additional discussion of this topic see Appendix G of the StatXact-5 User Manual. Reference Barnard GA (1945). A new test for 2 × 2 tables. Nature 156:177. Chan I (1998). Exact tests of equivalence and efficacy with a non-zero lower bound for comparative studies. Statistics in Medicine 17, 1403-1413. Fisher RA (1925). Statistical Methods for Research Workers. Oliver and Boyd, Edinburgh. Mehta CR, Hilton JF (1993). Exact power of conditional and unconditional tests: going beyond the 2x2 contingency table. American Statistician, 47:91-98. 5 Cytel Software Corporation | 675 Massachusetts Avenue | Cambridge, MA 02139 | www.cytel.com