352 W.J.CONOVER,MARK E.JOHNSON,AND MYRLE M.JOHNSON cedure appeared to be the best of the six procedures described.A final section presents the summary and investigated by Hall(1972)in an extensive simulation conclusions of this study. study,while Keselman,Games,and Clinch(1979)con- clude that the jackknife procedure(Mill)has unstable 2.A SURVEY OF k-SAMPLE TESTS FOR error rates (Type I error)when the sample sizes are EQUALITY OF VARIANCES unequal.They conclude from their study of 10 tests For i=1,...,k,let {Xu be random samples of size that "the current tests for variance heterogeneity are ni from populations with means u;and variances of. either sensitive to nonnormality or,if robust,lacking To test the hypothesis of equal variances,one ad- in power.Therefore these tests cannot be rec- ditional assumption is necessary (Moses 1963).One ommended for the purpose of testing the validity of possible assumption is that the Xi's are normally the ANOVA homogeneity assumption."The four tests distributed.This leads to a large number of tests,some studied by Levy (1978)all "were grossly affected by with exact tables available and some with only violations of the underlying assumption of normality." asymptotic approximations available,for the dis- The potential user of a test for equality of variances tributions of the test statistics.Another possible as- is thus presented with a confusing array of infor- sumption is that the Xif's are identically distributed mation concerning which test to use.As a result,many when the null hypothesis is true.This assumption users default to Bartlett's (1937)modification of the enables various nonparametric tests to be formulated likelihood ratio test,a modification that is well known In practice,neither assumption is entirely true,so that to be nonrobust and that none of the comparative all of these tests for variances are only approximate.It studies recommends except when the populations are is appropriate to examine all of the available tests for known to be normal.The purpose of our study is to their robustness to violations of the assumptions.In provide a list of tests that have a stable Type I error this section we present a(nearly)chronological listing rate when the normality assumption may not be true, of tests for equal variances and a summary of these when sample sizes may be small and/or unequal,and tests in Tables 1 through 4.Most of the tests in Tables when distributions may be skewed and heavy-tailed. 1 through 3 are based on some modification of the The tests that show the desired robustness are com- likelihood ratio test statistic derived under the as- pared on the basis of power.Further,we hope that sumption of normality.Tests that are essentially our method of comparing tests may be useful in future modifications of the likelihood ratio test or that other- studies for evaluating additional tests of variance. wise rely on the assumption of normality are given in The tests examined in this study are described Table 1.Modifications to those tests,employing an briefly in Section 2.Fifty-six tests for equality of vari- estimate of the kurtosis,appear in Table 2.They are ances are compared,most of which are variations of asymptotically distribution free for all parent popu- the most popular and most useful parametric and lations,with only minor restrictions.Tests based on a nonparametric tests available for testing the equality modification of the F test for means are given in Table of k variances (k 2)in the presence of unknown 3,along with the jackknife test,which does not seem means.Some tests not studied in detail are also men- to fit anywhere else.Finally,Table 4 presents modifi- tioned in Section 2,along with the reason for their cations of nonparametric tests.The modification con- exclusion.This coverage is by far the most extensive sists of using the sample mean or sample median that we are aware of and should provide valuable instead of the population mean when computing the comparative information regarding tests for variances. test statistic.Only nonparametric tests in the class of The simulation study is described in Section 3.Each linear rank tests are included here,because this class test statistic is computed 1,000 times in each of 91 of tests includes all locally most powerful rank tests situations,representing various distributions,sample (Hajek and Sidak 1967).Therefore,in Table 4,only sizes,means,and variances.Nineteen of these sample the scores,a.i,for these tests are presented.From situations have equal variances and are therefore these scores,chi squared tests may be formulated studies of the Type I error rate,while the remaining 72 based on the statistic situations represent studies of the power The basic motivation for this study is described in Section 4.The lease production,and revenue(LPR) X2=∑n:(a-a2/W2, (2.1) data base includes,among other data,the actual amount of each sealed bid submitted by oil and gas where A;=mean score in the ith sample,a overall companies on individual tracts offered by the federal government in all of the sales of offshore oil and gas mean score 1/N >aN.i,and V2 =(1/N -1) leases in the United States since 1954.The results of 1(aw.-a)2,which is compared with quantiles from a chi squared distribution with k-1 degrees of several tests for variances applied to those sales are freedom.Alternatively,the statistic TECHNOMETRICS©,VOL.23,NO.4,NOVEMBER1981 This content downloaded from 61.190.7.73 on Mon,30 Sep 2013 22:38:50 PM All use subject to JSTOR Terms and ConditionsW. J. CONOVER, MARK E. JOHNSON, AND MYRLE M. JOHNSON cedure appeared to be the best of the six procedures investigated by Hall (1972) in an extensive simulation study, while Keselman, Games, and Clinch (1979) conclude that the jackknife procedure (Mill) has unstable error rates (Type I error) when the sample sizes are unequal. They conclude from their study of 10 tests that "the current tests for variance heterogeneity are either sensitive to nonnormality or, if robust, lacking in power. Therefore these tests cannot be recommended for the purpose of testing the validity of the ANOVA homogeneity assumption." The four tests studied by Levy (1978) all "were grossly affected by violations of the underlying assumption of normality." The potential user of a test for equality of variances is thus presented with a confusing array of information concerning which test to use. As a result, many users default to Bartlett's (1937) modification of the likelihood ratio test, a modification that is well known to be nonrobust and that none of the comparative studies recommends except when the populations are known to be normal. The purpose of our study is to provide a list of tests that have a stable Type I error rate when the normality assumption may not be true, when sample sizes may be small and/or unequal, and when distributions may be skewed and heavy-tailed. The tests that show the desired robustness are compared on the basis of power. Further, we hope that our method of comparing tests may be useful in future studies for evaluating additional tests of variance. The tests examined in this study are described briefly in Section 2. Fifty-six tests for equality of variances are compared, most of which are variations of the most popular and most useful parametric and nonparametric tests available for testing the equality of k variances (k > 2) in the presence of unknown means. Some tests not studied in detail are also mentioned in Section 2, along with the reason for their exclusion. This coverage is by far the most extensive that we are aware of and should provide valuable comparative information regarding tests for variances. The simulation study is described in Section 3. Each test statistic is computed 1,000 times in each of 91 situations, representing various distributions, sample sizes, means, and variances. Nineteen of these sample situations have equal variances and are therefore studies of the Type I error rate, while the remaining 72 situations represent studies of the power. The basic motivation for this study is described in Section 4. The lease production, and revenue (LPR) data base includes, among other data, the actual amount of each sealed bid submitted by oil and gas companies on individual tracts offered by the federal government in all of the sales of offshore oil and gas leases in the United States since 1954. The results of several tests for variances applied to those sales are described. A final section presents the summary and conclusions of this study. 2. A SURVEY OF k-SAMPLE TESTS FOR EQUALITY OF VARIANCES For i = 1, ..., k, let {Xij} be random samples of size ni from populations with means pi and variances of. To test the hypothesis of equal variances, one additional assumption is necessary (Moses 1963). One possible assumption is that the Xij's are normally distributed. This leads to a large number of tests, some with exact tables available and some with only asymptotic approximations available, for the distributions of the test statistics. Another possible assumption is that the Xij's are identically distributed when the null hypothesis is true. This assumption enables various nonparametric tests to be formulated. In practice, neither assumption is entirely true, so that all of these tests for variances are only approximate. It is appropriate to examine all of the available tests for their robustness to violations of the assumptions. In this section we present a (nearly) chronological listing of tests for equal variances and a summary of these tests in Tables 1 through 4. Most of the tests in Tables 1 through 3 are based on some modification of the likelihood ratio test statistic derived under the assumption of normality. Tests that are essentially modifications of the likelihood ratio test or that otherwise rely on the assumption of normality are given in Table 1. Modifications to those tests, employing an estimate of the kurtosis, appear in Table 2. They are asymptotically distribution free for all parent populations, with only minor restrictions. Tests based on a modification of the F test for means are given in Table 3, along with the jackknife test, which does not seem to fit anywhere else. Finally, Table 4 presents modifications of nonparametric tests. The modification consists of using the sample mean or sample median instead of the population mean when computing the test statistic. Only nonparametric tests in the class of linear rank tests are included here, because this class of tests includes all locally most powerful rank tests (Hajek and Sidak 1967). Therefore, in Table 4, only the scores, a, i, for these tests are presented. From these scores, chi squared tests may be formulated based on the statistic k X2 = E ni(Ai-a)2/V2, i= 1 (2.1) where Ai = mean score in the ith sample, a = overall mean score = 1/N EiN= aN.i, and V2 = (1/N - 1) 1= (aN.- a)2, which is compared with quantiles from a chi squared distribution with k - 1 degrees of freedom. Alternatively, the statistic TECHNOMETRICS ?, VOL. 23, NO. 4, NOVEMBER 1981 352 This content downloaded from 61.190.7.73 on Mon, 30 Sep 2013 22:38:50 PM All use subject to JSTOR Terms and Conditions