SIATISTICALASO N American Society for Quality A Comparative Study of Tests for Homogeneity of Variances,with Applications to the Outer Continental Shelf Bidding Data Author(s):W.J.Conover,Mark E.Johnson and Myrle M.Johnson Source:Technometrics,Vol.23,No.4(Nov.,1981),pp.351-361 Published by:American Statistical Association and American Society for Quality Stable URL:http://www.jstor.org/stable/1268225 Accessed:30/09/201322:38 Your use of the JSTOR archive indicates your acceptance of the Terms Conditions of Use,available at http://www.jstor.org/page/info/about/policies/terms.jsp JSTOR is a not-for-profit service that helps scholars,researchers,and students discover,use,and build upon a wide range of content in a trusted digital archive.We use information technology and tools to increase productivity and facilitate new forms of scholarship.For more information about JSTOR,please contact support@jstor.org. American Statistical Association and American Society for Quality are collaborating with JSTOR to digitize, preserve and extend access to Technometrics. 29 STOR http://www.jstor.org This content downloaded from 61.190.7.73 on Mon,30 Sep 2013 22:38:50 PM All use subject to JSTOR Terms and Conditions
American Society for Quality A Comparative Study of Tests for Homogeneity of Variances, with Applications to the Outer Continental Shelf Bidding Data Author(s): W. J. Conover, Mark E. Johnson and Myrle M. Johnson Source: Technometrics, Vol. 23, No. 4 (Nov., 1981), pp. 351-361 Published by: American Statistical Association and American Society for Quality Stable URL: http://www.jstor.org/stable/1268225 . Accessed: 30/09/2013 22:38 Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp . JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org. . American Statistical Association and American Society for Quality are collaborating with JSTOR to digitize, preserve and extend access to Technometrics. http://www.jstor.org This content downloaded from 61.190.7.73 on Mon, 30 Sep 2013 22:38:50 PM All use subject to JSTOR Terms and Conditions
TECHNOMETRICS©,VOL.23,NO.4,NOVEMBER1981 This paper was presented at the TECHNOMETRICS Session of the 25th Annual Fall Technical Conference of the Chemical Division of the American Society for Quality Control and the Section on Physical and Engineering Sciences of the American Statistical Associ- ation in Gatlinburg,Tennessee,October 29-30,1981. A Comparative Study of Tests for Homogeneity of Variances,with Applications to the Outer Continental Shelf Bidding Data W.J.Conover Mark E.Johnson and Myrle M.Johnson College of Business Statistics Group,S-1 Administration Los Alamos National Texas Tech University Laboratory Lubbock,TX 79409 Los Alamos,NM 87545 Many of the existing parametric and nonparametric tests for homogeneity of variances,and some variations of these tests,are examined in this paper.Comparisons are made under the null hypothesis(for robustness)and under the alternative(for power).Monte Carlo simulations of various symmetric and asymmetric distributions,for various sample sizes,reveal a few tests that are robust and have good power.These tests are further compared using data from outer continental shelf bidding on oil and gas leases. KEY WORDS:Test for homogeneity of variances;Bartlett's test;Robustness;Power;Non- parametric tests;Monte Carlo 1.INTRODUCTION to test variances rather than means.Many are based Tests for homogeneity of variances are often of on nonparametric methods,although their modifi- interest as a preliminary to other analyses such as cation for the case in which the means are unknown analysis of variance or a pooling of data from different often makes these tests distributionally dependent. sources to yield an improved estimated variance.For Among the many possible tests for equality of vari- example,in the data base described in Section 4,if the ances,one would hope that at least one is robust to variance of the logs of the bids on each offshore lease variations in the underlying distribution and yet sensi- is homogeneous within a sale,then the scale pa- tive to departures from the equal variance hypothesis. rameter of the lognormal distribution can be esti- However,recent comparative studies are not reassur- mated using all the bids in the sale.In quality control ing in this regard.For example,Gartside(1972)stud- work,tests for homogeneity of variances are often a ied eight tests and concluded that the only robust useful endpoint in an analysis procedure was a log-anova test that not only has poor The classical approach to hypothesis testing usually power,but also depends on the unpleasant process of begins with the likelihood ratio test under the assump- dividing each sample at random into smaller subsam- tion of normal distributions.However,the dis- ples.Layard(1973)reached a similar conclusion re- tribution of the statistic in the likelihood ratio test for garding the log-anova test,but indicated that two equality of variances in normal populations depends other tests in his study of four tests,Miller's jackknife on the kurtosis of the distribution(Box 1953),which procedure and Scheffe's chi squared test,did not suffer helps to explain why that test is so sensitive to depar- greatly from lack of robustness and had considerably tures from normality.This nonrobust (sometimes more power,at least when sample sizes were equal. called"puny")property of the likelihood ratio test has These tests are included in our study as Mill and Sch2. prompted the invention of many alternative tests for Layard indicated a reluctance to use these tests when variances.Some of these are modifications of the like- sample sizes are less than 10,and yet this is the case of ihood ratio test.Others are adaptations of the F test interest to us,as we explain later.The jackknife pro- 351 This content downloaded from 61.190.7.73 on Mon,30 Sep 2013 22:38:50 PM All use subject to JSTOR Terms and Conditions
TECHNOMETRICS ?, VOL. 23, NO. 4, NOVEMBER 1981 This paper was presented at the TECHNOMETRICS Session of the 25th Annual Fall Technical Conference of the Chemical Division of the American Society for Quality Control and the Section on Physical and Engineering Sciences of the American Statistical Association in Gatlinburg, Tennessee, October 29-30, 1981. A Comparative Study of Tests for Homogeneity of Variances, with Applications to the Outer Continental Shelf Bidding Data W. J. Conover College of Business Administration Texas Tech University Lubbock, TX 79409 Mark E. Johnson and Myrle M. Johnson Statistics Group, S-1 Los Alamos National Laboratory Los Alamos, NM 87545 Many of the existing parametric and nonparametric tests for homogeneity of variances, and some variations of these tests, are examined in this paper. Comparisons are made under the null hypothesis (for robustness) and under the alternative (for power). Monte Carlo simulations of various symmetric and asymmetric distributions, for various sample sizes, reveal afew tests that are robust and have good power. These tests are further compared using data from outer continental shelf bidding on oil and gas leases. KEY WORDS: Test for homogeneity of variances; Bartlett's test; Robustness; Power; Nonparametric tests; Monte Carlo. 1. INTRODUCTION Tests for homogeneity of variances are often of interest as a preliminary to other analyses such as analysis of variance or a pooling of data from different sources to yield an improved estimated variance. For example, in the data base described in Section 4, if the variance of the logs of the bids on each offshore lease is homogeneous within a sale, then the scale parameter of the lognormal distribution can be estimated using all the bids in the sale. In quality control work, tests for homogeneity of variances are often a useful endpoint in an analysis. The classical approach to hypothesis testing usually begins with the likelihood ratio test under the assumption of normal distributions. However, the distribution of the statistic in the likelihood ratio test for equality of variances in normal populations depends on the kurtosis of the distribution (Box 1953), which helps to explain why that test is so sensitive to departures from normality. This nonrobust (sometimes called "puny") property of the likelihood ratio test has prompted the invention of many alternative tests for variances. Some of these are modifications of the likelihood ratio test. Others are adaptations of the F test to test variances rather than means. Many are based on nonparametric methods, although their modification for the case in which the means are unknown often makes these tests distributionally dependent. Among the many possible tests for equality of variances, one would hope that at least one is robust to variations in the underlying distribution and yet sensitive to departures from the equal variance hypothesis. However, recent comparative studies are not reassuring in this regard. For example, Gartside (1972) studied eight tests and concluded that the only robust procedure was a log-anova test that not only has poor power, but also depends on the unpleasant process of dividing each sample at random into smaller subsamples. Layard (1973) reached a similar conclusion regarding the log-anova test, but indicated that two other tests in his study of four tests, Miller's jackknife procedure and Scheff6's chi squared test, did not suffer greatly from lack of robustness and had considerably more power, at least when sample sizes were equal. These tests are included in our study as Mill and Sch2. Layard indicated a reluctance to use these tests when sample sizes are less than 10, and yet this is the case of interesto us, as we explain later. The jackknife pro- 351 This content downloaded from 61.190.7.73 on Mon, 30 Sep 2013 22:38:50 PM All use subject to JSTOR Terms and Conditions
352 W.J.CONOVER,MARK E.JOHNSON,AND MYRLE M.JOHNSON cedure appeared to be the best of the six procedures described.A final section presents the summary and investigated by Hall(1972)in an extensive simulation conclusions of this study. study,while Keselman,Games,and Clinch(1979)con- clude that the jackknife procedure(Mill)has unstable 2.A SURVEY OF k-SAMPLE TESTS FOR error rates (Type I error)when the sample sizes are EQUALITY OF VARIANCES unequal.They conclude from their study of 10 tests For i=1,...,k,let {Xu be random samples of size that "the current tests for variance heterogeneity are ni from populations with means u;and variances of. either sensitive to nonnormality or,if robust,lacking To test the hypothesis of equal variances,one ad- in power.Therefore these tests cannot be rec- ditional assumption is necessary (Moses 1963).One ommended for the purpose of testing the validity of possible assumption is that the Xi's are normally the ANOVA homogeneity assumption."The four tests distributed.This leads to a large number of tests,some studied by Levy (1978)all "were grossly affected by with exact tables available and some with only violations of the underlying assumption of normality." asymptotic approximations available,for the dis- The potential user of a test for equality of variances tributions of the test statistics.Another possible as- is thus presented with a confusing array of infor- sumption is that the Xif's are identically distributed mation concerning which test to use.As a result,many when the null hypothesis is true.This assumption users default to Bartlett's (1937)modification of the enables various nonparametric tests to be formulated likelihood ratio test,a modification that is well known In practice,neither assumption is entirely true,so that to be nonrobust and that none of the comparative all of these tests for variances are only approximate.It studies recommends except when the populations are is appropriate to examine all of the available tests for known to be normal.The purpose of our study is to their robustness to violations of the assumptions.In provide a list of tests that have a stable Type I error this section we present a(nearly)chronological listing rate when the normality assumption may not be true, of tests for equal variances and a summary of these when sample sizes may be small and/or unequal,and tests in Tables 1 through 4.Most of the tests in Tables when distributions may be skewed and heavy-tailed. 1 through 3 are based on some modification of the The tests that show the desired robustness are com- likelihood ratio test statistic derived under the as- pared on the basis of power.Further,we hope that sumption of normality.Tests that are essentially our method of comparing tests may be useful in future modifications of the likelihood ratio test or that other- studies for evaluating additional tests of variance. wise rely on the assumption of normality are given in The tests examined in this study are described Table 1.Modifications to those tests,employing an briefly in Section 2.Fifty-six tests for equality of vari- estimate of the kurtosis,appear in Table 2.They are ances are compared,most of which are variations of asymptotically distribution free for all parent popu- the most popular and most useful parametric and lations,with only minor restrictions.Tests based on a nonparametric tests available for testing the equality modification of the F test for means are given in Table of k variances (k 2)in the presence of unknown 3,along with the jackknife test,which does not seem means.Some tests not studied in detail are also men- to fit anywhere else.Finally,Table 4 presents modifi- tioned in Section 2,along with the reason for their cations of nonparametric tests.The modification con- exclusion.This coverage is by far the most extensive sists of using the sample mean or sample median that we are aware of and should provide valuable instead of the population mean when computing the comparative information regarding tests for variances. test statistic.Only nonparametric tests in the class of The simulation study is described in Section 3.Each linear rank tests are included here,because this class test statistic is computed 1,000 times in each of 91 of tests includes all locally most powerful rank tests situations,representing various distributions,sample (Hajek and Sidak 1967).Therefore,in Table 4,only sizes,means,and variances.Nineteen of these sample the scores,a.i,for these tests are presented.From situations have equal variances and are therefore these scores,chi squared tests may be formulated studies of the Type I error rate,while the remaining 72 based on the statistic situations represent studies of the power The basic motivation for this study is described in Section 4.The lease production,and revenue(LPR) X2=∑n:(a-a2/W2, (2.1) data base includes,among other data,the actual amount of each sealed bid submitted by oil and gas where A;=mean score in the ith sample,a overall companies on individual tracts offered by the federal government in all of the sales of offshore oil and gas mean score 1/N >aN.i,and V2 =(1/N -1) leases in the United States since 1954.The results of 1(aw.-a)2,which is compared with quantiles from a chi squared distribution with k-1 degrees of several tests for variances applied to those sales are freedom.Alternatively,the statistic TECHNOMETRICS©,VOL.23,NO.4,NOVEMBER1981 This content downloaded from 61.190.7.73 on Mon,30 Sep 2013 22:38:50 PM All use subject to JSTOR Terms and Conditions
W. J. CONOVER, MARK E. JOHNSON, AND MYRLE M. JOHNSON cedure appeared to be the best of the six procedures investigated by Hall (1972) in an extensive simulation study, while Keselman, Games, and Clinch (1979) conclude that the jackknife procedure (Mill) has unstable error rates (Type I error) when the sample sizes are unequal. They conclude from their study of 10 tests that "the current tests for variance heterogeneity are either sensitive to nonnormality or, if robust, lacking in power. Therefore these tests cannot be recommended for the purpose of testing the validity of the ANOVA homogeneity assumption." The four tests studied by Levy (1978) all "were grossly affected by violations of the underlying assumption of normality." The potential user of a test for equality of variances is thus presented with a confusing array of information concerning which test to use. As a result, many users default to Bartlett's (1937) modification of the likelihood ratio test, a modification that is well known to be nonrobust and that none of the comparative studies recommends except when the populations are known to be normal. The purpose of our study is to provide a list of tests that have a stable Type I error rate when the normality assumption may not be true, when sample sizes may be small and/or unequal, and when distributions may be skewed and heavy-tailed. The tests that show the desired robustness are compared on the basis of power. Further, we hope that our method of comparing tests may be useful in future studies for evaluating additional tests of variance. The tests examined in this study are described briefly in Section 2. Fifty-six tests for equality of variances are compared, most of which are variations of the most popular and most useful parametric and nonparametric tests available for testing the equality of k variances (k > 2) in the presence of unknown means. Some tests not studied in detail are also mentioned in Section 2, along with the reason for their exclusion. This coverage is by far the most extensive that we are aware of and should provide valuable comparative information regarding tests for variances. The simulation study is described in Section 3. Each test statistic is computed 1,000 times in each of 91 situations, representing various distributions, sample sizes, means, and variances. Nineteen of these sample situations have equal variances and are therefore studies of the Type I error rate, while the remaining 72 situations represent studies of the power. The basic motivation for this study is described in Section 4. The lease production, and revenue (LPR) data base includes, among other data, the actual amount of each sealed bid submitted by oil and gas companies on individual tracts offered by the federal government in all of the sales of offshore oil and gas leases in the United States since 1954. The results of several tests for variances applied to those sales are described. A final section presents the summary and conclusions of this study. 2. A SURVEY OF k-SAMPLE TESTS FOR EQUALITY OF VARIANCES For i = 1, ..., k, let {Xij} be random samples of size ni from populations with means pi and variances of. To test the hypothesis of equal variances, one additional assumption is necessary (Moses 1963). One possible assumption is that the Xij's are normally distributed. This leads to a large number of tests, some with exact tables available and some with only asymptotic approximations available, for the distributions of the test statistics. Another possible assumption is that the Xij's are identically distributed when the null hypothesis is true. This assumption enables various nonparametric tests to be formulated. In practice, neither assumption is entirely true, so that all of these tests for variances are only approximate. It is appropriate to examine all of the available tests for their robustness to violations of the assumptions. In this section we present a (nearly) chronological listing of tests for equal variances and a summary of these tests in Tables 1 through 4. Most of the tests in Tables 1 through 3 are based on some modification of the likelihood ratio test statistic derived under the assumption of normality. Tests that are essentially modifications of the likelihood ratio test or that otherwise rely on the assumption of normality are given in Table 1. Modifications to those tests, employing an estimate of the kurtosis, appear in Table 2. They are asymptotically distribution free for all parent populations, with only minor restrictions. Tests based on a modification of the F test for means are given in Table 3, along with the jackknife test, which does not seem to fit anywhere else. Finally, Table 4 presents modifications of nonparametric tests. The modification consists of using the sample mean or sample median instead of the population mean when computing the test statistic. Only nonparametric tests in the class of linear rank tests are included here, because this class of tests includes all locally most powerful rank tests (Hajek and Sidak 1967). Therefore, in Table 4, only the scores, a, i, for these tests are presented. From these scores, chi squared tests may be formulated based on the statistic k X2 = E ni(Ai-a)2/V2, i= 1 (2.1) where Ai = mean score in the ith sample, a = overall mean score = 1/N EiN= aN.i, and V2 = (1/N - 1) 1= (aN.- a)2, which is compared with quantiles from a chi squared distribution with k - 1 degrees of freedom. Alternatively, the statistic TECHNOMETRICS ?, VOL. 23, NO. 4, NOVEMBER 1981 352 This content downloaded from 61.190.7.73 on Mon, 30 Sep 2013 22:38:50 PM All use subject to JSTOR Terms and Conditions
TESTS FOR HOMOGENEITY OF VARIANCES 353 Table 1.Tests That Are Classically Based on an In tests for equal variances,F is computed on some Estimate of Sampling Fluctuation Assuming Nor- transformation of the Xi's rather than on the Xi's mality themselves. Comments on the various tests are now presented. Areyation Test Statistic and Distribution The notation med refers to the replacement of with -卫 5学名-点n安为 in the test statistic in an attempt to improve the robustness of the test Aar 21~是here52·00a2-上 N-P.The test proposed by Neyman and Pearson ac1+)l:安 (1931)is the likelihood ratio test under normality.We also examine the modification N-P:med. Cach 6e0e381a18e:t1oy(a970.p.209 Bar.Bartlett (1937)modified N-P to "correct for bias."The resulting test is probably the most common used for equality of variances.It is well known to be B-K 1n(e等-la(ta a/25 6toam1rt1e1970.p.177 sensitive to departures from normality.Recent papers average sample site) by Glaser(1976),Chao and Glaser(1978),and Dyer 8eg818a18art1ey1970.p202 and Keating(1980)give methods for finding the exact distribution of the test statistic.We also examine Bar:med. End 58218rt1eyi97o.p.2 Coch.The test introduced by Cochran(1941)was considerably easier to compute than the tests up to T Bar? 1w“-37aev1c-w2 that time.With today's computers the difference in andb·c+z computation time is slight,however.We also look at (See Bat for c and T2) Coch:med. 名齿学 he1”《 92 2 B-K.Another attempt to simplify calculations re- sulted in this test by Bartlett and Kendall (1946), 星-29-1 which relies on the fact that In s2 is approximately normal and uses tables for the normalized range in w.3 normal samples.We do not examine this test because 03 of its equivalence to the following test. Hart.Four years after B-K this test by Hartley (1950)was presented.Well known as the"F-max"test, Bartrange 0-n点o-3-1.号21c it is merely an exponential transformation of B-K.An advantage of this test is the exact tables available for Lthl 名0à京 equal sample sizes (David 1952).We also examine Hart:med. nd男·ns Table 2.Tests That Attempt To Estimate Kurtosis 爱-1·信-0,/2-6) (See Lehl for T) Test Statistie and Distribution X2k-1) F= (N-1-X2)/N-k) (2.2) Barl g may be compared with quantiles from the F dis- 名a咖i好 tribution with k-1,N-k degrees of freedom. (See Bar for T,and c) In the following descriptions of the tests,we let, and r denote the ith sample mean,median,and range,respectively,while X denotes the overall mean. Bar2 名点7.野 c(1+y/2) The ith sample variance,with divisor n-1,is s.In 【2 addition, N=∑n,s2=∑m-1)s:/N-k, (ee Lathl.for Narl for) and 2+(1- ∑n以X:-X)2/k-) FX)=2x-XPw-内 (2.3) is the usual one-way analysis of variance test statistic. Sch2 (See Lehl for Ta.Bar2 for Y) TECHNOMETRICS©,VOL.23,NO.4,NOVEMBER1981 This content downloaded from 61.190.7.73 on Mon,30 Sep 2013 22:38:50 PM All use subject to JSTOR Terms and Conditions
TESTS FOR HOMOGENEITY OF VARIANCES Table 1. Tests That Are Classically Based on an Estimate of Sampling Fluctuation Assuming Normality Abbreviation of Test Test Statistic and Distribution 2 H-k 2nk 1 2 N-P 1b - T T1 = N ln(-- s ) - n (i n n- s) Bar. x_1 -2 T where T2 - (N-k)ln s2 - n s and C - 1 + 3(- - i I - and C" 1 + 3(23s-1 -_) - ]N-k Coch max s. i i B-K ln(max s )-ln(min si) (n/2)1 Hart 2 max si min si max ri min ri See Pearson and Hartley (1970), p. 203 for special tables. See Pearson and Hartley (1970), p. 177 for special tables. (n-average sample size) See Pearson and Hartley (1970), p. 202 for special tables. See Pearson and Hartley (1970), p. 264 for special tables. B--ar3, F~w T2 where w - (k+l)/(C-1)2 Far k-l,w (k-)(b-T2) and b - Cw (See Bar for C and T2) 2 k (mi_m)2 2- Sam Xk-1 ' E where m - (1- 2 )s-2/3 a 9(ni-l) a2i 2/[9(n-l)s4/3] (m/a2) and m 2 i E(1/a ) Bar:range [(N-k)ln(H E (ni-l)(~ )2) - Z(ni-)ln( )2 1/C H-k i i d5 i S) (d I (See Bar for C) See Pearson and Hartley (1970), p. 201 for special tables. Lehl X l " T3/2 where T3 - E(ni-l)(Pi- k 2 Xk- 1 (nj- )2 3 3 i N-k E (n3-S)P) and P - ln s2 Leh2 k_1 - (N-k)T3/(2N-4k) (See Lehl for T3) F= X2/(k- 1) (2.2) (N - 1 - X2)/(N - k) (2.2) may be compared with quantiles from the F distribution with k - 1, N - k degrees of freedom. In the following descriptions of the tests, we let Xi,, Xi, and ri denote the ith sample mean, median, and range, respectively, while X denotes the overall mean. The ith sample variance, with divisor ni - 1, is si. In addition, N= ni,, s2 = (n,i - l)s,/(N- k), and F(Xj) - ,i nX - X)2/(k - 1) (2.3) i, tZ u a (X - o ,)vi(N - k) ( .3) is the usual one-way analysis of variance test statistic. In tests for equal variances, F is computed on some transformation of the Xij's rather than on the X1j's themselves. Comments on the various tests are now presented. The notation med refers to the replacement of Xi with Xi in the test statistic in an attempt to improve the robustness of the test. N-P. The test proposed by Neyman and Pearson (1931) is the likelihood ratio test under normality. We also examine the modification N-P :med. Bar. Bartlett (1937) modified N-P to "correct for bias." The resulting test is probably the most common used for equality of variances. It is well known to be sensitive to departures from normality. Recent papers by Glaser (1976), Chao and Glaser (1978), and Dyer and Keating (1980) give methods for finding the exact distribution of the test statistic. We also examine Bar:med. Coch. The test introduced by Cochran (1941) was considerably easier to compute than the tests up to that time. With today's computers the difference in computation time is slight, however. We also look at Coch :med. B-K. Another attempt to simplify calculations resulted in this test by Bartlett and Kendall (1946), which relies on the fact that In s2 is approximately normal and uses tables for the normalized range in normal samples. We do not examine this test because of its equivalence to the following test. Hart. Four years after B-K this test by Hartley (1950) was presented. Well known as the "F-max" test, it is merely an exponential transformation of B-K. An advantage of this test is the exact tables available for equal sample sizes (David 1952). We also examine Hart :med. Table 2. Tests That Attempt To Estimate Kurtosis Abbreviation of Test Bar 1 Bar2 Schl Sch2 Test Statistic and Distribution (See Bar for T2 and C) T NEE(X ij-Xi4 2 2 - i1 i Xk-_ 1 where y - 3 C(l+y/2) [E(ni-l)sI2 T2 2 3 (See Lehl for T3, Barl for y) 2+(l- )Y 2 T3 Xk-1 + k - (See Lehl for T3, Bar2 for y) TECHNOMETRICS ?, VOL. 23, NO. 4, NOVEMBER 1981 353 Cad This content downloaded from 61.190.7.73 on Mon, 30 Sep 2013 22:38:50 PM All use subject to JSTOR Terms and Conditions
354 W.J.CONOVER,MARK E.JOHNSON,AND MYRLE M.JOHNSON Table 3.Tests Based on a Modification of the F tractive to the practitioner.For this reason we do not Test for Means see equation (2.3)for F()) include these tests in our study.A Monte Carlo com- w FK-1.N-k-F(1X13-X11) parison of these methods with the jackknife methods (see Mill)is presented by Martin and Games(1977). Lev2 -,M-k·F%-X)) Mood.The first nonparametric test for the variance -1.krn%-风) problem was presented by Mood(1954).It,like all of the nonparametric tests,assumes identical dis- Levs 男-1.k”F%11:2 tributions under the null hypothesis.In particular,this requires equal means,or a known transformation to 5-1,Nek”ru}tere"时"ng in af-(n-i)in achieve equal means,which is often not met in appli- ad划2nn)21n1 cations.Therefore,we adapt the Mood test and all of the nonparametric tests as follows.Instead of letting Rij be the rank of Xu when the means are equal or of Cad.A desire for simplification led to replacing the (Xy-u)when the means are unequal but known,we variance in Hart with the sample range in a paper by let Rij be the rank of (Xij).Each Xij is then Cadwell (1953).Exact tables for equal sample sizes are replaced by the score aN,Ri;based on this rank.The given by Harter(1963)for k=2 and Leslie and Brown result is a test that is not nonparametric but may be as (1966)for k s 12.We do not examine this test because robust and powerful as some of its parametric com- we feel that the computational advantages are no petitors.The use ofX instead ofi results in longer real with present-day software Mood:med,which we also examine.The chi squared Barl.Box (1953)showed that the asymptotic dis- approximation and the F approximation for each test tribution of Bar was dependent on the common kur- lead to four variations,which are studied. tosis of the sampled distributions and that by dividing F-A-B.Although the Mood test is a quadratic func- Bar by (1 +y/2),where y=E(X-u)/of-3,the tion of Ri,this test introduced by Freund and Ansari test would be asymptotically distribution free,pro- (1957)and further developed by Ansari and Bradley vided the assumption of common kurtosis was met. (1960)is a linear function of Rij.Again,we let Rij be Our form for this modification of Bar involves esti- the rank of (Xij-X).We examine four variations of mating y with the sample moments,a suggestion that F-A-B(see Mood).The B-D test was introduced by Layard (1973)attributes to Scheffe (1959).We also Barton and David (1958)shortly after the F-A-B test examine Barl:med.Bar2 and Bar2:med result from a and is similar to the F-A-B test in principle.Whereas different estimator for y as given by Layard the F-A-B scores are triangular in shape,the B-D Box.An interesting approach to obtaining a more scores follow a V shape with the large scores at the robust test for variance involves using the one-way extremes and the small scores at the grand median. layout F statistic,which is known to be quite robust. The result is a test with the same robustness and A concept suggested by Bartlett and Kendall(1946) power as F-A-B.The same can be said for the S-T test, was developed by Box(1953)into a test known as the log-anova test.For a preselected,arbitrary integer m>2,each sample is divided into subsamples of size Table 4.Linear Rank Tests scores may be used in m in some random manner.(See Martin and Games equations(2.1),(2.2),or(2.3)) 1975,1977 and Martin 1976 for suggestions on the size of m.)Remaining observations either are not used Score Function whereR the rank of: or are included in the final subsample.The sample 4-42 % variance sij is computed for each subsample, i=1,...,k,j=1,...,[n/m]=Ji.A log trans- F-A-B ---1,2,313.2,1 % formation Yy=In sy then makes the variables more P 3,2,1,12,… x nearly normal,and F(Y)is used as a test statistic. Subsequent studies by Gartside(1972),Layard(1973), 1,4,5…6,3,2 」 and Levy (1975)confirmed the robustness of this 9200 【E,1here,11the1边 method,but also revealed a lack of power as com- oradele pared with other tests that have the same robustness. ()2 where (x)ts the A modification that leads to a more nearly normal faea5doa1aatrtbnton sample is attributed to Bargmann by Gartside(1972). It uses Wij=wi(ln sij c),where wi and ci are nor- x malizing constants.However,the random method of 2 x subdividing samples and the possibility of not using all of the observations make these procedures unat- ◆小空+20* (SeeK1 otr for)】 TECHNOMETRICS©,VOL.23.NO.4,NOVEMBER1981 This content downloaded from 61.190.7.73 on Mon,30 Sep 2013 22:38:50 PM All use subject to JSTOR Terms and Conditions
W. J. CONOVER, MARK E. JOHNSON, AND MYRLE M. JOHNSON Table 3. Tests Based on a Modification of the F Test for Means ( see equation ( 2. 3 ) for F ( ) ) Levl Fk-, N-k F(IX -Xil) Lev2 Fk-1, N-k = F((Xij-Xi)2) Lev3 Fk-1, N-k = F(ln(Xij -i)2) Lev4 Fk-l, N-k = F ( X I i l) Mill F 1 Nk = F(Uij) where Uj = ni in si -(ni -)ln sij 2 1 2 2 and sij2 = n-2 [(ni-1)si-ni(Xij-Xi) /(ni-O)] Cad. A desire for simplification led to replacing the variance in Hart with the sample range in a paper by Cadwell (1953). Exact tables for equal sample sizes are given by Harter (1963) for k = 2 and Leslie and Brown (1966) for k 2, each sample is divided into subsamples of size m in some random manner. (See Martin and Games 1975, 1977 and Martin 1976 for suggestions on the size of m.) Remaining observations either are not used or are included in the final subsample. The sample variance si is computed for each subsample, i = 1, ..., k, j= 1, ..., [ni/m] = Ji. A log transformation Yij = In sij then makes the variables more nearly normal, and F(Y1j) is used as a test statistic. Subsequent studies by Gartside (1972), Layard (1973), and Levy (1975) confirmed the robustness of this method, but also revealed a lack of power as compared with other tests that have the same robustness. A modification that leads to a more nearly normal sample is attributed to Bargmann by Gartside (1972). It uses Wij = wi(ln sij + ci), where wi and ci are normalizing constants. However, the random method of subdividing samples and the possibility of not using all of the observations make these procedures unattractive to the practitioner. For this reason we do not include these tests in our study. A Monte Carlo comparison of these methods with the jackknife methods (see Mill) is presented by Martin and Games (1977). Mood. The first nonparametric test for the variance problem was presented by Mood (1954). It, like all of the nonparametric tests, assumes identical distributions under the null hypothesis. In particular, this requires equal means, or a known transformation to achieve equal means, which is often not met in applications. Therefore, we adapt the Mood test and all of the nonparametric tests as follows. Instead of letting Rij be the rank of Xij when the means are equal or of (Xij - p) when the means are unequal but known, we let Rij be the rank of (Xij - X). Each Xij is then replaced by the score aN, Rij based on this rank. The result is a test that is not nonparametric but may be as robust and powerful as some of its parametric competitors. The use of Xi instead of Xi results in Mood:med, which we also examine. The chi squared approximation and the F approximation for each test lead to four variations, which are studied. F-A-B. Although the Mood test is a quadratic function of Rij, this test introduced by Freund and Ansari (1957) and further developed by Ansari and Bradley (1960) is a linear function of Rj. Again, we let Rij be the rank of (Xij - X). We examine four variations of F-A-B (see Mood). The B-D test was introduced by Barton and David (1958) shortly after the F-A-B test and is similar to the F-A-B test in principle. Whereas the F-A-B scores are triangular in shape, the B-D scores follow a V shape with the large scores at the extremes and the small scores at the grand median. The result is a test with the same robustness and power as F-A-B. The same can be said for the S-T test, Table 4. Linear Rank Tests (scores may be used in equations (2. 1), (2.2), or (2.3) ) Abbreviation of Test Score aNR is a function of Rij, Score Function aN,i where Rij is the rank of: Mood (i- N1)2 F-A-B 2 - ji- 2 1-1, 2, 3,...3, 2, 1 B-D ...,3, 2, 1, 1, 2, 3, ... S-T 1, 4, 5,..., 6, 3, 2 Capon [E(ZN,i)2 where ZN, is the i th order statistic from a standard normal random sample of size N Klotz -i1 2 [~ (N~+) where n(x) is the standard normal distribution function T-G i S-R i2 (e-1 1 + i (See Klotz for 0) (Xi-Xi) (X -X ) (X ij-X) (X ij-Xi ) (X ij-X i) I xij-x i I Xij -Xi I Xl - il TECHNOMETRICS ?, VOL. 23, NO. 4, NOVEMBER 1981 354 This content downloaded from 61.190.7.73 on Mon, 30 Sep 2013 22:38:50 PM All use subject to JSTOR Terms and Conditions
TESTS FOR HOMOGENEITY OF VARIANCES 355 introduced by Siegel and Tukey(1960)at about the resulting chi squared test is adjusted from (k-1)to same time.The only advantage of the S-T test is that vi,where v is available in the same reference.We do tables for the Mann-Whitney test may be used;no not examine this test because in general the range is special exact tables are required.We do not examine less efficient than the sample variance. the B-D and S-T tests here because the results would Mill.The innovative jackknife procedure was ap- be essentially the same as those found for F-A-B plied to variance testing by Miller(1968).The jack- Sch/.The test statistic of this parametric procedure, knife procedure relies on partitioning the samples into attributed by Layard (1973)to Scheffe (1959),resem- subsamples of some predetermined size m.We take bles in some respects the numerator of an F statistic m =1,to remove the chance variation involved with computed on si,weighted by the degrees of freedom m 1.We do not examine Mill:med. n-1.The denominator is a function of the(assumed) Bar3.Dixon and Massey (1969)reported a vari- common kurtosis,which in practice must be esti- ation of Bar that uses the F distribution.We also mated.We use the sample kurtosis for y,and also examine Bar3:med. examine Schl:med.The variations Sch2 and Sch2:med Sam.The cube root of s2 is more nearly normal arise when Layard's estimator for y is used. than s,which leads to this test by Samuiddin (1976) Leh/.Lehmann's(1959)suggested procedure is the We also examined Sam:med. same as Sch1,but with y=0 as in normal dis- F-K.Fligner and Killeen (1976)suggest ranking tributions.Ghosh(1972)shows that multiplication by Xi and assigning increasing scores aN.i=i, (N-k)/(N-2k)gives a distribution closer to the chi aw,i=i2,and aw.t=Φ-(1/2+(/2(W+1)》based on square.We call this variation Leh2 and examine those ranks.We suggest using the ranks ofXj Leh1:med and Leh2:med also and call the first test T-G after Talwar and Gentle Levl.Levene(1960)suggested using the one-way (1977),who used a trimmed mean instead of X:.The analysis of variance on the variables Zi=Xy-Xi second test,called the squared ranks test S-R,was as a method of incorporating the robustness of that discussed by Conover and Iman(1978),but has roots test into a test for variance.Further variations sugges- in earlier papers by Shorack (1965),Duran and ted by Levene involve Z2(Lev2),In Zi;(Lev3),and Mielke(1968),and others.We denote the third test by Z(Lev4).We also consider Lev1:med,recommended F-K,even though we have taken liberties with their by Brown and Forsythe(1974),and Lev4:med,but do suggestion.We also examine,as with Mood,the four not examine Lev3:med because In 0=-oo occurs variations associated with each test.We do not exam- with odd sample sizes.We also do not consider use of ine Fligner and Killeen's suggestion of using the grand the trimmed mean as Brown and Forsythe did,largely median in place of because their results indicated no advantages in using This list of tests does not include others such as one this variation by Moses (1963)that relies on a random pairing Capon.Instead of using scores that are a quadratic within samples or one by Sukhatme (1958)that is function of the ranks as Mood had done,Capon closely related to some of the linear rank tests already (1961)suggested choosing scores that give optimum included.Also,the Box-Anderson(1955)permutation power in some sense.The result is this normal scores test for two samples,which Shorack(1965)highly test,which is locally most powerful among rank tests recommends,was found by Hall(1972)to have Type I against the normal-type alternatives,and asymptoti- error rates as high as 27 percent in the multisample cally locally most powerful among all tests for this case with normal populations at =.05,so it is not alternative. included in our study.However,the list is extensive Klotz.Shortly thereafter,Klotz (1962)introduced enough for our purposes,namely,to obtain a listing of another normal scores test that used the more con- tests for variances that appear to have well-controlled venient normal quantiles.The result has possibly less Type I error rates,and to compare the power of the power locally for small sample sizes,but has the same tests.This is accomplished in the next section. asymptotic properties as Capon.Because of its con- venience,we examine the Klotz test,but not the very 3.THE RESULTS OF A SIMULATION STUDY similar Capon.As in Mood,four variations of Klotz In the search for one or more tests that are robust are considered. as well as powerful,it became necessary to obtain Bar:range.Implicit in the literature since Patnaik's (1950)paper on the use of the range instead of the pseudorandom samples from several distributions, using several sample sizes and various combinations variance,but not explicitly mentioned until Gartside of variances.The simulation study is described in this (1972),is this variation of Bar that uses the standard- section.The results in terms of percent of times the ized range instead of the variance.The standardizing null hypothesis was rejected are summarized in Tables constants di are available from Pearson and Hartley 5 and 6. (1970,p.201).The number of degrees of freedom of the For symmetric distributions we chose the uniform, TECHNOMETRICS©,VOL.23,NO.4,NOVEMBER1981 This content downloaded from 61.190.7.73 on Mon,30 Sep 2013 22:38:50 PM All use subject to JSTOR Terms and Conditions
TESTS FOR HOMOGENEITY OF VARIANCES introduced by Siegel and Tukey (1960) at about the same time. The only advantage of the S-T test is that tables for the Mann-Whitney test may be used; no special exact tables are required. We do not examine the B-D and S-T tests here because the results would be essentially the same as those found for F-A-B. Schl. The test statistic of this parametric procedure, attributed by Layard (1973) to Scheffe (1959), resembles in some respects the numerator of an F statistic computed on si, weighted by the degrees of freedom ni - 1. The denominator is a function of the (assumed) common kurtosis, which in practice must be estimated. We use the sample kurtosis for y, and also examine Schl :med. The variations Sch2 and Sch2 :med arise when Layard's estimator for y is used. Lehl. Lehmann's (1959) suggested procedure is the same as Schl, but with y = 0 as in normal distributions. Ghosh (1972) shows that multiplication by (N - k)/(N - 2k) gives a distribution closer to the chi square. We call this variation Leh2 and examine Lehl :med and Leh2:med also. Levi. Levene (1960) suggested using the one-way analysis of variance on the variables Zij = I Xij - xi as a method of incorporating the robustness of that test into a test for variance. Further variations suggested by Levene involve Zh/2 (Lev2), In Zij (Lev3), and Zj (Lev4). We also consider Levl :med, recommended by Brown and Forsythe (1974), and Lev4:med, but do not examine Lev3 :med because In 0 = - oo occurs with odd sample sizes. We also do not consider use of the trimmed mean as Brown and Forsythe did, largely because their results indicated no advantages in using this variation. Capon. Instead of using scores that are a quadratic function of the ranks as Mood had done, Capon (1961) suggested choosing scores that give optimum power in some sense. The result is this normal scores test, which is locally most powerful among rank tests against the normal-type alternatives, and asymptotically locally most powerful among all tests for this alternative. Klotz. Shortly thereafter, Klotz (1962) introduced another normal scores test that used the more convenient normal quantiles. The result has possibly less power locally for small sample sizes, but has the same asymptotic properties as Capon. Because of its convenience, we examine the Klotz test, but not the very similar Capon. As in Mood, four variations of Klotz are considered. Bar :range. Implicit in the literature since Patnaik's (1950) paper on the use of the range instead of the variance, but not explicitly mentioned until Gartside (1972), is this variation of Bar that uses the standardized range instead of the variance. The standardizing constants di are available from Pearson and Hartley (1970, p. 201). The number of degrees of freedom of the resulting chi squared test is adjusted from (k - 1) to vi, where vi is available in the same reference. We do not examine this test because in general the range is less efficient than the sample variance. Mill. The innovative jackknife procedure was applied to variance testing by Miller (1968). The jackknife procedure relies on partitioning the samples into subsamples of some predetermined size m. We take m = 1, to remove the chance variation involved with m > 1. We do not examine Mill :med. Bar3. Dixon and Massey (1969) reported a variation of Bar that uses the F distribution. We also examine Bar3 :med. Sam. The cube root of s2 is more nearly normal than s2, which leads to this test by Samuiddin (1976). We also examined Sam :med. F-K. Fligner and Killeen (1976) suggest ranking |Xij| and assigning increasing scores aN, = i, aN, i = i2, and aN. i = - 1(1/2 + (i/2(N + 1))) based on those ranks. We suggest using the ranks of| Xi - Xi | and call the first test T-G after Talwar and Gentle (1977), who used a trimmed mean instead of Xi. The second test, called the squared ranks test S-R, was discussed by Conover and Iman (1978), but has roots in earlier papers by Shorack (1965), Duran and Mielke (1968), and others. We denote the third test by F-K, even though we have taken liberties with their suggestion. We also examine, as with Mood, the four variations associated with each test. We do not examine Fligner and Killeen's suggestion of using the grand median in place of Xi. This list of tests does not include others such as one by Moses (1963) that relies on a random pairing within samples or one by Sukhatme (1958) that is closely related to some of the linear rank tests already included. Also, the Box-Anderson (1955) permutation test for two samples, which Shorack (1965) highly recommends, was found by Hall (1972) to have Type I error rates as high as 27 percent in the multisample case with normal populations at a = .05, so it is not included in our study. However, the list is extensive enough for our purposes, namely, to obtain a listing of tests for variances that appear to have well-controlled Type I error rates, and to compare the power of the tests. This is accomplished in the next section. 3. THE RESULTS OF A SIMULATION STUDY In the search for one or more tests that are robust as well as powerful, it became necessary to obtain pseudorandom samples from several distributions, using several sample sizes and various combinations of variances. The simulation study is described in this section. The results in terms of percent of times the null hypothesis was rejected are summarized in Tables 5 and 6. For symmetric distributions we chose the uniform, TECHNOMETRICS ?, VOL. 23, NO. 4, NOVEMBER 1981 355 This content downloaded from 61.190.7.73 on Mon, 30 Sep 2013 22:38:50 PM All use subject to JSTOR Terms and Conditions
356 W.J.CONOVER,MARK E.JOHNSON,AND MYRLE M.JOHNSON normal,and double exponential distributions.Uni- in Table 5 represents a special study chosen to resem- form random numbers were simulated using CDC's ble the application situation described in Section 4.In uniform generator RANNUM,which is a multi- brief,13 samples in which the sample sizes were 2 plicative congruential generator type.The normal and (7 samples),3(2 samples),4,7(2 samples),and 13,were double exponential variates were obtained from the drawn from standard normal distributions.This was respective inverse cumulative distribution functions. repeated 1,000 times and 55 test statistics(Mill cannot Four samples were drawn with respective sample sizes be computed for n=2)were computed each time. (n1,n2,m3,n4)=(5,5,5,5),(10,10,10,10),(20,20,20, This case was investigated to see how the tests might 20),and (5,5,20,20).The null hypothesis of equal behave under conditions typically encountered in oil- variances(all equal to 1)was examined along with the ease-bidding data. four alternatives (oi,0z,03,0)=(1,1,1,2),(1,1,1, There are many different ways of interpreting the 4),(1,1,1,8),and (1,2,4,8).The mean was set equal to results of Tables 5 and 6,just as there are many ways the standard deviation in each population under the of defining what is a“good”test as opposed to a“bad" alternative hypothesis.Zero means were used for Ho. test.We will define a test to be robust if the maximum Each of these 60 combinations of distribution type, Type I error rate is less than.10 for a 5 percent test. sample size,and variances was repeated 1,000 times, The four tests that qualify under this criterion,and so that the 56 test statistics mentioned in Section 2 their maximum estimated test size in parentheses,are were computed and compared with their 5 percent Bar2:med (.071),Lev1:med (.060),Lev2:med (.078), and 1 percent nominal critical values 60,000 times and F-K:med X2(.099).We include F-K:med F(.112) each.The observed frequency of rejection of the null in this group of robust tests also,because in 18 of the hypothesis is reported in Table 5 for normal dis- 19 null cases examined the estimated test size was less tributions and in Table 6 for double exponential dis- than .084,which is well under control.Of these five tributions.The figures in parentheses in those tables tests the second,fourth,and fifth tests appear to have represent the averages over the four variance combi- slightly more power than the other two.It is interest- nations under the alternative hypothesis.The stan- ing to note that if the qualifications for robustness are dard errors of all entries in Tables 5 and 6 are less loosened somewhat to max test size s.15,only one than.016.The results for the uniform distribution are new test is included,Lev4:med (.145).Two additional not reported here to save space.A table with the tests have max test size s.20.These are Lev2 (.163) results for the uniform distribution is available from and Bar2(.172).The increase in the Type I error rates the authors on request. of Lev2 and Bar2 over Lev2:med and Bar2:med is The corresponding figures for the asymmetric case accompanied by only a 40 percent relative increase in were obtained by squaring the random variables ob- power.The other test has less power.Therefore,a tained in the symmetric case to obtain highly skewed reasonable conclusion seems to be that the five tests and extremely leptokurtic distributions.To be more with max test size <.112 qualify as robust tests for specific,we usedx?+u rather than (X+u)2 variances,with the tests Levl:med,F-K:med X2,and where Xi represents the null distributed random vari- its sister test F-K:med F having slightly more power able,because the latter transformation does not allow than the other two.Notice the resemblance among as much control over means and variances as does the these three tests.The first uses an analysis of variance former.The three distributions(uniform)2,(normal)2, on XuXi,while the second and third convert and (double exponential)2,in combination with two X to ranks and then to normal type scores, sample sizes (10,10,10,10)and (5,5,20,20)and the where they are then subjected to either a chi squared five variance combinations (the null case and four test or an analysis of variance F test. alternatives,as before)gave a total of 30 combi- Similar conclusions were drawn using a =.01.The nations.For each combination,1,000 repetitions were only tests with a reasonably well-controlled test size run for each of the 56 test statistics.The average are the same five tests that were selected using =.05. frequency of rejection,averaged over the four variance On the basis of demonstrated power at a=.01,the combinations under the alternative,is presented in same three tests mentioned for a=.05 again appear Tables 5 and 6 also to be the best.Therefore,the number of rejections for The columns in Tables 5 and 6 represent the vari- each test at a =.01 is not reported. ous sample sizes under symmetric and asymmetric If we consider only those five cases that have sym- distributions.For convenience,the nonsymmetric dis- metric distributions,there are many additional tests tributions are simply called asymmetric,although this that qualify as robust under the above definition.The is not meant to imply that the simulation results are five that show the most power,in order of decreasing attributable to the skewness of those distributions power,are Bar2,Klotz:med F,Klotz:med X2,Lev rather than to the extreme leptokurtic nature of those 4:med,and S-R:med F.However,the power of these same asymmetric distributions.The seventh column five tests for symmetric distributions is about the same TECHNOMETRICS©,VOL.23,NO.4,NOVEMBER 1981 This content downloaded from 61.190.7.73 on Mon,30 Sep 2013 22:38:50 PM All use subject to JSTOR Terms and Conditions
W. J. CONOVER, MARK E. JOHNSON, AND MYRLE M. JOHNSON normal, and double exponential distributions. Uniform random numbers were simulated using CDC's uniform generator RANNUM, which is a multiplicative congruential generator type. The normal and double exponential variates were obtained from the respective inverse cumulative distribution functions. Four samples were drawn with respective sample sizes (nl, n2, n3, n4) = (5, 5, 5, 5), (10, 10, 10, 10), (20, 20, 20, 20), and (5, 5, 20, 20). The null hypothesis of equal variances (all equal to 1) was examined along with the four alternatives (a2, a2 , a2, 2) = (1, 1, 1, 2), (1, 1, 1, 4), (1, 1, 1, 8), and (1, 2, 4, 8). The mean was set equal to the standard deviation in each population under the alternative hypothesis. Zero means were used for Ho. Each of these 60 combinations of distribution type, sample size, and variances was repeated 1,000 times, so that the 56 test statistics mentioned in Section 2 were computed and compared with their 5 percent and 1 percent nominal critical values 60,000 times each. The observed frequency of rejection of the null hypothesis is reported in Table 5 for normal distributions and in Table 6 for double exponential distributions. The figures in parentheses in those tables represent the averages over the four variance combinations under the alternative hypothesis. The standard errors of all entries in Tables 5 and 6 are less than .016. The results for the uniform distribution are not reported here to save space. A table with the results for the uniform distribution is available from the authors on request. The corresponding figures for the asymmetric case were obtained by squaring the random variables obtained in the symmetric case to obtain highly skewed and extremely leptokurtic distributions. To be more specific, we used aX2 + u rather than (aXi + #i)2, where Xi represents the null distributed random variable, because the latter transformation does not allow as much control over means and variances as does the former. The three distributions (uniform)2, (normal)2, and (double exponential)2, in combination with two sample sizes (10, 10, 10, 10) and (5, 5, 20, 20) and the five variance combinations (the null case and four alternatives, as before) gave a total of 30 combinations. For each combination, 1,000 repetitions were run for each of the 56 test statistics. The average frequency of rejection, averaged over the four variance combinations under the alternative, is presented in Tables 5 and 6 also. The columns in Tables 5 and 6 representhe various sample sizes under symmetric and asymmetric distributions. For convenience, the nonsymmetric distributions are simply called asymmetric, although this is not meant to imply that the simulation results are attributable to the skewness of those distributions rather than to the extreme leptokurtic nature of those same asymmetric distributions. The seventh column in Table 5 represents a special study chosen to resemble the application situation described in Section 4. In brief, 13 samples in which the sample sizes were 2 (7 samples), 3 (2 samples), 4, 7 (2 samples), and 13, were drawn from standard normal distributions. This was repeated 1,000 times and 55 test statistics (Mill cannot be computed for ni = 2) were computed each time. This case was investigated to see how the tests might behave under conditions typically encountered in oillease-bidding data. There are many different ways of interpreting the results of Tables 5 and 6, just as there are many ways of defining what is a "good" test as opposed to a "bad" test. We will define a test to be robust if the maximum Type I error rate is less than .10 for a 5 percent test. The four tests that qualify under this criterion, and their maximum estimated test size in parentheses, are Bar2:med (.071), Levl:med (.060), Lev2:med (.078), and F-K:med X2 (.099). We include F-K:med F (.112) in this group of robust tests also, because in 18 of the 19 null cases examined the estimated test size was less than .084, which is well under control. Of these five tests the second, fourth, and fifth tests appear to have slightly more power than the other two. It is interesting to note that if the qualifications for robustness are loosened somewhat to max test size < .15, only one new test is included, Lev4:med (.145). Two additional tests have max test size < .20. These are Lev2 (.163) and Bar2 (.172). The increase in the Type I error rates of Lev2 and Bar2 over Lev2:med and Bar2:med is accompanied by only a 40 percent relative increase in power. The other test has less power. Therefore, a reasonable conclusion seems to be that the five tests with max test size < .112 qualify as robust tests for variances, with the tests Levl :med, F-K:med X2, and its sister test F-K:med F having slightly more power than the other two. Notice the resemblance among these three tests. The first uses an analysis of variance on Xij- Xi|, while the second and third convert I Xij- Xi to ranks and then to normal type scores, where they are then subjected to either a chi squared test or an analysis of variance F test. Similar conclusions were drawn using a = .01. The only tests with a reasonably well-controlled test size are the same five tests that were selected using a = .05. On the basis of demonstrated power at a = .01, the same three tests mentioned for a = .05 again appear to be the best. Therefore, the number of rejections for each test at a = .01 is not reported. If we consider only those five cases that have symmetric distributions, there are many additional tests that qualify as robust under the above definition. The five that show the most power, in order of decreasing power, are Bar2, Klotz:med F, Klotz:med X2, Lev 4 :med, and S-R:med F. However, the power of these five tests for symmetric distributions iabout the same TECHNOMETRICS ?, VOL. 23, NO. 4, NOVEMBER 1981 356 This content downloaded from 61.190.7.73 on Mon, 30 Sep 2013 22:38:50 PM All use subject to JSTOR Terms and Conditions
TESTS FOR HOMOGENEITY OF VARIANCES 357 Table 5.For Normal and(Normal)2 Distributions,Proportion of Times the Null Hypothesis of Equal Variance Was Rejected by the Various Tests,Under the Null Hypothesis(test size and(in parentheses Under the Alternative Hypothesis(power )ata=.05 Normal Distribution:Symmetric (Normal)2:Asymmetric Special n=(5,5,5,5) (10,10,10,10) (20,20,20,20) (5,5,20,20) Study (10,10,10,10) (5,5,20,20) TABLE 1 TESTS N-P .103(.455) 069 .662) .812 (.826 .663 N-P:med 814 759 ,674 (.865 .115 454 .081 .662 .750 .687 ,830) ,669 ,864 Bar .033 .298) .051 .600 .060 .796 .049 .646 .032 .614 .788 .567 ,797 Bar:med .034 .2991 .052 596 64 .798 Coch .040 ,356 .045 ,602 .04 .791 98 629 .795 577 .798 .480 .563】 .576 ,768 Coch:med .041 .353】 .042 .604 .045 .792 .151 .684 223 .493 .6691 ,592 ,762 Hart .028 .231 055 554 ,052 .774 ,739 .604 Hart med 02g 235 .772 .720 ,882 .058 552 ,776 .720 .034 …613 (.777) .725 .879 Bar3 .303 ,051 .600 .060 ,796 .049 .648 .040 .614 .788 .570 .799 Bar3:med .037 .306 ,053 .597 ,064 .798 046 Sam 022 269 587 794 008 .796 .579 .799 ,781 .538 .764 Sam med 019 .274 ,048 .582 .795 594 .006 .616 90 547 Leh .094 .377 .618 med .615 8 .854 Leh2 .665 .761 .745 .697 .837 673 ,873 Leh2:med .198 .515) ,106 .665 .087 .8071 .112 .750 748 .717 .844 680 .870 TABLE 2 TESTS Barl .273(.612 .154 (.709) ,105 .822 123 .729) ,121 545 med .435 087 .638 :82 .414 132 Bar2 med .007 ,039 383 .249 083 .206 .281 .033 .696 .710 649 锅 9 090 .706 med .630 605 Sch2 ,112 .419 .063 ,720 ,645 Sch2: med .056 .136 .048 .322 .049 682 577 TABLE 3 TESTS Levl .083 (.303) .064 .543) 058 .768 .583 Levl:med .002 .065 732 52 ,293 489 927 235 .043 (.142 ,774 .456 .163 .107 Lev2 med .080 .015 .749 Lev3 069 Lev4 091 89 9 060 070 Lev4:med .000 .004 .037 .383 .034 659 Mill 030 134 .040 435 054 .752 89 .254 172 (.324 TABLE 4 TESTS Mood 3 .070 *247 .069 .472 .060 71 066 862) Mood F .091 .296 .077 Mood:med .002 874 827 837 .577 Mood:med 61 .009 2085 ,036 .506 48 .595 389 8器 F-A-8 .070 ,193 058 F-A-B F .094 .240 ,068 ,803 F-A-B:med 2 000 解 8 :8a8 .811 .486 ,033 F-A-B:med .000 9 .566 395 .550 .340 .643 (.000 .030 .575 043 .504 .065 ,418 .572 .357 660 Klotz .057 .265) Klotz F .078 .311 :88 med x2 032 强 841 ,678 802 855 .011 .704 .078 815 ,734 ,033 .472 Klotz:med .05 (.104) .039 .424) .036 .7401 .033 ,490 969 3粥 S-R 2 .060 248 .062 S-R F ,093 296 :88 .613 770 589 789 .630 783 e .000 .015 .658 .491 .054 .003 .038 .029 .353 ,035 .026 .654j .032 .509 092 ,119 .364 F-K 04.4 248 043 F-K F .061 776 .050 528 7 .422 623 361 576 ,782 .054 .544 .174 .442 646 F-K:med x2 004 .746) .030 ,470 034 F-K:med F .009 (.081 .020 .436 033 .751 032 .489 .054 218 (.235 .057 .211 T-G X2 068 203 .536 ,058 525 .305 .610 .753 608 8n9 T-G F ,643 .065 .540 T-G:med .418 .770 (.000 .027 .268 .025 .564 .035 .472 T-G:med F .000 (.000 .033 .288 .026 (.573 .038 .390 .491 .413 .189 .458 TECHNOMETRICS©,VOL. 23,N0. 4,NOVEMBER 1981 This content downloaded from 61.190.7.73 on Mon,30 Sep 2013 22:38:50 PM All use subject to JSTOR Terms and Conditions
TESTS FOR HOMOGENEITY OF VARIANCES Table 5. For Normal and (Normal)2 Distributions, Proportion of Times the Null Hypothesis of Equal Variance Was Rejected by VNull the Hypothesis (test size) and (in parentheses) Under the Alternative Hypothesis (power), at a = .05 Normal Distribution: Symmetric n=(5,5,5,5) (10,10,10,10) TABLE 1 TESTS .103 (.455) .115 (.454) .033 (.298) .034 (.299) .040 (.356) .041 (.353) .028 (.231) .029 (.235) .034 (.303) .037 (.306) .022 (.269) .019 (.274) .094 (.377) .104 (.381) .179 (.514) .198 (.515) .069 (.662) .081 (.662) .051 (.600) .052 (.596) .045 (.602) .042 (.604) .055 (.554) .058 (.552) .051 (.600) .053 (.597) .046 (.587) .048 (.582) .082 (.618) .085 (.615) .108 (.665) .106 (.665) (20,20,20,20) .071 (.812) .077 (.814) .060 (.796) .064 (.798) .043 (.791) .045 (.792) .052 (.774) .056 (.776) .060 (.796) .064 (.798) .058 (.794) .064 (.795) .069 (.792) .078 (.794) .079 (.806) .087 (.807) (Normal)2: Asymmetric (5,5,20,20) .104 (.759) .098 (.750) .049 (.646) .049 (.630) .138 (.706) .151 (.684) .218 (.739) .213 (.720) .049 (.648) .049 (.633) .045 (.607) .054 (.594) .102 (.731) .099 (.722) .119 (.761) .112 (.750) Special Study .625 .639 .032 .034 .234 .223 .625 .627 .040 .046 .008 .006 .498 .511 .745 .748 (10,10,10,10) .674 (.826) .687 (.830) .614 (.788) .629 (.795) .480 (.663) .493 (.669) .604 (.772) .613 (.777) .614 (.788) .629 (.796) .606 (.781) .616 (.790) .664 (.814) .676 (.819) .697 (.837) .717 (.844) (5,5,20,20) .663 (.865) .669 (.864) .567 (.797) .577 (.798) .576 (.768) .592 (.762) .720 (.882) .725 (.879) .570 (.799) .579 (.799) .538 (.764) .547 (.766) .634 (.858) .648 (.854) .673 (.873) .680 (.870) TABLE 2 TESTS .273 (.612) .121 (.435) .047 (.132) .007 (.039) .272 (.603) .170 (.477) .112 (.242) .056 (.136) .154 (.709) .087 (.638) .053 (.383) .024 (.281) .163 (.710) .114 (.649) .079 (.419) .048 (.322) .105 (.822) .082 (.807) .051 (.734) .033 (.696) .119 (.819) .090 (.802) .063 (.720) .049 (.682) .123 (.729) .092 (.667) .048 (.505) .029 (.431) .176 (.790) .140 (.737) .103 (.645) .072 (.577) .648 .397 .050 .014 .808 .722 .510 .402 .487 (.689) .365 (.549) .143 (.249) .043 (.100) .558 (.742) .443 (.630) .247 (.380) .137 (.228) .301 (.545) .182 (.414) .083 (.206) .021 (.090) .421 (.706) .321 (.605) .206 (.447) .122 (.312) TABLE 3 TESTS .083 (.303) .064 (.543) .002 (.065) .025 (.437) .057 (.235) .047 (.489) .011 (.080) .015 (.388) .069 (.192) .062 (.337) .091 (.283) .069 (.493) .000 (.004) .037 (.383) .030 (.134) .040 (.435) .058 (.768) .039 (.732) .048 (.774) .033 (.749) .057 (.554) .060 (.716) .034 (.659) .054 (.752) .060 (.583) .032 (.521) .055 (.456) .035 (.383) .069 (.461) .070 (.571) .049 (.552) .077 (.550) .263 .057 .163 .048 .403 .372 .020 .349 (.561) .293 (.489) .054 (.184) .043 (.142) .097 (.208) .116 (.107) .014 (.061) .044 (.029) .461 (.637) .471 (.699) .491 (.688) .477 (.710) .144 (.297) .104 (.337) .153 (.254) .172 (.324) TABLE 4 TESTS Mood X2 .070 (.247) Mood F .091 (.296) Mood:med X2 .002 (.033) Mood:med F .009 (.063) F-A-B X2 .070 (.193) F-A-B F .094 (.240) F-A-B:med X2 .000 (.000) F-A-B:med F .000 (.000) Klotz X2 .057 (.265) Klotz F .078 (.311) Klotz:med X2 .011 (.078) Klotz:med F .015 (.104) S-R X2 .060 (.248) S-R F .093 (.296) S-R:med X2 .000 (.015) S-R:med F .003 (.038) F-K X2 .044 (.248) F-K F .061 (.296) F-K:med X2 .004 (.058) F-K:med F .009 (.081) T-G X2 .068 (.203) T-G F .089 (.247) T-G:med X2 .000 (.000) T-G:med F .000 (.000) TECHNOMETRICS ?, VOL. 23, NO. 4, NOVEMBER 1981 N-P N-P:med Bar Bar:med Coch Coch:med Hart Hart: med Bar3 Bar3 :med Sam Sam:med Lehl Lehl :med Leh2 Leh2: med Barl Barl:med Bar2 Bar2 :med Schl Schl:med Sch2 Sch2:med Levl Levl:med Lev2 Lev2:med Lev3 Lev4 Lev4:med Mill .069 (.472) .077 (.494) .036 (.342) .041 (.367) .058 (.395) .068 (.415) .034 (.276) .037 (.294) .053 (.526) .064 (.547) .031 (.407) .039 (.424) .062 (.474) .074 (.494) .023 (.332) .029 (.353) .043 (.521) .052 (.540) .018 (.413) .020 (.436) .058 (.397) .067 (.420) .027 (.268) .033 (.288) .060 (.711) .063 (.716) .038 (.657) .038 (.663) .056 (.634) .058 (.643) .029 (.566) .030 (.575) .058 (.772) .062 (.777) .034 (.734) .036 (.740) .057 (.709) .059 (.714) .032 (.658) .035 (.664) .051 (.776) .053 (.782) .033 (.746) .033 (.751) .056 (.636) .059 (.643) .025 (.564) .026 (.573) .066 (.562) .076 (.578) .032 (.491) .036 (.506) .060 (.516) .065 (.532) .040 (.486) .043 (.504) .062 (.538) .072 (.551) .033 (.472) .033 (.490) .060 (.566) .068 (.586) .026 (.491) .032 (.509) .050 (.528) .054 (.544) .030 (.470) .032 (.489) .058 (.525) .065 (.540) .035 (.472) .038 (.491) .215 .317 .059 .094 .269 .380 .033 .065 .152 .222 .050 .084 .228 .322 .054 .092 .127 .174 .034 .054 .305 .418 .027 .053 .752 (.862) .768 (.874) .410 (.577) .433 (.595) .728 (.838) .741 (.852) .395 (.550) .418 (.572) .713 (.841) .741 (.855) .352 (.554) .387 (.574) .613 (.770) .630 (.783) .171 (.322) .183 (.340) .422 (.623) .442 (.646) .066 (.218) .084 (.235) .610 (.753) .621 (.770) .256 (.390) .272 (.413) .684 (.827) .702 (.837) .370 (.623) .381 (.636) .638 (.803) .648 (.811) .340 (.643) .357 (.660) .678 (.802) .704 (.815) .328 (.570) .348 (.588) .589 (.789) .611 (.802) .105 (.347) .119 (.364) .361 (.576) .383 (.596) .052 (.197) .057 (.211) .608 (.809) .623 (.818) .172 (.444) .189 (.458) 357 This content downloaded from 61.190.7.73 on Mon, 30 Sep 2013 22:38:50 PM All use subject to JSTOR Terms and Conditions
358 W.J.CONOVER,MARK E.JOHNSON,AND MYRLE M.JOHNSON Table 6.For Double Exponential and(Dbl.Exp.)2 Distributions,Proportion of Times the Null Hypothesis of Equal Variances Was Rejected by the Various Tests,Under the Null Hypothesis(test size)and(in parentheses)Under the Alternative Hypothesis(power),at a=.05 Dbl.Exp.Distribution:Symmetric (Db1.Exp.)2:Asymmetric n=(5,5,5,5) (10,10,10,10) (20,20,20,20) (5,5,20,20) (10,10,10,10) (5,5,20,20) TABLE 1 TESTS N-P .316 (.553) ,339(,7131 ,316(,836) .333(,801) .876 (.912) 883(.933 N-P:med ,322 .556 .340 (.713 .317 .835 .333 (.795 .881 .914 .886 .934 Bar .157 .395 .273 (.661 288 .821 ,233 .710 .856 802 .832 897 Bar:med .164 .397 .275 .659 ,292 .822 .240 .697 ,860 .891 .830 .898 Coch 154 .3R6 .232 .592 .214 762 .721 .723 .773 ,798 .856 Coch:med .164 .381 .236 .593 .212 .764 ,712 ,724 .7 .799 .855 Hart .134 .345 .248 .624 .264 806 .460 .815 .845 .888 .908 .950 Hart:med ,139 ,348 .252 .629 267 .806 .457 .807 .849 .888 .906 ,950 Bar3 ,161 .402 .275 ,661 288 .821 .236 .711 .856 .892 .833 (.897 Bar3:med .170 ,401 .275 .659 293 ,822 .699 .862 .892 .834 898 Sam .135 .364 .261 .653 .284 .819 .670 .853 .888 .808 .884 Sam:med 145 .365 265 .650 .820 663 ,853 .887 .805 .885 Lehl .275 .504 .315 .687 .831 :3 .790 .868 .907 .866 .929 Lehl:med .278 .507 313 689 .828 .779 ,874 .910 .872 .930 Leh2 .404 .620 .361 .728 .841 .809 .888 ,921 ,883 .939 Leh2:med ,401 627) 366 .727 ,341 .840j ,356 .803 ,889 (.924 .886 (.939 TABLE 2 TESTS Barl ,450 .553 .273 .641) ,169 (.727 .179 (.605 .696 .766 .439 (.557 8a med 470 8 144 .701 133 .546 .551 .654 .332 .454 .129 .050 .492 .046 ,289 .230 .153 Bar2 med 165 440 .020 .229 .099 .073 Schl .668 .739 .254 .717 .758 .819 .759 Schl med 548 .250 603 .170 .718 .210 .663 .652 .741 8 .693 Sch2 .298 .101 ,322j .079 .511 .116 .486 361 ,453 284 .468 Sch2:med 087 (.176 .069 .243j .058 .460) ,082 .416j .249 (.310j ,188 .355 TABLE 3 TESTS Levl .097 (.268 .077 (.415 .068 (.645) .087 (.396 .473 (.579 .384 (.420 Levl:med .008 .051 .033 ,291 .039 .591 .035 .325 .048 .092 .060 .057 Lev2 .057 .155 .048 ,266 .040 .524 .079 .194 Lev2 .074 149 .077 med .010 .051 .024 .184 .027 115 .473 .048 43 .012 .024 .078 (.027 .098 .229 .121 83 .326 .078 498 .087 .404 ,741 .805 .729 .836 .290 .419 .082 .630 .092 .458 715 803 ,688 .797 Lev4:med .000 .008 .306 .041 562 .047 .413 Mi]l .045 145 .226) .099 .199 .046 (,136 .067 .319) .087 .537) .107 .419 .195 (.240)】 .214 .291) as the power of the three tests mentioned previously these bids(Dougherty and Lohrenz 1976).If it is for those same symmetric distributions.Therefore,the reasonable to assume that the variance of the log of three tests,Levl:med,F-K:med X2,and F-K:med F, the bids on each lease is constant within a sale,then again appear to be the best tests to use on the basis of the scale parameter of the lognormal distribution can robustness and power be estimated using all the bids in the sale. 4.APPLICATION TO THE LPR DATA BASE The bids in 40 sales were examined.These included all the sales held from October 13,1954 to October 27 Since 1954 the United States government has 1977,which is the date of the last sale recorded in the periodically held sales in which offshore leases have data base at the time of this study.We considered only been offered to the highest bidder for the production leases within a sale receiving two or more bids on the of oil and gas.The lease,production,and revenue lease.The 40 sales averaged about 50 leases per sale, (LPR)data base includes detailed information on the with a range from 5 to 133.Although some of the bids submitted,as well as the yearly production and leases have as many as 12 or 13 bids,small numbers of revenue data on each lease.Our interest is in the bids bids are the general rule,with about half of the leases submitted on the various leases within each sale. examined having only two bids submitted on them. Often,the lognormal distribution is used to model For example,the sale held on July 21,1970 was the TECHNOMETRICS©,VOL.23,NO.4,NOVEMBER1981 This content downloaded from 61.190.7.73 on Mon,30 Sep 2013 22:38:50 PM All use subject to JSTOR Terms and Conditions
W. J. CONOVER, MARK E. JOHNSON, AND MYRLE M. JOHNSON Table 6. For Double Exponential and( Dbl. Exp.)2 Distributions, Proportion of Times the Null Hypothesis of Equal Variances Was Rejected by the Various Tests, Under the Null Hypothesis (test size) and (in parentheses ) Under the Alternative Hypothesis (power), at a = .05 Dbl. Exp. Distribution: Symmetric (Dbl. Exp.)2: Asymmetric n=(5,5,5,5) (10,10,10,10) (20,20,20,20) (5,5,20,20) (10,10,10,10) TABLE 1 TESTS N-P N-P:med Bar Bar:med Coch Coch:med Hart Hart:med Bar3 Bar3:med Sam Sam:med Lehl Lehl :med Leh2 Leh2:med TABLE 2 TESTS Barl Barl :med Bar2 Bar2:med Schl Schl :med Sch2 Sch2:med TABLE 3 TESTS Levl Levl:med Lev2 Lev2 :med Lev3 Lev4 Lev4 :med Mill as the power of the three tests mentioned previously for those same symmetric distributions. Therefore, the three tests, Levi: med, F-K: med X2, and F-K: med F, again appear to be the best tests to use on the basis of robustness and power. 4. APPLICATION TO THE LPR DATA BASE Since 1954 the United States government has periodically held sales in which offshore leases have been offered to the highest bidder for the production of oil and gas. The lease, production, and revenue (LPR) data base includes detailed information on the bids submitted, as well as the yearly production and revenue data on each lease. Our interest is in the bids submitted on the various leases within each sale. Often, the lognormal distribution is used to model these bids (Dougherty and Lohrenz 1976). If it is reasonable to assume that the variance of the log of the bids on each lease is constant within a sale, then the scale parameter of the lognormal distribution can be estimated using all the bids in the sale. The bids in 40 sales were examined. These included all the sales held from October 13, 1954 to October 27, 1977, which is the date of the last sale recorded in the data base at the time of this study. We considered only leases within a sale receiving two or more bids on the lease. The 40 sales averaged about 50 leases per sale, with a range from 5 to 133. Although some of the leases have as many as 12 or 13 bids, small numbers of bids are the general rule, with about half of the leases examined having only two bids submitted on them. For example, the sale held on July 21, 1970 was the TECHNOMETRICS ?, VOL. 23, NO. 4, NOVEMBER 1981 (5,5,20,20) .316 .322 .157 .164 .154 .164 .134 .139 .161 .170 .135 .145 .275 .278 .404 .401 (.553) (.556) (.395) (.397) (.386) (.381) (.345) (.348) (.402) (.401) (.364) (.365) (.504) (.507) (.620) (.627) .339 .340 .273 .275 .232 .236 .248 .252 .275 .275 .261 .265 .315 .313 .361 .366 (.713) (.713) (.661) (.659) (.592) (.593) (.624) (.629) (.661) (.659) (.653) (.650) (.687) (.689) (.728) (.727) .316 .317 .288 .292 .214 .212 .264 .267 .288 .293 .284 .285 .314 .315 .334 .341 (.836) (.835) (.821) (.822) (.762) (.764) (.806) (.806) (.821) (.822) (.819) (.820) (.831) (.828) (.841) (.840) .333 .333 .233 .240 .324 .340 .460 .457 .236 .243 .213 .231 .317 .314 .357 .356 (.801) (.795) (.710) (.697) (.721) (.712) (.815) (.807) (.711) (.699) (.670) (.663) (.790) (.779) (.809) (.803) .876 .881 .856 .860 .723 .724 .845 .849 .856 .862 .853 .853 .868 .874 .888 .889 (.912) (.914) (.892) (.891) (.773) (.777) (.888) (.888) (.892) (.892) (.888) (.887) (.907) (.910) (.921) (.924) .883 .886 .832 .830 .798 .799 .908 .906 .833 .834 .808 .805 .866 .872 .883 .886 (.933) (.934) (.897) (.898) (.856) (.855) (.950) (.950) (.897) (.898) (.884) (.885) (.929) (.930) (.939) (.939) .450 .238 .047 .010 .470 .325 .167 .087 (.553) (.470) .129) .041) (.671) (.548) (.298) (.176) .273 .199 .054 .016 .313 .250 .101 .069 (.641) (.563) (.232) (.165) (.668) (.603) (.322) (.243) .169 .144 .050 .033 .190 .170 .079 .058 (.727) (.701) (.492) (.440) (.739) (.718) (.511) (.460) .179 .133 .046 .020 .254 .210 .116 .082 (.605) (.546) (.289) (.229) (.717) (.663) (.486) (.416) .696 .551 .172 .071 .758 .652 .361 .249 (.766) (.654) .230) .099) (.819) (.741) (.453) (.310) .439 .332 .100 .024 .598 .514 .284 .188 (.557) (.454) (.153) .073) (.759) (.693) (.468) (.355) .097 .008 .057 .010 .098 .121 .000 .046 (.268) (.051) (.155) (.051) (.229) (.290) (.008) (.136) .077 .033 .048 .024 .077 .093 .045 .067 (.415) (.291) (.266) (.184) (.326) (.419) (.306) (.319) .068 .039 .040 .027 .078 .082 .041 .087 (.645) (.591) (.524) (.473) (.498) (.630) (.562) (.537) .087 .035 .079 .048 .087 .092 .047 .107 (.396) (.325) (.194) (.143) (.404) (.458) (.413) (.419) .473 .048 .074 .012 .741 .715 .145 .195 (.579) (.092) (.115) (.024) (.805) (.803) (.226) (.240) .384 .060 .149 .078 .729 .688 .099 .214 (.420) (.057) (.077) (.027) (.836) (.797) (.199) (.291) 358 This content downloaded from 61.190.7.73 on Mon, 30 Sep 2013 22:38:50 PM All use subject to JSTOR Terms and Conditions
TESTS FOR HOMOGENEITY OF VARIANCES 359 Table6(Continued) Dbl.Exp.Distribution:Symmetric (Db1.Exp.)2:Asymmetric n=(5,5,5,5) (10,10,10,10) (20,20,20,20)】 (5,5,20,20) (10,10,10,10) (5,5,20,20) TABLE 4 TESTS Mood x2 .080 ,221 .087 .372 .065 .592) .082 .424 ,919 (.936) .855 (.892) Mood F 2 .275 .095 .388 .069 .598 .090 .440 .928 .945 .863 (.899 Mood:med x2 .003 .027 ,036 ,262 .041 .523 .036 .356 73 .670 .505 .66R Mood:med F .009 (.048 .041 .275 .045 .533 .039 .373) 59 .687 .528 (.682) F-A-B X2 .091 .195 082 .325 .068 .536 .071 .403 .908 .925 .829 .862 F-A-B F .112 .240 .094 .349 .077 .546 .082 .420 .913 ,935 .834 (.869 F-A-B:med X2 .000 (.000 .045 .228 .035 .447 .043 .368) .562 .660 .474 (.679 F-A-B:med F ,000 .000 .053 .245 .036 (.454 .048 .385 .575 (.680 .494 .694 Klotz x2 .072 (,223 .077 .388 .070 .629 .079 .390 .907 ,923 .838 .865 Klotz F .105 .273 .082 .410 ,075 ,637 085 .408) 923 .934 849 877 Klotz:med x2 .012 (.061 .039 .286 .037 .575 .045 .330 .516 .615 .483 (.595) Klotz:med F .016 (.081) .044 .303 .039 (.584i .050 (.345) ,537 .641 .512 .616 S-R X2 .087 .241 .086 ,386 .069 ,599 .081 .445 .842 .891 .837 .909 S-R F ,115 .289 .097 .408 .071 .607 .087 (.460 .851 .902 .846 .915 S-R:med x2 .000 (.010 .031 .250 .042 .526 .029 (.355 .254 .342 .145 (.312 S-R:med F .003 (.029 .034 .269 .042 .536) .032 .371 .262 .365 .153 .326 F-K X2 .058 .214 067 .387 .063 .632 .074 383 .660 755 .632 .711 F-K F .086 .263 .076 .405 .067 .639 ,077 .401 .677 .768 .651 .729 F-K:med x2 .005 .040 .026 .274 .033 .581 .032 ,317 .099 ,195 .076 .152 F-K:med F .011 .063 .030 .293 .036 .588 .037 .331) .112 20 .080 .160) T-G X2 .095 27 .095 .342 .070 .545 .076 (,429 847 ,890) 845 .920 T-GF .122 .264 .099 .364 .072 .554 .082 .446) .863 .899 .858 .924 T-G:med X2 .000 .000 .039 222 ,033 447 .037 .358) .364 .458 .251 .439 T-G:med F .000(.000) .047 .2371 .034 .457 .043 .376) .382 .480 .266 (.451 20th sale in chronological sequence.It had 13 leases equal variances should be accepted.In fact,for the that received two or more bids apiece.A special simu- two tests Bar2:med and Lev2:med,the overall P value lation study for this number of leases,with the same is in the opposite tail of the distribution,suggesting sample sizes,was reported in Table 5 and mentioned that the asymptotic approximations used in those in Section 3.Some of the tests for variances rejected tests may be too conservative.This could also explain the null hypothesis over 70 percent of the time even the well-controlled Type I error rate and the low though the normal distribution was used in the simu- power in the simulation study of Section 3 for those lation and Ho was true.It is useless to consider such two tests.The three tests,Levl:med,F-K :med X2,and tests for real data,since the results of such tests would F-K:med F,do not exhibit this weakness.They all be meaningless.Therefore,the results of only those have overall P values that do in fact resemble obser- tests that had well-controlled Type I error rates in the vations on a uniformly distributed random variable simulation study are examined in this section.This Again,the same three tests show the same desirable includes the five tests that had estimated test sizes less properties. than.112 in all cases described in Section 3.For each It was mentioned previously that if Ho is true,the p of the five test statistics in each of the 40 sales,the P values should be uniform on (0,1).A Kolmogorov values were obtained by referring to the appropriate goodness-of-fit test was used on the 40 P values to see chi squared or F distribution.If Ho is true these P how well they agreed with the uniform distribution. values should be uniform on(0,1),but if Ho is false The test statistics for Levl:med,F-K:med X2,and they should tend to be smaller.For each test,the 40 P values were summed and normalized by subtracting 20 and dividing by 40/12.The results appear in Table 7. Summary of P Values for 5 Tests,40 Table 7,column(2).Column(3)in Table 7 is simply Applications Each the overall P value obtained by comparing the statis- tic in column(2)with the standard normal dis- (2)Standardized p-value Sum 9tel tribution. Bar2:med 6.109 1.000 For all five tests the overall P value is well above 5 398 F-K:med x2 0.766 percent,clearly indicating that the null hypothesis of F-K:med F 1.034 TECHNOMETRICS©,VOL.23,NO.4.NOVEMBER1981 This content downloaded from 61.190.7.73 on Mon,30 Sep 2013 22:38:50 PM All use subject to JSTOR Terms and Conditions
TESTS FOR HOMOGENEITY OF VARIANCES Table 6 (Continued) Dbl. Exp. Distribution: Symmetric (Dbl. Exp.)2: Asymmetric n=(5,5,5,5) (10,10,10,10) (20,20,20,20) (5,5,20,20) (10,10,10,10) TABLE 4 TESTS Mood X2 Mood F Mood:med Mood:med X2 F F-A-B X2 F-A-B F F-A-B:med X2 F-A-B:med F Klotz X2 Klotz F Klotz:med X2 Klotz:med F S-R X2 S-R F S-R:med S-R:med F-K X2 F-K F F-K:med F-K:med T-G X2 T-G F T-G:med T-G:med X2 F X2 F X2 F .080 (.221) .121 (.275) .003 (.027) .009 (.048) .091 (.195) .112 (.240) .000 (.000) .000 (.000) .072 (.223) .105 (.273) .012 (.061) .016 (.081) .087 (.241) .115 (.289) .000 (.010) .003 (.029) .058 (.214) .086 (.263) .005 (.040) .011 (.063) .095 (.217) .122 (.264) .000 (.000) .000ooo (.000) .919 (.936) .928 (.945) .573 (.670) .597 (.687) 20th sale in chronological sequence. It had 13 leases that received two or more bids apiece. A special simulation study for this number of leases, with the same sample sizes, was reported in Table 5 and mentioned in Section 3. Some of the tests for variances rejected the null hypothesis over 70 percent of the time even though the normal distribution was used in the simulation and Ho was true. It is useless to consider such tests for real data, since the results of such tests would be meaningless. Therefore, the results of only those tests that had well-controlled Type I error rates in the simulation study are examined in this section. This includes the five tests that had estimated test sizes less than .112 in all cases described in Section 3. For each of the five test statistics in each of the 40 sales, the P values were obtained by referring to the appropriate chi squared or F distribution. If Ho is true these P values should be uniform on (0, 1), but if Ho is false they should tend to be smaller. For each test, the 40 P values were summed and normalized by subtracting 20 and dividing by ,/40/12. The results appear in Table 7, column (2). Column (3) in Table 7 is simply the overall P value obtained by comparing the statistic in column (2) with the standard normal distribution. For all five tests the overall P value is well above 5 percent, clearly indicating that the null hypothesis of equal variances should be accepted. In fact, for the two tests Bar2 :med and Lev2 :med, the overall P value is in the opposite tail of the distribution, suggesting that the asymptotic approximations used in those tests may be too conservative. This could also explain the well-controlled Type I error rate and the low power in the simulation study of Section 3 for those two tests. The three tests, Levl :med, F-K :med X2, and F-K: med F, do not exhibit this weakness. They all have overall P values that do in fact resemble observations on a uniformly distributed random variable. Again, the same three tests show the same desirable properties. It was mentioned previously that if Ho is true, the p values should be uniform on (0, 1). A Kolmogorov goodness-of-fit test was used on the 40 P values to see how well they agreed with the uniform distribution. The test statistics for Levl :med, F-K:med X2, and Table 7. Summary of P Values for 5 Tests, 40 Applications Each (1) Test Bar2:med Levl :med Lev2:med F-K:med X2 F-K:med F (2) Standardized p-value Sum 6.109 -0.530 2.998 0.766 1.034 (3) p-value of Col (2) 1.000 .298 .999 .778 .849 TECHNOMETRICS ?, VOL. 23, NO. 4, NOVEMBER 1981 .065 .069 .041 .045 (5,5,20,20) (.592) (.598) (.523) (.533) .082 .090 .036 .039 (.424) (.440) (.356) (.373) .068 (.536) .077 (.546) .035 (.447) .036 (.454) .855 (.892) .863 (.899) .505 (.668) .528 (.682) .071 (.403) .082 (.420) .043 (.368) .048 (.385) .087 .095 .036 .041 .082 .094 .045 .053 .077 .082 .039 .044 .086 .097 .031 .034 .067 .076 .026 .030 .095 .099 .039 .047 (.372) (.388) (.262) (.275) (.325) (.349) (.228) (.245) (.388) (.410) (.286) (.303) (.386) (.408) (.250) (.269) (.387) (.405) (.274) (.293) (.342) (.364) (.222) (.237) .070 .075 .037 .039 .069 .071 .042 .042 .063 .067 .033 .036 .070 .072 .033 .034 (.629) (.637) (.575) (.584) (.599) (.607) (.526) (.536) (.632) (.639) (.581) (.588) (.545) (.554) (.447) (.457) .908 .913 .562 .575 .907 .923 .516 .537 .842 .851 .254 .262 .660 .677 .099 .112 .847 .863 .364 .382 .079 .085 .045 .050 .081 .087 .029 .032 .074 .077 .032 .037 .076 .082 .037 .043 (.925) (.935) (.660) (.680) (.923) (.934) (.615) (.641) (.891) (.902) (.342) (.365) (.755) (.768) (.195) (.210) (.890) (.899) (.458) (.480) (.390) (.408) (.330) (.345) (.445) (.460) (.355) (.371) (.383) (.401) (.317) (.331) (.429) (.446) (.358) (.376) .829 .834 .474 .494 .838 .849 .483 .512 .837 .846 .145 .153 .632 .651 .076 .080 .845 .858 .251 .266 (.862) (.869) (.679) (.694) (.865) (.877) (.595) (.616) (.909) (.915) (.312) (.326) (.711) (.729) (.152) (.160) (.920) (.924) (.439) (.451) 359 This content downloaded from 61.190.7.73 on Mon, 30 Sep 2013 22:38:50 PM All use subject to JSTOR Terms and Conditions