356 W.J.CONOVER,MARK E.JOHNSON,AND MYRLE M.JOHNSON normal,and double exponential distributions.Uni- in Table 5 represents a special study chosen to resem- form random numbers were simulated using CDC's ble the application situation described in Section 4.In uniform generator RANNUM,which is a multi- brief,13 samples in which the sample sizes were 2 plicative congruential generator type.The normal and (7 samples),3(2 samples),4,7(2 samples),and 13,were double exponential variates were obtained from the drawn from standard normal distributions.This was respective inverse cumulative distribution functions. repeated 1,000 times and 55 test statistics(Mill cannot Four samples were drawn with respective sample sizes be computed for n=2)were computed each time. (n1,n2,m3,n4)=(5,5,5,5),(10,10,10,10),(20,20,20, This case was investigated to see how the tests might 20),and (5,5,20,20).The null hypothesis of equal behave under conditions typically encountered in oil- variances(all equal to 1)was examined along with the ease-bidding data. four alternatives (oi,0z,03,0)=(1,1,1,2),(1,1,1, There are many different ways of interpreting the 4),(1,1,1,8),and (1,2,4,8).The mean was set equal to results of Tables 5 and 6,just as there are many ways the standard deviation in each population under the of defining what is a“good”test as opposed to a“bad" alternative hypothesis.Zero means were used for Ho. test.We will define a test to be robust if the maximum Each of these 60 combinations of distribution type, Type I error rate is less than.10 for a 5 percent test. sample size,and variances was repeated 1,000 times, The four tests that qualify under this criterion,and so that the 56 test statistics mentioned in Section 2 their maximum estimated test size in parentheses,are were computed and compared with their 5 percent Bar2:med (.071),Lev1:med (.060),Lev2:med (.078), and 1 percent nominal critical values 60,000 times and F-K:med X2(.099).We include F-K:med F(.112) each.The observed frequency of rejection of the null in this group of robust tests also,because in 18 of the hypothesis is reported in Table 5 for normal dis- 19 null cases examined the estimated test size was less tributions and in Table 6 for double exponential dis- than .084,which is well under control.Of these five tributions.The figures in parentheses in those tables tests the second,fourth,and fifth tests appear to have represent the averages over the four variance combi- slightly more power than the other two.It is interest- nations under the alternative hypothesis.The stan- ing to note that if the qualifications for robustness are dard errors of all entries in Tables 5 and 6 are less loosened somewhat to max test size s.15,only one than.016.The results for the uniform distribution are new test is included,Lev4:med (.145).Two additional not reported here to save space.A table with the tests have max test size s.20.These are Lev2 (.163) results for the uniform distribution is available from and Bar2(.172).The increase in the Type I error rates the authors on request. of Lev2 and Bar2 over Lev2:med and Bar2:med is The corresponding figures for the asymmetric case accompanied by only a 40 percent relative increase in were obtained by squaring the random variables ob- power.The other test has less power.Therefore,a tained in the symmetric case to obtain highly skewed reasonable conclusion seems to be that the five tests and extremely leptokurtic distributions.To be more with max test size <.112 qualify as robust tests for specific,we usedx?+u rather than (X+u)2 variances,with the tests Levl:med,F-K:med X2,and where Xi represents the null distributed random vari- its sister test F-K:med F having slightly more power able,because the latter transformation does not allow than the other two.Notice the resemblance among as much control over means and variances as does the these three tests.The first uses an analysis of variance former.The three distributions(uniform)2,(normal)2, on XuXi,while the second and third convert and (double exponential)2,in combination with two X to ranks and then to normal type scores, sample sizes (10,10,10,10)and (5,5,20,20)and the where they are then subjected to either a chi squared five variance combinations (the null case and four test or an analysis of variance F test. alternatives,as before)gave a total of 30 combi- Similar conclusions were drawn using a =.01.The nations.For each combination,1,000 repetitions were only tests with a reasonably well-controlled test size run for each of the 56 test statistics.The average are the same five tests that were selected using =.05. frequency of rejection,averaged over the four variance On the basis of demonstrated power at a=.01,the combinations under the alternative,is presented in same three tests mentioned for a=.05 again appear Tables 5 and 6 also to be the best.Therefore,the number of rejections for The columns in Tables 5 and 6 represent the vari- each test at a =.01 is not reported. ous sample sizes under symmetric and asymmetric If we consider only those five cases that have sym- distributions.For convenience,the nonsymmetric dis- metric distributions,there are many additional tests tributions are simply called asymmetric,although this that qualify as robust under the above definition.The is not meant to imply that the simulation results are five that show the most power,in order of decreasing attributable to the skewness of those distributions power,are Bar2,Klotz:med F,Klotz:med X2,Lev rather than to the extreme leptokurtic nature of those 4:med,and S-R:med F.However,the power of these same asymmetric distributions.The seventh column five tests for symmetric distributions is about the same TECHNOMETRICS©,VOL.23,NO.4,NOVEMBER 1981 This content downloaded from 61.190.7.73 on Mon,30 Sep 2013 22:38:50 PM All use subject to JSTOR Terms and ConditionsW. J. CONOVER, MARK E. JOHNSON, AND MYRLE M. JOHNSON normal, and double exponential distributions. Uniform random numbers were simulated using CDC's uniform generator RANNUM, which is a multiplicative congruential generator type. The normal and double exponential variates were obtained from the respective inverse cumulative distribution functions. Four samples were drawn with respective sample sizes (nl, n2, n3, n4) = (5, 5, 5, 5), (10, 10, 10, 10), (20, 20, 20, 20), and (5, 5, 20, 20). The null hypothesis of equal variances (all equal to 1) was examined along with the four alternatives (a2, a2 , a2, 2) = (1, 1, 1, 2), (1, 1, 1, 4), (1, 1, 1, 8), and (1, 2, 4, 8). The mean was set equal to the standard deviation in each population under the alternative hypothesis. Zero means were used for Ho. Each of these 60 combinations of distribution type, sample size, and variances was repeated 1,000 times, so that the 56 test statistics mentioned in Section 2 were computed and compared with their 5 percent and 1 percent nominal critical values 60,000 times each. The observed frequency of rejection of the null hypothesis is reported in Table 5 for normal distributions and in Table 6 for double exponential distributions. The figures in parentheses in those tables represent the averages over the four variance combinations under the alternative hypothesis. The standard errors of all entries in Tables 5 and 6 are less than .016. The results for the uniform distribution are not reported here to save space. A table with the results for the uniform distribution is available from the authors on request. The corresponding figures for the asymmetric case were obtained by squaring the random variables obtained in the symmetric case to obtain highly skewed and extremely leptokurtic distributions. To be more specific, we used aX2 + u rather than (aXi + #i)2, where Xi represents the null distributed random variable, because the latter transformation does not allow as much control over means and variances as does the former. The three distributions (uniform)2, (normal)2, and (double exponential)2, in combination with two sample sizes (10, 10, 10, 10) and (5, 5, 20, 20) and the five variance combinations (the null case and four alternatives, as before) gave a total of 30 combinations. For each combination, 1,000 repetitions were run for each of the 56 test statistics. The average frequency of rejection, averaged over the four variance combinations under the alternative, is presented in Tables 5 and 6 also. The columns in Tables 5 and 6 representhe various sample sizes under symmetric and asymmetric distributions. For convenience, the nonsymmetric distributions are simply called asymmetric, although this is not meant to imply that the simulation results are attributable to the skewness of those distributions rather than to the extreme leptokurtic nature of those same asymmetric distributions. The seventh column in Table 5 represents a special study chosen to resemble the application situation described in Section 4. In brief, 13 samples in which the sample sizes were 2 (7 samples), 3 (2 samples), 4, 7 (2 samples), and 13, were drawn from standard normal distributions. This was repeated 1,000 times and 55 test statistics (Mill cannot be computed for ni = 2) were computed each time. This case was investigated to see how the tests might behave under conditions typically encountered in oillease-bidding data. There are many different ways of interpreting the results of Tables 5 and 6, just as there are many ways of defining what is a "good" test as opposed to a "bad" test. We will define a test to be robust if the maximum Type I error rate is less than .10 for a 5 percent test. The four tests that qualify under this criterion, and their maximum estimated test size in parentheses, are Bar2:med (.071), Levl:med (.060), Lev2:med (.078), and F-K:med X2 (.099). We include F-K:med F (.112) in this group of robust tests also, because in 18 of the 19 null cases examined the estimated test size was less than .084, which is well under control. Of these five tests the second, fourth, and fifth tests appear to have slightly more power than the other two. It is interesting to note that if the qualifications for robustness are loosened somewhat to max test size < .15, only one new test is included, Lev4:med (.145). Two additional tests have max test size < .20. These are Lev2 (.163) and Bar2 (.172). The increase in the Type I error rates of Lev2 and Bar2 over Lev2:med and Bar2:med is accompanied by only a 40 percent relative increase in power. The other test has less power. Therefore, a reasonable conclusion seems to be that the five tests with max test size < .112 qualify as robust tests for variances, with the tests Levl :med, F-K:med X2, and its sister test F-K:med F having slightly more power than the other two. Notice the resemblance among these three tests. The first uses an analysis of variance on Xij- Xi|, while the second and third convert I Xij- Xi to ranks and then to normal type scores, where they are then subjected to either a chi squared test or an analysis of variance F test. Similar conclusions were drawn using a = .01. The only tests with a reasonably well-controlled test size are the same five tests that were selected using a = .05. On the basis of demonstrated power at a = .01, the same three tests mentioned for a = .05 again appear to be the best. Therefore, the number of rejections for each test at a = .01 is not reported. If we consider only those five cases that have symmetric distributions, there are many additional tests that qualify as robust under the above definition. The five that show the most power, in order of decreasing power, are Bar2, Klotz:med F, Klotz:med X2, Lev 4 :med, and S-R:med F. However, the power of these five tests for symmetric distributions iabout the same TECHNOMETRICS ?, VOL. 23, NO. 4, NOVEMBER 1981 356 This content downloaded from 61.190.7.73 on Mon, 30 Sep 2013 22:38:50 PM All use subject to JSTOR Terms and Conditions