PROCEEDINGS of the HUMAN FACTORS AND ERGONOMICS SOCIETY 49th ANNUAL MEETING-2005 2100 ESTIMATING COMPLETION RATES FROM SMALL SAMPLES USING BINOMIAL CONFIDENCE INTERVALS: COMPARISONS AND RECOMMENDATIONS Jeff Sauro James R.Lewis Oracle IBM Denver,CO USA Boca Raton,FL jeff.sauro@oracle.com jimlewis@us.ibm.com The completion rate-the proportion of participants who successfully complete a task-is a common usability measurement.As is true for any point measurement,practitioners should compute appropriate confidence intervals for completion rate data.For proportions such as the completion rate,the appropriate interval is a binomial confidence interval.The most widely-taught method for calculating binomial confidence intervals(the "Wald Method,"discussed both in introductory statistics texts and in the human factors literature)grossly understates the width of the true interval when sample sizes are small. Alternative "exact"methods over-correct the problem by providing intervals that are too conservative. This can result in practitioners unintentionally accepting interfaces that are unusable or rejecting interfaces that are usable.We examined alternative methods for building confidence intervals from small sample completion rates,using Monte Carlo methods to sample data from a number of real,large-sample usability tests.It appears that the best method for practitioners to compute 95%confidence intervals for small- sample completion rates is to add two successes and two failures to the observed completion rate,then compute the confidence interval using the Wald method (the "Adjusted Wald Method").This simple approach provides the best coverage,is fairly easy to compute,and agrees with other analyses in the statistics literature. Introduction Estimating completion rates with small samples is an Task completion rates are often modeled using a binomial distribution because the outcome of a task important and challenging task.Confidence intervals attempt is usually a binomial value (complete /didn't are taught as an appropriate way to qualify results from complete).The Wald interval is simple to compute,has small samples.The addition of confidence intervals to been around for some time (Laplace,1812)and is completion rate estimates helps both the engineer and presented in most introductory statistics texts and some readers of usability reports understand the variability writings in the human factors literature (e.g.,Landauer, inherent in small samples.While the importance of 1988).Unfortunately,it produces intervals that are too adding confidence intervals is widely agreed upon,the narrow when samples are small,especially when the best method for computing them is not. completion rate is not near 50%.Under these Most practitioners interpret a 95%confidence interval conditions its average coverage is approximately 60%, not 95%(Agresti and Coull,1998).This is a real to indicate that in 95 out of 100 experiments,the interval constructed from the sample will contain the problem considering that HF practitioners rely on true value for the population.The extent to which this confidence intervals to have true coverage that is equal is the case for any given method of computing intervals to nominal coverage in the long run. is the“coverage”'for that method To improve the poor average coverage of the Wald interval,advanced statistics texts often present a more The Wald method is the most commonly presented complicated method called the Clopper-Pearson or formula for calculating binomial confidence intervals “Exact'”method(see Figure2 below). (see Figure I below). Figure I:Wald Confidence Interval p±za/2V(1-m)/m
ESTIMATING COMPLETION RATES FROM SMALL SAMPLES USING BINOMIAL CONFIDENCE INTERVALS: COMPARISONS AND RECOMMENDATIONS Jeff Sauro Oracle Denver, CO USA jeff.sauro@oracle.com James R. Lewis IBM Boca Raton, FL jimlewis@us.ibm.com The completion rate – the proportion of participants who successfully complete a task – is a common usability measurement. As is true for any point measurement, practitioners should compute appropriate confidence intervals for completion rate data. For proportions such as the completion rate, the appropriate interval is a binomial confidence interval. The most widely-taught method for calculating binomial confidence intervals (the “Wald Method,” discussed both in introductory statistics texts and in the human factors literature) grossly understates the width of the true interval when sample sizes are small. Alternative “exact” methods over-correct the problem by providing intervals that are too conservative. This can result in practitioners unintentionally accepting interfaces that are unusable or rejecting interfaces that are usable. We examined alternative methods for building confidence intervals from small sample completion rates, using Monte Carlo methods to sample data from a number of real, large-sample usability tests. It appears that the best method for practitioners to compute 95% confidence intervals for smallsample completion rates is to add two successes and two failures to the observed completion rate, then compute the confidence interval using the Wald method (the “Adjusted Wald Method”). This simple approach provides the best coverage, is fairly easy to compute, and agrees with other analyses in the statistics literature. Introduction Estimating completion rates with small samples is an important and challenging task. Confidence intervals are taught as an appropriate way to qualify results from small samples. The addition of confidence intervals to completion rate estimates helps both the engineer and readers of usability reports understand the variability inherent in small samples. While the importance of adding confidence intervals is widely agreed upon, the best method for computing them is not. Most practitioners interpret a 95% confidence interval to indicate that in 95 out of 100 experiments, the interval constructed from the sample will contain the true value for the population. The extent to which this is the case for any given method of computing intervals is the “coverage” for that method. The Wald method is the most commonly presented formula for calculating binomial confidence intervals (see Figure 1 below). Figure 1: Wald Confidence Interval Task completion rates are often modeled using a binomial distribution because the outcome of a task attempt is usually a binomial value (complete / didn’t complete). The Wald interval is simple to compute, has been around for some time (Laplace, 1812) and is presented in most introductory statistics texts and some writings in the human factors literature (e.g., Landauer, 1988). Unfortunately, it produces intervals that are too narrow when samples are small, especially when the completion rate is not near 50%. Under these conditions its average coverage is approximately 60%, not 95% (Agresti and Coull, 1998). This is a real problem considering that HF practitioners rely on confidence intervals to have true coverage that is equal to nominal coverage in the long run. To improve the poor average coverage of the Wald interval, advanced statistics texts often present a more complicated method called the Clopper-Pearson or “Exact” method (see Figure 2 below). PROCEEDINGS of the HUMAN FACTORS AND ERGONOMICS SOCIETY 49th ANNUAL MEETING—2005 2100
PROCEEDINGS of the HUMAN FACTORS AND ERGONOMICS SOCIETY 49th ANNUAL MEETING-2005 2101 Figure2:“Exact'”/“Clopper-Pearson”Interval Table I displays the four differing results for each of the interval methods for a sample of five users with n-x+1 four successes and one failure (80%completion rate). F2x,2n-x+11-a/2 n一① Table 1:95%confidence intervals by method for an < (x+1)F2(x+1),2(n-,a/2 80%completion rate (4 successes,1 failure) CI Method Low High CI Width The Exact method provides more reliable confidence Exact 28.4 99.5 71.1 intervals with small samples(Clopper and Pearson, Score 37.6 96.4 58.8 1934)and has also been discussed in the HF literature Adj.Wald 36.5 98.3 61.8 Wald 44.9 100 (e.g.,Lewis,1996,and Sauro,2004).In actual 55.1 practice,however,the Exact interval produces overly As can be seen from Table 1,the different methods conservative confidence intervals with true coverage provide different end points and differing confidence closer to 99%when the nominal confidence is 95%.It interval widths.While one would like a narrower is especially vulnerable to this overly conservative confidence interval(which provides less uncertainty), nature when samples sizes are small(n <15)(Agresti the interval should not be so narrow as to exclude more and Coull,1996).Thus,Exact intervals are too wide completion rates than expected from the stated or and Wald intervals are too narrow. nominal rate-that is,a nominal 95%confidence interval should have a likelihood of 95%of containing A third method called the "Score"interval (Wilson. the population parameter.The implication is clear, 1927)is not overly conservative,and provides average depending on which method the HF practitioner coverage near 95%for nominal 95%intervals chooses,the boundaries presented with a completion Unfortunately,its computation is as cumbersome as the rate can lead to different conclusions about the Exact method(see Figure 3 below),and it has some usability of an interface. serious coverage problems for certain values when the completion rate is near 0 or 1 (Agresti and Coull, The Wald and Exact methods are by far the most 1998). popular ways of calculating confidence intervals. Depending on which method practitioners are using to Figure3:“Score”/Approximate Interval calculate their intervals,they will either work with intervals that provide a false sense of precision (Wald + method)or work with intervals that are consistently 2n /1+2a/2/n). less precise than their nominal precision(Exact method).If the Adjusted Wald method can provide the best average coverage while still being relatively simple Another alternative method,named the Adjusted Wald to compute(as suggested in the statistical literature, method by Agresti and Coull(1998,based on work Agresti and Coull,1998),it will provide the HF originally reported by Wilson,1927),simply requires, practitioner with the easiest and most precise way of for 95%confidence intervals.the addition of two computing binomial confidence intervals for small successes and two failures to the observed completion samples. rate,then uses the Wald formula to compute the 95% binomial confidence interval.Its coverage is as good as the Score method for most values of p,and is usually Method better when the completion rate approaches 0 or 1. The method is astonishingly simple,and has been One way to test the effectiveness of a confidence recommended in the statistical literature (Agresti and interval calculation is to take a sample many times Coull,1998).The "add two successes and two from a larger data set and see how well the calculated failures"(or adding two to the numerator and 4 to the confidence interval contained the actual completion denominator)is derived from the critical value of the rate of the data set.We took data from several tasks normal distribution for 95%intervals(1.96,which is across five usability evaluations with completion rates approximately 2).Squaring this critical value provides between 20%and 97%.The usability analyses were the 4 for the denominator.For example,an observed performed on commercially available desktop and web- completion rate of 80%with 10 users (8 successes and based software applications in the accounting industry. 2 failures)would be converted to 10 successes and 4 Each task had at least 49 participants,and we used failures,and these values would then be used in the these completion rates as the best estimate of the Wald formula. population completion rate
Figure 2: “Exact” / “Clopper-Pearson” Interval The Exact method provides more reliable confidence intervals with small samples (Clopper and Pearson, 1934) and has also been discussed in the HF literature (e.g., Lewis, 1996, and Sauro, 2004). In actual practice, however, the Exact interval produces overly conservative confidence intervals with true coverage closer to 99% when the nominal confidence is 95%. It is especially vulnerable to this overly conservative nature when samples sizes are small (n <15) (Agresti and Coull, 1996). Thus, Exact intervals are too wide and Wald intervals are too narrow. A third method called the “Score” interval (Wilson, 1927) is not overly conservative, and provides average coverage near 95% for nominal 95% intervals. Unfortunately, its computation is as cumbersome as the Exact method (see Figure 3 below), and it has some serious coverage problems for certain values when the completion rate is near 0 or 1 (Agresti and Coull, 1998). Figure 3: “Score” / Approximate Interval Another alternative method, named the Adjusted Wald method by Agresti and Coull (1998, based on work originally reported by Wilson, 1927), simply requires, for 95% confidence intervals, the addition of two successes and two failures to the observed completion rate, then uses the Wald formula to compute the 95% binomial confidence interval. Its coverage is as good as the Score method for most values of p, and is usually better when the completion rate approaches 0 or 1. The method is astonishingly simple, and has been recommended in the statistical literature (Agresti and Coull, 1998). The “add two successes and two failures” (or adding two to the numerator and 4 to the denominator) is derived from the critical value of the normal distribution for 95% intervals (1.96, which is approximately 2). Squaring this critical value provides the 4 for the denominator. For example, an observed completion rate of 80% with 10 users (8 successes and 2 failures) would be converted to 10 successes and 4 failures, and these values would then be used in the Wald formula. Table 1 displays the four differing results for each of the interval methods for a sample of five users with four successes and one failure (80% completion rate). Table 1: 95% confidence intervals by method for an 80% completion rate (4 successes, 1 failure) CI Method Low % High % CI Width Exact 28.4 99.5 71.1 Score 37.6 96.4 58.8 Adj. Wald 36.5 98.3 61.8 Wald 44.9 100 55.1 As can be seen from Table 1, the different methods provide different end points and differing confidence interval widths. While one would like a narrower confidence interval (which provides less uncertainty), the interval should not be so narrow as to exclude more completion rates than expected from the stated or nominal rate – that is, a nominal 95% confidence interval should have a likelihood of 95% of containing the population parameter. The implication is clear, depending on which method the HF practitioner chooses, the boundaries presented with a completion rate can lead to different conclusions about the usability of an interface. The Wald and Exact methods are by far the most popular ways of calculating confidence intervals. Depending on which method practitioners are using to calculate their intervals, they will either work with intervals that provide a false sense of precision (Wald method) or work with intervals that are consistently less precise than their nominal precision (Exact method). If the Adjusted Wald method can provide the best average coverage while still being relatively simple to compute (as suggested in the statistical literature, Agresti and Coull, 1998), it will provide the HF practitioner with the easiest and most precise way of computing binomial confidence intervals for small samples. Method One way to test the effectiveness of a confidence interval calculation is to take a sample many times from a larger data set and see how well the calculated confidence interval contained the actual completion rate of the data set. We took data from several tasks across five usability evaluations with completion rates between 20% and 97%. The usability analyses were performed on commercially available desktop and webbased software applications in the accounting industry. Each task had at least 49 participants, and we used these completion rates as the best estimate of the population completion rate. PROCEEDINGS of the HUMAN FACTORS AND ERGONOMICS SOCIETY 49th ANNUAL MEETING—2005 2101
PROCEEDINGS of the HUMAN FACTORS AND ERGONOMICS SOCIETY 49th ANNUAL MEETING-2005 2102 Table 2:Percent coverage for nine task completion rates by confidence interval method and number of users. Expected width is 95.0.Values are derived from sampling 5,10 or 15 completion rates (or hypothetical users) 10,000 times. Observed Task Completion Rate CI Method Users20.4/%42.9%61.2%65.3%77.6%85.7%91.8%93.8%97.8% 5 99.5 98.74 99.11 99.73 99.34 98.55 99.78 99.88 100 Exact 10 99.72 98.93 98.96 97.73 99.60 99.81 99.86 99.35 100 15 97.73 99.02 99.68 99.81 98.88 99.70 100 100 100 5 94.98 98.74 99.11 96.05 93.48 98.55 95.40 97.50 89.94 Adjusted Wald 10 98.23 98.93 96.54 97.73 96.89 97.46 97.50 99.35 100 15 99.36 99.02 98.92 97.89 97.96 97.88 99.43 97.38 100 5 94.98 93.50 91.47 96.05 93.48 98.55 95.40 97.50 89.94 Score 10 98.23 96.87 96.54 97.73 91.17 97.46 97.50 99.35 100 15 97.73 99.02 97.70 97.89 97.96 97.88 99.43 97.38 100 69.35 84.93 85.70 84.84 73.10 53.75 35.93 28.30 10.06 Wald 92.01 96.87 93.26 91.66 93.88 81.80 60.20 51.77 20.74 15 88.11 96.46 97.70 94.82 92.04 92.87 77.61 67.15 30.53 enough to test 10,000 combinations of completion Using a Monte Carlo simulation method written in rates,even this modest sample size contains about 2 Minitab,we took 10,000 unique random samples of 5, million unique combinations of five users. 10 and 15 completion rates to test each of the confidence interval methods (Wald,Exact,Score and Results Adjusted Wald).We then counted how many of the 10,000 completion rates fell outside the calculated Table 2 contains the results of Monte Carlo simulations intervals for each of the methods.For example,on one for nine tasks with varying completion rates (e.g., sample of 5 users from a dataset with a population 91.8%,93.8%,etc.)for sample sizes of 5,10 or 15.As completion rate of 65.3%,we observed one success expected,the Wald interval provided the worst and four failures (a 20%completion rate).The Exact coverage,only containing the actual proportion 10%of method provided a 95%confidence interval from.5% the time for the task with a 97.8%completion rate and to 71.6%,so it did contain the true population 5 users.To find this value,start with the Wald method completion rate of 65.3%.The Score method provided in the bottom left cell of Table 2.Next,find the intervals from 3.6%to 62.5%,so it did not contain the intersection with the completion rate of 97.8%(the true rate.Since we calculated nominal 95%confidence rightmost column).The first value in this cell (10.06) intervals,we expect coverage of 95%.In other words, means that 10.06%of the calculated intervals about 9,500 of the 10,000 intervals computed during a contained the true values using the Wald method with a Monte Carlo simulation should contain the true value. sample of 5 users (the second and third values are for 10 and 15 user samples respectively).For the Wald A Note on the Methodology method to be a legitimate method to apply to these We could have chosen any hypothetical completion types of data,one would expect this value to be rates to test the confidence intervals (as is often the approximately 95%.Even at the less extreme case in the statistical literature)but we used values completion rate of 85.7%,the Wald interval only from a known large sample usability study so as to contained the true value about half of the time focus our analysis on likely completion rates for (53.75%)-a far cry from the 95%many practitioners commercially available software.While the HF would have expected from a nominal 95%confidence practitioner usually doesn't know ahead of time what interval calculation. the population completion rate is,this exercise allowed us to work backwards to see how well the smaller The Exact interval showed the expected conservative samples predicted the known completion rates.We coverage with many of the nominally 95%confidence were in essence running 10,000 usability evaluations intervals capturing over 99%of the 10,000 completion with small samples,calculating the confidence interval rates(see especially the completion rates above 90%in with the different methods,and seeing how many times Table 2).The Adjusted Wald and Score methods the known completion rate was contained within the provided average coverage closest to the 95%nominal intervals.While a sample size of 49 may not seem large level,which confirms earlier recommendations in the statistical literature (Agresti and Coull,1998).The
Using a Monte Carlo simulation method written in Minitab, we took 10,000 unique random samples of 5, 10 and 15 completion rates to test each of the confidence interval methods (Wald, Exact, Score and Adjusted Wald). We then counted how many of the 10,000 completion rates fell outside the calculated intervals for each of the methods. For example, on one sample of 5 users from a dataset with a population completion rate of 65.3%, we observed one success and four failures (a 20% completion rate). The Exact method provided a 95% confidence interval from .5% to 71.6%, so it did contain the true population completion rate of 65.3%. The Score method provided intervals from 3.6% to 62.5%, so it did not contain the true rate. Since we calculated nominal 95% confidence intervals, we expect coverage of 95%. In other words, about 9,500 of the 10,000 intervals computed during a Monte Carlo simulation should contain the true value. A Note on the Methodology We could have chosen any hypothetical completion rates to test the confidence intervals (as is often the case in the statistical literature) but we used values from a known large sample usability study so as to focus our analysis on likely completion rates for commercially available software. While the HF practitioner usually doesn’t know ahead of time what the population completion rate is, this exercise allowed us to work backwards to see how well the smaller samples predicted the known completion rates. We were in essence running 10,000 usability evaluations with small samples, calculating the confidence interval with the different methods, and seeing how many times the known completion rate was contained within the intervals. While a sample size of 49 may not seem large enough to test 10,000 combinations of completion rates, even this modest sample size contains about 2 million unique combinations of five users. Results Table 2 contains the results of Monte Carlo simulations for nine tasks with varying completion rates (e.g., 91.8%, 93.8%, etc.) for sample sizes of 5, 10 or 15. As expected, the Wald interval provided the worst coverage, only containing the actual proportion 10% of the time for the task with a 97.8% completion rate and 5 users. To find this value, start with the Wald method in the bottom left cell of Table 2. Next, find the intersection with the completion rate of 97.8% (the rightmost column). The first value in this cell (10.06) means that 10.06% of the calculated intervals contained the true values using the Wald method with a sample of 5 users (the second and third values are for 10 and 15 user samples respectively). For the Wald method to be a legitimate method to apply to these types of data, one would expect this value to be approximately 95%. Even at the less extreme completion rate of 85.7%, the Wald interval only contained the true value about half of the time (53.75%) – a far cry from the 95% many practitioners would have expected from a nominal 95% confidence interval calculation. The Exact interval showed the expected conservative coverage with many of the nominally 95% confidence intervals capturing over 99% of the 10,000 completion rates (see especially the completion rates above 90% in Table 2). The Adjusted Wald and Score methods provided average coverage closest to the 95% nominal level, which confirms earlier recommendations in the statistical literature (Agresti and Coull, 1998). The Table 2: Percent coverage for nine task completion rates by confidence interval method and number of users. Expected width is 95.0. Values are derived from sampling 5, 10 or 15 completion rates (or hypothetical users) 10,000 times. Observed Task Completion Rate CI Method Users 20.4% 42.9% 61.2% 65.3% 77.6% 85.7% 91.8% 93.8% 97.8% Exact 5 10 15 99.5 99.72 97.73 98.74 98.93 99.02 99.11 98.96 99.68 99.73 97.73 99.81 99.34 99.60 98.88 98.55 99.81 99.70 99.78 99.86 100 99.88 99.35 100 100 100 100 Adjusted Wald 5 10 15 94.98 98.23 99.36 98.74 98.93 99.02 99.11 96.54 98.92 96.05 97.73 97.89 93.48 96.89 97.96 98.55 97.46 97.88 95.40 97.50 99.43 97.50 99.35 97.38 89.94 100 100 Score 5 10 15 94.98 98.23 97.73 93.50 96.87 99.02 91.47 96.54 97.70 96.05 97.73 97.89 93.48 91.17 97.96 98.55 97.46 97.88 95.40 97.50 99.43 97.50 99.35 97.38 89.94 100 100 Wald 5 10 15 69.35 92.01 88.11 84.93 96.87 96.46 85.70 93.26 97.70 84.84 91.66 94.82 73.10 93.88 92.04 53.75 81.80 92.87 35.93 60.20 77.61 28.30 51.77 67.15 10.06 20.74 30.53 PROCEEDINGS of the HUMAN FACTORS AND ERGONOMICS SOCIETY 49th ANNUAL MEETING—2005 2102
PROCEEDINGS of the HUMAN FACTORS AND ERGONOMICS SOCIETY 49th ANNUAL MEETING-2005 2103 mean and standard deviation of the coverage for each actual coverage is inappropriate for an application,then of the methods appears in Table 3. the Exact method provides the necessary precision. The Wald method should be avoided if calculating Table 3:Average coverage by confidence interval confidence intervals for completion rates with sample method (n=27 for each cell).Expected mean is 95.00. sizes less than 100.Its coverage is too far from the nominal level to provide a reliable estimate of the CI Method Mean SD population completion rate.As the sample size Exact 99.39 0.64 increases above 100,all four methods converge to Score 97.56 2.17 similar intervals.A calculator for all four methods is Adj.Wald 96.69 2.68 72.06 26.43 available online at Wald http://www.measuringusability.com/wald.htm. Discussion When All Users Pass or Fail With small sample sizes,it is a common occurrence The Monte Carlo simulations show that the Adjusted that all users in the sample will complete a task(100% Wald method provides the coverage closest to 95%. completion rate)or all will fail the task(0%completion An additional advantage of the Adjusted Wald method rate).For these scenarios,it is often unpalatable to is its ease of calculation.Thus,HF practitioners should report 100%or 0%.After all,how likely is it that the use the Adjusted Wald method to calculate confidence true population parameter is as extreme as 100%or intervals for small sample completion rates.This can 0%?One alternative is to use the midpoint of the be accomplished by simply adding two successes and binomial confidence interval derived from the Adjusted two failures to their observed sample,then computing a Wald method as the point estimate (called the Wilson 95%confidence interval using the standard Wald Point Estimator).For example,if 15 out of 15 users method.If a practitioner needs a higher level of complete a task,the mid-point of the Adjusted Wald confidence than 95%.then he or she should substitute method provides a 94.01%completion rate.While this the appropriate Z-critical values for 2 and 4.For value may seem too far from the observed 100%,its example,a 99%confidence interval would use the Z- attractiveness is that it is a function of the sample critical value of 2.58.The confidence interval would size-the greater the sample size,the closer this value then be calculated by adding 2.58 successes and 6.63 will be to 100%.Whether this method provides a failures to the observed completion rate. consistent advantage in improving the accuracy of point estimates is a topic for future research. The Score method provided coverage better than the Exact and Wald methods but fell short of the Adjusted Conclusion Wald method.Additionally,its drawback is its computational difficulty and its poor coverage for some There is a strong need to continue to encourage HF values when the population completion rate is around practitioners to include confidence intervals when 98%or 2%,regardless of sample size (Agresti and reporting estimates of completion rates.Because the Coull,1998).The only advantage in using the Score Adjusted Wald method is just a slight modification to method is that it provides more precise endpoints when the widely-taught Wald method,it should be easy to the ends of the intervals are close to 0 or 1.For some teach with other basic statistics without overwhelming values (e.g.9/10)the adjusted Wald's crude intervals students go beyond 1 and a substitution of >.999 is used.For the Score method,however,the upper interval is Confidence intervals are a way to build a reasonable calculated as a more precise .9975. boundary to capture unknown population completion rates.For a 95%confidence interval,"reasonable The Exact method was designed to guarantee at least boundary"means a 5%chance of not containing the 95%coverage,whereas approximate methods(such as population completion rate after repeated samples. the Adjusted Wald)provide an average coverage of “Reasonable boundary”is not a 1%chance and 95%in the long run.HF practitioners should use the certainly not a 40%chance-the typical rates obtained Exact method when they need to be sure they are when using the Exact or Wald methods to generate calculating a 95%or greater interval-erring on the binomial confidence intervals.To use the Adjusted conservative side.For example,at the population Wald interval,the HF practitioner can use their own completion rate of 97.8%both the Score and Adjusted software,a spreadsheet calculation,or the calculator at Wald methods had actual coverage that fell to 89% http://www.measuringusability.com/wald.htm,which (See Table 2 above).When the risk of this level of also computes the Exact,Score and Wald intervals for comparison
mean and standard deviation of the coverage for each of the methods appears in Table 3. Table 3: Average coverage by confidence interval method (n= 27 for each cell). Expected mean is 95.00. CI Method Mean % SD Exact 99.39 0.64 Score 97.56 2.17 Adj. Wald 96.69 2.68 Wald 72.06 26.43 Discussion The Monte Carlo simulations show that the Adjusted Wald method provides the coverage closest to 95%. An additional advantage of the Adjusted Wald method is its ease of calculation. Thus, HF practitioners should use the Adjusted Wald method to calculate confidence intervals for small sample completion rates. This can be accomplished by simply adding two successes and two failures to their observed sample, then computing a 95% confidence interval using the standard Wald method. If a practitioner needs a higher level of confidence than 95%, then he or she should substitute the appropriate Z-critical values for 2 and 4. For example, a 99% confidence interval would use the Zcritical value of 2.58. The confidence interval would then be calculated by adding 2.58 successes and 6.63 failures to the observed completion rate. The Score method provided coverage better than the Exact and Wald methods but fell short of the Adjusted Wald method. Additionally, its drawback is its computational difficulty and its poor coverage for some values when the population completion rate is around 98% or 2%, regardless of sample size (Agresti and Coull, 1998). The only advantage in using the Score method is that it provides more precise endpoints when the ends of the intervals are close to 0 or 1. For some values (e.g. 9/10) the adjusted Wald’s crude intervals go beyond 1 and a substitution of >.999 is used. For the Score method, however, the upper interval is calculated as a more precise .9975. The Exact method was designed to guarantee at least 95% coverage, whereas approximate methods (such as the Adjusted Wald) provide an average coverage of 95% in the long run. HF practitioners should use the Exact method when they need to be sure they are calculating a 95% or greater interval – erring on the conservative side. For example, at the population completion rate of 97.8% both the Score and Adjusted Wald methods had actual coverage that fell to 89% (See Table 2 above). When the risk of this level of actual coverage is inappropriate for an application, then the Exact method provides the necessary precision. The Wald method should be avoided if calculating confidence intervals for completion rates with sample sizes less than 100. Its coverage is too far from the nominal level to provide a reliable estimate of the population completion rate. As the sample size increases above 100, all four methods converge to similar intervals. A calculator for all four methods is available online at http://www.measuringusability.com/wald.htm. When All Users Pass or Fail With small sample sizes, it is a common occurrence that all users in the sample will complete a task (100% completion rate) or all will fail the task (0% completion rate). For these scenarios, it is often unpalatable to report 100% or 0%. After all, how likely is it that the true population parameter is as extreme as 100% or 0%? One alternative is to use the midpoint of the binomial confidence interval derived from the Adjusted Wald method as the point estimate (called the Wilson Point Estimator). For example, if 15 out of 15 users complete a task, the mid-point of the Adjusted Wald method provides a 94.01% completion rate. While this value may seem too far from the observed 100%, its attractiveness is that it is a function of the sample size—the greater the sample size, the closer this value will be to 100%. Whether this method provides a consistent advantage in improving the accuracy of point estimates is a topic for future research. Conclusion There is a strong need to continue to encourage HF practitioners to include confidence intervals when reporting estimates of completion rates. Because the Adjusted Wald method is just a slight modification to the widely-taught Wald method, it should be easy to teach with other basic statistics without overwhelming students. Confidence intervals are a way to build a reasonable boundary to capture unknown population completion rates. For a 95% confidence interval, “reasonable boundary” means a 5% chance of not containing the population completion rate after repeated samples. “Reasonable boundary” is not a 1% chance and certainly not a 40% chance– the typical rates obtained when using the Exact or Wald methods to generate binomial confidence intervals. To use the Adjusted Wald interval, the HF practitioner can use their own software, a spreadsheet calculation, or the calculator at http://www.measuringusability.com/wald.htm, which also computes the Exact, Score and Wald intervals for comparison. PROCEEDINGS of the HUMAN FACTORS AND ERGONOMICS SOCIETY 49th ANNUAL MEETING—2005 2103
PROCEEDINGS of the HUMAN FACTORS AND ERGONOMICS SOCIETY 49th ANNUAL MEETING-2005 2104 Acknowledgements We'd like to thank Lynda Finn of Statistical Insight for providing the Monte Carlo macro in Minitab and assistance with interpreting the statistical literature. We'd also like to thank Erika Kindlund of Intuit for providing the large sample completion rates. References Agresti,A.,and Coull,B.(1998).Approximate is better than 'exact'for interval estimation of binomial proportions.The American Statistician,52,119-126. Clopper,C.J.,and Pearson,E.(1934).The use of confidence intervals for fiducial limits illustrated in the case of the binomial.Biometrika.26.404-413. Landauer,T.K.(1997).Behavioral research methods in human-computer interaction.In M.Helander,T.K. Landauer,and P.Prabhu (Eds.),Handbook of Human- Computer Interaction(pp.203-227).Amsterdam, Netherlands:North Holland. Laplace,P.S.(1812).Theorie analytique des probabilitites.Paris,France:Courcier. Lewis,J.R.(1996).Binomial confidence intervals for small sample usability studies.In G.Salvendy and A. Ozok (eds.),Advances in Applied Ergonomics: Proceedings of the Ist International Conference on Applied Ergonomics--ICAE '96(pp.732-737). Istanbul,Turkey:USA Publishing. Sauro,J.(2004).Restoring confidence in usability results.From Measuring Usability,article retrieved Jan 2005 from http://www.measuringusability.com/conf intervals.htm Wilson,E.B.(1927).Probable inference,the law of succession,and statistical inference.Journal of the American Statistical Association.22.209-212
Acknowledgements We’d like to thank Lynda Finn of Statistical Insight for providing the Monte Carlo macro in Minitab and assistance with interpreting the statistical literature. We’d also like to thank Erika Kindlund of Intuit for providing the large sample completion rates. References Agresti, A., and Coull, B. (1998). Approximate is better than ‘exact’ for interval estimation of binomial proportions. The American Statistician, 52, 119-126. Clopper, C. J., and Pearson, E. (1934). The use of confidence intervals for fiducial limits illustrated in the case of the binomial. Biometrika, 26, 404-413. Landauer, T. K. (1997). Behavioral research methods in human-computer interaction. In M. Helander, T. K. Landauer, and P. Prabhu (Eds.), Handbook of HumanComputer Interaction (pp. 203-227). Amsterdam, Netherlands: North Holland. Laplace, P. S. (1812). Theorie analytique des probabilitites. Paris, France: Courcier. Lewis, J. R. (1996). Binomial confidence intervals for small sample usability studies. In G. Salvendy and A. Ozok (eds.), Advances in Applied Ergonomics: Proceedings of the 1st International Conference on Applied Ergonomics -- ICAE '96 (pp. 732-737). Istanbul, Turkey: USA Publishing. Sauro, J. (2004). Restoring confidence in usability results. From Measuring Usability, article retrieved Jan 2005 from http://www.measuringusability.com/conf_intervals.htm Wilson, E. B. (1927). Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association, 22, 209-212. PROCEEDINGS of the HUMAN FACTORS AND ERGONOMICS SOCIETY 49th ANNUAL MEETING—2005 2104