T founding effect of function size on the associations between those metrics with fault-proneness [14,39].Therefore,it is 含 日 T 白 白 0 not readily known whether our conclusions will change if 8 00 the potentially confounding effect of module size is excluded. 8 In the following,we use the method proposed by Zhou et BB+C B+C B B+C B+C BB+C BASH GCC GIMP GLIB GSTR al.[39 to remove the confounding effect of module size and then rerun the analyses for RQ4 Figure 8:Classification performance comparison for the“B”and the“B+C model in terms of ERA0.2 Table 11:Ranking comparison in terms of CEo.2 af- ter excluding the potentially confounding effect of the "B"model in terms of the median CEo.2.On average,the nodule size:the“B”nodel vs the"B+C”nodel "B+C"model leads to about 50.0%improvement over the "B" System B B+C %↑ model in terms of the median CEo.2.The Wilcoxon signed- BASH rank test p-values are very significant (<0.001).Furthermore, 0.117 0.094 -0.194 0.224 the effect sizes are moderate to large except in GLIB where GCC 0.150 0.174 0.160 0.520/ the effect size is small.The core observation is that,from the GIMP 0.073 0.131 0.799 0.928 GLIB 0.183 0.188 0.025 0.041 viewpoint of practical application,the "B+C"model has a substantially better ranking performance than the"B"model. GSTR 0.155 0.187 0.209 0.399√ (2)Classification performance comparison Average 0.136 0.155 0.200 0.333 Figure 8 employs box-plots to describe the distributions of ERAo.2 obtained from 30 times 3-fold cross-validation for Table 12: Classification comparison in terms of the "B"and the"B+C"models with respect to each of the ERAo.2 after excluding the potentially confounding subject systems.From Figure 8,we can find that the "B+C effect of module size:the“B”nodel vs the“B+C models are also substantially better than the "B"model. model Table 10: Classification comparison in term System B B+C %↑ 6 ERAo.2:the“B”model vs the“B+C"model BASH 0.128 0.109 -15.40% 0.136 GCC 0.148 0.196 31.80% 0.728/ System B B+C %↑ GIMP 0.059 0.128 118.80% BASH 0.948/ 0.098 0.211 115.20% 0.644 GCC GLIB 0.112 0.135 20.70% 0.446/ 0.148 0.210 41.80% 0.816W GSTR 0.940 0.128 0.171 33.40% 0.444V GIMP 0.059 0.124 112.60% 0.148 0.486 GLIB 0.110 0.135 22.80% 0.462V Average 0.115 37.90% GSTR 0.132 0.189 43.30% 0.551/ Table 11 and Table 12 respectively present the median Average 0.109 0.174 67.10% 0.683 CEo.2 and ERAo.2 for the“B”and the“B+C"models after Table 10 presents the classification performance for the "B" excluding the potentially confounding effect of module size. and the "B+C"models in terms of CEo2.For all systems. From Table 11 and Table 12.we find that the"B+C"models the"B+C"model has a larger median ERA0.2 than the"B" have both larger median CEo.2 and median ERAo.2 than the model in terms of the median ERA0.2.On average,the "B"model in all the five subject systems except in BASH.This "B+C"model leads to about 67.1%improvement over the "B" indicates that our proposed model still performs better than model.The p-values are very significant (<0.001).Besides, the baseline model in both of the ranking and classification the effect sizes are moderate to large.The core observation is scenarios in most cases. that,from the viewpoint of practical application,the "B+C" Overall,after excluding the potentially confounding effect model has a substantially better classification performance of function size,our conclusion on RQ4 is mostly the same than the“B”nodel. 6.2 Will our conclusions change if the multi- Overall,the above observations suggest that the "B+C" plicity of dependencies is ignored? model outperforms the"B"model in effort-aware fault-proneness prediction under both ranking and classification scenarios. As mentioned before,in our study we take into account This indicates that dependence clusters are actually useful the multiplicity of dependencies between functions.The in effort-aware fault-proneness prediction. multiplicity information is used as the weight of dependencies in the SDG.However,prior studies [23,31,41]ignored 6.DISCUSSION this information.Therefore,it is not readily answerable whether our conclusions will change if the multiplicity of In this section,we further discuss our findings.First,we dependencies is also ignored.Next,we ignore the multiplicity analyze whether our conclusions will change if the potentially of dependencies and rerun the analysis for RQ4. confounding effect of module size is excluded for the "B"and Table 13 and Table 14 respectively summarize the median the "B+C"models.Then,we analyze whether we have CEo.2 and the median ERAo..2 for the“B”and the“B+C similar conclusions if the multiplicity of dependencies is not models when the multiplicity of dependencies is not con- considered. sidered.From Table 13 and Table 14,we observe that the 6.1 Will our conclusions change if the poten- "B+C"models have substantially larger median CEo.2 and median ERAo.2 than the "B"model in all the five subject tially confounding effect of module size is systems.This indicates that our proposed model still per- excluded? forms substantially better than the baseline model in both In our study,when building a fault-proneness prediction of the ranking and classification scenarios. model,we did not take into account the potentially con- Overall,the above observations show that our conclusions 304ERA0.2 BASH GCC GIMP GLIB GSTR Figure 8: Classification performance comparison for the “B” and the “B+C” model in terms of ERA0.2 the “B” model in terms of the median CE0.2. On average, the “B+C” model leads to about 50.0% improvement over the “B” model in terms of the median CE0.2. The Wilcoxon signedrank test p-values are very significant (< 0.001). Furthermore, the effect sizes are moderate to large except in GLIB where the effect size is small. The core observation is that, from the viewpoint of practical application, the “B+C” model has a substantially better ranking performance than the “B” model. (2) Classification performance comparison Figure 8 employs box-plots to describe the distributions of ERA0.2 obtained from 30 times 3-fold cross-validation for the “B” and the “B+C” models with respect to each of the subject systems. From Figure 8, we can find that the “B+C” models are also substantially better than the “B” model. Table 10: Classification comparison in term of ERA0.2: the “B” model vs the “B+C” model System B B+C %↑ |δ| BASH 0.098 0.211 115.20% 0.644 √ GCC 0.148 0.210 41.80% 0.816 √ GIMP 0.059 0.124 112.60% 0.940 √ GLIB 0.110 0.135 22.80% 0.462 √ GSTR 0.132 0.189 43.30% 0.551 √ Average 0.109 0.174 67.10% 0.683 Table 10 presents the classification performance for the “B” and the “B+C” models in terms of CE0.2. For all systems, the “B+C” model has a larger median ERA0.2 than the “B” model in terms of the median ERA0.2. On average, the “B+C” model leads to about 67.1% improvement over the “B” model. The p-values are very significant (< 0.001). Besides, the effect sizes are moderate to large. The core observation is that, from the viewpoint of practical application, the “B+C” model has a substantially better classification performance than the “B” model. Overall, the above observations suggest that the “B+C” model outperforms the “B”model in effort-aware fault-proneness prediction under both ranking and classification scenarios. This indicates that dependence clusters are actually useful in effort-aware fault-proneness prediction. 6. DISCUSSION In this section, we further discuss our findings. First, we analyze whether our conclusions will change if the potentially confounding effect of module size is excluded for the “B” and the “B+C” models. Then, we analyze whether we have similar conclusions if the multiplicity of dependencies is not considered. 6.1 Will our conclusions change if the potentially confounding effect of module size is excluded? In our study, when building a fault-proneness prediction model, we did not take into account the potentially confounding effect of function size on the associations between those metrics with fault-proneness [14, 39]. Therefore, it is not readily known whether our conclusions will change if the potentially confounding effect of module size is excluded. In the following, we use the method proposed by Zhou et al. [39] to remove the confounding effect of module size and then rerun the analyses for RQ4. Table 11: Ranking comparison in terms of CE0.2 after excluding the potentially confounding effect of module size: the “B” model vs the “B+C” model System B B+C %↑ |δ| BASH 0.117 0.094 -0.194 0.224 GCC 0.150 0.174 0.160 0.520 √ GIMP 0.073 0.131 0.799 0.928 √ GLIB 0.183 0.188 0.025 0.041 GSTR 0.155 0.187 0.209 0.399 √ Average 0.136 0.155 0.200 0.333 Table 12: Classification comparison in terms of ERA0.2 after excluding the potentially confounding effect of module size: the “B” model vs the “B+C” model System B B+C %↑ |δ| BASH 0.128 0.109 -15.40% 0.136 GCC 0.148 0.196 31.80% 0.728 √ GIMP 0.059 0.128 118.80% 0.948 √ GLIB 0.112 0.135 20.70% 0.446 √ GSTR 0.128 0.171 33.40% 0.444 √ Average 0.115 0.148 37.90% 0.486 Table 11 and Table 12 respectively present the median CE0.2 and ERA0.2 for the “B” and the “B+C” models after excluding the potentially confounding effect of module size. From Table 11 and Table 12, we find that the “B+C” models have both larger median CE0.2 and median ERA0.2 than the “B”model in all the five subject systems except in BASH. This indicates that our proposed model still performs better than the baseline model in both of the ranking and classification scenarios in most cases. Overall, after excluding the potentially confounding effect of function size, our conclusion on RQ4 is mostly the same. 6.2 Will our conclusions change if the multiplicity of dependencies is ignored? As mentioned before, in our study we take into account the multiplicity of dependencies between functions. The multiplicity information is used as the weight of dependencies in the SDG. However, prior studies [23, 31, 41] ignored this information. Therefore, it is not readily answerable whether our conclusions will change if the multiplicity of dependencies is also ignored. Next, we ignore the multiplicity of dependencies and rerun the analysis for RQ4. Table 13 and Table 14 respectively summarize the median CE0.2 and the median ERA0.2 for the “B” and the “B+C” models when the multiplicity of dependencies is not considered. From Table 13 and Table 14, we observe that the “B+C” models have substantially larger median CE0.2 and median ERA0.2 than the “B” model in all the five subject systems. This indicates that our proposed model still performs substantially better than the baseline model in both of the ranking and classification scenarios. Overall, the above observations show that our conclusions 304