amount of available resource for inspecting functions.As T T aforementioned,practitioners are more interested in the rank- T ing performance of a prediction model at the top fraction.In 白 g 白 口 日 this study,we use the CE at the cut-off =0.2 (indicated 0 as CEo.2)to evaluate the effort-aware ranking performance BB+C B+C B+C B B+C BB+C of a model. Classification.We use Effort Reduction in Amount BASH GCC GIMP GLIB GSTR (ERA),a classification performance indicator adapted from Figure 7:Ranking performance comparison for the the "ER"measure used by Zhou et al.[40,to evaluate the “B"and the“B+C"model in terms of CEo.2 effort-aware classification effectiveness of a fault-proneness prediction model.In the classification scenario,only those have a significant difference in their predictive effectiveness. functions predicted to be high-risk will be inspected or test- Then.we use the Bonferroni correction method to adjust ed for software quality enhancement.The ERA measure p-values to examine whether a difference is significant at the denotes the amount of the reduced SLOC (i.e.,the amount significance level of 0.05 [4.Furthermore,we use Cliff's 6 of effort reduction)to be inspected by a model m compared to examine whether the magnitude of the difference between with the random model that achieves the same recall of faults the prediction performances of two models is important from Therefore,the effort-aware classification effectiveness of the the viewpoint of practical application [2].Cliff's 6 is widely prediction model m can be formally defined as follows: used for median comparison.By convention,the magnitude Here,Effort(m)is the ratio of the total SLOC in those of the difference is considered either trivial (0.147), predicted faulty functions to the total SLOC in the sys- small (0.147~0.33),moderate (0.33~0.474),or large ( tem.Effort(random)is the ratio of SLOC to inspect or test 0.474)35. to the total SLOC in the system that a random selection model needs to achieve the same recall of faults as the pre- 5.4.2 Experimental result diction model m.In this paper,for the sake of simplicity, This section presents the results with respect to ranking we use ERA0.2 to evaluate the effort-aware classification and classification scenarios to answer RQ4. performance.In order to compute ERAo.2,we first use the (1)Ranking performance comparison predicted fault-proneness by the model to rank the modules Figure 7 employs the box-plot to describe the distributions in descending order.Then,we classify the top 20%modules of CEo.2 obtained from 30 times 3-fold cross-validation for the into the fault-prone category and the other 80%modules into “B”and the“B+C"models with respect to each of the subject the defect-free category.Finally,we compute the effort-aware systems.For each model,the box-plot shows the median classification performance ERA as ERA0.2.Here,we use (the horizontal line within the box),the 25th percentile (the 20%as the cut-off value because many studies show that the lower side of the box),and the 75th percentile (the upper distribution of fault data in a system generally follow the side of the box).In Figure 7,a blue box indicates that (1) Pareto principle [1,15.The Pareto principle,also known as the corresponding "B+C"model performs significantly better the 20-80 rule,states that for many phenomena,80 percent than the "B"model according to the p-values from Wilcoxon of the consequences stem from 20 percent of the causes 22 signed-rank test;and (2)the magnitude of the difference In our context,this means that by inspecting these 209 between the corresponding“B+C'nodel and the“B”is not predicted fault-prone functions,we expect that almost 80% trivial according to Cliff's 6 (i.e.>0.147). of faulty modules in a system will be found. (2)Prediction settings.To obtain a realistic compari- Table 9:Ranking comparison in terms of CEo.2:the son,we evaluate the prediction performance under 30 times B”model vs the“B+C"model 3-fold cross-validation.We choose 3-fold cross-validation System B B+C %1 6 rather than 10-fold cross-validation due to the small per- BASH 0.098 0.201 104.90% 0.688/ centage of faulty function in the data sets.At each 3-fold GCC 0.148 0.197 33.00% 0.714/ cross-validation,we randomize and then divide the data set GIMP 0.073 0.130 78.70% 0.938/ into 3 parts of approximately equal size.Then,we test each GLIB 0.172 0.188 9.40% 0.194 part by the prediction model built with the remainder of the GSTR 0.160 0.198 24.00% 0.426/ data set.This process is repeated 30 times to alleviate poten- Average 0.130 0.183 50.00% 0.592 tial sampling bias.Note that,for each fold of the 30 times From Figure 7,it is obvious that the "B+C"model performs 3-fold cross-validation,we use the same training/test set to substantially better than the "B"model in each of the subject train/test our segmented model (i.e.,the "B+C"model)and the baseline model (i.e.,the "B"model).On each fold,we systems. Table9 presents median CEo.2 for the“B”and the“B+C first divide the training set into two groups:functions inside models.In Table 9,the second and the third columns present dependence clusters and functions outside dependence clus- ters.Then,we train the“B+Cin”model and the“B+Cout” the median CEo.2"respectively for the B and the "B+C" model,respectively.We also divide the test set into two model.The fourth and the fifth column are respectively the groups and subsequently use the "B+Cin"model and the percentage of the improvement for the "B+C"model over "B+Cout"model to predict the probability of those functions the"B"model and the effect sizes in terms of the Cliff's 6. that contain faults.After that,we combine the predicted In the last column,""indicates that the"B+C"model has significantly larger median CEo.2 than the "B"model by the values to derive the final predicted values to compute the Wilcoxon's signed-rank test.The last row in Table 9 shows performance indicators. Based on these predictive effectiveness values,we use the the average values for the five projects. From Table 9,we have the following observations.For all Wilcoxon's signed-rank test to examine whether two models systems,the "B+C"model has a larger median CEo.2 than 303amount of available resource for inspecting functions. As aforementioned, practitioners are more interested in the ranking performance of a prediction model at the top fraction. In this study, we use the CE at the cut-off π = 0.2 (indicated as CE0.2) to evaluate the effort-aware ranking performance of a model. Classification. We use Effort Reduction in Amount (ERA), a classification performance indicator adapted from the “ER” measure used by Zhou et al. [40], to evaluate the effort-aware classification effectiveness of a fault-proneness prediction model. In the classification scenario, only those functions predicted to be high-risk will be inspected or tested for software quality enhancement. The ERA measure denotes the amount of the reduced SLOC (i.e., the amount of effort reduction) to be inspected by a model m compared with the random model that achieves the same recall of faults. Therefore, the effort-aware classification effectiveness of the prediction model m can be formally defined as follows: Here, Effort(m) is the ratio of the total SLOC in those predicted faulty functions to the total SLOC in the system. Effort(random) is the ratio of SLOC to inspect or test to the total SLOC in the system that a random selection model needs to achieve the same recall of faults as the prediction model m. In this paper, for the sake of simplicity, we use ERA0.2 to evaluate the effort-aware classification performance. In order to compute ERA0.2, we first use the predicted fault-proneness by the model to rank the modules in descending order. Then, we classify the top 20% modules into the fault-prone category and the other 80% modules into the defect-free category. Finally, we compute the effort-aware classification performance ERA as ERA0.2. Here, we use 20% as the cut-off value because many studies show that the distribution of fault data in a system generally follow the Pareto principle [1, 15]. The Pareto principle, also known as the 20-80 rule, states that for many phenomena, 80 percent of the consequences stem from 20 percent of the causes [22]. In our context, this means that by inspecting these 20% predicted fault-prone functions, we expect that almost 80% of faulty modules in a system will be found. (2) Prediction settings. To obtain a realistic comparison, we evaluate the prediction performance under 30 times 3-fold cross-validation. We choose 3-fold cross-validation rather than 10-fold cross-validation due to the small percentage of faulty function in the data sets. At each 3-fold cross-validation, we randomize and then divide the data set into 3 parts of approximately equal size. Then, we test each part by the prediction model built with the remainder of the data set. This process is repeated 30 times to alleviate potential sampling bias. Note that, for each fold of the 30 times 3-fold cross-validation, we use the same training/test set to train/test our segmented model (i.e., the “B+C” model) and the baseline model (i.e., the “B” model). On each fold, we first divide the training set into two groups: functions inside dependence clusters and functions outside dependence clusters. Then, we train the “B+Cin” model and the “B+Cout” model, respectively. We also divide the test set into two groups and subsequently use the “B+Cin” model and the “B+Cout” model to predict the probability of those functions that contain faults. After that, we combine the predicted values to derive the final predicted values to compute the performance indicators. Based on these predictive effectiveness values, we use the Wilcoxon’s signed-rank test to examine whether two models CE0.2 BASH GCC GIMP GLIB GSTR Figure 7: Ranking performance comparison for the “B” and the “B+C” model in terms of CE0.2 have a significant difference in their predictive effectiveness. Then, we use the Bonferroni correction method to adjust p-values to examine whether a difference is significant at the significance level of 0.05 [4]. Furthermore, we use Cliff’s δ to examine whether the magnitude of the difference between the prediction performances of two models is important from the viewpoint of practical application [2]. Cliff’s δ is widely used for median comparison. By convention, the magnitude of the difference is considered either trivial (|δ| < 0.147), small (0.147 ∼ 0.33), moderate (0.33 ∼ 0.474), or large (> 0.474) [35]. 5.4.2 Experimental result This section presents the results with respect to ranking and classification scenarios to answer RQ4. (1) Ranking performance comparison Figure 7 employs the box-plot to describe the distributions of CE0.2 obtained from 30 times 3-fold cross-validation for the “B” and the “B+C” models with respect to each of the subject systems. For each model, the box-plot shows the median (the horizontal line within the box), the 25th percentile (the lower side of the box), and the 75th percentile (the upper side of the box). In Figure 7, a blue box indicates that (1) the corresponding “B+C” model performs significantly better than the “B” model according to the p-values from Wilcoxon signed-rank test; and (2) the magnitude of the difference between the corresponding “B+C” model and the “B” is not trivial according to Cliff’s δ (i.e. |δ| ≥ 0.147). Table 9: Ranking comparison in terms of CE0.2: the “B” model vs the “B+C” model System B B+C %↑ |δ| BASH 0.098 0.201 104.90% 0.688 √ GCC 0.148 0.197 33.00% 0.714 √ GIMP 0.073 0.130 78.70% 0.938 √ GLIB 0.172 0.188 9.40% 0.194 √ GSTR 0.160 0.198 24.00% 0.426 √ Average 0.130 0.183 50.00% 0.592 From Figure 7, it is obvious that the “B+C”model performs substantially better than the “B” model in each of the subject systems. Table 9 presents median CE0.2 for the “B” and the “B+C” models. In Table 9, the second and the third columns present the median CE0.2” respectively for the B and the “B+C” model. The fourth and the fifth column are respectively the percentage of the improvement for the “B+C” model over the “B” model and the effect sizes in terms of the Cliff’s δ. In the last column, “√ ” indicates that the “B+C” model has significantly larger median CE0.2 than the “B” model by the Wilcoxon’s signed-rank test. The last row in Table 9 shows the average values for the five projects. From Table 9, we have the following observations. For all systems, the “B+C” model has a larger median CE0.2 than 303