10 times 10-fold cross-validation fold ifiom I to 10 Randomized Dataset (10 folds) Test(fold f) Unsupervised model Time-wise-cross-validation fold Train (fold &and +1) Supervised model i from I to n-5 Odered Datase (n months Test()Unsupervised mo el Train (project) Across-project prediction Unsupervised model Test ( Figure 2:Overview of the three prediction settings Table 3:Summarization of studied data sets corresponding simple model and the best supervised model is %defect-mean #modified not significant;or(2)the magnitude of the difference between inducing LOC per files per the corresponding simple model and the best supervised Project Period #changechangechange change BUG0871998.12720064620 36% 37.5 2.3 model is trivial. COL 11/2002-07/2006 4455 14% 149.4 6.2 From Figure 3 and Figure 4,we have the following ob JDT 05/2001-12/2007 35386 14 71.4 4.3 servations.First,according to Popt,the best supervised PLA 20/2001-12/2007 64250 5% 72.2 4.3 model is the EALR model,which performs significantly bet- MOZ 01/2000-12/200698275 25% 106.5 5.3 ter than all the other supervised models.However,the POS 07/1996-05/2010.20431 20% 101.3 4.5 ND/EXP/REXP/SEXP unsupervised models have a perfor- mance similar to the EALR model and the NF/Entropy/LT- of changes (around 5%~36%of all changes) /NDEV/AGE/NUC unsupervised models perform signifi- cantly better than the EALR model.Second,according to 5.EXPERIMENTAL RESULTS ACC,the best supervised model is also the EALR model. which performs significantly better than the other supervised In this section,we report the experimental results models except the RBFN model.However,the NDEV/NUC 5.1 10 Times 10-fold Cross-validation unsupervised models perform similar to the EALR model Figure 3 and Figure 4 respectively employ the box-plot to and the NF/Entropy/LT/AGE unsupervised models perform describe the distributions of Popt and ACC obtained from 10 significantly better than the EALR model. times 10-fold cross-validation for the supervised models and Figure 5 and Figure 6 respectively present the results from simple unsupervised models for the overall result over the six Scott-Knott test.In Figure 5 and Figure 6,the y-axis is data sets.For each model,the box-plot shows the median the average performance.The blue labels indicate simple (the horizontal line within the box),the 25th percentile (the unsupervised models.The dotted lines represent groups lower side of the box),and the 75th percentile (the upper divided by the Scott-Knott test.All models are ordered side of the box).In Figure 3 and Figure 4,the horizontal by their mean ranks over the six different projects.As can dotted lines respectively represent the median performance be seen,all models in the first group are the unsupervised of the best supervised model.In particular,there are blue. models.The best supervised model (i.e.the EALR model)is red,and black box-plots.A blue box indicates that:(1) in the second group.This indicates that,those unsupervised the corresponding simple model performs significantly better models in the first group significantly outperform the best than the best supervised model according to the Wilcoxon supervised model. signed-rank test (i.e.the BH corrected p-value is less than Table 4 (a)and Table 5(b)respectively summarize the 0.05);and (2)the magnitude of the difference between the median Popt and ACC for the best supervised model (i.e. corresponding simple model and the best supervised model is the EALR model)and the best four simple unsupervised not trivial according to Cliff's 6 (i.e.>0.147).A red box models obtained from 10 times 10-fold cross-validation.In indicates that:(1)the corresponding simple model performs each table,for each simple unsupervised model,we show how significantly worse than the best supervised model;and (2) often it performs significantly better(denoted by "")or the magnitude of the difference between the corresponding worse(denoted by"x")than the best supervised model by the simple model and the best supervised model is not trivial. Wilcoxon's signed-rank test.The row "AVG"reports the aver- A black box indicates that:(1)the difference between the age median over the six data sets.The row "W/T/L"reports 16210 times 10-fold cross-validation Dataset Across-project prediction Time-wise-cross-validation Supervised model Unsupervised model Ordered Dataset (n months) Supervised model Unsupervised model Randomized Dataset (10 folds) Supervised model Unsupervised model Evaluate Evaluate i from 1 to 10 i from 1 to n-5 Evaluate Evaluate Evaluate Evaluate ... fold 1 fold n Dataset ... fold 1 fold n Train (all folds except i) Test (fold i) Train (fold i, and i+1) Test (fold i+4, and i+5) Train (project i) Test (project j) Figure 2: Overview of the three prediction settings Table 3: Summarization of studied data sets %defect- mean #modified inducing LOC per files per Project Period #change change change change BUG 08/1998-12/2006 4620 36% 37.5 2.3 COL 11/2002-07/2006 4455 14% 149.4 6.2 JDT 05/2001-12/2007 35386 14% 71.4 4.3 PLA 20/2001-12/2007 64250 5% 72.2 4.3 MOZ 01/2000-12/2006 98275 25% 106.5 5.3 POS 07/1996-05/2010 20431 20% 101.3 4.5 of changes (around 5% ∼ 36% of all changes). 5. EXPERIMENTAL RESULTS In this section, we report the experimental results. 5.1 10 Times 10-fold Cross-validation Figure 3 and Figure 4 respectively employ the box-plot to describe the distributions of Popt and ACC obtained from 10 times 10-fold cross-validation for the supervised models and simple unsupervised models for the overall result over the six data sets. For each model, the box-plot shows the median (the horizontal line within the box), the 25th percentile (the lower side of the box), and the 75th percentile (the upper side of the box). In Figure 3 and Figure 4, the horizontal dotted lines respectively represent the median performance of the best supervised model. In particular, there are blue, red, and black box-plots. A blue box indicates that: (1) the corresponding simple model performs significantly better than the best supervised model according to the Wilcoxon signed-rank test (i.e. the BH corrected p-value is less than 0.05); and (2) the magnitude of the difference between the corresponding simple model and the best supervised model is not trivial according to Cliff’s δ (i.e. |δ| ≥ 0.147). A red box indicates that: (1) the corresponding simple model performs significantly worse than the best supervised model; and (2) the magnitude of the difference between the corresponding simple model and the best supervised model is not trivial. A black box indicates that: (1) the difference between the corresponding simple model and the best supervised model is not significant; or (2) the magnitude of the difference between the corresponding simple model and the best supervised model is trivial. From Figure 3 and Figure 4, we have the following observations. First, according to Popt, the best supervised model is the EALR model, which performs significantly better than all the other supervised models. However, the ND/EXP/REXP/SEXP unsupervised models have a performance similar to the EALR model and the NF/Entropy/LT- /NDEV/AGE/NUC unsupervised models perform signifi- cantly better than the EALR model. Second, according to ACC, the best supervised model is also the EALR model, which performs significantly better than the other supervised models except the RBFN model. However, the NDEV/NUC unsupervised models perform similar to the EALR model and the NF/Entropy/LT/AGE unsupervised models perform significantly better than the EALR model. Figure 5 and Figure 6 respectively present the results from Scott-Knott test. In Figure 5 and Figure 6, the y-axis is the average performance. The blue labels indicate simple unsupervised models. The dotted lines represent groups divided by the Scott-Knott test. All models are ordered by their mean ranks over the six different projects. As can be seen, all models in the first group are the unsupervised models. The best supervised model (i.e. the EALR model) is in the second group. This indicates that, those unsupervised models in the first group significantly outperform the best supervised model. Table 4 (a) and Table 5 (b) respectively summarize the median Popt and ACC for the best supervised model (i.e. the EALR model) and the best four simple unsupervised models obtained from 10 times 10-fold cross-validation. In each table, for each simple unsupervised model, we show how often it performs significantly better (denoted by “√ ”) or worse (denoted by “×”) than the best supervised model by the Wilcoxon’s signed-rank test. The row “AVG” reports the average median over the six data sets. The row “W/T/L” reports 162