Here,Area(optimal)and Area(worst)is the area under the tically significant at the significance level of 0.05 2.If the curve corresponding to the best and the worst model,re- statistical test shows a significant difference.we then use the spectively.Note that both ACC and Popt are applicable to Cliff's 6 to examine whether the magnitude of the difference supervised models as well as unsupervised models. is practically important from the viewpoint of practical ap- plication [1].By convention,the magnitude of the difference 3.6 Data Analysis Method is considered trivial (1 <0.147),small (0.147 <6 <0.33), Figure 2 provides an overview of our data analysis method. moderate (0.33<<0.474),or large (>0.474)[35] As can be seen,in order to obtain an adequate and realistic Furthermore,similar to Ghotra et al.[8],we use Scott- assessment,we examine the three RQs under the following Knott test [14,27]to group the supervised and unsupervised three prediction settings:10 times 10-fold cross-validation, prediction models to examine whether there exist some mod- time-wise-cross-validation,and across-project prediction. els outperform others.The Scott-Knott test uses hierarchical 10 times 10-fold cross-validation is performed within the cluster analysis method to divide the prediction models into same project.At each 10-fold cross-validation.we first ran- two groups according to the mean performance (i.e.the Popt domize the data set.Then,we divide the data set into 10 and the ACC with respect to different runs for each model). parts of approximately equal size.After that,each part is If the difference between the divided groups is statistically used as a testing data set to evaluate the effectiveness of the significant,Scott-Knott will recursively divide each group in- prediction model built on the remainder of the data set (i.e. to two different groups.The test will terminate when groups the training data set).The entire process is then repeated can no longer divided into statistically distinct groups. 10 times to alleviate possible sampling bias in random split- s.Consequently,each model has 10 x 10=100 prediction 4. EXPERIMENTAL SETUP effectiveness values. In this section,we first introduce the subject projects and Time-urise-cross-validation is also performed within the then describe the data sets collected from these projects. same project,in which the chronological order of changes is considered.This is the method followed in [37].For each 4.1 Subject Projects project,we first rank the changes in chronological order In this study,we use the same open-source subject projects according to the commit date.Then,all changes committed as used in Kamei et al.'s study [13.More specifically,we within the same month are grouped into the same part use the following six projects to investigate the predictive Assume the changes in a project are grouped into n parts. power of simple unsupervised models in effort-aware JIT we use the following approach for the time-wise prediction. defect prediction:Bugzilla (BUG),Columba (COL),Eclipse We build a prediction model m on the combination of part i JDT (JDT),Eclipse Platform (PLA),Mozilla (MOZ),and and part i+1 and then apply m to predict changes in part PostgreSQL (POS).Bugzilla is a well-known web-based bug i+4 and i+5(1<i<n-5).As such,each training set and tracking system.Columba is a powerful mail management test set will have changes committed with two consecutive tool.Eclipse JDT is the Eclipse Java Development Tool- months.The reasons for this setting are four-fold.First, s,which is a set of plug-ins that add the capabilities of a the release cycle of most projects is typically 6~8 weeks full-featured Java IDE to the Eclipse platform.Mozilla is a 5].Second,it can make sure that each training set and test well-known and widely used open-source web browser.Post- set will have a gap of two months.Third,two consecutive greSQL is a powerful,open-source object-relational database months can make sure that each training set will have enough system.As stated by Kamei et al.13,these six projects instances,which is important for supervised models.Forth are large,well-known,and long-lived projects,which cover a it allows us to have enough runs for each project.If a project wide range of domains and sizes.In this sense,it is appropri- has changes of n months,this method will produce n-5 ate to use these projects to investigate simple unsupervised prediction effectiveness values for each model. models in JIT defect prediction. Across-project prediction is performed across different projects. We use a model trained on one project (i.e.the training 4.2 Data Sets data set)to predict defect-proneness in another project(i.e The data sets from these six projects used in this study the testing data set)[33,43].Given n projects,this method are shared by Kamei et al.and are available online.As will produce n x (n-1)prediction effectiveness values for mentioned by Kamei et al.[13],these data are gathered each model.In this study,we use six projects as the sub- by combining the change information mined from the CVS ject projects.Therefore,each prediction model will produce repositories of these projects with the corresponding bug 6 x (6-1)=30 prediction effectiveness values reports.More specifically,the data for Bugzilla and Mozilla Note that,the unsupervised models only use the change were gathered from the data provided by MSR 2007 Mining metrics in testing data to build the prediction models.In Challenge.The data for the Eclipse JDT and Platform were this study,we also apply cross-validation,time-wise-cross- gathered from the data provided by the MSR 2008 Mining validation,and across-project prediction settings to the un- Challenge.The data for Columba and PostgreSQL were supervised models.This allows the unsupervised models to gathered from the official CVS repository. use the same testing data as those supervised models,thus Table 3 summarizes the six data sets used in this study making a fair comparison on their prediction performance. The first column and the second column are respectively the When investigating RQ1,RQ2,and RQ3,we use the B- subject data set name and the period of time for collecting H corrected p-values from the Wilcoxon signed-rank test to the changes.The third to the sixth columns respectively examine whether there is a significant difference in the predic- report the total number of changes,the percentage of defect- tion effectiveness between the unsupervised and supervised inducing changes,the average LOC per change,and the models.In particular,we use the Benjamini-Hochberg(BH) number of files modified in a code change.As can be seen, corrected p-values to examine whether a difference is statis for each data set,defects concentrated in a small percentage 161Here, Area(optimal) and Area(worst) is the area under the curve corresponding to the best and the worst model, respectively. Note that both ACC and Popt are applicable to supervised models as well as unsupervised models. 3.6 Data Analysis Method Figure 2 provides an overview of our data analysis method. As can be seen, in order to obtain an adequate and realistic assessment, we examine the three RQs under the following three prediction settings: 10 times 10-fold cross-validation, time-wise-cross-validation, and across-project prediction. 10 times 10-fold cross-validation is performed within the same project. At each 10-fold cross-validation, we first randomize the data set. Then, we divide the data set into 10 parts of approximately equal size. After that, each part is used as a testing data set to evaluate the effectiveness of the prediction model built on the remainder of the data set (i.e. the training data set). The entire process is then repeated 10 times to alleviate possible sampling bias in random splits. Consequently, each model has 10 × 10 = 100 prediction effectiveness values. Time-wise-cross-validation is also performed within the same project, in which the chronological order of changes is considered. This is the method followed in [37]. For each project, we first rank the changes in chronological order according to the commit date. Then, all changes committed within the same month are grouped into the same part. Assume the changes in a project are grouped into n parts, we use the following approach for the time-wise prediction. We build a prediction model m on the combination of part i and part i+1 and then apply m to predict changes in part i+4 and i+5 (1 ≤ i ≤ n − 5). As such, each training set and test set will have changes committed with two consecutive months. The reasons for this setting are four-fold. First, the release cycle of most projects is typically 6 ∼ 8 weeks [5]. Second, it can make sure that each training set and test set will have a gap of two months. Third, two consecutive months can make sure that each training set will have enough instances, which is important for supervised models. Forth, it allows us to have enough runs for each project. If a project has changes of n months, this method will produce n − 5 prediction effectiveness values for each model. Across-project prediction is performed across different projects. We use a model trained on one project (i.e. the training data set) to predict defect-proneness in another project (i.e. the testing data set) [33, 43]. Given n projects, this method will produce n × (n − 1) prediction effectiveness values for each model. In this study, we use six projects as the subject projects. Therefore, each prediction model will produce 6 × (6 − 1) = 30 prediction effectiveness values. Note that, the unsupervised models only use the change metrics in testing data to build the prediction models. In this study, we also apply cross-validation, time-wise-crossvalidation, and across-project prediction settings to the unsupervised models. This allows the unsupervised models to use the same testing data as those supervised models, thus making a fair comparison on their prediction performance. When investigating RQ1, RQ2, and RQ3, we use the BH corrected p-values from the Wilcoxon signed-rank test to examine whether there is a significant difference in the prediction effectiveness between the unsupervised and supervised models. In particular, we use the Benjamini-Hochberg (BH) corrected p-values to examine whether a difference is statistically significant at the significance level of 0.05 [2]. If the statistical test shows a significant difference, we then use the Cliff’s δ to examine whether the magnitude of the difference is practically important from the viewpoint of practical application [1]. By convention, the magnitude of the difference is considered trivial (|δ| < 0.147), small (0.147 ≤ |δ| < 0.33), moderate (0.33 ≤ |δ| < 0.474), or large (≥ 0.474) [35]. Furthermore, similar to Ghotra et al. [8], we use ScottKnott test [14, 27] to group the supervised and unsupervised prediction models to examine whether there exist some models outperform others. The Scott-Knott test uses hierarchical cluster analysis method to divide the prediction models into two groups according to the mean performance (i.e. the Popt and the ACC with respect to different runs for each model). If the difference between the divided groups is statistically significant, Scott-Knott will recursively divide each group into two different groups. The test will terminate when groups can no longer divided into statistically distinct groups. 4. EXPERIMENTAL SETUP In this section, we first introduce the subject projects and then describe the data sets collected from these projects. 4.1 Subject Projects In this study, we use the same open-source subject projects as used in Kamei et al.’s study [13]. More specifically, we use the following six projects to investigate the predictive power of simple unsupervised models in effort-aware JIT defect prediction: Bugzilla (BUG), Columba (COL), Eclipse JDT (JDT), Eclipse Platform (PLA), Mozilla (MOZ), and PostgreSQL (POS). Bugzilla is a well-known web-based bug tracking system. Columba is a powerful mail management tool. Eclipse JDT is the Eclipse Java Development Tools, which is a set of plug-ins that add the capabilities of a full-featured Java IDE to the Eclipse platform. Mozilla is a well-known and widely used open-source web browser. PostgreSQL is a powerful, open-source object-relational database system. As stated by Kamei et al. [13], these six projects are large, well-known, and long-lived projects, which cover a wide range of domains and sizes. In this sense, it is appropriate to use these projects to investigate simple unsupervised models in JIT defect prediction. 4.2 Data Sets The data sets from these six projects used in this study are shared by Kamei et al. and are available online. As mentioned by Kamei et al. [13], these data are gathered by combining the change information mined from the CVS repositories of these projects with the corresponding bug reports. More specifically, the data for Bugzilla and Mozilla were gathered from the data provided by MSR 2007 Mining Challenge. The data for the Eclipse JDT and Platform were gathered from the data provided by the MSR 2008 Mining Challenge. The data for Columba and PostgreSQL were gathered from the official CVS repository. Table 3 summarizes the six data sets used in this study. The first column and the second column are respectively the subject data set name and the period of time for collecting the changes. The third to the sixth columns respectively report the total number of changes, the percentage of defectinducing changes, the average LOC per change, and the number of files modified in a code change. As can be seen, for each data set, defects concentrated in a small percentage 161