tively represent the rule based and the decision tree based supervised models."Ensemble"are those supervised ensem- ble models which are built with multiple base leaners.The Naive Bayes is a probability-based technique.In [13.Kamei et al.using the linear regression model to build the effort- prediction model aware JIT defect prediction model (i.e.the EALR model) Random model The EALR model is the state-of-the-art supervised model Optimal model Worst model in effort-aware JIT defect prediction.Besides,we also in- clude other supervised techniques (i.e.the models in Table 40 60 80 100 2 except the EALR model)as the baseline models.The %code chum reasons are two-folds.First,they are the most commonly Figure 1:Code-churn-based Alberg diagram used supervised techniques in defect prediction studies [8 art supervised models in time-wise-cross-validation? 10,20,23,25.Second,a recent literature 8 using most of RQ3:How well do simple unsupervised models predict them (except for the Random Forest)to revisit their impact defect-inducing changes when compared with the state-of-the- on the performance of defect prediction. art supervised models in across-project prediction? The purposes of RQ1,RQ2,and RQ3 are to compare sim- Table 2:Overview of the supervised models Family Model Abbreviation ple unsupervised models with the state-of-the-art supervised Function Linear Regression EALR models with respect to three different prediction settings Simple Logistic SL (i.e.the cross-validation,time-wise-cross-validation,and Radial basis functions RBFNet across-project prediction)to determine how well they predict network defect-inducing changes.Since unsupervised models do not Sequential Minimal SMO leverage the buggy or not-buggy label information to build Optimization the prediction models,they are not expected to perform Lazy K-Nearest Neighbour IBk better than the supervised models.However,if unsupervised Rule Propositional rule JRip Ripple down rules Ridor models are not much worse than the supervised models,it is Bayes Naive Bayes NB still a good choice for practitioners to apply them because Tree J48 J48 they have a lower building cost,a wider application range, Logistic Model Tree LMT and a higher efficiency.To the best of our knowledge,little is Random Forest RF currently known on these research questions from the view- Ensemble Bagging BG+LMT,BG+NB.BG+SL. point of unsupervised models in the literature.Our study BG+SMO,and BG+J48 Adaboost AB+LMT.AB+NB.AB+SL attempts to fill this gap by an in-depth investigation into AB+SMO.and AB+J48 simple unsupervised models in the context of effort-aware Rotation Forest RF+LMT.RF+NB.RF+SL JIT defect prediction. RF+SMO.and RF+J48 Random Subspace RS+LMT.RS+NB,RS+SL, 3.5 Performance Indicators RS+SMO,and RS+J48 When evaluating the predictive effectiveness of a JIT defect In this study,we use the same method as Kamei et al prediction model,we take into account the effort required 13 to build the EALR model.As stated in Section 2.2 to inspect those changes predicted as defect-prone to find Y(x)/Effort(x)was used as the dependent variable in the whether they are defect-inducing changes.Consistent with EALR model.For the other supervised models,we use the Kamei et al.13,we use the code churn (i.e.the total same method as Ghotra et al.[8.More specifically,Y(x number of lines added and deleted)by a change as a proxy of was used as the dependent variable for these supervised the effort required to inspect the change.In [13],Kamei et al. models.Consistent with Ghotra et al.[8],we use the same used ACC and Popt to evaluate the effort-aware performance parameters to build these supervised models.For example for the EALR model.ACC denotes the recall of defect- the K-Nearest Neighbor requires K most similar training inducing changes when using 20%of the entire effort required example for classifying an instance.In 8,Ghotra et al. to inspect all changes to inspect the top ranked changes.Popt found that K =8 performed best than other options (i.e is the normalized version of the effort-aware performance 2.4.6,and 16).As such.we also use K =8 to build the indicator originally introduced by Mende and Koschke [24 K-Nearest Neighbor.In EALR model,Kamei et al.used the The Popt is based on the concept of the "code-churn-based" under sampling method to deal with the imbalanced data Alberg diagram. set and then removed the most highly correlated factors to Figure 1 is an example "code-churn-based"Alberg diagram deal with collinearity.In consistence with Kamei et al.'s showing the performances of a prediction model m.In this study [13],we use exactly the same method to deal with diagram,the x-axis and y-axis are respectively the cumulative imbalanced data set and collinearity. percentage of code churn of the changes (i.e.the percentage of effort)and the cumulative percentage of defect-inducing 3.4 Research Questions changes found in selected changes.To compute Popt,two We investigate the following three research questions to additional curves are included:the "optimal"model and the determine the practical value of simple unsupervised models: “worst'”model.In the“optimal'”model and the“worst"model, RQ1:How well do simple unsupervised models predict changes are respectively sorted in decreasing and ascending defect-inducing changes when compared with the state-of-the- order according to their actual defect densities.According art supervised models in cross-validation? to [13],Popt can be formally defined as: RQ2:How well do simple unsupervised models predict Area(optimal)-Area(m) defect-inducing changes when compared with the state-of-the- Popt(m)=1- Area(optimal)-Area(worst) 160tively represent the rule based and the decision tree based supervised models. “Ensemble” are those supervised ensemble models which are built with multiple base leaners. The Naive Bayes is a probability-based technique. In [13], Kamei et al. using the linear regression model to build the effortaware JIT defect prediction model (i.e. the EALR model). The EALR model is the state-of-the-art supervised model in effort-aware JIT defect prediction. Besides, we also include other supervised techniques (i.e. the models in Table 2 except the EALR model) as the baseline models. The reasons are two-folds. First, they are the most commonly used supervised techniques in defect prediction studies [8, 10, 20, 23, 25]. Second, a recent literature [8] using most of them (except for the Random Forest) to revisit their impact on the performance of defect prediction. Table 2: Overview of the supervised models Family Model Abbreviation Function Linear Regression EALR Simple Logistic SL Radial basis functions RBFNet network Sequential Minimal SMO Optimization Lazy K-Nearest Neighbour IBk Rule Propositional rule JRip Ripple down rules Ridor Bayes Na¨ıve Bayes NB Tree J48 J48 Logistic Model Tree LMT Random Forest RF Ensemble Bagging BG+LMT, BG+NB, BG+SL, BG+SMO, and BG+J48 Adaboost AB+LMT, AB+NB, AB+SL, AB+SMO, and AB+J48 Rotation Forest RF+LMT, RF+NB, RF+SL, RF+SMO, and RF+J48 Random Subspace RS+LMT, RS+NB, RS+SL, RS+SMO, and RS+J48 In this study, we use the same method as Kamei et al. [13] to build the EALR model. As stated in Section 2.2, Y(x)/Effort(x) was used as the dependent variable in the EALR model. For the other supervised models, we use the same method as Ghotra et al. [8]. More specifically, Y(x) was used as the dependent variable for these supervised models. Consistent with Ghotra et al. [8], we use the same parameters to build these supervised models. For example, the K-Nearest Neighbor requires K most similar training example for classifying an instance. In [8], Ghotra et al. found that K = 8 performed best than other options (i.e. 2, 4, 6, and 16). As such, we also use K = 8 to build the K-Nearest Neighbor. In EALR model, Kamei et al. used the under sampling method to deal with the imbalanced data set and then removed the most highly correlated factors to deal with collinearity. In consistence with Kamei et al.’s study [13], we use exactly the same method to deal with imbalanced data set and collinearity. 3.4 Research Questions We investigate the following three research questions to determine the practical value of simple unsupervised models: RQ1: How well do simple unsupervised models predict defect-inducing changes when compared with the state-of-theart supervised models in cross-validation? RQ2: How well do simple unsupervised models predict defect-inducing changes when compared with the state-of-theFigure 1: Code-churn-based Alberg diagram art supervised models in time-wise-cross-validation? RQ3: How well do simple unsupervised models predict defect-inducing changes when compared with the state-of-theart supervised models in across-project prediction? The purposes of RQ1, RQ2, and RQ3 are to compare simple unsupervised models with the state-of-the-art supervised models with respect to three different prediction settings (i.e. the cross-validation, time-wise-cross-validation, and across-project prediction) to determine how well they predict defect-inducing changes. Since unsupervised models do not leverage the buggy or not-buggy label information to build the prediction models, they are not expected to perform better than the supervised models. However, if unsupervised models are not much worse than the supervised models, it is still a good choice for practitioners to apply them because they have a lower building cost, a wider application range, and a higher efficiency. To the best of our knowledge, little is currently known on these research questions from the viewpoint of unsupervised models in the literature. Our study attempts to fill this gap by an in-depth investigation into simple unsupervised models in the context of effort-aware JIT defect prediction. 3.5 Performance Indicators When evaluating the predictive effectiveness of a JIT defect prediction model, we take into account the effort required to inspect those changes predicted as defect-prone to find whether they are defect-inducing changes. Consistent with Kamei et al. [13], we use the code churn (i.e. the total number of lines added and deleted) by a change as a proxy of the effort required to inspect the change. In [13], Kamei et al. used ACC and Popt to evaluate the effort-aware performance for the EALR model. ACC denotes the recall of defectinducing changes when using 20% of the entire effort required to inspect all changes to inspect the top ranked changes. Popt is the normalized version of the effort-aware performance indicator originally introduced by Mende and Koschke [24]. The Popt is based on the concept of the “code-churn-based” Alberg diagram. Figure 1 is an example “code-churn-based” Alberg diagram showing the performances of a prediction model m. In this diagram, the x-axis and y-axis are respectively the cumulative percentage of code churn of the changes (i.e. the percentage of effort) and the cumulative percentage of defect-inducing changes found in selected changes. To compute Popt, two additional curves are included: the “optimal” model and the “worst” model. In the “optimal” model and the “worst” model, changes are respectively sorted in decreasing and ascending order according to their actual defect densities. According to [13], Popt can be formally defined as: Popt(m) = 1 − Area(optimal) − Area(m) Area(optimal) − Area(worst) 160