Menzies et al.did find that the ManualUp model had a of the developer of a change is expected to have a negative good effort-aware prediction performance.Their results were correlation to the likelihood of introducing a defect into the further confirmed by Zhou et al.'s study [42],in which the code by the change.In other words,if the current change is ManualUp model was found even competitive to the regular made by a more experienced developer,it is less likely that a supervised logistic regression model.All these studies show defect will be introduced.Note that,all these change metrics that,in traditional defect prediction.unsupervised models are the same as those used in Kamei et al.'s study. perform well under effort-aware evaluation. 3.2 Simple Unsupervised Models 3.RRESEARCH METHODOLODY In this study,we leverage change metrics to build simple In this section,we first introduce the investigated indepen- unsupervised models.As stated by Monden et al.[29,to dent and dependent variables.Then,we describe the simple adopt defect prediction models,one needs to consider not unsupervised models under study and present the supervised only their prediction effectiveness but also the significant models which will be used as the baseline models against cost required for metrics collection and modeling themselves. Next,we give the research questions.After that,we provide A recent investigation from Google developers further shows the performance indicators for evaluating the effectiveness of that a prerequisite for deploying a defect prediction model in defect prediction models in effort-aware JIT defect prediction. a large company such as Google is that it must be able to scale Finally,we give the data analysis method used in this study to large source repositories 21.Therefore,we only take into account those unsupervised defect prediction models that 3.1 Dependent and Independent Variables have a low application cost (including metrics collection cost The dependent variable in this study is a binary variable and modeling cost)and a good scalability.More specifically If a code change is a defect-inducing change,the dependent our study will investigate the following unsupervised defect variable is set to 1 and 0 otherwise. prediction models.For each of the change metrics (except LA.and LD),we build an unsupervised model that ranks Table 1:Summarization of change metrics changes in descendant order according to the reciprocal of Metric Description their corresponding raw metric values.This idea is inspired NS Number of subsystems touched by the current change ND by Koru and Menzies et al.'s finding that smaller modules Number of directories touched by the current change NF Number of files touched by the current change are proportionally more defect-prone and hence should be Entropy Distribution across the touched files,i.e.-Pklog2pk, inspected first 19,25.In our study,we expect that "smaller' where n is the number of files touched by the change and changes tend to be more proportionally defect-prone.More pk is the ratio of the touched code in the k-th file to the formally,for each change metric M,the corresponding model total touched code is R(c)=1/M(c).Here,c represents a change and R is the LA Lines of code added by the current change predicted risk value.For a given system,the changes will be LD Lines of code deleted by the current change LT Lines of code in a file before the current change ranked in descendant order according to the predicted risk FIX Whether or not the current change is a defect fix value R.In this context,changes with smaller change metric NDEV The number of developers that changed the files values will be ranked higher. AGE The average time interval (in days)between the last and Note that,under each of the above-mentioned simple un- the change over the files that are touched supervised models,it is possible that two changes have the NUC The number of unique last changes to the files same predicted risk values,i.e.they have a tied rank.In our EXP Developers experience,i.e.the number of changes REXP Recent developer experience,i.e.the total experience of study,if there is a tied rank according to the predicted risk the developer in terms of changes,weighted by their age values,the change with a lower defect density will be ranked SEXP Developer experience on a subsystem,i.e.the number of higher.Furthermore,if there is still a tied rank according changes the developer made in the past to the subsystems to the defect densities,the change with a larger change size The independent variables used in this study consist of will be ranked higher.In this way,we will obtain simple fourteen change metrics.Table 1 summarizes these change unsupervised models that have the "worst"predictive per- metrics,including the metric name,the description,and the formance (theoretically)in effort-aware just-in-time defect source.These fourteen metrics can be classified into the prediction 14.In our study,we investigate the predictive following five dimensions:diffusion,size,purpose,history. power of those "worst"simple unsupervised models.If our and experience.The diffusion dimension consists of NS.ND. experimental results show that those "worst"simple unsuper- NF,and Entropy,which characterize the distribution of a vised models are competitive to the supervised models,we change.As stated by Kamei et al.13,it is believed that will have confidence that simple unsupervised models are of a highly distributed change is more likely to be a defect- practical value for practitioners in effort-aware just-in-time inducing change.The size dimension leverages LA,LD,and defect prediction. LT to characterize the size of a change,in which a larger As can be seen,there are 12 simple unsupervised models change is expected to have a higher likelihood of being a which involve a low application cost and can be efficiently defect-inducing change 30,36.The purpose dimension applied to large source repositories. consists of only FIX.In the literature,there is a belief that a defect-fixing change is more likely to introduce a new 3.3 The Supervised Models defect [40].The history dimension consists of NDEV,AGE, The supervised models are summarized in Table 2.These and NUC.It is believed that a defect is more likely to be supervised models are categorized into six groups:"Func- introduced by a change if the touched files have been modified tion”,“Lazy”,“Rule”,“Bayes'”,Tree",and "Ensemble”.The by more developers,by more recent changes,or by more supervised models in the "Function"group are the regression unique last changes [4,9,11,22].The experience dimension models and the neural networks."Lazy"are the supervised consists of EXP,REXP,and SEXP,in which the experience models based on lazy learning."Rule"and "Tree"respec- 159Menzies et al. did find that the ManualUp model had a good effort-aware prediction performance. Their results were further confirmed by Zhou et al.’s study [42], in which the ManualUp model was found even competitive to the regular supervised logistic regression model. All these studies show that, in traditional defect prediction, unsupervised models perform well under effort-aware evaluation. 3. RRESEARCH METHODOLODY In this section, we first introduce the investigated independent and dependent variables. Then, we describe the simple unsupervised models under study and present the supervised models which will be used as the baseline models against. Next, we give the research questions. After that, we provide the performance indicators for evaluating the effectiveness of defect prediction models in effort-aware JIT defect prediction. Finally, we give the data analysis method used in this study. 3.1 Dependent and Independent Variables The dependent variable in this study is a binary variable. If a code change is a defect-inducing change, the dependent variable is set to 1 and 0 otherwise. Table 1: Summarization of change metrics Metric Description NS Number of subsystems touched by the current change ND Number of directories touched by the current change NF Number of files touched by the current change EntropyDistribution across the touched files, i.e. −Σn k−1 pklog2pk, where n is the number of files touched by the change and pk is the ratio of the touched code in the k-th file to the total touched code LA Lines of code added by the current change LD Lines of code deleted by the current change LT Lines of code in a file before the current change FIX Whether or not the current change is a defect fix NDEV The number of developers that changed the files AGE The average time interval (in days) between the last and the change over the files that are touched NUC The number of unique last changes to the files EXP Developers experience, i.e. the number of changes REXP Recent developer experience, i.e. the total experience of the developer in terms of changes, weighted by their age SEXP Developer experience on a subsystem, i.e. the number of changes the developer made in the past to the subsystems The independent variables used in this study consist of fourteen change metrics. Table 1 summarizes these change metrics, including the metric name, the description, and the source. These fourteen metrics can be classified into the following five dimensions: diffusion, size, purpose, history, and experience. The diffusion dimension consists of NS, ND, NF, and Entropy, which characterize the distribution of a change. As stated by Kamei et al. [13], it is believed that a highly distributed change is more likely to be a defectinducing change. The size dimension leverages LA, LD, and LT to characterize the size of a change, in which a larger change is expected to have a higher likelihood of being a defect-inducing change [30, 36]. The purpose dimension consists of only FIX. In the literature, there is a belief that a defect-fixing change is more likely to introduce a new defect [40]. The history dimension consists of NDEV, AGE, and NUC. It is believed that a defect is more likely to be introduced by a change if the touched files have been modified by more developers, by more recent changes, or by more unique last changes [4, 9, 11, 22]. The experience dimension consists of EXP, REXP, and SEXP, in which the experience of the developer of a change is expected to have a negative correlation to the likelihood of introducing a defect into the code by the change. In other words, if the current change is made by a more experienced developer, it is less likely that a defect will be introduced. Note that, all these change metrics are the same as those used in Kamei et al.’s study. 3.2 Simple Unsupervised Models In this study, we leverage change metrics to build simple unsupervised models. As stated by Monden et al. [29], to adopt defect prediction models, one needs to consider not only their prediction effectiveness but also the significant cost required for metrics collection and modeling themselves. A recent investigation from Google developers further shows that a prerequisite for deploying a defect prediction model in a large company such as Google is that it must be able to scale to large source repositories [21]. Therefore, we only take into account those unsupervised defect prediction models that have a low application cost (including metrics collection cost and modeling cost) and a good scalability. More specifically, our study will investigate the following unsupervised defect prediction models. For each of the change metrics (except LA, and LD), we build an unsupervised model that ranks changes in descendant order according to the reciprocal of their corresponding raw metric values. This idea is inspired by Koru and Menzies et al.’s finding that smaller modules are proportionally more defect-prone and hence should be inspected first [19, 25]. In our study, we expect that “smaller” changes tend to be more proportionally defect-prone. More formally, for each change metric M, the corresponding model is R(c) = 1/M(c). Here, c represents a change and R is the predicted risk value. For a given system, the changes will be ranked in descendant order according to the predicted risk value R. In this context, changes with smaller change metric values will be ranked higher. Note that, under each of the above-mentioned simple unsupervised models, it is possible that two changes have the same predicted risk values, i.e. they have a tied rank. In our study, if there is a tied rank according to the predicted risk values, the change with a lower defect density will be ranked higher. Furthermore, if there is still a tied rank according to the defect densities, the change with a larger change size will be ranked higher. In this way, we will obtain simple unsupervised models that have the “worst” predictive performance (theoretically) in effort-aware just-in-time defect prediction [14]. In our study, we investigate the predictive power of those “worst” simple unsupervised models. If our experimental results show that those “worst” simple unsupervised models are competitive to the supervised models, we will have confidence that simple unsupervised models are of practical value for practitioners in effort-aware just-in-time defect prediction. As can be seen, there are 12 simple unsupervised models, which involve a low application cost and can be efficiently applied to large source repositories. 3.3 The Supervised Models The supervised models are summarized in Table 2. These supervised models are categorized into six groups: “Function”, “Lazy”, “Rule”, “Bayes”, “Tree”, and “Ensemble”. The supervised models in the “Function” group are the regression models and the neural networks. “Lazy” are the supervised models based on lazy learning. “Rule” and “Tree” respec- 159