with the state-of-the-art supervised models in effort-aware systems.In our experiments,we use six long-lived and widely JIT defect prediction (indicated by Popt and ACC).The used open-source software systems as the subject systems. above finding is very surprising,as it is in contrast with The experimental results drawn from these subject systems our original expectation that the-state-of-the-art supervised are quite consistent.Furthermore,the data sets from these models should have a better performance.The reason for this systems are large enough to draw statistically meaningful expectation is that the state-of-the-art supervised models conclusions.We believe that our study makes a significant exploiting the defect data(i.e.the label information)to build contribution to the software engineering body of empirical the prediction model.From the above finding,we believe knowledge about effort-aware JIT defect prediction.Nonethe- that it is a good choice for practitioners to apply simple less,we do not claim that our findings can be generalized to unsupervised models in practice due to the low building cost all systems,as the subject systems under study might not be and the wide application range. representative of systems in general.To mitigate this threat, there is a need to use a wide variety of systems to replicate THREATS TO VALIDITY our study in the future. In this section,we analyze the most important threats to the construct,internal,and external validity of our study. 7.CONCLUSIONS AND FUTURE WORK 6.1 Construct Validity In this paper,we perform an empirical study to investi- gate the predictive power of simple unsupervised models in The dependent variable used in this study is a binary vari- effort-aware JIT defect prediction.Our experimental results able indicating whether a change is defect-inducing.Our from six industrial-size systems show that many simple un- study used the data sets provided online by Kamei el al.13 supervised models perform well in predicting defect-inducing As stated by Kamei et al.13,they used the commonly changes.In particular,contrary to the general expectation, used SZZ algorithm [37 to discover defect-inducing changes. we find that several simple unsupervised models perform However,the discovered defect-inducing changes may be in- better than the state-of-the-art supervised model reported complete,which is a potential threat to the construct validity by Kamei et al.13.The experimental results from the of the dependent variable.Thus,the approaches to recover- investigated six subject systems are quite consistent,regard- ing missing links [31,38]are required to improve the accuracy less of whether 10 times 10-fold cross-validation,time-wise- of the SZZ algorithm.Indeed,this is an inherent problem to cross-validation,or across-project prediction is considered. most,if not all,studies that discover defect-inducing changes Our findings have important implications.For practitioner- by mining software repositories,not unique to us.Nonethe- s,simple unsupervised models are attractive alternative to less,this threat needs to be eliminated by using complete supervised models in the context of effort-aware JIT defect defect data in the future work.The independent variables prediction,as they have a lower building cost and a wider used in this study are the commonly used change metrics. application range.This is especially true for those projects For their construct validity,previous research has investigat- whose defect data are expensive to collect or even unavail- ed the degree to which they accurately measure the concepts able.For researchers,we strongly suggest that future JIT they purport to measure [13.In particular,each change defect prediction research should use our simple unsupervised metric has a clear definition and can be easily collected.In models as the baseline models for comparison when a novel this sense,the construct validity of the independent variables prediction model is proposed. should be acceptable. Our study only investigates the actual usefulness of simple 6.2 Internal Validity unsupervised models in effort-aware JIT defect prediction for open-source software systems.It is unclear whether they can There are two potential threats to the internal validity. be applied to closed-source software systems.In the future, The first potential threat is from the specific cut-off value an interesting work is hence to extend our current study to used for the performance indicator (i.e.the ACC).In our closed-source software systems. study,0.2 is used as the cut-off value for the computation for recall of defect-inducing changes.The reasons are two-fold. 8. REPEATABILITY First,0.2 was used in Kamei et al.'s study,this enable us to directly compare our result with them.Second,0.2 is the We provide all datasets and R scripts that used to conduct most commonly used cut-off value in the literature.However this study at http://ise.nju.edu.cn/yangyibiao/jit.html it is unknown whether our conclusion depends on the chosen cut-off value.To eliminate this potential threat,we re-run 9.ACKNOWLEDGMENTS all the analyses using the following typical cut-off values: 0.05,0.10,and 0.15.We found our conclusion remained no We are very grateful to Y.Kamei,E.Shihab,B.Adams,A. E.Hassan,A.Mockus,A.Sinha,and N.Ubayashi for sharing change.The second potential threat is from the gap between their data sets.This enables us to conduct this study.This the training and the test set in time-wise-cross-validation. work is supported by the National Key Basic Research and In this study,we use two months as the gap.However. Development Program of China(2014CB340702),the Nation it is unknown whether our result depend on this gap.In al Natural Science Foundation of China(61432001,91418202 order to eliminate this potential threat,we re-run all the 61272082.61300051,61321491.61472178.and91318301).the analysis using the other gaps(2,4,6,and 12).We found our Natural Science Foundation of Jiangsu Province(BK20130014). conclusion remained no change. the Hong Kong Competitive Earmarked Research Grant 6.3 External Validity (PolyU5219/06E),the HongKong PolyU Grant (4-6934),and the program A for Outstanding PhD candidate of Nanjing The most important threat to the external validity of this University. study is that our results may not be generalized to other 166with the state-of-the-art supervised models in effort-aware JIT defect prediction (indicated by Popt and ACC). The above finding is very surprising, as it is in contrast with our original expectation that the-state-of-the-art supervised models should have a better performance. The reason for this expectation is that the state-of-the-art supervised models exploiting the defect data (i.e. the label information) to build the prediction model. From the above finding, we believe that it is a good choice for practitioners to apply simple unsupervised models in practice due to the low building cost and the wide application range. 6. THREATS TO VALIDITY In this section, we analyze the most important threats to the construct, internal, and external validity of our study. 6.1 Construct Validity The dependent variable used in this study is a binary variable indicating whether a change is defect-inducing. Our study used the data sets provided online by Kamei el al. [13]. As stated by Kamei et al. [13], they used the commonly used SZZ algorithm [37] to discover defect-inducing changes. However, the discovered defect-inducing changes may be incomplete, which is a potential threat to the construct validity of the dependent variable. Thus, the approaches to recovering missing links [31, 38] are required to improve the accuracy of the SZZ algorithm. Indeed, this is an inherent problem to most, if not all, studies that discover defect-inducing changes by mining software repositories, not unique to us. Nonetheless, this threat needs to be eliminated by using complete defect data in the future work. The independent variables used in this study are the commonly used change metrics. For their construct validity, previous research has investigated the degree to which they accurately measure the concepts they purport to measure [13]. In particular, each change metric has a clear definition and can be easily collected. In this sense, the construct validity of the independent variables should be acceptable. 6.2 Internal Validity There are two potential threats to the internal validity. The first potential threat is from the specific cut-off value used for the performance indicator (i.e. the ACC). In our study, 0.2 is used as the cut-off value for the computation for recall of defect-inducing changes. The reasons are two-fold. First, 0.2 was used in Kamei et al.’s study, this enable us to directly compare our result with them. Second, 0.2 is the most commonly used cut-off value in the literature. However, it is unknown whether our conclusion depends on the chosen cut-off value. To eliminate this potential threat, we re-run all the analyses using the following typical cut-off values: 0.05, 0.10, and 0.15. We found our conclusion remained no change. The second potential threat is from the gap between the training and the test set in time-wise-cross-validation. In this study, we use two months as the gap. However, it is unknown whether our result depend on this gap. In order to eliminate this potential threat, we re-run all the analysis using the other gaps (2, 4, 6, and 12). We found our conclusion remained no change. 6.3 External Validity The most important threat to the external validity of this study is that our results may not be generalized to other systems. In our experiments, we use six long-lived and widely used open-source software systems as the subject systems. The experimental results drawn from these subject systems are quite consistent. Furthermore, the data sets from these systems are large enough to draw statistically meaningful conclusions. We believe that our study makes a significant contribution to the software engineering body of empirical knowledge about effort-aware JIT defect prediction. Nonetheless, we do not claim that our findings can be generalized to all systems, as the subject systems under study might not be representative of systems in general. To mitigate this threat, there is a need to use a wide variety of systems to replicate our study in the future. 7. CONCLUSIONS AND FUTURE WORK In this paper, we perform an empirical study to investigate the predictive power of simple unsupervised models in effort-aware JIT defect prediction. Our experimental results from six industrial-size systems show that many simple unsupervised models perform well in predicting defect-inducing changes. In particular, contrary to the general expectation, we find that several simple unsupervised models perform better than the state-of-the-art supervised model reported by Kamei et al. [13]. The experimental results from the investigated six subject systems are quite consistent, regardless of whether 10 times 10-fold cross-validation, time-wisecross-validation, or across-project prediction is considered. Our findings have important implications. For practitioners, simple unsupervised models are attractive alternative to supervised models in the context of effort-aware JIT defect prediction, as they have a lower building cost and a wider application range. This is especially true for those projects whose defect data are expensive to collect or even unavailable. For researchers, we strongly suggest that future JIT defect prediction research should use our simple unsupervised models as the baseline models for comparison when a novel prediction model is proposed. Our study only investigates the actual usefulness of simple unsupervised models in effort-aware JIT defect prediction for open-source software systems. It is unclear whether they can be applied to closed-source software systems. In the future, an interesting work is hence to extend our current study to closed-source software systems. 8. REPEATABILITY We provide all datasets and R scripts that used to conduct this study at http://ise.nju.edu.cn/yangyibiao/jit.html 9. ACKNOWLEDGMENTS We are very grateful to Y. Kamei, E. Shihab, B. Adams, A. E. Hassan, A. Mockus, A. Sinha, and N. Ubayashi for sharing their data sets. This enables us to conduct this study. This work is supported by the National Key Basic Research and Development Program of China (2014CB340702), the National Natural Science Foundation of China (61432001, 91418202, 61272082, 61300051, 61321491, 61472178, and 91318301), the Natural Science Foundation of Jiangsu Province (BK20130014), the Hong Kong Competitive Earmarked Research Grant (PolyU5219/06E), the HongKong PolyU Grant (4-6934), and the program A for Outstanding PhD candidate of Nanjing University. 166