How Far We Have Progressed in the Journey?An Examination of Cross-Project Defect Prediction YUMING ZHOU,YIBIAO YANG,HONGMIN LU,LIN CHEN,YANHUI LI,and YANGYANG ZHAO,Nanjing University JUNYAN QIAN,Guilin University of Electronic Technology BAOWEN XU,Nanjing University Background.Recent years have seen an increasing interest in cross-project defect prediction(CPDP),which aims to apply defect prediction models built on source projects to a target project.Currently,a variety of (complex)CPDP models have been proposed with a promising prediction performance. Problem.Most,if not all,of the existing CPDP models are not compared against those simple module size models that are easy to implement and have shown a good performance in defect prediction in the literature. Objective.We aim to investigate how far we have really progressed in the journey by comparing the perfor- mance in defect prediction between the existing CPDP models and simple module size models. Method.We first use module size in the target project to build two simple defect prediction models,Manual- Down and ManualUp,which do not require any training data from source projects.ManualDown considers a larger module as more defect-prone,while ManualUp considers a smaller module as more defect-prone. Then,we take the following measures to ensure a fair comparison on the performance in defect prediction between the existing CPDP models and the simple module size models:using the same publicly available data sets,using the same performance indicators,and using the prediction performance reported in the original cross-project defect prediction studies. Result.The simple module size models have a prediction performance comparable or even superior to most of the existing CPDP models in the literature,including many newly proposed models. Conclusion.The results caution us that,if the prediction performance is the goal,the real progress in CPDP is not being achieved as it might have been envisaged.We hence recommend that future studies should include ManualDown/ManualUp as the baseline models for comparison when developing new CPDP models to predict defects in a complete target project. CCS Concepts:.Software and its engineering-Software evolution;Maintaining software; Additional Key Words and Phrases:Defect prediction,cross-project,supervised,unsupervised,model This work is partially supported by the National Key Basic Research and Development Program of China(2014CB340702) and the National Natural Science Foundation of China (61432001,61772259,61472175,61472178,61702256,61562015, 61403187). Authors'addresses:Y.Zhou (corresponding author),Y.Yang.H.Lu,L.Chen,Y.Li,Y.Zhao,and B.Xu(corresponding author), State Key Laboratory for Novel Software Technology,Nanjing University,No.163,Xianlin Road,Nanjing.210023,Jiangsu Province,P.R.China;emails:(zhouyuming.yangyibiao,hmlu,Ichen,yanhuili,bwxu@nju.edu.cn,csurjzhyy@163.com;Y. Qian,Guangxi Key Laboratory of Trusted Software,Guilin University of Electronic Technology,Guilin,541004,Guangxi Province,P.R.China;email:qjy2000@gmail.com. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.Copyrights for components of this work owned by others than ACM must be honored Abstracting with credit is permitted.To copy otherwise,or republish,to post on servers or to redistribute to lists,requires prior specific permission and/or a fee.Request permissions from permissions@acm.org. ©2018ACM1049-331X/2018/04-ART1$15.00 https:/∥doi.org/10.1145/3183339 ACM Transactions on Software Engineering and Methodology,Vol.27.No.1,Article 1.Pub.date:April 2018
1 How Far We Have Progressed in the Journey? An Examination of Cross-Project Defect Prediction YUMING ZHOU, YIBIAO YANG, HONGMIN LU, LIN CHEN, YANHUI LI, and YANGYANG ZHAO, Nanjing University JUNYAN QIAN, Guilin University of Electronic Technology BAOWEN XU, Nanjing University Background. Recent years have seen an increasing interest in cross-project defect prediction (CPDP), which aims to apply defect prediction models built on source projects to a target project. Currently, a variety of (complex) CPDP models have been proposed with a promising prediction performance. Problem. Most, if not all, of the existing CPDP models are not compared against those simple module size models that are easy to implement and have shown a good performance in defect prediction in the literature. Objective. We aim to investigate how far we have really progressed in the journey by comparing the performance in defect prediction between the existing CPDP models and simple module size models. Method. We first use module size in the target project to build two simple defect prediction models, ManualDown and ManualUp, which do not require any training data from source projects. ManualDown considers a larger module as more defect-prone, while ManualUp considers a smaller module as more defect-prone. Then, we take the following measures to ensure a fair comparison on the performance in defect prediction between the existing CPDP models and the simple module size models: using the same publicly available data sets, using the same performance indicators, and using the prediction performance reported in the original cross-project defect prediction studies. Result. The simple module size models have a prediction performance comparable or even superior to most of the existing CPDP models in the literature, including many newly proposed models. Conclusion. The results caution us that, if the prediction performance is the goal, the real progress in CPDP is not being achieved as it might have been envisaged. We hence recommend that future studies should include ManualDown/ManualUp as the baseline models for comparison when developing new CPDP models to predict defects in a complete target project. CCS Concepts: • Software and its engineering → Software evolution; Maintaining software; Additional Key Words and Phrases: Defect prediction, cross-project, supervised, unsupervised, model This work is partially supported by the National Key Basic Research and Development Program of China (2014CB340702) and the National Natural Science Foundation of China (61432001, 61772259, 61472175, 61472178, 61702256, 61562015, 61403187). Authors’ addresses: Y. Zhou (corresponding author), Y. Yang, H. Lu, L. Chen, Y. Li, Y. Zhao, and B. Xu (corresponding author), State Key Laboratory for Novel Software Technology, Nanjing University, No. 163, Xianlin Road, Nanjing, 210023, Jiangsu Province, P.R. China; emails: {zhouyuming, yangyibiao, hmlu, lchen, yanhuili, bwxu}@nju.edu.cn, csurjzhyy@163.com; Y. Qian, Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology, Guilin, 541004, Guangxi Province, P.R. China; email: qjy2000@gmail.com. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2018 ACM 1049-331X/2018/04-ART1 $15.00 https://doi.org/10.1145/3183339 ACM Transactions on Software Engineering and Methodology, Vol. 27, No. 1, Article 1. Pub. date: April 2018
1:2 Y.Zhou et al. ACM Reference format: Yuming Zhou,Yibiao Yang,Hongmin Lu,Lin Chen,Yanhui Li,Yangyang Zhao,Junyan Qian,and Baowen Xu.2018.How Far We Have Progressed in the Journey?An Examination of Cross-Project Defect Prediction. ACM Trans.Softw.Eng.Methodol 27,1,Article 1(April 2018),51 pages. https:/doi.org/10.1145/3183339 INTRODUCTION A defect prediction model can predict the defect-proneness of the modules(e.g.,files,classes,or functions)in a software project.Given the prediction result,a project manager can(1)classify the modules into two categories,high defect-prone or low defect-prone,or(2)rank the modules from the highest to lowest defect-proneness.In both scenarios,the project manager could allocate more resources to inspect or test those high defect-prone modules.This will help to find more defects in the project if the model can accurately predict defect-proneness.For a given project, it is common to use the historical project data (e.g.,module-level complexity metric data and defect data in the previous releases of the project)to train the model.Prior studies have shown that the model predicts defects well in the test data if it is trained using a sufficiently large amount of data [23].However,in practice,it might be difficult to obtain sufficient training data [129].This is especially true for a new type of projects or projects with little historical data collected. One way to deal with the shortage of training data is to leverage the data from other projects (i.e.,source projects)to build the model and apply it to predict the defects in the current project (i.e.,the target project)[9].However,in practice,it is challenging to achieve accurate cross-project defect prediction(CPDP)[129].The main reason is that the source and target project data usually exhibit significantly different distributions,which violates the similar distribution assumption of the training and test data required by most modeling techniques [61,80,123].Furthermore,in many cases,the source and target project data even consist of different metric sets,which makes it difficult to use the regular modeling techniques to build and apply the prediction model [17,27, 41,79].In recent years,various techniques have been proposed to address these challenges and a large number of CPDP models have hence been developed(see Section 2.4).In particular,it has been reported that these CPDP models produce a promising prediction performance [27,41,61, 79,80] Yet,most,if not all,of the existing CPDP models are not compared against those simple module size models that are easy to implement and have shown a good performance in defect prediction in the literature.In the past decades,many studies have reported that simple models based on the module size(e.g.,SLOC,source lines of code)in a project can in general predict the defects in the project well [2,49-51,65,102,127,128].In the context of CPDP,we can use module size in a target project to build two simple module size models,ManualDown and ManualUp [49-51,65]. The former considers a larger module as more defect-prone,while the latter considers a smaller module as more defect-prone.Since ManualDown and ManualUp do not require any data from the source projects to build the models,they are free of the challenges on different data distributions/ metric sets of the source and target project data.In particular,they have a low computation cost and are easy to implement.In contrast,due to the use of complex modeling techniques,many existing CPDP models not only have a high computation cost but also involve a large number of parameters needed to be carefully tuned.This imposes substantial barriers to apply them in practice,especially for large projects.Furthermore,previous studies show that module size has a strong confounding effect on the associations between code metrics and defect-proneness [20, 128].Emam et al.even reported that their associations disappeared after controlling for module ACM Transactions on Software Engineering and Methodology,Vol.27,No.1,Article 1.Pub.date:April 2018
1:2 Y. Zhou et al. ACM Reference format: Yuming Zhou, Yibiao Yang, Hongmin Lu, Lin Chen, Yanhui Li, Yangyang Zhao, Junyan Qian, and Baowen Xu. 2018. How Far We Have Progressed in the Journey? An Examination of Cross-Project Defect Prediction. ACM Trans. Softw. Eng. Methodol. 27, 1, Article 1 (April 2018), 51 pages. https://doi.org/10.1145/3183339 1 INTRODUCTION A defect prediction model can predict the defect-proneness of the modules (e.g., files, classes, or functions) in a software project. Given the prediction result, a project manager can (1) classify the modules into two categories, high defect-prone or low defect-prone, or (2) rank the modules from the highest to lowest defect-proneness. In both scenarios, the project manager could allocate more resources to inspect or test those high defect-prone modules. This will help to find more defects in the project if the model can accurately predict defect-proneness. For a given project, it is common to use the historical project data (e.g., module-level complexity metric data and defect data in the previous releases of the project) to train the model. Prior studies have shown that the model predicts defects well in the test data if it is trained using a sufficiently large amount of data [23]. However, in practice, it might be difficult to obtain sufficient training data [129]. This is especially true for a new type of projects or projects with little historical data collected. One way to deal with the shortage of training data is to leverage the data from other projects (i.e., source projects) to build the model and apply it to predict the defects in the current project (i.e., the target project) [9]. However, in practice, it is challenging to achieve accurate cross-project defect prediction (CPDP) [129]. The main reason is that the source and target project data usually exhibit significantly different distributions, which violates the similar distribution assumption of the training and test data required by most modeling techniques [61, 80, 123]. Furthermore, in many cases, the source and target project data even consist of different metric sets, which makes it difficult to use the regular modeling techniques to build and apply the prediction model [17, 27, 41, 79]. In recent years, various techniques have been proposed to address these challenges and a large number of CPDP models have hence been developed (see Section 2.4). In particular, it has been reported that these CPDP models produce a promising prediction performance [27, 41, 61, 79, 80]. Yet, most, if not all, of the existing CPDP models are not compared against those simple module size models that are easy to implement and have shown a good performance in defect prediction in the literature. In the past decades, many studies have reported that simple models based on the module size (e.g., SLOC, source lines of code) in a project can in general predict the defects in the project well [2, 49–51, 65, 102, 127, 128]. In the context of CPDP, we can use module size in a target project to build two simple module size models, ManualDown and ManualUp [49–51, 65]. The former considers a larger module as more defect-prone, while the latter considers a smaller module as more defect-prone. Since ManualDown and ManualUp do not require any data from the source projects to build the models, they are free of the challenges on different data distributions/ metric sets of the source and target project data. In particular, they have a low computation cost and are easy to implement. In contrast, due to the use of complex modeling techniques, many existing CPDP models not only have a high computation cost but also involve a large number of parameters needed to be carefully tuned. This imposes substantial barriers to apply them in practice, especially for large projects. Furthermore, previous studies show that module size has a strong confounding effect on the associations between code metrics and defect-proneness [20, 128]. Emam et al. even reported that their associations disappeared after controlling for module ACM Transactions on Software Engineering and Methodology, Vol. 27, No. 1, Article 1. Pub. date: April 2018
How Far We Have Progressed in the Journey?An Examination of CPDP 1:3 size [20].This indicates that module size may have a capability in defect prediction similar to code metrics,the most commonly used predictors in the existing CPDP models.An interesting question hence arises:"If the prediction performance is the goal,then how do the existing CPDP models perform compared with ManualDown and ManualUp?"This question is important for both practitioners and researchers.For practitioners,it will help to determine whether it is worth to apply the existing CPDP models in practice.If simple module size models perform similarly or even better,then there is no practical reason to apply complex CPDP models.For researchers,if simple module size models perform similarly or even better,then there is a strong need to improve the prediction power of the existing CPDP models.Otherwise,the motivation of applying those CPDP models could not be well justified.To the best of our knowledge,of the existing CPDP studies,only two studies compared the proposed CPDP models against simple module size models [9,12].In other words,for most of the existing CPDP models,it remains unclear whether they have a superior prediction performance compared with simple module size models. In this article,we attempt to investigate how far we have really progressed in the journey by conducting a comparison in defect prediction performance between the existing CPDP models and simple module size models.We want to know not only whether the difference in defect prediction performance is statistically significant but also whether it is of practical importance. In our study,we take the following two measures to ensure a fair comparison.On the one hand, we use the same publicly available datasets and the same performance indicators to collect the performance values for the simple module size models as those used in the existing CPDP studies. On the other hand,we do not attempt to re-implement the CPDP models to collect their prediction performance values.In contrast,we use the prediction performance values reported in the original CPDP studies to conduct the comparison.Therefore,the implementation bias can be avoided. These measures ensure that we can draw a reliable conclusion on the benefits of the existing CPDP models w.r.t.simple module size models in defect prediction performance in practice. Under the above experimental settings,we perform an extensive comparison between the existing CPDP models and simple module size models for defect prediction.Surprisingly,our experimental results show that simple module size models have a prediction performance comparable or even superior to most of the existing CPDP models in the literature,including many newly proposed models.Consequently,for practitioners,it would be better to apply simple module size models rather than the existing CPDP models to predict defects in a target project. This is especially true when taking into account the application cost(including metrics collection cost and model building cost).The results reveal that,if the prediction performance is the goal, the current progress in kjjc zCPDP studies is not being achieved as it might have been envisaged We hence strongly recommend that future CPDP studies should consider simple module size models as the baseline models to be compared against.The benefits of using a baseline model are two-fold [112].On the one hand,this would ensure that the predictive power of a newly proposed CPDP model could be adequately compared and assessed.On the other hand,"the ongoing use of a baseline model in the literature would give a single point of comparison".This will allow a meaningful assessment of any new CPDP model against previous CPDP models. The rest of this article is organized as follows.Section 2 introduces the background on cross- project defect prediction,including the problem studied,the general framework,the performance evaluation indicators,and the state of progress.Section 3 describes the experimental design,in- cluding the simple module size model,the research questions,the datasets,and the data analysis method.Section 4 presents in detail the experimental results.Section 5 compares the simple module size models with related work.Section 6 discusses the results and implications.Section 7 analyzes the threats to the validity of our study.Section 8 concludes the article and outlines directions for future work. ACM Transactions on Software Engineering and Methodology.Vol.27.No.1,Article 1.Pub.date:April 2018
How Far We Have Progressed in the Journey? An Examination of CPDP 1:3 size [20]. This indicates that module size may have a capability in defect prediction similar to code metrics, the most commonly used predictors in the existing CPDP models. An interesting question hence arises: “If the prediction performance is the goal, then how do the existing CPDP models perform compared with ManualDown and ManualUp?” This question is important for both practitioners and researchers. For practitioners, it will help to determine whether it is worth to apply the existing CPDP models in practice. If simple module size models perform similarly or even better, then there is no practical reason to apply complex CPDP models. For researchers, if simple module size models perform similarly or even better, then there is a strong need to improve the prediction power of the existing CPDP models. Otherwise, the motivation of applying those CPDP models could not be well justified. To the best of our knowledge, of the existing CPDP studies, only two studies compared the proposed CPDP models against simple module size models [9, 12]. In other words, for most of the existing CPDP models, it remains unclear whether they have a superior prediction performance compared with simple module size models. In this article, we attempt to investigate how far we have really progressed in the journey by conducting a comparison in defect prediction performance between the existing CPDP models and simple module size models. We want to know not only whether the difference in defect prediction performance is statistically significant but also whether it is of practical importance. In our study, we take the following two measures to ensure a fair comparison. On the one hand, we use the same publicly available datasets and the same performance indicators to collect the performance values for the simple module size models as those used in the existing CPDP studies. On the other hand, we do not attempt to re-implement the CPDP models to collect their prediction performance values. In contrast, we use the prediction performance values reported in the original CPDP studies to conduct the comparison. Therefore, the implementation bias can be avoided. These measures ensure that we can draw a reliable conclusion on the benefits of the existing CPDP models w.r.t. simple module size models in defect prediction performance in practice. Under the above experimental settings, we perform an extensive comparison between the existing CPDP models and simple module size models for defect prediction. Surprisingly, our experimental results show that simple module size models have a prediction performance comparable or even superior to most of the existing CPDP models in the literature, including many newly proposed models. Consequently, for practitioners, it would be better to apply simple module size models rather than the existing CPDP models to predict defects in a target project. This is especially true when taking into account the application cost (including metrics collection cost and model building cost). The results reveal that, if the prediction performance is the goal, the current progress in kjjc zCPDP studies is not being achieved as it might have been envisaged. We hence strongly recommend that future CPDP studies should consider simple module size models as the baseline models to be compared against. The benefits of using a baseline model are two-fold [112]. On the one hand, this would ensure that the predictive power of a newly proposed CPDP model could be adequately compared and assessed. On the other hand, “the ongoing use of a baseline model in the literature would give a single point of comparison”. This will allow a meaningful assessment of any new CPDP model against previous CPDP models. The rest of this article is organized as follows. Section 2 introduces the background on crossproject defect prediction, including the problem studied, the general framework, the performance evaluation indicators, and the state of progress. Section 3 describes the experimental design, including the simple module size model, the research questions, the datasets, and the data analysis method. Section 4 presents in detail the experimental results. Section 5 compares the simple module size models with related work. Section 6 discusses the results and implications. Section 7 analyzes the threats to the validity of our study. Section 8 concludes the article and outlines directions for future work. ACM Transactions on Software Engineering and Methodology, Vol. 27, No. 1, Article 1. Pub. date: April 2018
1:4 Y.Zhou et al. 2 CROSS-PROJECT DEFECT PREDICTION In this section,we first describe the problem that cross-project defect prediction aims to address. Then,we present a general framework that depicts the key components in a supervised cross- project defect prediction.After that,we introduce the commonly used performance evaluation indicators involved in the existing CPDP studies.Finally,we give a literature overview of current developments in this area and provide a comparison of the main CPDP studies under the general framework. 2.1 Problem Statement The purpose of cross-project defect prediction is to address the shortage problem of training data that are used for building a model to predict defects in a target release of a target project.To predict defects in the target release,it is common to use a two-phase process(i.e.,model building and ap- plication).At the model-building phase,the metric data and the defect data are first collected from the modules in historical releases(i.e.,the releases before the target release in the target project) Then,based on the collected data,a specific modeling technique is used to train a model to cap- ture the relationships between the metrics and defect-proneness.At the model application phase, the same metrics are first collected from the modules in the target release.Then,the predicted defect-proneness of each module in the target release is obtained by substituting the correspond- ing metric data into the model.After that,the predicted performance is evaluated by comparing the predicted defect-proneness against the actual defect information in the target release.It has been shown that,if there is a sufficiently large amount of training data at the model-building phase,the resulting model in general predicts defects well at the model application phase [23].In practice, however,sufficiently large training data are often unavailable,especially for those projects with few or even no historical releases.To mitigate this problem,researchers propose to use the training data from other projects(i.e.,source projects)to build the model for the target project,which is called cross-project defect prediction [9,106,129]. In the literature,it has been highlighted that there are two major challenges that cross-project defect prediction has to deal with.The first challenge is that the source and target project data usually exhibit significantly different distributions [13,41,42,61,80,123].Since the source and target projects might be from different corporations,there might have a large difference between their development environments.Consequently,the same metrics in the source and target project data might have a large difference between the data distributions.However,regular modeling techniques are based on the assumption that the training and test data are drawn from the same distribution.Therefore,it is difficult to achieve a good prediction performance by directly applying regular modeling techniques to build the prediction models.The second challenge is that the source and target project data might have no common metrics [27,41,79].In practice,since the source and target projects might be from different corporations,it is likely that the source and target project data consist of completely different metric sets.However,regular modeling techniques are based on the assumption that the training and test data have common metric sets.Consequently, it is difficult,if not impossible,to predict defects in the target project using the models built with regular modeling techniques on the source project data.In recent years,much effort has been devoted to dealing with these challenges. 2.2 General Framework In the field of cross-project defect prediction,supervised models are the mainstream models. Figure 1 presents a general framework that applies supervised techniques to cross-project defect prediction.As can be seen,the target project consists of k release,among which the kth release is ACM Transactions on Software Engineering and Methodology,Vol.27,No.1,Article 1.Pub.date:April 2018
1:4 Y. Zhou et al. 2 CROSS-PROJECT DEFECT PREDICTION In this section, we first describe the problem that cross-project defect prediction aims to address. Then, we present a general framework that depicts the key components in a supervised crossproject defect prediction. After that, we introduce the commonly used performance evaluation indicators involved in the existing CPDP studies. Finally, we give a literature overview of current developments in this area and provide a comparison of the main CPDP studies under the general framework. 2.1 Problem Statement The purpose of cross-project defect prediction is to address the shortage problem of training data that are used for building a model to predict defects in a target release of a target project. To predict defects in the target release, it is common to use a two-phase process (i.e., model building and application). At the model-building phase, the metric data and the defect data are first collected from the modules in historical releases (i.e., the releases before the target release in the target project). Then, based on the collected data, a specific modeling technique is used to train a model to capture the relationships between the metrics and defect-proneness. At the model application phase, the same metrics are first collected from the modules in the target release. Then, the predicted defect-proneness of each module in the target release is obtained by substituting the corresponding metric data into the model. After that, the predicted performance is evaluated by comparing the predicted defect-proneness against the actual defect information in the target release. It has been shown that, if there is a sufficiently large amount of training data at the model-building phase, the resulting model in general predicts defects well at the model application phase [23]. In practice, however, sufficiently large training data are often unavailable, especially for those projects with few or even no historical releases. To mitigate this problem, researchers propose to use the training data from other projects (i.e., source projects) to build the model for the target project, which is called cross-project defect prediction [9, 106, 129]. In the literature, it has been highlighted that there are two major challenges that cross-project defect prediction has to deal with. The first challenge is that the source and target project data usually exhibit significantly different distributions [13, 41, 42, 61, 80, 123]. Since the source and target projects might be from different corporations, there might have a large difference between their development environments. Consequently, the same metrics in the source and target project data might have a large difference between the data distributions. However, regular modeling techniques are based on the assumption that the training and test data are drawn from the same distribution. Therefore, it is difficult to achieve a good prediction performance by directly applying regular modeling techniques to build the prediction models. The second challenge is that the source and target project data might have no common metrics [27, 41, 79]. In practice, since the source and target projects might be from different corporations, it is likely that the source and target project data consist of completely different metric sets. However, regular modeling techniques are based on the assumption that the training and test data have common metric sets. Consequently, it is difficult, if not impossible, to predict defects in the target project using the models built with regular modeling techniques on the source project data. In recent years, much effort has been devoted to dealing with these challenges. 2.2 General Framework In the field of cross-project defect prediction, supervised models are the mainstream models. Figure 1 presents a general framework that applies supervised techniques to cross-project defect prediction. As can be seen, the target project consists of k release, among which the kth release is ACM Transactions on Software Engineering and Methodology, Vol. 27, No. 1, Article 1. Pub. date: April 2018
How Far We Have Progressed in the Journey?An Examination of CPDP 15 ◆ ]1a[S'9 - ACM Transactions on Software Engineering and Methodology.Vol.27.No.1,Article 1.Pub.date:April 2018
How Far We Have Progressed in the Journey? An Examination of CPDP 1:5 Fig. 1. The general framework that applies supervised modeling techniques to cross-project defect prediction. ACM Transactions on Software Engineering and Methodology, Vol. 27, No. 1, Article 1. Pub. date: April 2018
1:6 Y.Zhou et al. the target release and the 1th,...,k-1th releases are the historical releases.The source projects consist of n projects,each having a number of releases.Each release in the target and source projects has an associated dataset,which consists of the metric data and the defect label data at the module level.The modules can be packages,files,classes,or functions,depending on the defect prediction context.The metric sets are assumed to be the same for all releases within a single project but can be different for different projects.In each dataset,one instance represents a module and one feature represents a software metric extracted from the module. From Figure 1,on the one hand,we can see that the training data consist of the external training data and the internal training data.The external training data are indeed the labeled source project data.The internal data consist of two parts:the historical training data(i.e.,the labeled historical release data)and the target training data (i.e.,a small amount of the labeled target release data of the target project).On the other hand,we can see that the test data consist of the labeled target release data excluding the target training data.In other words,the labeled target release data are divided into two parts:a small amount of data used as the target training data and the remaining data used as the test data.For the simplicity of presentation,we use the test metric data and the defect oracle data to respectively denote the metric data and the defect label data in the test data. The test metric data will be introduced into the model built with the training data to compute the predicted defect-proneness.Once the predicted defect-proneness of each module in the test data is obtained,it will be compared against the defect oracle data to compute the prediction performance. At a high level,the application of a supervised CPDP model to a target project in practice in- volves the following three phases:data preparation,model training,and model testing.At the data preparation phase,preprocess the training data collected from different sources to make them ap- propriate for building a CPDP model for the target project.On the one hand,there is a need to address the privacy concerns of(source project)data owners,ie.,preventing the disclosure of specific sensitive metric values of the original source project data [84,85].On the other hand, there is a need to address the utility of the privatized source project data in cross-project defect prediction.This includes dealing with homogenous feature sets(ie.,metric set)between source and target project data [27,41,79],filtering out irrelevant training data [39,86,94,106],handling class-imbalanced training data [42,93],making source and target project have a similar data dis- tribution [13,41,42,61,80,123],and removing irrelevant/redundant features [5,9,26,44].At the modeling training phase,use a supervised modeling technique to build a CPDP model [59,96. 108,116,126].In other words,this aims to use a specific supervised learning algorithm to build a model to capture the relationship between the metrics(i.e.,the independent variables)and defect- proneness(i.e.,the dependent variable)in the training data.At the model testing phase,apply the CPDP model to predict defects in the target release and evaluate its prediction performance under the classification or ranking scenario.In the former scenario,the modules in the test data are clas- sified into defective or not defective.In the latter scenario,the modules in the test data are ranked from the highest to the lowest predicted defect-proneness.Under each scenario,the performance report is generated by comparing the predicted defect-proneness with the defect oracle data.By the above phases,it is expected that the knowledge about the effect of the metrics on defects can be learned from the source projects by the supervised model to predict defects in the target release of the target project. 2.3 Performance Evaluation Table 1 summarizes the prediction performance indicators involved in the existing cross-project defect prediction literature.The first column reports the scenario in which a specific indicator is used.The second,third,and fourth columns,respectively,show the name,the definition,and the ACM Transactions on Software Engineering and Methodology,Vol.27,No.1,Article 1.Pub.date:April 2018
1:6 Y. Zhou et al. the target release and the 1th,..., k − 1th releases are the historical releases. The source projects consist of n projects, each having a number of releases. Each release in the target and source projects has an associated dataset, which consists of the metric data and the defect label data at the module level. The modules can be packages, files, classes, or functions, depending on the defect prediction context. The metric sets are assumed to be the same for all releases within a single project but can be different for different projects. In each dataset, one instance represents a module and one feature represents a software metric extracted from the module. From Figure 1, on the one hand, we can see that the training data consist of the external training data and the internal training data. The external training data are indeed the labeled source project data. The internal data consist of two parts: the historical training data (i.e., the labeled historical release data) and the target training data (i.e., a small amount of the labeled target release data of the target project). On the other hand, we can see that the test data consist of the labeled target release data excluding the target training data. In other words, the labeled target release data are divided into two parts: a small amount of data used as the target training data and the remaining data used as the test data. For the simplicity of presentation, we use the test metric data and the defect oracle data to respectively denote the metric data and the defect label data in the test data. The test metric data will be introduced into the model built with the training data to compute the predicted defect-proneness. Once the predicted defect-proneness of each module in the test data is obtained, it will be compared against the defect oracle data to compute the prediction performance. At a high level, the application of a supervised CPDP model to a target project in practice involves the following three phases: data preparation, model training, and model testing. At the data preparation phase, preprocess the training data collected from different sources to make them appropriate for building a CPDP model for the target project. On the one hand, there is a need to address the privacy concerns of (source project) data owners, i.e., preventing the disclosure of specific sensitive metric values of the original source project data [84, 85]. On the other hand, there is a need to address the utility of the privatized source project data in cross-project defect prediction. This includes dealing with homogenous feature sets (i.e., metric set) between source and target project data [27, 41, 79], filtering out irrelevant training data [39, 86, 94, 106], handling class-imbalanced training data [42, 93], making source and target project have a similar data distribution [13, 41, 42, 61, 80, 123], and removing irrelevant/redundant features [5, 9, 26, 44]. At the modeling training phase, use a supervised modeling technique to build a CPDP model [59, 96, 108, 116, 126]. In other words, this aims to use a specific supervised learning algorithm to build a model to capture the relationship between the metrics (i.e., the independent variables) and defectproneness (i.e., the dependent variable) in the training data. At the model testing phase, apply the CPDP model to predict defects in the target release and evaluate its prediction performance under the classification or ranking scenario. In the former scenario, the modules in the test data are classified into defective or not defective. In the latter scenario, the modules in the test data are ranked from the highest to the lowest predicted defect-proneness. Under each scenario, the performance report is generated by comparing the predicted defect-proneness with the defect oracle data. By the above phases, it is expected that the knowledge about the effect of the metrics on defects can be learned from the source projects by the supervised model to predict defects in the target release of the target project. 2.3 Performance Evaluation Table 1 summarizes the prediction performance indicators involved in the existing cross-project defect prediction literature. The first column reports the scenario in which a specific indicator is used. The second, third, and fourth columns, respectively, show the name, the definition, and the ACM Transactions on Software Engineering and Methodology, Vol. 27, No. 1, Article 1. Pub. date: April 2018
How Far We Have Progressed in the Journey?An Examination of CPDP 1:7 Table 1.The Prediction Performance Indicators Involved in the Existing Cross-Project Defect Prediction Studies Scenario Name Definition Interpretation Better Classification Recall IP/(TP FN) The fraction of defective modules High that are predicted as defective Precision TP /(TP FP) The fraction of predicted defective High modules that are defective PD TP/(TP +FN) Probability of Detection.Same as Recall High PF FP/(FP +TN) Probability of False alarm.The fraction Low of not defective modules that are predicted as defective Correctness TP/(TP FP)[9] Same as Precision High Completeness Defects(TP)/ A variant of Recall.The percentage High Defects(TP+FN)[9] of defects found FB (1+B2)xRecallxPrecision/ The harmonic mean of precision and High (Recall B2xPrecision) recall (B 1 or 2) G [2xPDX(1-PF)]/[PD+(1-PF)] The harmonic mean of PD and 1-PF High Gz VRecall x Precision The geometric mean of recall and High precision G3 Recall x (1-PF) The geometric mean of recall and 1-PF High Balance 1-Y0-PP4-PD[107 The balance between PF and PD High ED VB×(1-PDP+(1-6)×PF网 The distance between(PD,PF) Low [52 and the ideal point on the ROC space(1,0),weighted by cost function =0.6 in default) MCC TPXTN-EPXEN A correlation coefficient V(TP+FP)(TP+FN)(TN+FP)(TN+FN) High between the actual and T123] predicted binary classifications AUC The area under ROC curve [24] The probability that a model will High rank a randomly chosen defective module higher than a randomly chosen not defective one NECM (FP+C/CIxFN)/ The normalized expected cost Low (TP+FP+TN+FN)[48] of misclassification Z 晋可 A statistic for assessing the statistical High significance of the overall classification accuracy.Here,o is the total number of correct classifications (i.e.,o TP+TN) e is the expected number of correct classification due to chance,and n is the total number of instances (i.e.,n TP FP TN+FN) Ranking AUCEC Area(m):the area under the The cost-effectiveness of the overall High cost-effectiveness curve ranking corresponding to the model m [90] NofB20 The number of defective modules in The cost-effectiveness of the top ranking High the top 20%SLOC PofB20 The percentage of defects in the top The cost-effectiveness of the top ranking High 20%SLOC FPA The average,over all values of k,of The effectiveness of the overall ranking High the percentage of defects contained in the top k modules [111] (Continued) ACM Transactions on Software Engineering and Methodology,Vol.27.No.1,Article 1.Pub.date:April 2018
How Far We Have Progressed in the Journey? An Examination of CPDP 1:7 Table 1. The Prediction Performance Indicators Involved in the Existing Cross-Project Defect Prediction Studies Scenario Name Definition Interpretation Better Classification Recall TP / (TP + FN) The fraction of defective modules that are predicted as defective High Precision TP / (TP + FP) The fraction of predicted defective modules that are defective High PD TP / (TP + FN) Probability of Detection. Same as Recall High PF FP / (FP + TN) Probability of False alarm. The fraction of not defective modules that are predicted as defective Low Correctness TP / (TP + FP)[9] Same as Precision High Completeness Defects(TP) / Defects(TP + FN)[9] A variant of Recall. The percentage of defects found High Fβ (1 + β2)×Recall×Precision/ (Recall + β2×Precision) The harmonic mean of precision and recall (β = 1 or 2) High G1 [2×PD×(1-PF)]/[PD+(1-PF)] The harmonic mean of PD and 1-PF High G2 √ Recall × Precision The geometric mean of recall and precision High G3 Recall × (1 − P F ) The geometric mean of recall and 1-PF High Balance 1 − √ (0−P F )2+(1−P D)2 √ 2 [107] The balance between PF and PD High ED θ × (1 − P D)2 + (1 − θ ) × P F 2 [52] The distance between (PD, PF) and the ideal point on the ROC space (1, 0), weighted by cost function θ (θ = 0.6 in default) Low MCC √ T P×T N −F P×F N (T P+F P )(T P+F N )(T N +F P )(T N +F N ) [123] A correlation coefficient between the actual and predicted binary classifications High AUC The area under ROC curve [24] The probability that a model will rank a randomly chosen defective module higher than a randomly chosen not defective one High NECM (F P + CII/CI×F N )/ (T P + F P + T N + F N )[48] The normalized expected cost of misclassification Low Z* (o−e ) √ n e (n−e ) [10] A statistic for assessing the statistical significance of the overall classification accuracy. Here, o is the total number of correct classifications (i.e., o = TP + TN), e is the expected number of correct classification due to chance, and n is the total number of instances (i.e., n = TP + FP + TN + FN) High Ranking AUCEC Area(m): the area under the cost-effectiveness curve corresponding to the model m [90] The cost-effectiveness of the overall ranking High NofB20 The number of defective modules in the top 20% SLOC The cost-effectiveness of the top ranking High PofB20 The percentage of defects in the top 20% SLOC The cost-effectiveness of the top ranking High FPA The average, over all values of k, of the percentage of defects contained in the top k modules [111] The effectiveness of the overall ranking High (Continued) ACM Transactions on Software Engineering and Methodology, Vol. 27, No. 1, Article 1. Pub. date: April 2018.
1:8 Y.Zhou et al. Table 1.Continued Scenario Name Definition Interpretation Better E1(R) The percentage of modules having The effectiveness of the top Low R%defects ranking E2(R) The percentage of code having R The cost-effectiveness of the top Low defects ranking Prec@k Top k precision The fraction of the top k ranked High modules that are defective informal interpretation for each specific indicator.The last column indicates the expected direction for a better prediction performance.As can be seen,many performance indicators have a similar or even identical meaning.For example,PD has the same meaning as Recall,while Completeness is a variant of Recall.In the next section,we will compare the existing CPDP studies,including the performance indicators used.For the performance indicators,we use the exact same names as used in the original studies.This will help understand the comparison,especially for those readers who are familiar with the existing CPDP studies.Therefore,we include these indicators in Table 1, even if they express a similar or even identical meaning. Classification performance indicators.When applying a prediction model to classify modules in practice,we first need to determine a classification threshold for the model.A module will be classified as defective if the predicted defect-proneness is larger than the threshold and otherwise it will be classified as not defective.The four outcomes of the classification using the threshold are as follows:TP(the number of modules correctly classified as defective),TN(the number of modules correctly classified as not defective),FP(the number of modules incorrectly classified as defective),and FN(the number of modules incorrectly classified as not defective).As shown in Table 1,all classification performance indicators except AUC (Area Under ROC Curve)are based on TP,TN,FP,and FN,i.e.,they depend on the threshold.AUC is a threshold-independent indicator, which denotes that the area under a ROC(Relative Operating Characteristic)curve.A ROC curve is a graphical plot of the PD(the y-axis)vs.PF(the x-axis)for a binary classification model as its decision threshold is varied.In particular,all classification performance indicators except NECM do not explicitly take into account the cost of misclassification caused by the prediction model. Ranking performance indicators.In the ranking scenario,the performance indicators can be classified into two categories:effort-aware and non-effort-aware.The former includes AUCEC, NofB20,PofB20,and E2(R),while the latter includes FPA,E1(R),and Prec@k.The effort-aware indicators take into account the effort required to inspect or test those modules predicted as de- fective to find whether they contain defects.In these indicators,the size of a module measured by SLOC (source lines of code)is used as the proxy of the effort required to inspect or test the module.In particular,AUCEC is based on the SLOC-based Alberg diagram [6].In such an Alberg diagram,each defect-proneness prediction model corresponds to a curve constructed as follows. First,the modules in the target release are sorted in decreasing order according to the predicted defect-proneness by the model.Then,the cumulative percentage of SLOC of the top modules se- lected from the module ranking(the x-axis)is plotted against the cumulative percentage of defects found in the selected top modules(the y-axis).AUCEC is the area under the curve corresponding to the model.Note that NofB20,PofB20,and E2(R)quantify the performance of the top ranking. while AUCEC quantifies the performance of the overall ranking.The non-effort-aware perfor- mance indicators do not take into account the inspection or testing effort.Of non-effort-aware indicators,E1(R)and Prec@k quantify the performance of the top ranking,while FPA quantifies the performance of the overall ranking. ACM Transactions on Software Engineering and Methodology,Vol.27,No.1,Article 1.Pub.date:April 2018
1:8 Y. Zhou et al. Table 1. Continued Scenario Name Definition Interpretation Better E1(R) The percentage of modules having R% defects The effectiveness of the top ranking Low E2(R) The percentage of code having R% defects The cost-effectiveness of the top ranking Low Prec@k Top k precision The fraction of the top k ranked modules that are defective High informal interpretation for each specific indicator. The last column indicates the expected direction for a better prediction performance. As can be seen, many performance indicators have a similar or even identical meaning. For example, PD has the same meaning as Recall, while Completeness is a variant of Recall. In the next section, we will compare the existing CPDP studies, including the performance indicators used. For the performance indicators, we use the exact same names as used in the original studies. This will help understand the comparison, especially for those readers who are familiar with the existing CPDP studies. Therefore, we include these indicators in Table 1, even if they express a similar or even identical meaning. Classification performance indicators. When applying a prediction model to classify modules in practice, we first need to determine a classification threshold for the model. A module will be classified as defective if the predicted defect-proneness is larger than the threshold and otherwise it will be classified as not defective. The four outcomes of the classification using the threshold are as follows: TP (the number of modules correctly classified as defective), TN (the number of modules correctly classified as not defective), FP (the number of modules incorrectly classified as defective), and FN (the number of modules incorrectly classified as not defective). As shown in Table 1, all classification performance indicators except AUC (Area Under ROC Curve) are based on TP, TN, FP, and FN, i.e., they depend on the threshold. AUC is a threshold-independent indicator, which denotes that the area under a ROC (Relative Operating Characteristic) curve. A ROC curve is a graphical plot of the PD (the y-axis) vs. PF (the x-axis) for a binary classification model as its decision threshold is varied. In particular, all classification performance indicators except NECM do not explicitly take into account the cost of misclassification caused by the prediction model. Ranking performance indicators. In the ranking scenario, the performance indicators can be classified into two categories: effort-aware and non-effort-aware. The former includes AUCEC, NofB20, PofB20, and E2(R), while the latter includes FPA, E1(R), and Prec@k. The effort-aware indicators take into account the effort required to inspect or test those modules predicted as defective to find whether they contain defects. In these indicators, the size of a module measured by SLOC (source lines of code) is used as the proxy of the effort required to inspect or test the module. In particular, AUCEC is based on the SLOC-based Alberg diagram [6]. In such an Alberg diagram, each defect-proneness prediction model corresponds to a curve constructed as follows. First, the modules in the target release are sorted in decreasing order according to the predicted defect-proneness by the model. Then, the cumulative percentage of SLOC of the top modules selected from the module ranking (the x-axis) is plotted against the cumulative percentage of defects found in the selected top modules (the y-axis). AUCEC is the area under the curve corresponding to the model. Note that NofB20, PofB20, and E2(R) quantify the performance of the top ranking, while AUCEC quantifies the performance of the overall ranking. The non-effort-aware performance indicators do not take into account the inspection or testing effort. Of non-effort-aware indicators, E1(R) and Prec@k quantify the performance of the top ranking, while FPA quantifies the performance of the overall ranking. ACM Transactions on Software Engineering and Methodology, Vol. 27, No. 1, Article 1. Pub. date: April 2018
How Far We Have Progressed in the Journey?An Examination of CPDP 1:9 Literature reading (conducted by us) (condueted by us) Foward snowballing Hosseini et al. from BMW 2002 TSE paper (60 13 (46) eross pruject"defect prediction" 46+13-59 Total number of selected supervised CPDP studies (59+3=62) 636+3-72) 'cross projeet"+"fault prediction" (62+1-63) oss company"+"fault prediction" (63+0=63) Fig.2.The process to select supervised CPDP studies. 2.4 State of Progress To understand the progress in supervised cross-project defect prediction,we conducted a search of the literature published between 2002 and 2017.The starting year of the search was set to 2002 as it is the year that the first CPDP article was published [9](abbreviated as"BMW 2002 TSE paper" in the following).Before the search,we set up the following inclusion criteria:(1)the study was a supervised CPDP study;(2)the study was written in English;(3)the full text was available;(4) only the journal version was included if the study had both the conference and journal versions; and(5)the prediction scenario was classification or ranking. As shown in Figure 2,we searched the articles from three sources:Google Scholar,the existing systematic literature reviews,and literature reading.We used Google Scholar as the main search source,as it "provides a simple way to broadly search for scholarly literature."2 First,we did a forward snowballing search [114]by recursively examining the citations of the "BMW 2002 TSE paper".Consequently,46 relevant articles were identified [1,3,5,7,9,10,12,13,15,17,26-30, 37,39,41,42,44,46,47,52,59,66,79,80,82,85-87,89,90,92-96,107,109,116,118,120,122,123 l26].Then,we used“cross project'”+“defect prediction”as the search terms to do a search.Asa result,13 additional relevant articles [14,31,32,43,45,61,71,84,88,99,101,108,129]were found with respect to the identified 46 articles.Next,we used "cross company"+"defect prediction"as the search terms,identifying 3 additional relevant articles [104,106,110]with respect to the 59 (=46 +13)articles.After that,we used "cross project"+"fault prediction"as the search terms, identifying 1 additional relevant article [100]with respect to the 62(=59+3)articles.Finally, we used the terms"cross company"+"fault prediction"and did not find any additional relevant article.By the above steps,we totally identified 63(=46+13+3+1+0)supervised CPDP articles from Google Scholar.In addition to Google Scholar,we identified 6 additional relevant articles from a systematic literature review by Hosseini et al.[40].Hosseini et al.identified 46 primary CPDP studies from three electronic databases(the ACM digital library,the IEEE Explore,and the ISI Web of Science)and two search engines (Google Scholar and Scopus).After applying our inclusion criteria to the 46 CPDP studies,we found that 6 relevant articles [60,73,74,103,105, 119]were not present in the 63 identified CPDP articles from Google Scholar.The last source was literature reading,through which we found 3 additional relevant articles [48,72,121]with respect IThe literature search was conducted on April 21,2017. 2https://scholar.google.com/intl/en/scholar/about.html. ACM Transactions on Software Engineering and Methodology,Vol.27.No.1,Article 1.Pub.date:April 2018
How Far We Have Progressed in the Journey? An Examination of CPDP 1:9 Fig. 2. The process to select supervised CPDP studies. 2.4 State of Progress To understand the progress in supervised cross-project defect prediction, we conducted a search of the literature published between 2002 and 2017.1 The starting year of the search was set to 2002 as it is the year that the first CPDP article was published [9] (abbreviated as “BMW 2002 TSE paper” in the following). Before the search, we set up the following inclusion criteria: (1) the study was a supervised CPDP study; (2) the study was written in English; (3) the full text was available; (4) only the journal version was included if the study had both the conference and journal versions; and (5) the prediction scenario was classification or ranking. As shown in Figure 2, we searched the articles from three sources: Google Scholar, the existing systematic literature reviews, and literature reading. We used Google Scholar as the main search source, as it “provides a simple way to broadly search for scholarly literature.”2 First, we did a forward snowballing search [114] by recursively examining the citations of the “BMW 2002 TSE paper”. Consequently, 46 relevant articles were identified [1, 3, 5, 7, 9, 10, 12, 13, 15, 17, 26–30, 37, 39, 41, 42, 44, 46, 47, 52, 59, 66, 79, 80, 82, 85–87, 89, 90, 92–96, 107, 109, 116, 118, 120, 122, 123, 126]. Then, we used “cross project” + “defect prediction” as the search terms to do a search. As a result, 13 additional relevant articles [14, 31, 32, 43, 45, 61, 71, 84, 88, 99, 101, 108, 129] were found with respect to the identified 46 articles. Next, we used “cross company” + “defect prediction” as the search terms, identifying 3 additional relevant articles [104, 106, 110] with respect to the 59 (=46 + 13) articles. After that, we used “cross project” + “fault prediction” as the search terms, identifying 1 additional relevant article [100] with respect to the 62 (=59 + 3) articles. Finally, we used the terms “cross company” + “fault prediction” and did not find any additional relevant article. By the above steps, we totally identified 63 (=46 + 13 + 3 + 1 + 0) supervised CPDP articles from Google Scholar. In addition to Google Scholar, we identified 6 additional relevant articles from a systematic literature review by Hosseini et al. [40]. Hosseini et al. identified 46 primary CPDP studies from three electronic databases (the ACM digital library, the IEEE Explore, and the ISI Web of Science) and two search engines (Google Scholar and Scopus). After applying our inclusion criteria to the 46 CPDP studies, we found that 6 relevant articles [60, 73, 74, 103, 105, 119] were not present in the 63 identified CPDP articles from Google Scholar. The last source was literature reading, through which we found 3 additional relevant articles [48, 72, 121] with respect 1The literature search was conducted on April 21, 2017. 2https://scholar.google.com/intl/en/scholar/about.html. ACM Transactions on Software Engineering and Methodology, Vol. 27, No. 1, Article 1. Pub. date: April 2018
1:10 Y.Zhou et al. to the 69(=63+6)CPDP articles.Therefore,the overall search process resulted in 72(=63+6+3) supervised CPDP articles found in the literature. Table 2 provides an overview of the main literature concerning supervised cross-project defect prediction.For each study,the 2nd and 3rd columns,respectively,list the year published and the topic involved.The 4th to 6th columns list the source and target project characteristics,including the number of source/target projects(releases)and the programming languages involved.For the simplicity of presentation,we do not report the number of releases in parentheses if it is equal to the number of projects.Note that the entries with gray background under the 4th and 5th columns indicate that the source and target projects are different projects.The 7th to 12th columns list the key modeling components(challenges)covered.A"Yes"entry indicates that the study explicitly considers the corresponding component and a blank entry indicates that it does not.The 13th to 16th columns list the performance evaluation context,including the use (or not)of the target training data(i.e.,a small amount of the labeled target release data used as the training data), the application scenarios investigated,the main performance indicators employed,and the public availability of test data.An entry with gray background under the 15th column indicates that the study only graphically visualizes the performance and does not report numerical results.The 17th column reports the use(or not)of simple module size models that classify or rank modules in the target release by their size as the baseline models.The last column indicates whether the CPDP model proposed in the study will be compared against the simple module size models described in Section 3 in our study. From Table 2,we have the following observations.First,the earliest CPDP study appears to be Briand et al.'s work [9].In their work,Briand et al.investigated the applicability of CPDP on object-oriented software projects.Their results showed that a model built on one project produced a poor classification performance but a good ranking performance on another project. This indicates that,if used appropriately,CPDP would be helpful for practitioners.Second,CPDP has attracted a rapid increasing interest over the past few years.After Briand et al.'s pioneering work,more than 70 CPDP studies are published.The overall results show that CPDP has a promising prediction performance,comparable to or even better than WPDP(within-project defect prediction).Third,the existing CPDP studies cover a variety of wide-ranging topics. These topics include validating CPDP on different development environments(open-source and proprietary),different development stages(design and implementation),different programming languages(C,C++,C#,Java,JS,Pascal,Perl,and Ruby),different module granularity(change, function,class,and file),and different defect predictors(semantic,text,and structural).These studies make a significant contribution to our knowledge about the wide application range of CPDP.Fourth,the number of studies shows a highly unbalanced distribution on the key CPDP components..Of the key components,“filter instances'”and“transform distributions'”are the two most studied components,while "privatize data"and "homogenize features"are the two less studied components.In particular,none of the existing CPDP studies covers all of the key components and most of them only take into account less than three key components.Fifth,most studies evaluate CPDP models on the complete target release data,which are publicly available, under the classification scenario.In few studies,only a part of the target release data are used as the test data.The reason is either that there is a need to use the target training data when building a CPDP model or that a CPDP model is evaluated on the same test data as the WPDP models. Furthermore,most studies explicitly report the prediction results in numerical form.This allows us to make a direct comparison on the prediction performance of simple module size models against most of the existing CPDP models without a re-implementation.Finally,of 72 studies, only two studies (i.e.,Briand et al.'s study [9]and Canfora et al.'s study [12])used simple module size models in the target release as the baseline models.Both studies reported that the proposed ACM Transactions on Software Engineering and Methodology,Vol.27,No.1,Article 1.Pub.date:April 2018
1:10 Y. Zhou et al. to the 69 (=63 + 6) CPDP articles. Therefore, the overall search process resulted in 72 (=63 + 6 +3) supervised CPDP articles found in the literature. Table 2 provides an overview of the main literature concerning supervised cross-project defect prediction. For each study, the 2nd and 3rd columns, respectively, list the year published and the topic involved. The 4th to 6th columns list the source and target project characteristics, including the number of source/target projects (releases) and the programming languages involved. For the simplicity of presentation, we do not report the number of releases in parentheses if it is equal to the number of projects. Note that the entries with gray background under the 4th and 5th columns indicate that the source and target projects are different projects. The 7th to 12th columns list the key modeling components (challenges) covered. A “Yes” entry indicates that the study explicitly considers the corresponding component and a blank entry indicates that it does not. The 13th to 16th columns list the performance evaluation context, including the use (or not) of the target training data (i.e., a small amount of the labeled target release data used as the training data), the application scenarios investigated, the main performance indicators employed, and the public availability of test data. An entry with gray background under the 15th column indicates that the study only graphically visualizes the performance and does not report numerical results. The 17th column reports the use (or not) of simple module size models that classify or rank modules in the target release by their size as the baseline models. The last column indicates whether the CPDP model proposed in the study will be compared against the simple module size models described in Section 3 in our study. From Table 2, we have the following observations. First, the earliest CPDP study appears to be Briand et al.’s work [9]. In their work, Briand et al. investigated the applicability of CPDP on object-oriented software projects. Their results showed that a model built on one project produced a poor classification performance but a good ranking performance on another project. This indicates that, if used appropriately, CPDP would be helpful for practitioners. Second, CPDP has attracted a rapid increasing interest over the past few years. After Briand et al.’s pioneering work, more than 70 CPDP studies are published. The overall results show that CPDP has a promising prediction performance, comparable to or even better than WPDP (within-project defect prediction). Third, the existing CPDP studies cover a variety of wide-ranging topics. These topics include validating CPDP on different development environments (open-source and proprietary), different development stages (design and implementation), different programming languages (C, C++, C#, Java, JS, Pascal, Perl, and Ruby), different module granularity (change, function, class, and file), and different defect predictors (semantic, text, and structural). These studies make a significant contribution to our knowledge about the wide application range of CPDP. Fourth, the number of studies shows a highly unbalanced distribution on the key CPDP components. Of the key components, “filter instances” and “transform distributions” are the two most studied components, while “privatize data” and “homogenize features” are the two less studied components. In particular, none of the existing CPDP studies covers all of the key components and most of them only take into account less than three key components. Fifth, most studies evaluate CPDP models on the complete target release data, which are publicly available, under the classification scenario. In few studies, only a part of the target release data are used as the test data. The reason is either that there is a need to use the target training data when building a CPDP model or that a CPDP model is evaluated on the same test data as the WPDP models. Furthermore, most studies explicitly report the prediction results in numerical form. This allows us to make a direct comparison on the prediction performance of simple module size models against most of the existing CPDP models without a re-implementation. Finally, of 72 studies, only two studies (i.e., Briand et al.’s study [9] and Canfora et al.’s study [12]) used simple module size models in the target release as the baseline models. Both studies reported that the proposed ACM Transactions on Software Engineering and Methodology, Vol. 27, No. 1, Article 1. Pub. date: April 2018