1:4 Y.Zhou et al. 2 CROSS-PROJECT DEFECT PREDICTION In this section,we first describe the problem that cross-project defect prediction aims to address. Then,we present a general framework that depicts the key components in a supervised cross- project defect prediction.After that,we introduce the commonly used performance evaluation indicators involved in the existing CPDP studies.Finally,we give a literature overview of current developments in this area and provide a comparison of the main CPDP studies under the general framework. 2.1 Problem Statement The purpose of cross-project defect prediction is to address the shortage problem of training data that are used for building a model to predict defects in a target release of a target project.To predict defects in the target release,it is common to use a two-phase process(i.e.,model building and ap- plication).At the model-building phase,the metric data and the defect data are first collected from the modules in historical releases(i.e.,the releases before the target release in the target project) Then,based on the collected data,a specific modeling technique is used to train a model to cap- ture the relationships between the metrics and defect-proneness.At the model application phase, the same metrics are first collected from the modules in the target release.Then,the predicted defect-proneness of each module in the target release is obtained by substituting the correspond- ing metric data into the model.After that,the predicted performance is evaluated by comparing the predicted defect-proneness against the actual defect information in the target release.It has been shown that,if there is a sufficiently large amount of training data at the model-building phase,the resulting model in general predicts defects well at the model application phase [23].In practice, however,sufficiently large training data are often unavailable,especially for those projects with few or even no historical releases.To mitigate this problem,researchers propose to use the training data from other projects(i.e.,source projects)to build the model for the target project,which is called cross-project defect prediction [9,106,129]. In the literature,it has been highlighted that there are two major challenges that cross-project defect prediction has to deal with.The first challenge is that the source and target project data usually exhibit significantly different distributions [13,41,42,61,80,123].Since the source and target projects might be from different corporations,there might have a large difference between their development environments.Consequently,the same metrics in the source and target project data might have a large difference between the data distributions.However,regular modeling techniques are based on the assumption that the training and test data are drawn from the same distribution.Therefore,it is difficult to achieve a good prediction performance by directly applying regular modeling techniques to build the prediction models.The second challenge is that the source and target project data might have no common metrics [27,41,79].In practice,since the source and target projects might be from different corporations,it is likely that the source and target project data consist of completely different metric sets.However,regular modeling techniques are based on the assumption that the training and test data have common metric sets.Consequently, it is difficult,if not impossible,to predict defects in the target project using the models built with regular modeling techniques on the source project data.In recent years,much effort has been devoted to dealing with these challenges. 2.2 General Framework In the field of cross-project defect prediction,supervised models are the mainstream models. Figure 1 presents a general framework that applies supervised techniques to cross-project defect prediction.As can be seen,the target project consists of k release,among which the kth release is ACM Transactions on Software Engineering and Methodology,Vol.27,No.1,Article 1.Pub.date:April 2018.1:4 Y. Zhou et al. 2 CROSS-PROJECT DEFECT PREDICTION In this section, we first describe the problem that cross-project defect prediction aims to address. Then, we present a general framework that depicts the key components in a supervised crossproject defect prediction. After that, we introduce the commonly used performance evaluation indicators involved in the existing CPDP studies. Finally, we give a literature overview of current developments in this area and provide a comparison of the main CPDP studies under the general framework. 2.1 Problem Statement The purpose of cross-project defect prediction is to address the shortage problem of training data that are used for building a model to predict defects in a target release of a target project. To predict defects in the target release, it is common to use a two-phase process (i.e., model building and application). At the model-building phase, the metric data and the defect data are first collected from the modules in historical releases (i.e., the releases before the target release in the target project). Then, based on the collected data, a specific modeling technique is used to train a model to capture the relationships between the metrics and defect-proneness. At the model application phase, the same metrics are first collected from the modules in the target release. Then, the predicted defect-proneness of each module in the target release is obtained by substituting the corresponding metric data into the model. After that, the predicted performance is evaluated by comparing the predicted defect-proneness against the actual defect information in the target release. It has been shown that, if there is a sufficiently large amount of training data at the model-building phase, the resulting model in general predicts defects well at the model application phase [23]. In practice, however, sufficiently large training data are often unavailable, especially for those projects with few or even no historical releases. To mitigate this problem, researchers propose to use the training data from other projects (i.e., source projects) to build the model for the target project, which is called cross-project defect prediction [9, 106, 129]. In the literature, it has been highlighted that there are two major challenges that cross-project defect prediction has to deal with. The first challenge is that the source and target project data usually exhibit significantly different distributions [13, 41, 42, 61, 80, 123]. Since the source and target projects might be from different corporations, there might have a large difference between their development environments. Consequently, the same metrics in the source and target project data might have a large difference between the data distributions. However, regular modeling techniques are based on the assumption that the training and test data are drawn from the same distribution. Therefore, it is difficult to achieve a good prediction performance by directly applying regular modeling techniques to build the prediction models. The second challenge is that the source and target project data might have no common metrics [27, 41, 79]. In practice, since the source and target projects might be from different corporations, it is likely that the source and target project data consist of completely different metric sets. However, regular modeling techniques are based on the assumption that the training and test data have common metric sets. Consequently, it is difficult, if not impossible, to predict defects in the target project using the models built with regular modeling techniques on the source project data. In recent years, much effort has been devoted to dealing with these challenges. 2.2 General Framework In the field of cross-project defect prediction, supervised models are the mainstream models. Figure 1 presents a general framework that applies supervised techniques to cross-project defect prediction. As can be seen, the target project consists of k release, among which the kth release is ACM Transactions on Software Engineering and Methodology, Vol. 27, No. 1, Article 1. Pub. date: April 2018