正在加载图片...
1:6 Y.Zhou et al. the target release and the 1th,...,k-1th releases are the historical releases.The source projects consist of n projects,each having a number of releases.Each release in the target and source projects has an associated dataset,which consists of the metric data and the defect label data at the module level.The modules can be packages,files,classes,or functions,depending on the defect prediction context.The metric sets are assumed to be the same for all releases within a single project but can be different for different projects.In each dataset,one instance represents a module and one feature represents a software metric extracted from the module. From Figure 1,on the one hand,we can see that the training data consist of the external training data and the internal training data.The external training data are indeed the labeled source project data.The internal data consist of two parts:the historical training data(i.e.,the labeled historical release data)and the target training data (i.e.,a small amount of the labeled target release data of the target project).On the other hand,we can see that the test data consist of the labeled target release data excluding the target training data.In other words,the labeled target release data are divided into two parts:a small amount of data used as the target training data and the remaining data used as the test data.For the simplicity of presentation,we use the test metric data and the defect oracle data to respectively denote the metric data and the defect label data in the test data. The test metric data will be introduced into the model built with the training data to compute the predicted defect-proneness.Once the predicted defect-proneness of each module in the test data is obtained,it will be compared against the defect oracle data to compute the prediction performance. At a high level,the application of a supervised CPDP model to a target project in practice in- volves the following three phases:data preparation,model training,and model testing.At the data preparation phase,preprocess the training data collected from different sources to make them ap- propriate for building a CPDP model for the target project.On the one hand,there is a need to address the privacy concerns of(source project)data owners,ie.,preventing the disclosure of specific sensitive metric values of the original source project data [84,85].On the other hand, there is a need to address the utility of the privatized source project data in cross-project defect prediction.This includes dealing with homogenous feature sets(ie.,metric set)between source and target project data [27,41,79],filtering out irrelevant training data [39,86,94,106],handling class-imbalanced training data [42,93],making source and target project have a similar data dis- tribution [13,41,42,61,80,123],and removing irrelevant/redundant features [5,9,26,44].At the modeling training phase,use a supervised modeling technique to build a CPDP model [59,96. 108,116,126].In other words,this aims to use a specific supervised learning algorithm to build a model to capture the relationship between the metrics(i.e.,the independent variables)and defect- proneness(i.e.,the dependent variable)in the training data.At the model testing phase,apply the CPDP model to predict defects in the target release and evaluate its prediction performance under the classification or ranking scenario.In the former scenario,the modules in the test data are clas- sified into defective or not defective.In the latter scenario,the modules in the test data are ranked from the highest to the lowest predicted defect-proneness.Under each scenario,the performance report is generated by comparing the predicted defect-proneness with the defect oracle data.By the above phases,it is expected that the knowledge about the effect of the metrics on defects can be learned from the source projects by the supervised model to predict defects in the target release of the target project. 2.3 Performance Evaluation Table 1 summarizes the prediction performance indicators involved in the existing cross-project defect prediction literature.The first column reports the scenario in which a specific indicator is used.The second,third,and fourth columns,respectively,show the name,the definition,and the ACM Transactions on Software Engineering and Methodology,Vol.27,No.1,Article 1.Pub.date:April 2018.1:6 Y. Zhou et al. the target release and the 1th,..., k − 1th releases are the historical releases. The source projects consist of n projects, each having a number of releases. Each release in the target and source projects has an associated dataset, which consists of the metric data and the defect label data at the module level. The modules can be packages, files, classes, or functions, depending on the defect prediction context. The metric sets are assumed to be the same for all releases within a single project but can be different for different projects. In each dataset, one instance represents a module and one feature represents a software metric extracted from the module. From Figure 1, on the one hand, we can see that the training data consist of the external training data and the internal training data. The external training data are indeed the labeled source project data. The internal data consist of two parts: the historical training data (i.e., the labeled historical release data) and the target training data (i.e., a small amount of the labeled target release data of the target project). On the other hand, we can see that the test data consist of the labeled target release data excluding the target training data. In other words, the labeled target release data are divided into two parts: a small amount of data used as the target training data and the remaining data used as the test data. For the simplicity of presentation, we use the test metric data and the defect oracle data to respectively denote the metric data and the defect label data in the test data. The test metric data will be introduced into the model built with the training data to compute the predicted defect-proneness. Once the predicted defect-proneness of each module in the test data is obtained, it will be compared against the defect oracle data to compute the prediction performance. At a high level, the application of a supervised CPDP model to a target project in practice in￾volves the following three phases: data preparation, model training, and model testing. At the data preparation phase, preprocess the training data collected from different sources to make them ap￾propriate for building a CPDP model for the target project. On the one hand, there is a need to address the privacy concerns of (source project) data owners, i.e., preventing the disclosure of specific sensitive metric values of the original source project data [84, 85]. On the other hand, there is a need to address the utility of the privatized source project data in cross-project defect prediction. This includes dealing with homogenous feature sets (i.e., metric set) between source and target project data [27, 41, 79], filtering out irrelevant training data [39, 86, 94, 106], handling class-imbalanced training data [42, 93], making source and target project have a similar data dis￾tribution [13, 41, 42, 61, 80, 123], and removing irrelevant/redundant features [5, 9, 26, 44]. At the modeling training phase, use a supervised modeling technique to build a CPDP model [59, 96, 108, 116, 126]. In other words, this aims to use a specific supervised learning algorithm to build a model to capture the relationship between the metrics (i.e., the independent variables) and defect￾proneness (i.e., the dependent variable) in the training data. At the model testing phase, apply the CPDP model to predict defects in the target release and evaluate its prediction performance under the classification or ranking scenario. In the former scenario, the modules in the test data are clas￾sified into defective or not defective. In the latter scenario, the modules in the test data are ranked from the highest to the lowest predicted defect-proneness. Under each scenario, the performance report is generated by comparing the predicted defect-proneness with the defect oracle data. By the above phases, it is expected that the knowledge about the effect of the metrics on defects can be learned from the source projects by the supervised model to predict defects in the target release of the target project. 2.3 Performance Evaluation Table 1 summarizes the prediction performance indicators involved in the existing cross-project defect prediction literature. The first column reports the scenario in which a specific indicator is used. The second, third, and fourth columns, respectively, show the name, the definition, and the ACM Transactions on Software Engineering and Methodology, Vol. 27, No. 1, Article 1. Pub. date: April 2018
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有