1:2 Y. Zhou et al. ACM Reference form_中国高校课件下载中心

点击下载：How Far We Have Progressed in the Journey? An Examination of Cross-Project Defect Prediction

正在加载图片...

1:2 Y.Zhou et al. ACM Reference format: Yuming Zhou,Yibiao Yang,Hongmin Lu,Lin Chen,Yanhui Li,Yangyang Zhao,Junyan Qian,and Baowen Xu.2018.How Far We Have Progressed in the Journey?An Examination of Cross-Project Defect Prediction. ACM Trans.Softw.Eng.Methodol 27,1,Article 1(April 2018),51 pages. https:/doi.org/10.1145/3183339 INTRODUCTION A defect prediction model can predict the defect-proneness of the modules(e.g.,files,classes,or functions)in a software project.Given the prediction result,a project manager can(1)classify the modules into two categories,high defect-prone or low defect-prone,or(2)rank the modules from the highest to lowest defect-proneness.In both scenarios,the project manager could allocate more resources to inspect or test those high defect-prone modules.This will help to find more defects in the project if the model can accurately predict defect-proneness.For a given project, it is common to use the historical project data (e.g.,module-level complexity metric data and defect data in the previous releases of the project)to train the model.Prior studies have shown that the model predicts defects well in the test data if it is trained using a sufficiently large amount of data [23].However,in practice,it might be difficult to obtain sufficient training data [129].This is especially true for a new type of projects or projects with little historical data collected. One way to deal with the shortage of training data is to leverage the data from other projects (i.e.,source projects)to build the model and apply it to predict the defects in the current project (i.e.,the target project)[9].However,in practice,it is challenging to achieve accurate cross-project defect prediction(CPDP)[129].The main reason is that the source and target project data usually exhibit significantly different distributions,which violates the similar distribution assumption of the training and test data required by most modeling techniques [61,80,123].Furthermore,in many cases,the source and target project data even consist of different metric sets,which makes it difficult to use the regular modeling techniques to build and apply the prediction model [17,27, 41,79].In recent years,various techniques have been proposed to address these challenges and a large number of CPDP models have hence been developed(see Section 2.4).In particular,it has been reported that these CPDP models produce a promising prediction performance [27,41,61, 79,80] Yet,most,if not all,of the existing CPDP models are not compared against those simple module size models that are easy to implement and have shown a good performance in defect prediction in the literature.In the past decades,many studies have reported that simple models based on the module size(e.g.,SLOC,source lines of code)in a project can in general predict the defects in the project well [2,49-51,65,102,127,128].In the context of CPDP,we can use module size in a target project to build two simple module size models,ManualDown and ManualUp [49-51,65]. The former considers a larger module as more defect-prone,while the latter considers a smaller module as more defect-prone.Since ManualDown and ManualUp do not require any data from the source projects to build the models,they are free of the challenges on different data distributions/ metric sets of the source and target project data.In particular,they have a low computation cost and are easy to implement.In contrast,due to the use of complex modeling techniques,many existing CPDP models not only have a high computation cost but also involve a large number of parameters needed to be carefully tuned.This imposes substantial barriers to apply them in practice,especially for large projects.Furthermore,previous studies show that module size has a strong confounding effect on the associations between code metrics and defect-proneness [20, 128].Emam et al.even reported that their associations disappeared after controlling for module ACM Transactions on Software Engineering and Methodology,Vol.27,No.1,Article 1.Pub.date:April 2018.1:2 Y. Zhou et al. ACM Reference format: Yuming Zhou, Yibiao Yang, Hongmin Lu, Lin Chen, Yanhui Li, Yangyang Zhao, Junyan Qian, and Baowen Xu. 2018. How Far We Have Progressed in the Journey? An Examination of Cross-Project Defect Prediction. ACM Trans. Softw. Eng. Methodol. 27, 1, Article 1 (April 2018), 51 pages. https://doi.org/10.1145/3183339 1 INTRODUCTION A defect prediction model can predict the defect-proneness of the modules (e.g., files, classes, or functions) in a software project. Given the prediction result, a project manager can (1) classify the modules into two categories, high defect-prone or low defect-prone, or (2) rank the modules from the highest to lowest defect-proneness. In both scenarios, the project manager could allocate more resources to inspect or test those high defect-prone modules. This will help to find more defects in the project if the model can accurately predict defect-proneness. For a given project, it is common to use the historical project data (e.g., module-level complexity metric data and defect data in the previous releases of the project) to train the model. Prior studies have shown that the model predicts defects well in the test data if it is trained using a sufficiently large amount of data [23]. However, in practice, it might be difficult to obtain sufficient training data [129]. This is especially true for a new type of projects or projects with little historical data collected. One way to deal with the shortage of training data is to leverage the data from other projects (i.e., source projects) to build the model and apply it to predict the defects in the current project (i.e., the target project) [9]. However, in practice, it is challenging to achieve accurate cross-project defect prediction (CPDP) [129]. The main reason is that the source and target project data usually exhibit significantly different distributions, which violates the similar distribution assumption of the training and test data required by most modeling techniques [61, 80, 123]. Furthermore, in many cases, the source and target project data even consist of different metric sets, which makes it difficult to use the regular modeling techniques to build and apply the prediction model [17, 27, 41, 79]. In recent years, various techniques have been proposed to address these challenges and a large number of CPDP models have hence been developed (see Section 2.4). In particular, it has been reported that these CPDP models produce a promising prediction performance [27, 41, 61, 79, 80]. Yet, most, if not all, of the existing CPDP models are not compared against those simple module size models that are easy to implement and have shown a good performance in defect prediction in the literature. In the past decades, many studies have reported that simple models based on the module size (e.g., SLOC, source lines of code) in a project can in general predict the defects in the project well [2, 49–51, 65, 102, 127, 128]. In the context of CPDP, we can use module size in a target project to build two simple module size models, ManualDown and ManualUp [49–51, 65]. The former considers a larger module as more defect-prone, while the latter considers a smaller module as more defect-prone. Since ManualDown and ManualUp do not require any data from the source projects to build the models, they are free of the challenges on different data distributions/ metric sets of the source and target project data. In particular, they have a low computation cost and are easy to implement. In contrast, due to the use of complex modeling techniques, many existing CPDP models not only have a high computation cost but also involve a large number of parameters needed to be carefully tuned. This imposes substantial barriers to apply them in practice, especially for large projects. Furthermore, previous studies show that module size has a strong confounding effect on the associations between code metrics and defect-proneness [20, 128]. Emam et al. even reported that their associations disappeared after controlling for module ACM Transactions on Software Engineering and Methodology, Vol. 27, No. 1, Article 1. Pub. date: April 2018

<<向上翻页向下翻页>>

点击下载：How Far We Have Progressed in the Journey? An Examination of Cross-Project Defect Prediction