正在加载图片...
340 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,VOL 41,NO.4,APRIL 2015 After building the above models,we compare the predic- prediction performances of two models are important from tion effectiveness of the following two pairs of models:"S" the viewpoint of practical application [34].By convention, vs.“B"and“B+S"versus"B”.To obtain an adequate and the magnitude of the difference is considered either trivial realistic comparison,we use the prediction effectiveness (8l<0.147),small(0.147-0.33),moderate(0.33-0.474),or data generated from the following three methods: large(>0.474)[58l. We test the null Hypotheses H30 and H4o in the following Cross-validation.Cross-validation is performed within the same version of a project,i.e.predicting faults in two typical application scenarios:ranking and classification. In the ranking scenario,functions are ranked in order from one subset using a model trained on the other comple- the most to the least predicted relative risk.With this rank- mentary subsets.In our study,for a given project,we ing list in hand,software practitioners can simply select as use 30 times three-fold cross-validation to evaluate the many high-risk functions targeted for software quality effectiveness of the prediction models.More specifi- enhancement as available resources will allow.In the classi- cally,at each three-fold cross-validation,we randomly fication scenario,functions are first classified into two cate- divide the data set into three parts of approximately gories in terms of their predicted relative risk:high-risk and equal size.Each part is used to compute the effective- low-risk.After that,those functions classified as high-risk ness for the prediction models built on the remainder are targeted for software quality enhancement.In both sce- of the data set.The entire process is then repeated narios,we take into account the effort to test or inspect those 30 times to alleviate possible sampling bias in random functions predicted as high-risk when evaluating the pre- splits.Consequently,each model has 30 x 3=90 pre- diction effectiveness of a model.Following previous work diction effectiveness values.Note that we choose to [34],we use the source lines of code in a function f as a perform three-fold cross validation rather than 10-fold proxy to estimate the effort required to test or inspect the cross-validation due to the small percentage of post- function.In particular,we define the relative risk of the release faulty functions in the data sets. function f as R(f)=Pr/SLOC(f),where Pr is the probabil- Across-version prediction.Across-version prediction ity of the function f being faulty predicted using the logistic uses a model trained on earlier versions to predict regression model.In other words,R(f)can be regarded as faults in later versions within the same project.There the predicted fault-proneness per SLOC.In the context of are two kinds of approaches for the across-version effort-aware fault-proneness prediction,prior studies used prediction [50].The first approach is next-version defect density [35,[36],[37],i.e.#Error(f)/SLOC(f),as the prediction,i.e.building a prediction model on a ver- dependent variable to build the prediction model.In this sion i and then only applying the model to predict study,we first use the binary dependent variable to build faults in the next version i+1 of the same project. the logistic regression model and then use R(f)to estimate The second approach is follow-up-version predic- the relative risk of a given function f.Next,we describe the tion,i.e.building a prediction model on a version i effort-aware prediction performance indicators used in this and then applying the model to predict faults in any study for ranking and classification. follow-up version j(i.e.j>i)of the same project.In (1)Effort-aware ranking performance evaluation.We use the our study,we adopt both approaches.If a project has cost-effectiveness measure CE proposed by Arisholm et al. m versions,the first approach will produce m-1 [34]to evaluate the effort-aware ranking effectiveness of a prediction effectiveness values for each model,while fault-proneness prediction model.The CE measure is based the second approach will produce m x(m-1)/2 on the concept of the "SLOC-based"Alberg diagram.In this prediction effectiveness values for each model. diagram,the x-axis is the cumulative percentage of SLOC of Across-project prediction.Across-project prediction the functions selected from the function ranking and the y- uses a model trained on one project to predict faults axis is the cumulative percentage of post-release faults found in another project [50].Given n projects,this predic- in the selected functions.Consequently,each fault-prone- tion method will produce n x (n-1)prediction effec- ness prediction model corresponds to a curve in the diagram. tiveness values for each model. Fig.1 is an example "SLOC-based"Alberg diagram showing In each of the above-mentioned three prediction settings, the ranking performance of a prediction model m(in our con- all models use the same training data and the same testing text,the prediction model m could be the "B"model,the "S" data.Based on these setups,we employ the Wilcoxon model,and the"B+S"model).To compute CE,we also con- signed-rank test to examine whether two models have a sig- sider two additional curves,which respectively correspond nificant difference on the prediction effectiveness.In to“random”model and“optimal'"model.In the“random' particular,we use the Benjamini-Hochberg (BH)corrected model,functions are randomly selected to test or inspect.In p-values to examine whether a difference is significant at the "optimal"model,functions are sorted in decreasing the significance level of 0.10.The null hypothesis H30 corre- order according to their actual post-release fault densities sponding to RO3 will be rejected when the comparison Based on this diagram,the effort-aware ranking effective- shows that the "S"model outperforms the "B"model and ness of the prediction model m is defined as follows [34]: the difference is significant.The null hypothesis H4o corre- sponding to RQ4 will be rejected when the comparison Area(m)-Area(random model) shows that the"B+S”model outperforms the“B”model and the difference is significant.Furthermore,we use the CE(m)=Areas(optimal model)-Aredz(random model) Cliff's 8,which is used for median comparison,to examine Here,Area,(m)is the area under the curve corresponding whether the magnitude of the difference between the to model m for a given top xx 100%percentage of SLOC.After building the above models, we compare the predic￾tion effectiveness of the following two pairs of models: “S” vs. “B” and “BþS” versus “B”. To obtain an adequate and realistic comparison, we use the prediction effectiveness data generated from the following three methods:  Cross-validation. Cross-validation is performed within the same version of a project, i.e. predicting faults in one subset using a model trained on the other comple￾mentary subsets. In our study, for a given project, we use 30 times three-fold cross-validation to evaluate the effectiveness of the prediction models. More specifi- cally, at each three-fold cross-validation, we randomly divide the data set into three parts of approximately equal size. Each part is used to compute the effective￾ness for the prediction models built on the remainder of the data set. The entire process is then repeated 30 times to alleviate possible sampling bias in random splits. Consequently, each model has 30 3 ¼ 90 pre￾diction effectiveness values. Note that we choose to perform three-fold cross validation rather than 10-fold cross-validation due to the small percentage of post￾release faulty functions in the data sets.  Across-version prediction. Across-version prediction uses a model trained on earlier versions to predict faults in later versions within the same project. There are two kinds of approaches for the across-version prediction [50]. The first approach is next-version prediction, i.e. building a prediction model on a ver￾sion i and then only applying the model to predict faults in the next version i þ 1 of the same project. The second approach is follow-up-version predic￾tion, i.e. building a prediction model on a version i and then applying the model to predict faults in any follow-up version j (i.e. j>i) of the same project. In our study, we adopt both approaches. If a project has m versions, the first approach will produce m  1 prediction effectiveness values for each model, while the second approach will produce m ðm  1Þ=2 prediction effectiveness values for each model.  Across-project prediction. Across-project prediction uses a model trained on one project to predict faults in another project [50]. Given n projects, this predic￾tion method will produce n (n - 1) prediction effec￾tiveness values for each model. In each of the above-mentioned three prediction settings, all models use the same training data and the same testing data. Based on these setups, we employ the Wilcoxon signed-rank test to examine whether two models have a sig￾nificant difference on the prediction effectiveness. In particular, we use the Benjamini-Hochberg (BH) corrected p-values to examine whether a difference is significant at the significance level of 0.10. The null hypothesis H30 corre￾sponding to RQ3 will be rejected when the comparison shows that the “S” model outperforms the “B” model and the difference is significant. The null hypothesis H40 corre￾sponding to RQ4 will be rejected when the comparison shows that the “BþS” model outperforms the “B” model and the difference is significant. Furthermore, we use the Cliff’s d, which is used for median comparison, to examine whether the magnitude of the difference between the prediction performances of two models are important from the viewpoint of practical application [34]. By convention, the magnitude of the difference is considered either trivial (jdj < 0:147), small (0.147-0.33), moderate (0.33-0.474), or large (> 0.474) [58]. We test the null Hypotheses H30 and H40 in the following two typical application scenarios: ranking and classification. In the ranking scenario, functions are ranked in order from the most to the least predicted relative risk. With this rank￾ing list in hand, software practitioners can simply select as many high-risk functions targeted for software quality enhancement as available resources will allow. In the classi- fication scenario, functions are first classified into two cate￾gories in terms of their predicted relative risk: high-risk and low-risk. After that, those functions classified as high-risk are targeted for software quality enhancement. In both sce￾narios, we take into account the effort to test or inspect those functions predicted as high-risk when evaluating the pre￾diction effectiveness of a model. Following previous work [34], we use the source lines of code in a function f as a proxy to estimate the effort required to test or inspect the function. In particular, we define the relative risk of the function f as RðfÞ ¼ Pr=SLOCðfÞ, where Pr is the probabil￾ity of the function f being faulty predicted using the logistic regression model. In other words, R(f) can be regarded as the predicted fault-proneness per SLOC. In the context of effort-aware fault-proneness prediction, prior studies used defect density [35, [36], [37], i.e. #Error(f) / SLOC(f), as the dependent variable to build the prediction model. In this study, we first use the binary dependent variable to build the logistic regression model and then use R(f) to estimate the relative risk of a given function f. Next, we describe the effort-aware prediction performance indicators used in this study for ranking and classification. (1) Effort-aware ranking performance evaluation. We use the cost-effectiveness measure CE proposed by Arisholm et al. [34] to evaluate the effort-aware ranking effectiveness of a fault-proneness prediction model. The CE measure is based on the concept of the “SLOC-based” Alberg diagram. In this diagram, the x-axis is the cumulative percentage of SLOC of the functions selected from the function ranking and the y￾axis is the cumulative percentage of post-release faults found in the selected functions. Consequently, each fault-prone￾ness prediction model corresponds to a curve in the diagram. Fig. 1 is an example “SLOC-based” Alberg diagram showing the ranking performance of a prediction model m (in our con￾text, the prediction model m could be the “B” model, the “S” model, and the “BþS” model). To compute CE, we also con￾sider two additional curves, which respectively correspond to “random” model and “optimal” model. In the “random” model, functions are randomly selected to test or inspect. In the “optimal” model, functions are sorted in decreasing order according to their actual post-release fault densities. Based on this diagram, the effort-aware ranking effective￾ness of the prediction model m is defined as follows [34]: CEpðmÞ ¼ AreapðmÞ  Areapðrandom modelÞ Areapðoptimal modelÞ  Areapðrandom modelÞ : Here, Areap(m) is the area under the curve corresponding to model m for a given top p 100% percentage of SLOC. 340 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 41, NO. 4, APRIL 2015
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有