IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,VOL.41.NO.4.APRIL 2015 331 Are Slice-Based Cohesion Metrics Actually Useful in Effort-Aware Post-Release Fault-Proneness Prediction?An Empirical Study Yibiao Yang,Yuming Zhou,Hongmin Lu,Lin Chen,Zhenyu Chen,Member,/EEE,Baowen Xu, Hareton Leung,Member,IEEE,and Zhenyu Zhang,Member,IEEE Abstract-Background.Slice-based cohesion metrics leverage program slices with respect to the output variables of a module to quantify the strength of functional relatedness of the elements within the module.Although slice-based cohesion metrics have been proposed for many years,few empirical studies have been conducted to examine their actual usefulness in predicting fault-proneness. Objective.We aim to provide an in-depth understanding of the ability of slice-based cohesion metrics in effort-aware post-release fault-proneness prediction,i.e.their effectiveness in helping practitioners find post-release faults when taking into account the effort needed to test or inspect the code.Method.We use the most commonly used code and process metrics,including size,structural complexity,Halstead's software science,and code churn metrics,as the baseline metrics.First,we employ principal component analysis to analyze the relationships between slice-based cohesion metrics and the baseline metrics.Then,we use univariate prediction models to investigate the correlations between slice-based cohesion metrics and post-release fault-proneness.Finally,we build multivariate prediction models to examine the effectiveness of slice-based cohesion metrics in effort-aware post-release fault-proneness prediction when used alone or used together with the baseline code and process metrics.Results.Based on open-source software systems,our results show that:1)slice-based cohesion metrics are not redundant with respect to the baseline code and process metrics;2)most slice-based cohesion metrics are significantly negatively related to post-release fault-proneness; 3)slice-based cohesion metrics in general do not outperform the baseline metrics when predicting post-release fault-proneness;and 4)when used with the baseline metrics together,however,slice-based cohesion metrics can produce a statistically significant and practically important improvement of the effectiveness in effort-aware post-release fault-proneness prediction.Conclusion.Slice-based cohesion metrics are complementary to the most commonly used code and process metrics and are of practical value in the context of effort-aware post-release fault-proneness prediction Index Terms-Cohesion,metrics,slice-based,fault-proneness,prediction,effort-aware ◆ 1 INTRODUCTION OHESION refers to the relatedness of the elements cohesion is a subjective concept and hence is difficult to use within a module [1],[21.A highly cohesive module is in practice [141.In order to attack this problem,program one in which all elements work together towards a single slicing is applied to develop quantitative cohesion metrics, function.Highly cohesive modules are desirable in a system as it provides a means of accurately quantifying the interac- as they are easier to develop,maintain,and reuse,and tions between the elements within a module [12].In the last hence are less fault-prone [1],[2].For software developers, three decades,many slice-based cohesion metrics have it is expected to automatically identify low cohesive mod- been developed to quantify the degree of cohesion in a ules targeted for software quality enhancement.However, module at the function level of granularity [3],[4],[51,[61, [7],[8],[9],[10].For a given function,the computation of a slice-based cohesion metric consists of the following two .Y.Yang,Y.Zhou,H.Lu,L.Chen,and B.Xu are with the State Key steps.At the first step,a program reduction technology Laboratory for Novel Software Technology,Department of Computer Science and Technology,Nanjing UIniversity,Nanjing 210023,China. called program slicing is employed to obtain the set of pro- E-mail:yangyibiao@smail.nju.edu.cn,(zhouyuming,hmlu,Ichen, gram statements (i.e.program slice)that may affect each bioxufenju.edu.cn. output variable of the function [9],[11].The output varia- Z.Chen is with the State Key Laboratory for Novel Software Technology, bles include the function return value,modified global School of Software,Nanjing UIniversity,Nanjing 210023,China. E-mail:zychen@software.nju.edu.cn. variables,modified reference parameters,and variables H.Leung is with the Department of Computing,Hong Kong Polytechnic printed or other outputs by the function [12].At the second University,Hung Hom,Hong Kong,China. step,cohesion is computed by leveraging the commonality E-mail:cshleung@inet.polyu.edu.hk Z.Zhang is with the State Key Laboratory of Computer Science,Institute among the slices with respect to different output variables. of Software,Chinese Academy of Sciences,Beijing,China. Previous studies showed that slice-based cohesion metrics E-mail:zhangzy@ios.ac.cn. provided an excellent quantitative measure of cohesion [3], Manuscript received 16 Feb.2013;revised 24 Oct.2014;accepted 29 Oct. [13],[14].Hence,there is a reason to believe that they 2014.Date of publication 11 Nov.2014;date of current version 17 Apr.2015. should be useful predictors for fault-proneness.However, Recommended for acceptance by T.Menzies. For information on obtaining reprints of this article,please send e-mail to: few empirical studies have so far been conducted to exam- reprints@ieee.org,and reference the Digital Object Identifier below. ine the actual usefulness of slice-based cohesion metrics Digital Object Identifier no.10.1109/TSE.2014.2370048 for predicting fault-proneness,especially compared with See http://www.ieee.org/publi standards/p
Are Slice-Based Cohesion Metrics Actually Useful in Effort-Aware Post-Release Fault-Proneness Prediction? An Empirical Study Yibiao Yang, Yuming Zhou, Hongmin Lu, Lin Chen, Zhenyu Chen, Member, IEEE, Baowen Xu, Hareton Leung, Member, IEEE, and Zhenyu Zhang, Member, IEEE Abstract—Background. Slice-based cohesion metrics leverage program slices with respect to the output variables of a module to quantify the strength of functional relatedness of the elements within the module. Although slice-based cohesion metrics have been proposed for many years, few empirical studies have been conducted to examine their actual usefulness in predicting fault-proneness. Objective. We aim to provide an in-depth understanding of the ability of slice-based cohesion metrics in effort-aware post-release fault-proneness prediction, i.e. their effectiveness in helping practitioners find post-release faults when taking into account the effort needed to test or inspect the code. Method. We use the most commonly used code and process metrics, including size, structural complexity, Halstead’s software science, and code churn metrics, as the baseline metrics. First, we employ principal component analysis to analyze the relationships between slice-based cohesion metrics and the baseline metrics. Then, we use univariate prediction models to investigate the correlations between slice-based cohesion metrics and post-release fault-proneness. Finally, we build multivariate prediction models to examine the effectiveness of slice-based cohesion metrics in effort-aware post-release fault-proneness prediction when used alone or used together with the baseline code and process metrics. Results. Based on open-source software systems, our results show that: 1) slice-based cohesion metrics are not redundant with respect to the baseline code and process metrics; 2) most slice-based cohesion metrics are significantly negatively related to post-release fault-proneness; 3) slice-based cohesion metrics in general do not outperform the baseline metrics when predicting post-release fault-proneness; and 4) when used with the baseline metrics together, however, slice-based cohesion metrics can produce a statistically significant and practically important improvement of the effectiveness in effort-aware post-release fault-proneness prediction. Conclusion. Slice-based cohesion metrics are complementary to the most commonly used code and process metrics and are of practical value in the context of effort-aware post-release fault-proneness prediction. Index Terms—Cohesion, metrics, slice-based, fault-proneness, prediction, effort-aware Ç 1 INTRODUCTION COHESION refers to the relatedness of the elements within a module [1], [2]. A highly cohesive module is one in which all elements work together towards a single function. Highly cohesive modules are desirable in a system as they are easier to develop, maintain, and reuse, and hence are less fault-prone [1], [2]. For software developers, it is expected to automatically identify low cohesive modules targeted for software quality enhancement. However, cohesion is a subjective concept and hence is difficult to use in practice [14]. In order to attack this problem, program slicing is applied to develop quantitative cohesion metrics, as it provides a means of accurately quantifying the interactions between the elements within a module [12]. In the last three decades, many slice-based cohesion metrics have been developed to quantify the degree of cohesion in a module at the function level of granularity [3], [4], [5], [6], [7], [8], [9], [10]. For a given function, the computation of a slice-based cohesion metric consists of the following two steps. At the first step, a program reduction technology called program slicing is employed to obtain the set of program statements (i.e. program slice) that may affect each output variable of the function [9], [11]. The output variables include the function return value, modified global variables, modified reference parameters, and variables printed or other outputs by the function [12]. At the second step, cohesion is computed by leveraging the commonality among the slices with respect to different output variables. Previous studies showed that slice-based cohesion metrics provided an excellent quantitative measure of cohesion [3], [13], [14]. Hence, there is a reason to believe that they should be useful predictors for fault-proneness. However, few empirical studies have so far been conducted to examine the actual usefulness of slice-based cohesion metrics for predicting fault-proneness, especially compared with Y. Yang, Y. Zhou, H. Lu, L. Chen, and B. Xu are with the State Key Laboratory for Novel Software Technology, Department of Computer Science and Technology, Nanjing University, Nanjing 210023, China. E-mail: yangyibiao@smail.nju.edu.cn, {zhouyuming, hmlu, lchen, bwxu}@nju.edu.cn. Z. Chen is with the State Key Laboratory for Novel Software Technology, School of Software, Nanjing University, Nanjing 210023, China. E-mail: zychen@software.nju.edu.cn. H. Leung is with the Department of Computing, Hong Kong Polytechnic University, Hung Hom, Hong Kong, China. E-mail: cshleung@inet.polyu.edu.hk. Z. Zhang is with the State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, Beijing, China. E-mail: zhangzy@ios.ac.cn. Manuscript received 16 Feb. 2013; revised 24 Oct. 2014; accepted 29 Oct. 2014. Date of publication 11 Nov. 2014; date of current version 17 Apr. 2015. Recommended for acceptance by T. Menzies. For information on obtaining reprints of this article, please send e-mail to: reprints@ieee.org, and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TSE.2014.2370048 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 41, NO. 4, APRIL 2015 331 0098-5589 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information
332 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,VOL 41,NO.4,APRIL 2015 the most commonly used code and process metrics [51,[15],fault-proneness prediction.These research questions are [16l,[17,[181. critically important to both software researchers and practi- In this paper,we perform a thorough empirical investi- tioners,as they help to answer whether slice-based cohesion gation into the ability of slice-based cohesion metrics in the metrics are of practical value in view of the extra cost context of effort-aware post-release fault-proneness predic- involved in data collection.However,little is currently tion,i.e.their effectiveness in helping practitioners find known on this subject.Our study attempts to fill this gap by post-release faults when taking into account the effort a comprehensive investigation into the actual usefulness of needed to test or inspect the code [35].In our study,we use slice-based cohesion metrics in the context of effort-aware the most commonly used code and process metrics,includ- post-release fault-proneness prediction. ing size,structural complexity,Halstead's software science, The contributions of this paper are listed as follows.First, and code churn metrics,as the baseline metrics.We first we compare slice-based cohesion metrics with the most employ principal component analysis (PCA)to analyze the commonly used code and process metrics including size, relationships between slice-based cohesion metrics and the structural complexity,Halstead's software science metrics, baseline code and process metrics.Then,we build univari- and code churn metrics.The results show that slice-based ate prediction models to investigate the correlations bet- cohesion metrics measure essentially different quality infor- ween slice-based cohesion metrics and post-release fault-mation than the baseline code and process metrics measure proneness.Finally,we build multivariate prediction models This indicates that slice-based cohesion metrics are not to examine the effectiveness of slice-based cohesion metrics redundant with respect to the most commonly used code in effort-aware post-release fault-proneness prediction and process metrics.Second,we validate the correlations when used alone or used together with the baseline code between slice-based cohesion metrics and fault-proneness. and process metrics.In order to obtain comprehensive per- The results show that most slice-based cohesion metrics are formance evaluations,we evaluate the effectiveness of statistically related to fault-proneness in an expected direc- effort-aware post-release fault-proneness prediction under tion.Third,we analyze the effectiveness of slice-based cohe- the following three prediction settings:cross-validation,sion metrics in effort-aware post-release fault-proneness across-version prediction,and across-project prediction. prediction compared with the most commonly used code More specifically,cross-validation is performed within the and process metrics.The results,somewhat surprisingly, same version of a project,i.e.predicting faults in one subset show that slice-based cohesion metrics in general do not using a model trained on the other complementary subsets. outperform the most commonly used code and process met- Across-version prediction uses a model trained on earlier rics.Fourth,we investigate whether the combination of versions to predict faults in later versions within the same slice-based cohesion metrics with the most commonly used project,while across-project prediction uses a model trained code and process metrics provide better results in predict- on one project to predict faults in another project.The sub- ing fault-proneness.The results show that the inclusion of ject projects in our study consist of five well-known open- slice-based cohesion metrics can produce a statistically sig- source C projects:Bash,Gcc-core,Gimp,Subversion,and nificant improvement of the effectiveness in effort-aware Vim.We use a mature commercial tool called Understand! post-release fault-proneness prediction under any of the to collect the baseline code and process metrics and use a three prediction settings.In particular,in the ranking sce- powerful source code analysis tool called Frama-C to collect nario,when testing or inspecting 20 percent of the code of a slice-based cohesion metrics 57].Based on the data col- system,slice-based cohesion metrics lead to a moderate to lected from these five projects,we attempt to answer the fol- large improvement(Cliff's 8:0.33-1.00),regardless of which lowing four research questions: prediction setting is considered.In the classification sce- RO1.Are slice-based cohesion metrics redundant nario,they lead to a moderate to large improvement(Cliff's 8:0.31-0.77)in most systems under cross-validation and with respect to the most commonly used code and lead to a large improvement (Cliff's 8:0.55-0.72)under process metrics? across-version prediction.In summary,these results reveal RO2.Are slice-based cohesion metrics statistically sig- that the improvement is practically important for practi- nificantly correlated to post-release fault-proneness? tioners,which is worth the relatively high time cost for col- RO3.Are slice-based cohesion metrics more effective lecting slice-based cohesion metrics.In other words,for than the most commonly used code and process practitioners,slice-based cohesion metrics are of practical metrics in effort-aware post-release fault-proneness value in the context of effort-aware post-release fault-prone- prediction? ness prediction.Our study provides valuable data in an ● RO4.When used together with the most commonly important area for which otherwise there is limited experi- used code and process metrics,can slice-based cohe mental data available. sion metrics significantly improve the effectiveness of The rest of this paper is organized as follows.Section 2 effort-aware post-release fault-proneness prediction? introduces slice-based cohesion metrics and the most com- The purpose of RQ1 and RQ2 investigates whether slice- monly used code and process metrics that we will investi- based cohesion metrics are potentially useful post-release gate.Section 3 gives the research hypotheses on slice-based fault-proneness predictors.The purpose of RQ3 and RQ4 cohesion metrics,introduces the investigated dependent investigates whether slice-based cohesion metrics can lead and independent variables,presents the employed model- to significant improvements in effort-aware post-release ing technique,and describes the data analysis methods. Section 4 describes the experimental setup in our study, 1.www.scitools.com including the data sources and the method we used to
the most commonly used code and process metrics [5], [15], [16], [17], [18]. In this paper, we perform a thorough empirical investigation into the ability of slice-based cohesion metrics in the context of effort-aware post-release fault-proneness prediction, i.e. their effectiveness in helping practitioners find post-release faults when taking into account the effort needed to test or inspect the code [35]. In our study, we use the most commonly used code and process metrics, including size, structural complexity, Halstead’s software science, and code churn metrics, as the baseline metrics. We first employ principal component analysis (PCA) to analyze the relationships between slice-based cohesion metrics and the baseline code and process metrics. Then, we build univariate prediction models to investigate the correlations between slice-based cohesion metrics and post-release faultproneness. Finally, we build multivariate prediction models to examine the effectiveness of slice-based cohesion metrics in effort-aware post-release fault-proneness prediction when used alone or used together with the baseline code and process metrics. In order to obtain comprehensive performance evaluations, we evaluate the effectiveness of effort-aware post-release fault-proneness prediction under the following three prediction settings: cross-validation, across-version prediction, and across-project prediction. More specifically, cross-validation is performed within the same version of a project, i.e. predicting faults in one subset using a model trained on the other complementary subsets. Across-version prediction uses a model trained on earlier versions to predict faults in later versions within the same project, while across-project prediction uses a model trained on one project to predict faults in another project. The subject projects in our study consist of five well-known opensource C projects: Bash, Gcc-core, Gimp, Subversion, and Vim. We use a mature commercial tool called Understand1 to collect the baseline code and process metrics and use a powerful source code analysis tool called Frama-C to collect slice-based cohesion metrics [57]. Based on the data collected from these five projects, we attempt to answer the following four research questions: RQ1. Are slice-based cohesion metrics redundant with respect to the most commonly used code and process metrics? RQ2. Are slice-based cohesion metrics statistically significantly correlated to post-release fault-proneness? RQ3. Are slice-based cohesion metrics more effective than the most commonly used code and process metrics in effort-aware post-release fault-proneness prediction? RQ4. When used together with the most commonly used code and process metrics, can slice-based cohesion metrics significantly improve the effectiveness of effort-aware post-release fault-proneness prediction? The purpose of RQ1 and RQ2 investigates whether slicebased cohesion metrics are potentially useful post-release fault-proneness predictors. The purpose of RQ3 and RQ4 investigates whether slice-based cohesion metrics can lead to significant improvements in effort-aware post-release fault-proneness prediction. These research questions are critically important to both software researchers and practitioners, as they help to answer whether slice-based cohesion metrics are of practical value in view of the extra cost involved in data collection. However, little is currently known on this subject. Our study attempts to fill this gap by a comprehensive investigation into the actual usefulness of slice-based cohesion metrics in the context of effort-aware post-release fault-proneness prediction. The contributions of this paper are listed as follows. First, we compare slice-based cohesion metrics with the most commonly used code and process metrics including size, structural complexity, Halstead’s software science metrics, and code churn metrics. The results show that slice-based cohesion metrics measure essentially different quality information than the baseline code and process metrics measure. This indicates that slice-based cohesion metrics are not redundant with respect to the most commonly used code and process metrics. Second, we validate the correlations between slice-based cohesion metrics and fault-proneness. The results show that most slice-based cohesion metrics are statistically related to fault-proneness in an expected direction. Third, we analyze the effectiveness of slice-based cohesion metrics in effort-aware post-release fault-proneness prediction compared with the most commonly used code and process metrics. The results, somewhat surprisingly, show that slice-based cohesion metrics in general do not outperform the most commonly used code and process metrics. Fourth, we investigate whether the combination of slice-based cohesion metrics with the most commonly used code and process metrics provide better results in predicting fault-proneness. The results show that the inclusion of slice-based cohesion metrics can produce a statistically significant improvement of the effectiveness in effort-aware post-release fault-proneness prediction under any of the three prediction settings. In particular, in the ranking scenario, when testing or inspecting 20 percent of the code of a system, slice-based cohesion metrics lead to a moderate to large improvement (Cliff’s d: 0.33-1.00), regardless of which prediction setting is considered. In the classification scenario, they lead to a moderate to large improvement (Cliff’s d: 0.31-0.77) in most systems under cross-validation and lead to a large improvement (Cliff’s d: 0.55-0.72) under across-version prediction. In summary, these results reveal that the improvement is practically important for practitioners, which is worth the relatively high time cost for collecting slice-based cohesion metrics. In other words, for practitioners, slice-based cohesion metrics are of practical value in the context of effort-aware post-release fault-proneness prediction. Our study provides valuable data in an important area for which otherwise there is limited experimental data available. The rest of this paper is organized as follows. Section 2 introduces slice-based cohesion metrics and the most commonly used code and process metrics that we will investigate. Section 3 gives the research hypotheses on slice-based cohesion metrics, introduces the investigated dependent and independent variables, presents the employed modeling technique, and describes the data analysis methods. Section 4 describes the experimental setup in our study, 1. www.scitools.com including the data sources and the method we used to 332 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 41, NO. 4, APRIL 2015
YANG ET AL:ARE SLICE-BASED COHESION METRICS ACTUALLY USEFUL IN EFFORT-AWARE POST-RELEASE FAULT-PRONENESS.. 333 collect the experimental data sets.Section 5 reports in detail to the module size.Consequently,the slice-based cohesion the experimental results.Section 6 examines the threats to metrics suite proposed by Ott and Thuss consists of five validity of our study.Section 7 discusses the related work. metrics:Coverage,Overlap,Tightness,MinCoverage,and Max- Section 8 concludes the paper and outlines directions for Coverage.Note that these metrics are computed at the state- future work. ment level,i.e.statements are the basic unit of metric slices. Ott and Bieman [20]refined the concept of metric slices to 2 THE METRICS use data tokens (i.e.the definitions of and references to vari- In this section,we first describe slice-based cohesion met- ables and constants)rather than statements as the basic unit rics investigated in this study.Then,we describe the of which slices are composed of.They called such slices data most commonly used code and process metrics that will slices.More specifically,a data slice for a variable v is the be compared against when analyzing the actual useful- sequence of all data tokens in the statements that comprise ness of slice-based cohesion metrics in effort-aware post- the metric slice of v.This leads to five slice-based data- release fault-proneness prediction. token-level cohesion metrics. Bieman and Ott [4]used data slices to develop three cohe- 2.1 Slice-Based Cohesion Metrics sion metrics:SFC(strong functional cohesion),WFC(weak The origin of slice-based cohesion metrics can be traced functional cohesion),and A(Adhesiveness).They defined back to Weiser,who used backward slicing to describe the the slice abstraction of a module as the set of data slices with concepts of coverage,overlap,and tightness [9],[19].A back- respect to its output variables.In particular,a data token is ward slice of a module at statement n with respect to variable called a "glue token"if it lies on more than one data slices, v is the sequence of all statements and predicates that might and is called a"super-glue token"if it lies on all data slices in affect the value of v at n [91,[19].For a given module,Weiser the slice abstraction.As such,SFC is defined as the ratio of first sliced on every variable where it occurred in the mod- the number of super-glue tokens to the total number of data ule.Then,Weiser computed Coverage as the ratio of average tokens in the module.WFC is defined as the ratio of the num- slice size to program size,Overlap as the average ratio of ber of glue tokens to the total number of data tokens in the non-unique to unique statements in each slice,and Tightness module.A is defined as the average adhesiveness for all the as the percentage of statements common in all slices.As data tokens in the module.The adhesiveness of a data token stated by Ott and Bieman [101,however,Weiser "did not is the relative number of slices that it glues together.If a data identify actual software attributes these metrics might token is a glue token,its adhesiveness is the ratio of the num- meaningfully measure",although such metrics were helpful ber of slices that it appears in to the total number of slices. for observing the structuring of a module. Otherwise,its adhesiveness is zero.Indeed,SFC is equiva- Longworth [7]demonstrated that Coverage,a modified lent to the data-token-level Tightness metric and A is equiva- definition of Overlap (i.e.the average ratio of the size of non- lent to the data-token-level Coverage metric proposed by Ott unique statements to slice size),and Tightness could be used and Bieman [20]. as cohesion metrics of a module.In particular,Longworth Counsell et al.[5]proposed a cohesion metric called nor- sliced on every variable once at the end point of the module malized Hamming distance(NHD)based on the concept of to obtain end slices (i.e.backward slices computed from the slice occurrence matrix.For a given module,the slice occur- end of a module)and then used them to compute these met- rence matrix has columns indexed by its output variables rics.Later,Ott and Thuss [3]improved the behavior of slice- and rows indexed by its statements.The (i,j)th entry of the based cohesion metrics through the use of metric slices on matrix has a value of I if the ith statement is in the end slice output variables.A metric slice takes into account both the with respect to the ith output variable and otherwise 0.In uses and used by data relationships [31.More specifically,a this matrix,each row is called a slice occurrence vector. metric slice with respect to variable v is the union of the NHD is defined as the ratio of the total actual slice agree- backward slice with respect to v at the end point of the mod- ment between rows to the total possible agreement between ule and the forward slice computed from the definitions of v rows in the matrix.The slice agreement between two rows in the backward slice.A forward slice of a module at state- is the number of places in which the slice occurrence vectors ment n with respect to variable v is the sequence of all state- of the two rows are equal. ments and predicates that might be affected by the value of Dallal [8]used a data-token-level slice occurrence matrix v at n.Ott and Thuss argued that the purpose of executing a to develop a cohesion metric called similarity-based func- module was indicated by its output variables,including tional cohesion metric(SBFC).For a given module,the data- function return values,modified global variables,printed token-level slice occurrence matrix has columns indexed by variables,and modified reference parameters.Furthermore, its output variables and rows indexed by its data tokens. the slices on the output variables of a module capture the The(i,)th entry of the matrix has a value of 1 if the ith data specific computations for the tasks that the module per- token is in the end slice with respect to the ith output vari- forms.Therefore,we could use the relationships among the able and otherwise 0.SBFC is defined as the average degree slices on output variables to investigate whether the mod- of the normalized similarity between columns.The normal- ule's tasks are related,i.e.whether the module is cohesive. ized similarity between a pair of columns is the ratio of the They redefined Overlap as the average ratio of the slice inter- number of entries where both columns have a value of 1 to action size to slice size and added MinCoverage and MaxCo- the total number of rows in the matrix. verage to the metrics suite.MinCoverage and MaxCoverage Table 1 summarizes the formal definitions,descriptions, are respectively the ratio of the size of the smallest slice to and sources of the slice-based cohesion metrics that will be the module size and the ratio of the size of the largest slice investigated in this study.In this table,for a given module
collect the experimental data sets. Section 5 reports in detail the experimental results. Section 6 examines the threats to validity of our study. Section 7 discusses the related work. Section 8 concludes the paper and outlines directions for future work. 2 THE METRICS In this section, we first describe slice-based cohesion metrics investigated in this study. Then, we describe the most commonly used code and process metrics that will be compared against when analyzing the actual usefulness of slice-based cohesion metrics in effort-aware postrelease fault-proneness prediction. 2.1 Slice-Based Cohesion Metrics The origin of slice-based cohesion metrics can be traced back to Weiser, who used backward slicing to describe the concepts of coverage, overlap, and tightness [9], [19]. A backward slice of a module at statement n with respect to variable v is the sequence of all statements and predicates that might affect the value of v at n [9], [19]. For a given module, Weiser first sliced on every variable where it occurred in the module. Then, Weiser computed Coverage as the ratio of average slice size to program size, Overlap as the average ratio of non-unique to unique statements in each slice, and Tightness as the percentage of statements common in all slices. As stated by Ott and Bieman [10], however, Weiser “did not identify actual software attributes these metrics might meaningfully measure”, although such metrics were helpful for observing the structuring of a module. Longworth [7] demonstrated that Coverage, a modified definition of Overlap (i.e. the average ratio of the size of nonunique statements to slice size), and Tightness could be used as cohesion metrics of a module. In particular, Longworth sliced on every variable once at the end point of the module to obtain end slices (i.e. backward slices computed from the end of a module) and then used them to compute these metrics. Later, Ott and Thuss [3] improved the behavior of slicebased cohesion metrics through the use of metric slices on output variables. A metric slice takes into account both the uses and used by data relationships [3]. More specifically, a metric slice with respect to variable v is the union of the backward slice with respect to v at the end point of the module and the forward slice computed from the definitions of v in the backward slice. A forward slice of a module at statement n with respect to variable v is the sequence of all statements and predicates that might be affected by the value of v at n. Ott and Thuss argued that the purpose of executing a module was indicated by its output variables, including function return values, modified global variables, printed variables, and modified reference parameters. Furthermore, the slices on the output variables of a module capture the specific computations for the tasks that the module performs. Therefore, we could use the relationships among the slices on output variables to investigate whether the module’s tasks are related, i.e. whether the module is cohesive. They redefined Overlap as the average ratio of the slice interaction size to slice size and added MinCoverage and MaxCoverage to the metrics suite. MinCoverage and MaxCoverage are respectively the ratio of the size of the smallest slice to the module size and the ratio of the size of the largest slice to the module size. Consequently, the slice-based cohesion metrics suite proposed by Ott and Thuss consists of five metrics: Coverage, Overlap, Tightness, MinCoverage, and MaxCoverage. Note that these metrics are computed at the statement level, i.e. statements are the basic unit of metric slices. Ott and Bieman [20] refined the concept of metric slices to use data tokens (i.e. the definitions of and references to variables and constants) rather than statements as the basic unit of which slices are composed of. They called such slices data slices. More specifically, a data slice for a variable v is the sequence of all data tokens in the statements that comprise the metric slice of v. This leads to five slice-based datatoken-level cohesion metrics. Bieman and Ott [4] used data slices to develop three cohesion metrics: SFC (strong functional cohesion), WFC (weak functional cohesion), and A (Adhesiveness). They defined the slice abstraction of a module as the set of data slices with respect to its output variables. In particular, a data token is called a “glue token” if it lies on more than one data slices, and is called a “super-glue token” if it lies on all data slices in the slice abstraction. As such, SFC is defined as the ratio of the number of super-glue tokens to the total number of data tokens in the module. WFC is defined as the ratio of the number of glue tokens to the total number of data tokens in the module. A is defined as the average adhesiveness for all the data tokens in the module. The adhesiveness of a data token is the relative number of slices that it glues together. If a data token is a glue token, its adhesiveness is the ratio of the number of slices that it appears in to the total number of slices. Otherwise, its adhesiveness is zero. Indeed, SFC is equivalent to the data-token-level Tightness metric and A is equivalent to the data-token-level Coverage metric proposed by Ott and Bieman [20]. Counsell et al. [5] proposed a cohesion metric called normalized Hamming distance (NHD) based on the concept of slice occurrence matrix. For a given module, the slice occurrence matrix has columns indexed by its output variables and rows indexed by its statements. The (i, j)th entry of the matrix has a value of 1 if the ith statement is in the end slice with respect to the jth output variable and otherwise 0. In this matrix, each row is called a slice occurrence vector. NHD is defined as the ratio of the total actual slice agreement between rows to the total possible agreement between rows in the matrix. The slice agreement between two rows is the number of places in which the slice occurrence vectors of the two rows are equal. Dallal [8] used a data-token-level slice occurrence matrix to develop a cohesion metric called similarity-based functional cohesion metric (SBFC). For a given module, the datatoken-level slice occurrence matrix has columns indexed by its output variables and rows indexed by its data tokens. The (i, j)th entry of the matrix has a value of 1 if the ith data token is in the end slice with respect to the jth output variable and otherwise 0. SBFC is defined as the average degree of the normalized similarity between columns. The normalized similarity between a pair of columns is the ratio of the number of entries where both columns have a value of 1 to the total number of rows in the matrix. Table 1 summarizes the formal definitions, descriptions, and sources of the slice-based cohesion metrics that will be investigated in this study. In this table, for a given module YANG ET AL.: ARE SLICE-BASED COHESION METRICS ACTUALLY USEFUL IN EFFORT-AWARE POST-RELEASE FAULT-PRONENESS... 333
334 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,VOL 41,NO.4,APRIL 2015 TABLE 1 Definitions of Slice-Based Cohesion Metrics Metric Definition Description Source Coverage Coverage=☆∑图 The extent to which the slices cover the module(measured 31,[20] as the ratio of the mean slice size to the module size) MaxCoverage MaCoveragemaxiSLil 1 The extent to which the largest slice covers the module (measured as the ratio of the size of the largest slice to the module size) MinCoverage MinCoverage in(M minilSLil The extent to which the smallest slice covers the module (measured as the ratio of the size of the smallest slice to the module size) Overlap Overlap=高∑图哥 The extent to which slices are interdependent(measured as the average ratio of the size of the"cohesive section"to the size of each slice) Tightness Tightness gi万 The extent to which all the slices in the module belong together(measured as the ratio of the size of the "cohesive section"to the module size) SFC ISG(SA(M) SFC =okens(M) The extent to which all the slices in the module belong [4 together (measured as the ratio of the number of super-glue tokens to the total number of data tokens of the module) WFC WFC=ISS4AMω The extent to which the slices in the module belong together tokens(Ar) (measured as the ratio of the number of glue tokens to the total number of data tokens of the module) A A= 2tG(SAM))slices containingt The extent to which the glue tokens in the module are adhe- ltokens(M)x SA(M) sive(measured as the ratio of the amount of the adhesive- ness to the total possible adhesiveness) NHD NHD=1-∑=19(k-G) The extent to which the statements in the slices are the same [5) (measured as the ratio of the total slice agreement between rows to the total possible agreement between rows in the statement-level slice occurrence matrix of the module) SBFC 1 if Vol =1 The extent to which the slices are similar(measured as [8 SBFC -tabens(-l) the average degree of the normalized similarity between tokens(M川Vx(V-可 otherwise columns in the data-token-level slice occurrence matrix of the module) M,Vo denotes the set of its output variables,length(M) the forward slices from the definitions of the largest, denotes its size,SA(M)denotes its slice abstraction,and smallest,and range variables in the backward slices;and (5) tokens(M)denotes the set of its data tokens.SLi is the slice the ninth to eleventh columns list the metric slices for the obtained for vi Vo and SLint (called "cohesive section"by largest,smallest,and range variables.Here,a vertical bar Harman et al.[61)is the intersection of SLi over all vi Vo. "|"in the last nine columns denotes that the indicated state- In particular,G(SA(M))and SG(SA(M))are respectively ment is part of the corresponding slice for the named output the set of glue tokens and the set of super-glue tokens.In variable.This example function determines the smallest,the the definition of NHD,k is the number of statements,I is the largest,and the range of an array,which is a modified ver- number of output variables,and c;is the number of Is in sion of the example module used by Longworth [7].For this the jth column of the statement-level slice occurrence example,Vo consists of largest,smallest,and range.The for- matrix.In the definition of SBFC,z;is the number of 1s in mer two variables are the modified reference parameters the i-th row of the data-token-level slice occurrence matrix. and the latter is the function return value.Table 3 shows the Note that all the slice-based cohesion metrics can be com- data-token level slice occurrence matrix of the fun function puted at the statement or data-token level,although some of under end slices and metric slices,where T;indicates the ith them are originally defined at either the statement level or data token for T in the function. the data-token level.The data-token level is at a finer granu- Table 4 shows the computations of twenty data-token- larity than the statement level since a statement might con-level slice-based cohesion metrics.In this table,the second tain a number of data tokens.We next use an example to eleventh rows show the computations for end-slice-based function fun shown in Table 2 to illustrate the computations cohesion metrics and the 12th to 21st rows show the compu- of the slice-based cohesion metrics at the data-token level.tations for metric-slice-based cohesion metrics.As can be In Table 2:(1)the first column lists the statement number seen,end-slice-based metrics indicate with typical values (excluding non-executable statements such as blank state- around 0.5 or 0.6,while metric-slice-based metrics indicate ments,"("and ")")(2)the second column lists the code of with typical values around 0.7 or 0.8.In particular,for the example function;(3)the third to fifth columns respec-each cohesion metric(except MaxCoverage),the metric-slice- tively list the end slices for the largest,smallest,and range based version has a considerably larger value than the cor- variables;(4)the sixth to eighth columns respectively list responding end-slice-based version.When looking at the
M, Vo denotes the set of its output variables, length(M) denotes its size, SA(M) denotes its slice abstraction, and tokens(M) denotes the set of its data tokens. SLi is the slice obtained for vi 2 Vo and SLint (called “cohesive section” by Harman et al. [6]) is the intersection of SLi over all vi 2 Vo. In particular, G(SA(M)) and SG(SA(M)) are respectively the set of glue tokens and the set of super-glue tokens. In the definition of NHD, k is the number of statements, l is the number of output variables, and ci is the number of 1s in the jth column of the statement-level slice occurrence matrix. In the definition of SBFC, xi is the number of 1s in the i-th row of the data-token-level slice occurrence matrix. Note that all the slice-based cohesion metrics can be computed at the statement or data-token level, although some of them are originally defined at either the statement level or the data-token level. The data-token level is at a finer granularity than the statement level since a statement might contain a number of data tokens. We next use an example function fun shown in Table 2 to illustrate the computations of the slice-based cohesion metrics at the data-token level. In Table 2: (1) the first column lists the statement number (excluding non-executable statements such as blank statements, “{“, and “}”); (2) the second column lists the code of the example function; (3) the third to fifth columns respectively list the end slices for the largest, smallest, and range variables; (4) the sixth to eighth columns respectively list the forward slices from the definitions of the largest, smallest, and range variables in the backward slices; and (5) the ninth to eleventh columns list the metric slices for the largest, smallest, and range variables. Here, a vertical bar “ j ” in the last nine columns denotes that the indicated statement is part of the corresponding slice for the named output variable. This example function determines the smallest, the largest, and the range of an array, which is a modified version of the example module used by Longworth [7]. For this example, Vo consists of largest, smallest, and range. The former two variables are the modified reference parameters and the latter is the function return value. Table 3 shows the data-token level slice occurrence matrix of the fun function under end slices and metric slices, where Ti indicates the ith data token for T in the function. Table 4 shows the computations of twenty data-tokenlevel slice-based cohesion metrics. In this table, the second to eleventh rows show the computations for end-slice-based cohesion metrics and the 12th to 21st rows show the computations for metric-slice-based cohesion metrics. As can be seen, end-slice-based metrics indicate with typical values around 0.5 or 0.6, while metric-slice-based metrics indicate with typical values around 0.7 or 0.8. In particular, for each cohesion metric (except MaxCoverage), the metric-slicebased version has a considerably larger value than the corresponding end-slice-based version. When looking at the TABLE 1 Definitions of Slice-Based Cohesion Metrics Metric Definition Description Source Coverage Coverage ¼ 1 j j Vo P V0 j j i¼1 SLi j j length Mð Þ The extent to which the slices cover the module (measured as the ratio of the mean slice size to the module size) [3], [20] MaxCoverage MaxCoverage ¼ 1 lengthð Þ M maxi SLi j j The extent to which the largest slice covers the module (measured as the ratio of the size of the largest slice to the module size) MinCoverage MinCoverage ¼ 1 lengthð Þ M mini SLi j j The extent to which the smallest slice covers the module (measured as the ratio of the size of the smallest slice to the module size) Overlap Overlap ¼ 1 j j Vo P V0 j j i¼1 SLint j j SLi j j The extent to which slices are interdependent (measured as the average ratio of the size of the “cohesive section” to the size of each slice) Tightness Tightness ¼ SLint j j length Mð Þ The extent to which all the slices in the module belong together (measured as the ratio of the size of the “cohesive section” to the module size) SFC SFC ¼ j j SG SA M ð Þ ð Þ j j tokens Mð Þ The extent to which all the slices in the module belong together (measured as the ratio of the number of super-glue tokens to the total number of data tokens of the module) [4] WFC WFC ¼ j j G SA M ð Þ ð Þ j j tokens Mð Þ The extent to which the slices in the module belong together (measured as the ratio of the number of glue tokens to the total number of data tokens of the module) A A ¼ P t2G SA M ð Þ ð Þ slices containing t j j tokens Mð Þ j j SA Mð Þ The extent to which the glue tokens in the module are adhesive (measured as the ratio of the amount of the adhesiveness to the total possible adhesiveness) NHD NHD ¼ 1 2 lk kð Þ 1 Pl j¼1 cj k cj The extent to which the statements in the slices are the same (measured as the ratio of the total slice agreement between rows to the total possible agreement between rows in the statement-level slice occurrence matrix of the module) [5] SBFC SBFC ¼ 1 if Vj j¼ o 1 Pj j tokens Mð Þ t¼1 xið Þ xi1 j j tokens Mð Þ j j Vo ð Þ j j Vo 1 otherwise ( The extent to which the slices are similar (measured as the average degree of the normalized similarity between columns in the data-token-level slice occurrence matrix of the module) [8] 334 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 41, NO. 4, APRIL 2015
YANG ET AL:ARE SLICE-BASED COHESION METRICS ACTUALLY USEFUL IN EFFORT-AWARE POST-RELEASE FAULT-PRONENESS.. 335 TABLE 2 End Slice Profile and Metric Slice Profile for Function Fun Line Code End slice Forward slice Metric slice largest smallest range largest smallest range largest smallest range int fun( 1 int A[] 2 int size, 3 int"largest, 4 int 'smallest) 5 inti; 6 int range; 7 1=1少 8 range =0; 9 "smallest =A[0]; 10 "largest ='smallest; 11 while(i size) 12 if('smallest Alil) 13 smallest =A[i]; 14 if("largest Ali]) 15 "largest=Alil; 16 计+: 17 range="largest-'smallest; 18 return range; Data tokens included in the end slice for the variable smallest are indicated by the underline. example function fun shown in Table 2,we find that,except MaxCoverage,MinCoverage,Overlap,Tightness,WFC,NHD, an unnecessary initialization statement (statement 8 in and SBFC.During our analysis,a function is regarded as a Table 2:range=0;),all the rest statements are all related to module and the output variables of a function consist of the the computation of the final outputs.In other words,intui- function return value,modified global variables,modified tively,this function has a high cohesion.In this sense,when reference parameters,and standard outputs by the function. measuring its cohesion,it appears that metric-slice-based cohesion metrics are more accurate than end-slice-based 2.2 The Most Commonly Used Code and Process cohesion metrics. Metrics As mentioned above,Coverage,MaxCoverage,MinCover- In this study,we employ the most commonly used code and age,Overlap,Tightness,SFC,WFC,and A are originally based process metrics as the baseline metrics to analyze the actual on metric slices[4l,[6l,[13l,[15],[16,[17,[18,[211.How- usefulness of slice-based cohesion metrics in effort-aware ever,NHD and SBFC are originally based on end slices.In post-release fault-proneness prediction.As shown in Table5, this study,we will use metric slices to compute all the cohe- the baseline code and process metrics cover 16 product met- sion metrics.In particular,we will use metric-slice-based rics and three process metrics.These 16 product metrics con- cohesion metrics at the data-token level to investigate the sist of 1 size metric,11 structural complexity metrics,and actual usefulness of slice-based cohesion metrics in effort- 4 software science metrics.The size metric SLOC simply aware post-release fault-proneness prediction.The reason counts the non-blank non-commentary source lines of code for choosing the data-token level rather than the statement (SLOC)in a function.There is a common belief that a func- level is that the former is at a finer granularity.Previous tion with a larger size tends to be more fault-prone [221,[23], studies suggested that software metrics at a finer granularity [24],[25].The structural complexity metrics,including the would accordingly have a higher discriminative power and well-known McCabe's Cyclomatic complexity metrics hence may be more useful for fault-proneness prediction assume that a function with complex control flow structure [62],[63].Note that,at the data-token level,SFC is equiva- is likely to be fault-prone [26],[271,[28],[29].The Halstead's lent to Tightness and A is equivalent to Coverage.Therefore, software science metrics estimate reading complexity based in the subsequent analysis,only the following eight metric- on the counts of operators and operands,in which a function slice-based cohesion metrics will be examined:Coverage, hard to read is assumed to be fault-prone [30].Note that we
example function fun shown in Table 2, we find that, except an unnecessary initialization statement (statement 8 in Table 2: range ¼ 0;), all the rest statements are all related to the computation of the final outputs. In other words, intuitively, this function has a high cohesion. In this sense, when measuring its cohesion, it appears that metric-slice-based cohesion metrics are more accurate than end-slice-based cohesion metrics. As mentioned above, Coverage, MaxCoverage, MinCoverage, Overlap, Tightness, SFC, WFC, and A are originally based on metric slices [4], [6], [13], [15], [16], [17], [18], [21]. However, NHD and SBFC are originally based on end slices. In this study, we will use metric slices to compute all the cohesion metrics. In particular, we will use metric-slice-based cohesion metrics at the data-token level to investigate the actual usefulness of slice-based cohesion metrics in effortaware post-release fault-proneness prediction. The reason for choosing the data-token level rather than the statement level is that the former is at a finer granularity. Previous studies suggested that software metrics at a finer granularity would accordingly have a higher discriminative power and hence may be more useful for fault-proneness prediction [62], [63]. Note that, at the data-token level, SFC is equivalent to Tightness and A is equivalent to Coverage. Therefore, in the subsequent analysis, only the following eight metricslice-based cohesion metrics will be examined: Coverage, MaxCoverage, MinCoverage, Overlap, Tightness, WFC, NHD, and SBFC. During our analysis, a function is regarded as a module and the output variables of a function consist of the function return value, modified global variables, modified reference parameters, and standard outputs by the function. 2.2 The Most Commonly Used Code and Process Metrics In this study, we employ the most commonly used code and process metrics as the baseline metrics to analyze the actual usefulness of slice-based cohesion metrics in effort-aware post-release fault-proneness prediction. As shown in Table 5, the baseline code and process metrics cover 16 product metrics and three process metrics. These 16 product metrics consist of 1 size metric, 11 structural complexity metrics, and 4 software science metrics. The size metric SLOC simply counts the non-blank non-commentary source lines of code (SLOC) in a function. There is a common belief that a function with a larger size tends to be more fault-prone [22], [23], [24], [25]. The structural complexity metrics, including the well-known McCabe’s Cyclomatic complexity metrics, assume that a function with complex control flow structure is likely to be fault-prone [26], [27], [28], [29]. The Halstead’s software science metrics estimate reading complexity based on the counts of operators and operands, in which a function hard to read is assumed to be fault-prone [30]. Note that we TABLE 2 End Slice Profile and Metric Slice Profile for Function Fun Line Code End slice Forward slice Metric slice largest smallest range largest smallest range largest smallest range int fun( 1 int A[] j jj j jj 2 int size, j jj j jj 3 int largest, jj jj 4 int smallest) j jj j jj { 5 int i; j jj j jj 6 int range; j j 7 i ¼ 1; j jj j jj 8 range ¼ 0; 9 smallest ¼ A[0]; j jj j j jj 10 largest ¼ smallest; j jj j j j j 11 while(i A[i]) jj j jj 13 smallest ¼ A[i]; jj j jj 14 if( largest < A[i]) j jj j j j j 15 largest ¼ A[i]; j jj j j j j 16 iþþ; j jj j jj } 17 range ¼ largest - smallest; jj j jj j j 18 return range; jj j jj j j } Data tokens included in the end slice for the variable smallest are indicated by the underline. YANG ET AL.: ARE SLICE-BASED COHESION METRICS ACTUALLY USEFUL IN EFFORT-AWARE POST-RELEASE FAULT-PRONENESS... 335
336 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,VOL 41,NO.4,APRIL 2015 TABLE 3 slice-based cohesion metrics are of practical value only if: Data-Token Level Slice Occurrence Matrix with Respect (1)they have a significantly better fault-proneness predic- to End Slice Profile and Metric Slice Profile tion ability than the baseline code and process metrics;or End slice Metric slice (2)they can significantly improve the performance of fault- proneness prediction when used together with the baseline Line Token largest smallest range largest smallest range code and process metrics.This is especially true when con- A1 sidering the expenses for collecting slice-based cohesion size1 metrics.As Meyers and Binkley stated [13],slicing techni- 3 largesti 0 smallest ques and tools are now mature enough to allow an intensive 4 5 empirical investigation.In our study,we use Frama-C,a 6 range 0 0 well-known open-source static analysis tool for C programs 7 1 [57],to collect slice-based cohesion metrics.Frama-C pro- 7 vides scalable and sound software analyses for C programs, 8 range? 0 0 0 0 thus allowing accurate collection of slice-based cohesion 8 01 0 0 0 0 metrics on industrial-size systems [57]. 9 smallest2 9 A2 3 RESEARCH METHODOLOGY 02 10 largest2 In this section,we first give the research hypotheses relating 10 smallesta slice-based cohesion metrics to the most commonly used 11 i3 11 code and process metrics and to fault-proneness.Then,we size2 12 smallest 0 describe the investigated dependent and independent 12 A3 0 0 variables,the employed modeling technique,and the data 0 0 analysis methods 3 smallests 0 0 0 1 0 3.1 Research Hypotheses 13 0 人 0 The first research question(RQ1)of this study investigates 14 whether slice-based cohesion metrics are redundant when 14 16 compared with the most commonly used code and process 15 largest4 metrics.It is widely believed that software quality cannot 1 A6 0 be measured using only a single dimension [291.As stated 15 0 in Section 2.2,the most commonly used code and process 16 is metrics measure software quality from size,control flow 17 range3 0 0 largests 0 0 structure,and cognitive psychology perspectives.However, 17 smallest 0 0 slice-based cohesion metrics measure software quality from 18 range 0 0 the perspective of cohesion,which are based on control-/ data-flow dependence information among statements.Our do not include the other Halstead's software science metrics conjecture is that,given the nature of the information and such as N,n,V,D,and E [30].The reason is that these met- counting mechanism employed by slice-based cohesion rics are fully based on n1,n2,N1,and N2 (for example metrics,they should capture different underlying dimen- sions of software quality than the most commonly used n=n1 +n2).Consequently,they are highly correlated with n1,n2,N1,and N2.When building multivariate prediction code and process metrics capture.From this reasoning,we models,highly correlated predictors will lead to a high mul- set up the following null hypothesis Hlo and alternative ticollinearity and hence might lead to inaccurate coefficient hypothesis HI for RQ1: estimates [61].Therefore,our study only takes into account H1o.Slice-based cohesion metrics do not capture additional n1,n2,N1,and N2.The process metrics consist of three rela- dimensions of software quality compared with the most tive code churn metrics,i.e.the normalized number of commonly used code and process metrics. added,deleted,and modified source lines of code.These H1A.Slice-based cohesion metrics capture additional dimensions code churn metrics assume that a function with more added, of software quality compared with the most commonly used deleted,or modified code would have a higher possibility of code and process metrics. being fault-prone. The second research question (RO2)of this study investi- The reasons for choosing these baseline code and process gates whether slice-based cohesion metrics are statistically metrics in this study are three-fold.First,they are widely related to post-release fault-proneness.In the software used product and process metrics in both industry and engineering literature,there is a common belief that low academic research[22],[23l,[241,[25],[26l,[27,[28l,[29], cohesion indicates an inappropriate design [1],[2].Conse- [50],[52],[54],[55],[64].Second,they can be automatically quently,a function with low cohesion is more likely to be and cheaply collected from source code even for very large fault-prone than a function with high cohesion [1],[2].From software systems.Third,many studies show that they are Section 2.1,we can see that slice-based cohesion metrics useful indicators for fault-proneness prediction [261,[27], leverage the commonality among the slices with respect [281,[501,[521,[54],551,[64].In the context of effort-aware to different output variables of a function to quantify its post-release fault-proneness prediction,we believe that cohesion.Existing studies showed that they provided an
do not include the other Halstead’s software science metrics such as N, n, V, D, and E [30]. The reason is that these metrics are fully based on n1, n2, N1, and N2 (for example n ¼ n1 þ n2). Consequently, they are highly correlated with n1, n2, N1, and N2. When building multivariate prediction models, highly correlated predictors will lead to a high multicollinearity and hence might lead to inaccurate coefficient estimates [61]. Therefore, our study only takes into account n1, n2, N1, and N2. The process metrics consist of three relative code churn metrics, i.e. the normalized number of added, deleted, and modified source lines of code. These code churn metrics assume that a function with more added, deleted, or modified code would have a higher possibility of being fault-prone. The reasons for choosing these baseline code and process metrics in this study are three-fold. First, they are widely used product and process metrics in both industry and academic research [22], [23], [24], [25], [26], [27], [28], [29], [50], [52], [54], [55], [64]. Second, they can be automatically and cheaply collected from source code even for very large software systems. Third, many studies show that they are useful indicators for fault-proneness prediction [26], [27], [28], [50], [52], [54], [55], [64]. In the context of effort-aware post-release fault-proneness prediction, we believe that slice-based cohesion metrics are of practical value only if: (1) they have a significantly better fault-proneness prediction ability than the baseline code and process metrics; or (2) they can significantly improve the performance of faultproneness prediction when used together with the baseline code and process metrics. This is especially true when considering the expenses for collecting slice-based cohesion metrics. As Meyers and Binkley stated [13], slicing techniques and tools are now mature enough to allow an intensive empirical investigation. In our study, we use Frama-C, a well-known open-source static analysis tool for C programs [57], to collect slice-based cohesion metrics. Frama-C provides scalable and sound software analyses for C programs, thus allowing accurate collection of slice-based cohesion metrics on industrial-size systems [57]. 3 RESEARCH METHODOLOGY In this section, we first give the research hypotheses relating slice-based cohesion metrics to the most commonly used code and process metrics and to fault-proneness. Then, we describe the investigated dependent and independent variables, the employed modeling technique, and the data analysis methods. 3.1 Research Hypotheses The first research question (RQ1) of this study investigates whether slice-based cohesion metrics are redundant when compared with the most commonly used code and process metrics. It is widely believed that software quality cannot be measured using only a single dimension [29]. As stated in Section 2.2, the most commonly used code and process metrics measure software quality from size, control flow structure, and cognitive psychology perspectives. However, slice-based cohesion metrics measure software quality from the perspective of cohesion, which are based on control-/ data-flow dependence information among statements. Our conjecture is that, given the nature of the information and counting mechanism employed by slice-based cohesion metrics, they should capture different underlying dimensions of software quality than the most commonly used code and process metrics capture. From this reasoning, we set up the following null hypothesis H10 and alternative hypothesis H1A for RQ1: H10. Slice-based cohesion metrics do not capture additional dimensions of software quality compared with the most commonly used code and process metrics. H1A. Slice-based cohesion metrics capture additional dimensions of software quality compared with the most commonly used code and process metrics. The second research question (RQ2) of this study investigates whether slice-based cohesion metrics are statistically related to post-release fault-proneness. In the software engineering literature, there is a common belief that low cohesion indicates an inappropriate design [1], [2]. Consequently, a function with low cohesion is more likely to be fault-prone than a function with high cohesion [1], [2]. From Section 2.1, we can see that slice-based cohesion metrics leverage the commonality among the slices with respect to different output variables of a function to quantify its cohesion. Existing studies showed that they provided an TABLE 3 Data-Token Level Slice Occurrence Matrix with Respect to End Slice Profile and Metric Slice Profile End slice Metric slice Line Token largest smallest range largest smallest range 1 A1 1 11 1 11 2 size1 1 11 1 11 3 largest1 1 01 1 01 4 smallest1 1 11 1 11 5 i1 1 11 1 11 6 range1 0 01 0 01 7 i2 1 11 1 11 7 11 1 11 1 11 8 range2 0 00 0 00 8 01 0 00 0 00 9 smallest2 1 11 1 11 9 A2 1 11 1 11 9 02 1 11 1 11 10 largest2 1 01 1 11 10 smallest3 1 01 1 11 11 i3 1 11 1 11 11 size2 1 11 1 11 12 smallest4 0 11 0 11 12 A3 0 11 0 11 12 i4 0 11 0 11 13 smallest5 0 11 0 11 13 A4 0 11 0 11 13 i5 0 11 0 11 14 largest3 1 01 1 11 14 A5 1 01 1 11 14 i6 1 01 1 11 15 largest4 1 01 1 11 15 A6 1 01 1 11 15 i7 1 01 1 11 16 i8 1 11 1 11 17 range3 0 01 1 11 17 largest5 0 01 1 11 17 smallest6 0 01 1 11 18 range4 0 01 1 11 336 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 41, NO. 4, APRIL 2015
YANG ET AL:ARE SLICE-BASED COHESION METRICS ACTUALLY USEFUL IN EFFORT-AWARE POST-RELEASE FAULT-PRONENESS.. 337 TABLE 4 Example Metrics Computations at the Data-Token Level Type Metric Computation Value End slice Coverage =1/3×(21/34+18/34+32/34) =0.696 MaxCoverage =32/34 =0.941 MinCoverage =18/34 =0.529 Overlap =1/3×(12/21+12/18+12/32) =0.538 Tightness =12/34 =0.353 SFC =12/34 =0.353 WFC =27/34 =0.794 A =(21+18+32)/(3×34) =0.696 SBFC =(12×3×2+15×2×1)/(34×3×2) =0.500 NHD =1-2/(3×34×33)×(21×13+18×16+32×2) =0.629 Metric slice Coverage =1/3×(25/34+30/34+32/34) =0.853 MaxCoverage =32/34 =0.941 MinCoverage =25/34 =0.735 Overlap =1/3×(24/25+24/30+24/32) =0.837 Tightness =24/34 =0.706 SFC =24/34 =0.706 WFC =31/34 =0.912 A =(25+30+32)/(3×34) =0.853 SBFC =(24×3×2+7×2×1)/(34×3×2) =0.775 NHD =1-2/(3×34×33)×(25×9+30×4+32×2) =0.757 excellent quantitative measure of function cohesion [5],[13]. fault-prone functions more accurately than the most com- In particular,for each of the investigated slice-based cohe- monly used code and process metrics do.From Table 5,we sion metrics,a large value indicates a high cohesion.From can see that the most commonly used code and process met- this reasoning,we set up the following null hypothesis H2o rics are based on either simple syntactic information or con- and alternative hypothesis H2A for RQ2: trol flow structure information among statements in a function.In contrast,slice-based cohesion metrics make H20.There is no significant correlation between slice-based cohe- use of the semantic dependence information among the state- sion metrics and post-release fault-proneness. ments in a function.In other words,they are based on pro- H2A.There is a significant correlation between slice-based cohe- gram behaviors as captured by program slices.In this sense, sion metrics and post-release fault-proneness. slice-based cohesion metrics provide a higher level quantifica- The third research question(RQ3)of this study investigates tion of software quality than the most commonly used code whether slice-based cohesion metrics predict post-release and process metrics.Consequently,it is reasonable to expect TABLE 5 The Most Commonly Used Code and Process Metrics(i.e.the Baseline Metrics in This Study) Category Characteristic Metric Description Product Size SLOC Source lines of code in a function (excluding blank lines and comment lines) Structural FANIN Number of calling functions plus global variables read complexity FANOUT Number of calling functions plus global variables set NPATH Number of possible paths,not counting abnormal exits or gotos Cyclomatic Cyclomatic complexity CyclomaticModified Modified cyclomatic complexity CyclomaticStrict Strict cyclomatic complexity Essential Essential complexity Knots Measure of overlapping jumps Nesting Maximum nesting level of control constructs MaxEssentialKnots Maximum Knots after structured programming constructs have been removed MinEssentialKnots Minimum Knots after structured programming constructs have been removed Software n1 Total number of distinct operators of a function science n2 Total number of distinct operands of a function Total number of operators of a function N2 Total number of operands of a function Process Code churn Added Added source lines of code,normalized by function size Deleted Deleted source lines of code,normalized by function size Modified Modified source lines of code,normalized by function size
excellent quantitative measure of function cohesion [5], [13]. In particular, for each of the investigated slice-based cohesion metrics, a large value indicates a high cohesion. From this reasoning, we set up the following null hypothesis H20 and alternative hypothesis H2A for RQ2: H20. There is no significant correlation between slice-based cohesion metrics and post-release fault-proneness. H2A. There is a significant correlation between slice-based cohesion metrics and post-release fault-proneness. The third research question (RQ3) of this study investigates whether slice-based cohesion metrics predict post-release fault-prone functions more accurately than the most commonly used code and process metrics do. From Table 5, we can see that the most commonly used code and process metrics are based on either simple syntactic information or control flow structure information among statements in a function. In contrast, slice-based cohesion metrics make use of the semantic dependence information among the statements in a function. In other words, they are based on program behaviors as captured by program slices. In this sense, slice-based cohesion metrics provide a higher level quantification of software quality than the most commonly used code and process metrics. Consequently, it is reasonable to expect TABLE 4 Example Metrics Computations at the Data-Token Level Type Metric Computation Value End slice Coverage ¼ 1/3 (21/34 þ 18/34 þ 32/34) ¼ 0.696 MaxCoverage ¼ 32/34 ¼ 0.941 MinCoverage ¼ 18/34 ¼ 0.529 Overlap ¼ 1/3 (12/21 þ 12/18 þ 12/32) ¼ 0.538 Tightness ¼ 12/34 ¼ 0.353 SFC ¼ 12/34 ¼ 0.353 WFC ¼ 27/34 ¼ 0.794 A ¼ (21 þ 18 þ 32)/(3 34) ¼ 0.696 SBFC ¼ (12 3 2 þ 15 2 1)/(34 3 2) ¼ 0.500 NHD ¼ 12/(3 34 33) (21 13 þ 18 16 þ 32 2) ¼ 0.629 Metric slice Coverage ¼ 1/3 (25/34 þ 30/34 þ 32/34) ¼ 0.853 MaxCoverage ¼ 32/34 ¼ 0.941 MinCoverage ¼ 25/34 ¼ 0.735 Overlap ¼ 1/3 (24/25 þ 24/30 þ 24/32) ¼ 0.837 Tightness ¼ 24/34 ¼ 0.706 SFC ¼ 24/34 ¼ 0.706 WFC ¼ 31/34 ¼ 0.912 A ¼ (25 þ 30 þ 32)/(3 34) ¼ 0.853 SBFC ¼ (24 3 2 þ 7 2 1)/(34 3 2) ¼ 0.775 NHD ¼ 12/(3 34 33) (25 9 þ 30 4 þ 32 2) ¼ 0.757 TABLE 5 The Most Commonly Used Code and Process Metrics (i.e. the Baseline Metrics in This Study) Category Characteristic Metric Description Product Size SLOC Source lines of code in a function (excluding blank lines and comment lines) Structural complexity FANIN Number of calling functions plus global variables read FANOUT Number of calling functions plus global variables set NPATH Number of possible paths, not counting abnormal exits or gotos Cyclomatic Cyclomatic complexity CyclomaticModified Modified cyclomatic complexity CyclomaticStrict Strict cyclomatic complexity Essential Essential complexity Knots Measure of overlapping jumps Nesting Maximum nesting level of control constructs MaxEssentialKnots Maximum Knots after structured programming constructs have been removed MinEssentialKnots Minimum Knots after structured programming constructs have been removed Software science n1 Total number of distinct operators of a function n2 Total number of distinct operands of a function N1 Total number of operators of a function N2 Total number of operands of a function Process Code churn Added Added source lines of code, normalized by function size Deleted Deleted source lines of code, normalized by function size Modified Modified source lines of code, normalized by function size YANG ET AL.: ARE SLICE-BASED COHESION METRICS ACTUALLY USEFUL IN EFFORT-AWARE POST-RELEASE FAULT-PRONENESS... 337
338 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,VOL 41,NO.4,APRIL 2015 that slice-based cohesion metrics are more closely related to The independent variables in this study consist of two fault-proneness than the most commonly used code and pro- categories of metrics:(i)the most commonly used 19 cess metrics.From this expectation,we set up the following code and process metrics,and (ii)eight slice-based cohe- null hypothesis H30 and alternative hypothesis H3A for RQ3: sion metrics.All these metrics are collected at the func- tion level.The objective of this study is to empirically H30.Slice-based cohesion metrics are not more effective in effort- investigate the actual usefulness of slice-based cohesion aware post-release fault-proneness prediction than the most metrics in the context of effort-aware post-release fault- commonly used code and process metrics. H3A.Slice-based cohesion metrics are more effective in effort- proneness prediction,especially when compared with the most commonly used code and process metrics.With aware post-release fault-proneness prediction than the most these independent variables,we are able to test the four commonly used code and process metrics. The fourth research question(RQ4)of this study investi- null hypotheses described in Section 3.1. gates whether the model built with slice-based cohesion 3.3 Modeling Technique metrics and the most commonly used code and process met- rics together has a better ability to predict post-release fault- Logistic regression is a standard statistical modeling tech- proneness than the model built with the most commonly nique in which the dependent variable can take two differ- used code and process metrics alone.This issue is indeed ent values [281.It is suitable for building fault-proneness prediction models because the functions under consi- raised by the null hypothesis H1o.If the null hypothesis H1o is rejected,it means that slice-based cohesion metrics cap- deration are divided into two categories:faulty and not- ture different underlying dimensions of software quality faulty.Let Pr(Y =1X1,X2,...,Xn)represent the probabil- ity that the dependent variable Y=1 given the independent that are not captured by the most commonly used code and process metrics.In this case,we will naturally conjecture variables X1,X2,...,and Xn (i.e.the metrics in this study). that combining slice-based cohesion metrics with the most Then,a multivariate logistic regression model assumes that commonly used code and process metrics should give Pr(Y =1X1:X2...,Xn)is related to X1,X2,...,Xn by the a more complete indication of software quality.Conse- following equation: quently,the combination of slice-based cohesion metrics eu+1X1+2X2+…BnXn with the most commonly used code and process metrics will form a better indicator of post-release fault-proneness P(Y=1X,X2.Xn)=1+e+A1+22+石 than the combination of the most commonly used code and process metrics alone.From this reasoning,we set up the where o and B,s are the regression coefficients and can be following null hypothesis H4o and alternative hypothesis estimated through the maximization of a log-likelihood. H4A for RQ4: Odds ratio is the most commonly used measure to quantify the magnitude of the correlation between the independent H40.The combination of slice-based cohesion metrics with the and dependent variables in a logistic regression model.For most commonly used code and process metrics are not more a given independent variable Xi,the odds that Y=1 at effective in effort-aware post-release fault-proneness predic- Xi =x denotes the ratio of the probability that Y=1 to the tion than the combination of the most commonly used code probability that Y =1 at Xi=z,i.e. and process metrics H4A.The combination of slice-based cohesion metrics with the Pr(Y=1|,X=x,) most commonly used code and process metrics are more effective in effort-aware post-release fault-proneness predic- 0dds(Y=1X:=)=1-PY=1.,X=x, tion than the combination of the most commonly used code and process metrics. In this study,similar to [33],we use AOR,the odds 3.2 Variable Description ratio associated with one standard deviation increase,to The dependent variable in this study is a binary variable Y provide an intuitive insight into the impact of the indepen- that can take on only one of two different values.In the dent variable Xi: following,let the values be 0 and 1.Here,Y=1 represents that the corresponding function has at least one post-release faults and Y=0 represents that the corresponding function △OR(X)= Odds(Y =1X;=+i)=efo., Odds(Y =1Xi =x) has no post-release fault.In this paper,we use a modeling technique called logistic regression(described in Section 3.3) where B;and o;are respectively the regression coefficient to predict the probability of y=1.The probability of Y=1 and the standard deviation of the variable Xi.AOR(Xi) indeed indicates post-release fault-proneness,i.e.the extent can be used to compare the relative magnitude of the effects of a function being post-release faulty.As stated by Nagap- of different independent variables,as the same unit is used pan et al.[661,for the users,only post-release failures [421.AOR(Xi)>1 indicates that the independent variable is matter.It is hence essential to predict post-release fault- positively associated with dependent variable.AOR(Xi)=1 proneness of functions in a system in practice,as it enables indicates that there is no such correlation.AOR(X;)<1 developers to take focused preventive actions to improve indicates that there is a negative correlation.The univariate quality in a cost-effective way.Indeed,much effort has been logistic regression model is a special case of the multivariate devoted to post-release fault-proneness prediction [271,[341,logistic regression model,where there is only one indepen- [361,[42],[541,[60],[64],[651.[66]. dent variable
that slice-based cohesion metrics are more closely related to fault-proneness than the most commonly used code and process metrics. From this expectation, we set up the following null hypothesis H30 and alternative hypothesis H3A for RQ3: H30. Slice-based cohesion metrics are not more effective in effortaware post-release fault-proneness prediction than the most commonly used code and process metrics. H3A. Slice-based cohesion metrics are more effective in effortaware post-release fault-proneness prediction than the most commonly used code and process metrics. The fourth research question (RQ4) of this study investigates whether the model built with slice-based cohesion metrics and the most commonly used code and process metrics together has a better ability to predict post-release faultproneness than the model built with the most commonly used code and process metrics alone. This issue is indeed raised by the null hypothesis H10. If the null hypothesis H10 is rejected, it means that slice-based cohesion metrics capture different underlying dimensions of software quality that are not captured by the most commonly used code and process metrics. In this case, we will naturally conjecture that combining slice-based cohesion metrics with the most commonly used code and process metrics should give a more complete indication of software quality. Consequently, the combination of slice-based cohesion metrics with the most commonly used code and process metrics will form a better indicator of post-release fault-proneness than the combination of the most commonly used code and process metrics alone. From this reasoning, we set up the following null hypothesis H40 and alternative hypothesis H4A for RQ4: H40. The combination of slice-based cohesion metrics with the most commonly used code and process metrics are not more effective in effort-aware post-release fault-proneness prediction than the combination of the most commonly used code and process metrics. H4A. The combination of slice-based cohesion metrics with the most commonly used code and process metrics are more effective in effort-aware post-release fault-proneness prediction than the combination of the most commonly used code and process metrics. 3.2 Variable Description The dependent variable in this study is a binary variable Y that can take on only one of two different values. In the following, let the values be 0 and 1. Here, Y ¼ 1 represents that the corresponding function has at least one post-release faults and Y ¼ 0 represents that the corresponding function has no post-release fault. In this paper, we use a modeling technique called logistic regression (described in Section 3.3) to predict the probability of Y ¼ 1. The probability of Y ¼ 1 indeed indicates post-release fault-proneness, i.e. the extent of a function being post-release faulty. As stated by Nagappan et al. [66], for the users, only post-release failures matter. It is hence essential to predict post-release faultproneness of functions in a system in practice, as it enables developers to take focused preventive actions to improve quality in a cost-effective way. Indeed, much effort has been devoted to post-release fault-proneness prediction [27], [34], [36], [42], [54], [60], [64], [65], [66]. The independent variables in this study consist of two categories of metrics: (i) the most commonly used 19 code and process metrics, and (ii) eight slice-based cohesion metrics. All these metrics are collected at the function level. The objective of this study is to empirically investigate the actual usefulness of slice-based cohesion metrics in the context of effort-aware post-release faultproneness prediction, especially when compared with the most commonly used code and process metrics. With these independent variables, we are able to test the four null hypotheses described in Section 3.1. 3.3 Modeling Technique Logistic regression is a standard statistical modeling technique in which the dependent variable can take two different values [28]. It is suitable for building fault-proneness prediction models because the functions under consideration are divided into two categories: faulty and notfaulty. Let PrðY ¼ 1jX1; X2; ... ; XnÞ represent the probability that the dependent variable Y ¼ 1 given the independent variables X1, X2,..., and Xn (i.e. the metrics in this study). Then, a multivariate logistic regression model assumes that PrðY ¼ 1jX1; X2 ... ; XnÞ is related to X1; X2; ... ; Xn by the following equation: PrðY ¼ 1jX1; X2; ... XnÞ ¼ eaþb1X1þb2X2þ...bnXn 1 þ eaþb1X1þb2X2þ...bnXn ; where a and bis are the regression coefficients and can be estimated through the maximization of a log-likelihood. Odds ratio is the most commonly used measure to quantify the magnitude of the correlation between the independent and dependent variables in a logistic regression model. For a given independent variable Xi, the odds that Y ¼ 1 at Xi ¼ x denotes the ratio of the probability that Y ¼ 1 to the probability that Y ¼ 1 at Xi ¼ x, i.e. OddsðY ¼ 1jXi ¼ xÞ ¼ PrðY ¼ 1j ... ; Xi ¼ x; ...Þ 1 PrðY ¼ 1j ... ; Xi ¼ x; ...Þ : In this study, similar to [33], we use DOR, the odds ratio associated with one standard deviation increase, to provide an intuitive insight into the impact of the independent variable Xi: DORðXiÞ ¼ OddsðY ¼ 1jXi ¼ x þ siÞ OddsðY ¼ 1jXi ¼ xÞ ¼ ebisi ; where bi and si are respectively the regression coefficient and the standard deviation of the variable Xi. DORðXiÞ can be used to compare the relative magnitude of the effects of different independent variables, as the same unit is used [42]. DORðXiÞ > 1 indicates that the independent variable is positively associated with dependent variable. DORðXiÞ ¼ 1 indicates that there is no such correlation. DORðXiÞ < 1 indicates that there is a negative correlation. The univariate logistic regression model is a special case of the multivariate logistic regression model, where there is only one independent variable. 338 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 41, NO. 4, APRIL 2015
YANG ET AL:ARE SLICE-BASED COHESION METRICS ACTUALLY USEFUL IN EFFORT-AWARE POST-RELEASE FAULT-PRONENESS.. 339 3.4 Data Analysis Method and fault-proneness.Therefore,there is a need to remove In the following,we describe the data analysis method for the potentially confounding effect of module size in order testing the four null research hypotheses. to understand the essence that a metric measures [53]. In this study,we first apply the linear regression method 3.4.1 Principal Component Analysis for RQ1 proposed by Zhou et al.[53]to remove the potentially con- In order to answer RQ1,we use principal component analysis founding effect of function size.After that,we use univari- to determine whether slice-based cohesion metrics capture ate logistic regression to examine the correlations between different underlying dimensions of software quality than thethe cleaned metrics and fault-proneness.For each metric, most commonly used code and process metrics.PCA is a the null hypothesis H2o corresponding to RQ2 will be powerful statistical technique used to identify the underlying, rejected if the result of univariate logistic regression is statis- orthogonal dimensions that explain the relations among the tically significant at the significant level of 0.10 independent variables in a data set.These dimensions are called principal components(PCs),which are linear combina- 3.4.3 Multivariate Logistic Rearession Analysis for RQ3 tions of the standardized independent variables.In our study, and RQ4 for each data set,we use the following method to determine In order to answer RQ3 and RQ4,we perform a stepwise vari- the corresponding number of PCs.First,the stopping criterion able selection procedure to build three types of multivariate for PCA is that all the eigenvalues for each new component logistic regression models:(1)the"B"model (using only the are greater than zero.Second,we apply the varimax rotation most commonly used code and process metrics);(2)the "S" to PCs to make the mapping of the independent variables to model (using only slice-based cohesion metrics);and(3)the components clearer where the variables have either a very "B+S"model (using all the metrics).As suggested by Zhou low or a very high loading.This helps identify the variables et al.[531,before building the multivariate logistic regression that are strongly correlated and indeed measure the same models,we remove the confounding effect of function size property,though they may purport to capture different prop- (measured by SLOC).In addition,many metrics used in this erties.Third,after obtaining the rotated component matrix, study are defined similarly with each other.For example, we map each independent variable to the component having CyclomaticModified and CyclomaticStrict are the revised Cyclo- the maximum loading.Fourth,we only retain the compo- matic complexity versions.These highly correlated predictors nents to which at least one independent variable is mapped. may lead to a high multicollinearity and hence inaccurate In our context,the null hypothesis H1o corresponding to ROl coefficient estimates in a logistic regression model [61].Vari- will be rejected when the result of PCA shows that slice-based ance inflation factor(VIF)is a widely used indicator of multi- cohesion metrics define new PCs of their own compared with collinearity.In this study,we use the recommended cut-off the most commonly used code and process metrics value 10 to deal with multicollinearity in a regression model 3.4.2 Univariate Logistic Regression Analysis for RQ2 [591.If an independent variable has a VIF value larger than 10,it will be removed from the multivariate regression In order to answer RQ2,we use univariate logistic regres- model.More specifically,we use the following algorithm sion to examine whether each slice-based cohesion metric is BUILD-MODEL to build the multivariate logistic regression negatively related to post-release fault-proneness at the sig- models.As can be seen,when building a multivariate model nificant level a of 0.10.From a scientific perspective,it is often suggested to work at the a level 0.05 or 0.01.However, our algorithm takes into account:(1)the confounding effect of function size;(2)the multicollinearity among the indepen- the choice of a particular level of significance is ultimately a dent variables;and(3)the influential observations. subjective decision and other levels such as a =0.10 are also common [51].In this paper,the minimum significance Algorithm 1.BUILD-MODEL level for rejecting a null hypothesis is set at a=0.10,as we are aggressively interested in revealing unclosed correla- Input dataset D(X:set of independent variables,Y:dependent tions between metrics and fault-proneness.When perform- variable) ing univariate analysis,we employ the Cook's distance to Step identify influential observations.For an observation,its 1:Remove the confounding effect of function size from each independent variable in X for D.[53] Cook's distance is a measure of how far apart the regression 2: coefficients are with and without this observation included. Use the backward stepwise variable selection method to If an observation has a Cook's distance equal to or larger build the logistic regression model M on D. 3:Calculate the variance inflation factors for all independent than 1,it is regarded as an influential observation and is variables in the model M. hence excluded for the analysis [32].Furthermore,for each 4:If all the VIFs are less than or equal to 10,goto step 6;other- metric,we use AOR,the odds ratio associated with one wise,goto step 5. standard deviation increase in the metric,to quantify its 5:Remove the variable x;with the largest VIF from X,and goto effect on fault-proneness [331.This allows us to compare the step 2 relative magnitude of the effects of individual metrics on 6:Calculate the Cook's distance for all the observations in D.If post-release fault-proneness.Note that previous studies the maximum Cook's distance is less than or equal to 1,then reported that module size (i.e.function size in this study) goto step 8;otherwise,goto step 7. might have a potential confounding effect on the relation- 7: Update D by removing the observations whose Cook's dis- ships between software metrics and fault-proneness [43], tances are equal to or larger than 1.Goto step 2. [531.In other words,module size may falsely obscure or 8:Return the model M. accentuate the true correlations between software metrics
3.4 Data Analysis Method In the following, we describe the data analysis method for testing the four null research hypotheses. 3.4.1 Principal Component Analysis for RQ1 In order to answer RQ1, we use principal component analysis to determine whether slice-based cohesion metrics capture different underlying dimensions of software quality than the most commonly used code and process metrics. PCA is a powerful statistical technique used to identify the underlying, orthogonal dimensions that explain the relations among the independent variables in a data set. These dimensions are called principal components (PCs), which are linear combinations of the standardized independent variables. In our study, for each data set, we use the following method to determine the corresponding number of PCs. First, the stopping criterion for PCA is that all the eigenvalues for each new component are greater than zero. Second, we apply the varimax rotation to PCs to make the mapping of the independent variables to components clearer where the variables have either a very low or a very high loading. This helps identify the variables that are strongly correlated and indeed measure the same property, though they may purport to capture different properties. Third, after obtaining the rotated component matrix, we map each independent variable to the component having the maximum loading. Fourth, we only retain the components to which at least one independent variable is mapped. In our context, the null hypothesis H10 corresponding to RQ1 will be rejected when the result of PCA shows that slice-based cohesion metrics define new PCs of their own compared with the most commonly used code and process metrics. 3.4.2 Univariate Logistic Regression Analysis for RQ2 In order to answer RQ2, we use univariate logistic regression to examine whether each slice-based cohesion metric is negatively related to post-release fault-proneness at the significant level a of 0.10. From a scientific perspective, it is often suggested to work at the a level 0.05 or 0.01. However, the choice of a particular level of significance is ultimately a subjective decision and other levels such as a ¼ 0:10 are also common [51]. In this paper, the minimum significance level for rejecting a null hypothesis is set at a ¼ 0:10, as we are aggressively interested in revealing unclosed correlations between metrics and fault-proneness. When performing univariate analysis, we employ the Cook’s distance to identify influential observations. For an observation, its Cook’s distance is a measure of how far apart the regression coefficients are with and without this observation included. If an observation has a Cook’s distance equal to or larger than 1, it is regarded as an influential observation and is hence excluded for the analysis [32]. Furthermore, for each metric, we use DOR, the odds ratio associated with one standard deviation increase in the metric, to quantify its effect on fault-proneness [33]. This allows us to compare the relative magnitude of the effects of individual metrics on post-release fault-proneness. Note that previous studies reported that module size (i.e. function size in this study) might have a potential confounding effect on the relationships between software metrics and fault-proneness [43], [53]. In other words, module size may falsely obscure or accentuate the true correlations between software metrics and fault-proneness. Therefore, there is a need to remove the potentially confounding effect of module size in order to understand the essence that a metric measures [53]. In this study, we first apply the linear regression method proposed by Zhou et al. [53] to remove the potentially confounding effect of function size. After that, we use univariate logistic regression to examine the correlations between the cleaned metrics and fault-proneness. For each metric, the null hypothesis H20 corresponding to RQ2 will be rejected if the result of univariate logistic regression is statistically significant at the significant level of 0.10. 3.4.3 Multivariate Logistic Regression Analysis for RQ3 and RQ4 In order to answer RQ3 and RQ4, we perform a stepwise variable selection procedure to build three types of multivariate logistic regression models: (1) the “B” model (using only the most commonly used code and process metrics); (2) the “S” model (using only slice-based cohesion metrics); and (3) the “BþS” model (using all the metrics). As suggested by Zhou et al. [53], before building the multivariate logistic regression models, we remove the confounding effect of function size (measured by SLOC). In addition, many metrics used in this study are defined similarly with each other. For example, CyclomaticModified and CyclomaticStrict are the revised Cyclomatic complexity versions. These highly correlated predictors may lead to a high multicollinearity and hence inaccurate coefficient estimates in a logistic regression model [61]. Variance inflation factor (VIF) is a widely used indicator of multicollinearity. In this study, we use the recommended cut-off value 10 to deal with multicollinearity in a regression model [59]. If an independent variable has a VIF value larger than 10, it will be removed from the multivariate regression model. More specifically, we use the following algorithm BUILD-MODEL to build the multivariate logistic regression models. As can be seen, when building a multivariate model, our algorithm takes into account: (1) the confounding effect of function size; (2) the multicollinearity among the independent variables; and (3) the influential observations. Algorithm 1. BUILD-MODEL Input dataset D(X: set of independent variables, Y: dependent variable) Step 1: Remove the confounding effect of function size from each independent variable in X for D. [53] 2: Use the backward stepwise variable selection method to build the logistic regression model M on D. 3: Calculate the variance inflation factors for all independent variables in the model M. 4: If all the VIFs are less than or equal to 10, goto step 6; otherwise, goto step 5. 5: Remove the variable xi with the largest VIF from X, and goto step 2. 6: Calculate the Cook’s distance for all the observations in D. If the maximum Cook’s distance is less than or equal to 1, then goto step 8; otherwise, goto step 7. 7: Update D by removing the observations whose Cook’s distances are equal to or larger than 1. Goto step 2. 8: Return the model M. YANG ET AL.: ARE SLICE-BASED COHESION METRICS ACTUALLY USEFUL IN EFFORT-AWARE POST-RELEASE FAULT-PRONENESS... 339
340 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,VOL 41,NO.4,APRIL 2015 After building the above models,we compare the predic- prediction performances of two models are important from tion effectiveness of the following two pairs of models:"S" the viewpoint of practical application [34].By convention, vs.“B"and“B+S"versus"B”.To obtain an adequate and the magnitude of the difference is considered either trivial realistic comparison,we use the prediction effectiveness (8l0.474)[58l. We test the null Hypotheses H30 and H4o in the following Cross-validation.Cross-validation is performed within the same version of a project,i.e.predicting faults in two typical application scenarios:ranking and classification. In the ranking scenario,functions are ranked in order from one subset using a model trained on the other comple- the most to the least predicted relative risk.With this rank- mentary subsets.In our study,for a given project,we ing list in hand,software practitioners can simply select as use 30 times three-fold cross-validation to evaluate the many high-risk functions targeted for software quality effectiveness of the prediction models.More specifi- enhancement as available resources will allow.In the classi- cally,at each three-fold cross-validation,we randomly fication scenario,functions are first classified into two cate- divide the data set into three parts of approximately gories in terms of their predicted relative risk:high-risk and equal size.Each part is used to compute the effective- low-risk.After that,those functions classified as high-risk ness for the prediction models built on the remainder are targeted for software quality enhancement.In both sce- of the data set.The entire process is then repeated narios,we take into account the effort to test or inspect those 30 times to alleviate possible sampling bias in random functions predicted as high-risk when evaluating the pre- splits.Consequently,each model has 30 x 3=90 pre- diction effectiveness of a model.Following previous work diction effectiveness values.Note that we choose to [34],we use the source lines of code in a function f as a perform three-fold cross validation rather than 10-fold proxy to estimate the effort required to test or inspect the cross-validation due to the small percentage of post- function.In particular,we define the relative risk of the release faulty functions in the data sets. function f as R(f)=Pr/SLOC(f),where Pr is the probabil- Across-version prediction.Across-version prediction ity of the function f being faulty predicted using the logistic uses a model trained on earlier versions to predict regression model.In other words,R(f)can be regarded as faults in later versions within the same project.There the predicted fault-proneness per SLOC.In the context of are two kinds of approaches for the across-version effort-aware fault-proneness prediction,prior studies used prediction [50].The first approach is next-version defect density [35,[36],[37],i.e.#Error(f)/SLOC(f),as the prediction,i.e.building a prediction model on a ver- dependent variable to build the prediction model.In this sion i and then only applying the model to predict study,we first use the binary dependent variable to build faults in the next version i+1 of the same project. the logistic regression model and then use R(f)to estimate The second approach is follow-up-version predic- the relative risk of a given function f.Next,we describe the tion,i.e.building a prediction model on a version i effort-aware prediction performance indicators used in this and then applying the model to predict faults in any study for ranking and classification. follow-up version j(i.e.j>i)of the same project.In (1)Effort-aware ranking performance evaluation.We use the our study,we adopt both approaches.If a project has cost-effectiveness measure CE proposed by Arisholm et al. m versions,the first approach will produce m-1 [34]to evaluate the effort-aware ranking effectiveness of a prediction effectiveness values for each model,while fault-proneness prediction model.The CE measure is based the second approach will produce m x(m-1)/2 on the concept of the "SLOC-based"Alberg diagram.In this prediction effectiveness values for each model. diagram,the x-axis is the cumulative percentage of SLOC of Across-project prediction.Across-project prediction the functions selected from the function ranking and the y- uses a model trained on one project to predict faults axis is the cumulative percentage of post-release faults found in another project [50].Given n projects,this predic- in the selected functions.Consequently,each fault-prone- tion method will produce n x (n-1)prediction effec- ness prediction model corresponds to a curve in the diagram. tiveness values for each model. Fig.1 is an example "SLOC-based"Alberg diagram showing In each of the above-mentioned three prediction settings, the ranking performance of a prediction model m(in our con- all models use the same training data and the same testing text,the prediction model m could be the "B"model,the "S" data.Based on these setups,we employ the Wilcoxon model,and the"B+S"model).To compute CE,we also con- signed-rank test to examine whether two models have a sig- sider two additional curves,which respectively correspond nificant difference on the prediction effectiveness.In to“random”model and“optimal'"model.In the“random' particular,we use the Benjamini-Hochberg (BH)corrected model,functions are randomly selected to test or inspect.In p-values to examine whether a difference is significant at the "optimal"model,functions are sorted in decreasing the significance level of 0.10.The null hypothesis H30 corre- order according to their actual post-release fault densities sponding to RO3 will be rejected when the comparison Based on this diagram,the effort-aware ranking effective- shows that the "S"model outperforms the "B"model and ness of the prediction model m is defined as follows [34]: the difference is significant.The null hypothesis H4o corre- sponding to RQ4 will be rejected when the comparison Area(m)-Area(random model) shows that the"B+S”model outperforms the“B”model and the difference is significant.Furthermore,we use the CE(m)=Areas(optimal model)-Aredz(random model) Cliff's 8,which is used for median comparison,to examine Here,Area,(m)is the area under the curve corresponding whether the magnitude of the difference between the to model m for a given top xx 100%percentage of SLOC
After building the above models, we compare the prediction effectiveness of the following two pairs of models: “S” vs. “B” and “BþS” versus “B”. To obtain an adequate and realistic comparison, we use the prediction effectiveness data generated from the following three methods: Cross-validation. Cross-validation is performed within the same version of a project, i.e. predicting faults in one subset using a model trained on the other complementary subsets. In our study, for a given project, we use 30 times three-fold cross-validation to evaluate the effectiveness of the prediction models. More specifi- cally, at each three-fold cross-validation, we randomly divide the data set into three parts of approximately equal size. Each part is used to compute the effectiveness for the prediction models built on the remainder of the data set. The entire process is then repeated 30 times to alleviate possible sampling bias in random splits. Consequently, each model has 30 3 ¼ 90 prediction effectiveness values. Note that we choose to perform three-fold cross validation rather than 10-fold cross-validation due to the small percentage of postrelease faulty functions in the data sets. Across-version prediction. Across-version prediction uses a model trained on earlier versions to predict faults in later versions within the same project. There are two kinds of approaches for the across-version prediction [50]. The first approach is next-version prediction, i.e. building a prediction model on a version i and then only applying the model to predict faults in the next version i þ 1 of the same project. The second approach is follow-up-version prediction, i.e. building a prediction model on a version i and then applying the model to predict faults in any follow-up version j (i.e. j>i) of the same project. In our study, we adopt both approaches. If a project has m versions, the first approach will produce m 1 prediction effectiveness values for each model, while the second approach will produce m ðm 1Þ=2 prediction effectiveness values for each model. Across-project prediction. Across-project prediction uses a model trained on one project to predict faults in another project [50]. Given n projects, this prediction method will produce n (n - 1) prediction effectiveness values for each model. In each of the above-mentioned three prediction settings, all models use the same training data and the same testing data. Based on these setups, we employ the Wilcoxon signed-rank test to examine whether two models have a significant difference on the prediction effectiveness. In particular, we use the Benjamini-Hochberg (BH) corrected p-values to examine whether a difference is significant at the significance level of 0.10. The null hypothesis H30 corresponding to RQ3 will be rejected when the comparison shows that the “S” model outperforms the “B” model and the difference is significant. The null hypothesis H40 corresponding to RQ4 will be rejected when the comparison shows that the “BþS” model outperforms the “B” model and the difference is significant. Furthermore, we use the Cliff’s d, which is used for median comparison, to examine whether the magnitude of the difference between the prediction performances of two models are important from the viewpoint of practical application [34]. By convention, the magnitude of the difference is considered either trivial (jdj 0.474) [58]. We test the null Hypotheses H30 and H40 in the following two typical application scenarios: ranking and classification. In the ranking scenario, functions are ranked in order from the most to the least predicted relative risk. With this ranking list in hand, software practitioners can simply select as many high-risk functions targeted for software quality enhancement as available resources will allow. In the classi- fication scenario, functions are first classified into two categories in terms of their predicted relative risk: high-risk and low-risk. After that, those functions classified as high-risk are targeted for software quality enhancement. In both scenarios, we take into account the effort to test or inspect those functions predicted as high-risk when evaluating the prediction effectiveness of a model. Following previous work [34], we use the source lines of code in a function f as a proxy to estimate the effort required to test or inspect the function. In particular, we define the relative risk of the function f as RðfÞ ¼ Pr=SLOCðfÞ, where Pr is the probability of the function f being faulty predicted using the logistic regression model. In other words, R(f) can be regarded as the predicted fault-proneness per SLOC. In the context of effort-aware fault-proneness prediction, prior studies used defect density [35, [36], [37], i.e. #Error(f) / SLOC(f), as the dependent variable to build the prediction model. In this study, we first use the binary dependent variable to build the logistic regression model and then use R(f) to estimate the relative risk of a given function f. Next, we describe the effort-aware prediction performance indicators used in this study for ranking and classification. (1) Effort-aware ranking performance evaluation. We use the cost-effectiveness measure CE proposed by Arisholm et al. [34] to evaluate the effort-aware ranking effectiveness of a fault-proneness prediction model. The CE measure is based on the concept of the “SLOC-based” Alberg diagram. In this diagram, the x-axis is the cumulative percentage of SLOC of the functions selected from the function ranking and the yaxis is the cumulative percentage of post-release faults found in the selected functions. Consequently, each fault-proneness prediction model corresponds to a curve in the diagram. Fig. 1 is an example “SLOC-based” Alberg diagram showing the ranking performance of a prediction model m (in our context, the prediction model m could be the “B” model, the “S” model, and the “BþS” model). To compute CE, we also consider two additional curves, which respectively correspond to “random” model and “optimal” model. In the “random” model, functions are randomly selected to test or inspect. In the “optimal” model, functions are sorted in decreasing order according to their actual post-release fault densities. Based on this diagram, the effort-aware ranking effectiveness of the prediction model m is defined as follows [34]: CEpðmÞ ¼ AreapðmÞ Areapðrandom modelÞ Areapðoptimal modelÞ Areapðrandom modelÞ : Here, Areap(m) is the area under the curve corresponding to model m for a given top p 100% percentage of SLOC. 340 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 41, NO. 4, APRIL 2015