YANG ET AL:ARE SLICE-BASED COHESION METRICS ACTUALLY USEFUL IN EFFORT-AWARE POST-RELEASE FAULT-PRONENESS.. 343 procedural slicing to compute slice-based cohesion and Frama-C.The seventh and the eighth columns are metrics for each function.In other words,metric sli- respectively the total number of functions and the total ces are computed within a single function.In our number of faulty functions after removing the functions study,calls to other functions are handled conserva- that have an "undefined"metric value.As can be seen,for tively.More specifically,we developed two Frama-C each system,faulty functions detected during the post- plug-ins named INFER CONtract(INFERCON)and release phase concentrated in a very small number of func- SLIce-Based COhesion Metrics (SLIBCOM)to com- tions (only around 1.234~16.994 percent of all functions).In pute slice-based cohesion metrics for each function. all the subsequent analyses (in Section 5),we use only the The INFERCON plugin is used to infer the contract for pre-processed data sets.The last two columns respectively a called function conservatively.For a function,if it provide the version number and the release date for the pre- has a return value,INFERCON assumes that the vious releases.The previous releases are used for comput- returned value is data-dependent on all the arguments ing the code churn metrics for each system(described in passed to the function.For any pointer argument p in Step 2 in Section 4.2).We choose these previous five versions the function,INFERCON assumes that the value as the baseline versions to compute code churn metrics,as pointed by p will be changed at the end of the function they all are the first previous minor versions for each sys- and hence p is data-dependent on all the arguments tem.We can find that,on average,the previous release is passed to the function.In addition,INFERCON released 17 months before the subject version is released. assumes that any function is a terminating function. The SLIBCOM plugin is used to collect slice-based cohesion metrics for each function in each system.The 4.3 Data Distribution SLIBCOM plug-in was based on the INFERCON plug- Table 7 presents the descriptive statistics for each data set in and the other four plug-ins provided by Frama-C Columns "25","50",and "75"percent state for each metric namely "Value analysis","Outputs","Slicing",and the first quartile,the median value,and the third quartile, "Impact analysis".For each function m in the system, respectively.From Table 7,we have the following observa- INFERCON first inferred contracts for the functions tions.First,for the code metrics,we can see that Gcc-core called by m.Next,SLIBCOM employed the "Value 3.4.0 has the largest function size,the highest Cyclomatic analysis"plug-in to perform the value analysis in a complexity,and the maximum depth of nesting.This indi- context-insensitive way by using the inferred contracts cates that this compiler collection has a more complex control for the called functions.Then,based on the results flow than the other systems.Second,for the process metrics, from the value analysis,SLIBCOM used the we can see that the functions in Gcc-core 3.4.0 undergo more "Outputs"plug-in to obtain the output variables of m. code changes.This is probably because the Gcc 3.4.0 has After that,SLIBCOM leveraged the "Slicing"plug-in many improvements in the C++frontend.s Third,for the to obtain the end slices for each output variable.Based slice-based cohesion metrics,we can see that Vim 6.2 in gen- on those end slices,SLIBCOM used the "Impact analy- eral has a smaller cohesion value than the other systems.In sis"plug-in to obtain the corresponding "forward other words,its functions are less cohesive than the functions slices"and then combined them to obtain the metric in the other four systems.From Table 6,we observe that,of slices for each output variable.Finally,SLIBCOM used the five systems,Vim 6.2 has the largest percentage of faulty the metric-slice information to calculate slice-based functions.One possible explanation is that lower cohesive cohesion metrics for the function m.Note that,the functions are more likely to be faulty functions.This is consis- cohesion metric value of a function was set to unde- tent with our intuition.Fourth,for most metrics,there are fined if either of the following two conditions was sat- large differences between the lower 25th percentile,the isfied:(1)the execution time for the value analysis was median,and the 75th percentile,thus showing strong varia- very long;(2)the "Outputs"plug-in did not find any tions across functions. output variable.The reason for the former case is All of the metrics have more than five observations that are unknown.In our study,we terminated the value anal- nonzero,and hence,are considered for further analysis [33]. ysis when the execution time was longer than 30 min The latter case occurred when the function under 6 EXPERIMENTAL RESULTS analysis did not return anything and had no side effect as well.In this case,the "Outputs"plug-in was In this section,we elaborate on the experimental results for unable to identify any output variable for the function slice-based cohesion metrics.In Section 5.1,we present the under analysis. results from examining their redundancy with the most Table 6 summarizes the projects studied in this study commonly used code and process metrics (RQ1).In Sec- (the time cost for collecting the slice-based cohesion metrics tion 5.2,we give the results from examining their correla- is shown in Table 12 in Appendix A,which can be found on tions with post-release fault-proneness(RQ2).In Section 5.3, the Computer Society Digital Library at http://doi.ieeecom- we show the results from examining their ability for predict- putersociety.org/10.1109/TSE.2014.2370048).The second to ing post-release fault-proneness compared with the most the fourth columns are the version number,the release date, commonly used code and process metrics(RO3).In Section and the total source lines of code of the subject release, 5.4,we report the results from examining the usefulness of respectively.The fifth and the sixth columns are respec- their combination with the most commonly used code and tively the total number of functions and the total number of faulty functions that can be identified by both Understand 8.http://www.gnu.org/software/gcc/gcc-3.4/changes.htmlprocedural slicing to compute slice-based cohesion metrics for each function. In other words, metric slices are computed within a single function. In our study, calls to other functions are handled conservatively. More specifically, we developed two Frama-C plug-ins named INFER CONtract (INFERCON) and SLIce-Based COhesion Metrics (SLIBCOM) to compute slice-based cohesion metrics for each function. The INFERCON plugin is used to infer the contract for a called function conservatively. For a function, if it has a return value, INFERCON assumes that the returned value is data-dependent on all the arguments passed to the function. For any pointer argument p in the function, INFERCON assumes that the value pointed by p will be changed at the end of the function and hence p is data-dependent on all the arguments passed to the function. In addition, INFERCON assumes that any function is a terminating function. The SLIBCOM plugin is used to collect slice-based cohesion metrics for each function in each system. The SLIBCOM plug-in was based on the INFERCON plugin and the other four plug-ins provided by Frama-C, namely “Value analysis”, “Outputs”, “Slicing”, and “Impact analysis”. For each function m in the system, INFERCON first inferred contracts for the functions called by m. Next, SLIBCOM employed the “Value analysis” plug-in to perform the value analysis in a context-insensitive way by using the inferred contracts for the called functions. Then, based on the results from the value analysis, SLIBCOM used the “Outputs” plug-in to obtain the output variables of m. After that, SLIBCOM leveraged the “Slicing” plug-in to obtain the end slices for each output variable. Based on those end slices, SLIBCOM used the “Impact analysis” plug-in to obtain the corresponding “forward slices” and then combined them to obtain the metric slices for each output variable. Finally, SLIBCOM used the metric-slice information to calculate slice-based cohesion metrics for the function m. Note that, the cohesion metric value of a function was set to unde- fined if either of the following two conditions was satisfied: (1) the execution time for the value analysis was very long; (2) the “Outputs” plug-in did not find any output variable. The reason for the former case is unknown. In our study, we terminated the value analysis when the execution time was longer than 30 min. The latter case occurred when the function under analysis did not return anything and had no side effect as well. In this case, the “Outputs” plug-in was unable to identify any output variable for the function under analysis. Table 6 summarizes the projects studied in this study (the time cost for collecting the slice-based cohesion metrics is shown in Table 12 in Appendix A, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/TSE.2014.2370048). The second to the fourth columns are the version number, the release date, and the total source lines of code of the subject release, respectively. The fifth and the sixth columns are respectively the total number of functions and the total number of faulty functions that can be identified by both Understand and Frama-C. The seventh and the eighth columns are respectively the total number of functions and the total number of faulty functions after removing the functions that have an “undefined” metric value. As can be seen, for each system, faulty functions detected during the postrelease phase concentrated in a very small number of functions (only around 1.23416.994 percent of all functions). In all the subsequent analyses (in Section 5), we use only the pre-processed data sets. The last two columns respectively provide the version number and the release date for the previous releases. The previous releases are used for computing the code churn metrics for each system (described in Step 2 in Section 4.2). We choose these previous five versions as the baseline versions to compute code churn metrics, as they all are the first previous minor versions for each system. We can find that, on average, the previous release is released 17 months before the subject version is released. 4.3 Data Distribution Table 7 presents the descriptive statistics for each data set. Columns “25”, “50”, and “75” percent state for each metric the first quartile, the median value, and the third quartile, respectively. From Table 7, we have the following observations. First, for the code metrics, we can see that Gcc-core 3.4.0 has the largest function size, the highest Cyclomatic complexity, and the maximum depth of nesting. This indicates that this compiler collection has a more complex control flow than the other systems. Second, for the process metrics, we can see that the functions in Gcc-core 3.4.0 undergo more code changes. This is probably because the Gcc 3.4.0 has many improvements in the Cþþ frontend.8 Third, for the slice-based cohesion metrics, we can see that Vim 6.2 in general has a smaller cohesion value than the other systems. In other words, its functions are less cohesive than the functions in the other four systems. From Table 6, we observe that, of the five systems, Vim 6.2 has the largest percentage of faulty functions. One possible explanation is that lower cohesive functions are more likely to be faulty functions. This is consistent with our intuition. Fourth, for most metrics, there are large differences between the lower 25th percentile, the median, and the 75th percentile, thus showing strong variations across functions. All of the metrics have more than five observations that are nonzero, and hence, are considered for further analysis [33]. 5 EXPERIMENTAL RESULTS In this section, we elaborate on the experimental results for slice-based cohesion metrics. In Section 5.1, we present the results from examining their redundancy with the most commonly used code and process metrics (RQ1). In Section 5.2, we give the results from examining their correlations with post-release fault-proneness (RQ2). In Section 5.3, we show the results from examining their ability for predicting post-release fault-proneness compared with the most commonly used code and process metrics (RQ3). In Section 5.4, we report the results from examining the usefulness of their combination with the most commonly used code and 8. http://www.gnu.org/software/gcc/gcc-3.4/changes.html YANG ET AL.: ARE SLICE-BASED COHESION METRICS ACTUALLY USEFUL IN EFFORT-AWARE POST-RELEASE FAULT-PRONENESS... 343