YANG ET AL:ARE SLICE-BASED COHESION METRICS ACTUALLY USEFUL IN EFFORT-AWARE POST-RELEASE FAULT-PRONENESS.. 339 3.4 Data Analysis Method and fault-proneness.Therefore,there is a need to remove In the following,we describe the data analysis method for the potentially confounding effect of module size in order testing the four null research hypotheses. to understand the essence that a metric measures [53]. In this study,we first apply the linear regression method 3.4.1 Principal Component Analysis for RQ1 proposed by Zhou et al.[53]to remove the potentially con- In order to answer RQ1,we use principal component analysis founding effect of function size.After that,we use univari- to determine whether slice-based cohesion metrics capture ate logistic regression to examine the correlations between different underlying dimensions of software quality than thethe cleaned metrics and fault-proneness.For each metric, most commonly used code and process metrics.PCA is a the null hypothesis H2o corresponding to RQ2 will be powerful statistical technique used to identify the underlying, rejected if the result of univariate logistic regression is statis- orthogonal dimensions that explain the relations among the tically significant at the significant level of 0.10 independent variables in a data set.These dimensions are called principal components(PCs),which are linear combina- 3.4.3 Multivariate Logistic Rearession Analysis for RQ3 tions of the standardized independent variables.In our study, and RQ4 for each data set,we use the following method to determine In order to answer RQ3 and RQ4,we perform a stepwise vari- the corresponding number of PCs.First,the stopping criterion able selection procedure to build three types of multivariate for PCA is that all the eigenvalues for each new component logistic regression models:(1)the"B"model (using only the are greater than zero.Second,we apply the varimax rotation most commonly used code and process metrics);(2)the "S" to PCs to make the mapping of the independent variables to model (using only slice-based cohesion metrics);and(3)the components clearer where the variables have either a very "B+S"model (using all the metrics).As suggested by Zhou low or a very high loading.This helps identify the variables et al.[531,before building the multivariate logistic regression that are strongly correlated and indeed measure the same models,we remove the confounding effect of function size property,though they may purport to capture different prop- (measured by SLOC).In addition,many metrics used in this erties.Third,after obtaining the rotated component matrix, study are defined similarly with each other.For example, we map each independent variable to the component having CyclomaticModified and CyclomaticStrict are the revised Cyclo- the maximum loading.Fourth,we only retain the compo- matic complexity versions.These highly correlated predictors nents to which at least one independent variable is mapped. may lead to a high multicollinearity and hence inaccurate In our context,the null hypothesis H1o corresponding to ROl coefficient estimates in a logistic regression model [61].Vari- will be rejected when the result of PCA shows that slice-based ance inflation factor(VIF)is a widely used indicator of multi- cohesion metrics define new PCs of their own compared with collinearity.In this study,we use the recommended cut-off the most commonly used code and process metrics value 10 to deal with multicollinearity in a regression model 3.4.2 Univariate Logistic Regression Analysis for RQ2 [591.If an independent variable has a VIF value larger than 10,it will be removed from the multivariate regression In order to answer RQ2,we use univariate logistic regres- model.More specifically,we use the following algorithm sion to examine whether each slice-based cohesion metric is BUILD-MODEL to build the multivariate logistic regression negatively related to post-release fault-proneness at the sig- models.As can be seen,when building a multivariate model nificant level a of 0.10.From a scientific perspective,it is often suggested to work at the a level 0.05 or 0.01.However, our algorithm takes into account:(1)the confounding effect of function size;(2)the multicollinearity among the indepen- the choice of a particular level of significance is ultimately a dent variables;and(3)the influential observations. subjective decision and other levels such as a =0.10 are also common [51].In this paper,the minimum significance Algorithm 1.BUILD-MODEL level for rejecting a null hypothesis is set at a=0.10,as we are aggressively interested in revealing unclosed correla- Input dataset D(X:set of independent variables,Y:dependent tions between metrics and fault-proneness.When perform- variable) ing univariate analysis,we employ the Cook's distance to Step identify influential observations.For an observation,its 1:Remove the confounding effect of function size from each independent variable in X for D.[53] Cook's distance is a measure of how far apart the regression 2: coefficients are with and without this observation included. Use the backward stepwise variable selection method to If an observation has a Cook's distance equal to or larger build the logistic regression model M on D. 3:Calculate the variance inflation factors for all independent than 1,it is regarded as an influential observation and is variables in the model M. hence excluded for the analysis [32].Furthermore,for each 4:If all the VIFs are less than or equal to 10,goto step 6;other- metric,we use AOR,the odds ratio associated with one wise,goto step 5. standard deviation increase in the metric,to quantify its 5:Remove the variable x;with the largest VIF from X,and goto effect on fault-proneness [331.This allows us to compare the step 2 relative magnitude of the effects of individual metrics on 6:Calculate the Cook's distance for all the observations in D.If post-release fault-proneness.Note that previous studies the maximum Cook's distance is less than or equal to 1,then reported that module size (i.e.function size in this study) goto step 8;otherwise,goto step 7. might have a potential confounding effect on the relation- 7: Update D by removing the observations whose Cook's dis- ships between software metrics and fault-proneness [43], tances are equal to or larger than 1.Goto step 2. [531.In other words,module size may falsely obscure or 8:Return the model M. accentuate the true correlations between software metrics3.4 Data Analysis Method In the following, we describe the data analysis method for testing the four null research hypotheses. 3.4.1 Principal Component Analysis for RQ1 In order to answer RQ1, we use principal component analysis to determine whether slice-based cohesion metrics capture different underlying dimensions of software quality than the most commonly used code and process metrics. PCA is a powerful statistical technique used to identify the underlying, orthogonal dimensions that explain the relations among the independent variables in a data set. These dimensions are called principal components (PCs), which are linear combinations of the standardized independent variables. In our study, for each data set, we use the following method to determine the corresponding number of PCs. First, the stopping criterion for PCA is that all the eigenvalues for each new component are greater than zero. Second, we apply the varimax rotation to PCs to make the mapping of the independent variables to components clearer where the variables have either a very low or a very high loading. This helps identify the variables that are strongly correlated and indeed measure the same property, though they may purport to capture different properties. Third, after obtaining the rotated component matrix, we map each independent variable to the component having the maximum loading. Fourth, we only retain the components to which at least one independent variable is mapped. In our context, the null hypothesis H10 corresponding to RQ1 will be rejected when the result of PCA shows that slice-based cohesion metrics define new PCs of their own compared with the most commonly used code and process metrics. 3.4.2 Univariate Logistic Regression Analysis for RQ2 In order to answer RQ2, we use univariate logistic regression to examine whether each slice-based cohesion metric is negatively related to post-release fault-proneness at the significant level a of 0.10. From a scientific perspective, it is often suggested to work at the a level 0.05 or 0.01. However, the choice of a particular level of significance is ultimately a subjective decision and other levels such as a ¼ 0:10 are also common [51]. In this paper, the minimum significance level for rejecting a null hypothesis is set at a ¼ 0:10, as we are aggressively interested in revealing unclosed correlations between metrics and fault-proneness. When performing univariate analysis, we employ the Cook’s distance to identify influential observations. For an observation, its Cook’s distance is a measure of how far apart the regression coefficients are with and without this observation included. If an observation has a Cook’s distance equal to or larger than 1, it is regarded as an influential observation and is hence excluded for the analysis [32]. Furthermore, for each metric, we use DOR, the odds ratio associated with one standard deviation increase in the metric, to quantify its effect on fault-proneness [33]. This allows us to compare the relative magnitude of the effects of individual metrics on post-release fault-proneness. Note that previous studies reported that module size (i.e. function size in this study) might have a potential confounding effect on the relationships between software metrics and fault-proneness [43], [53]. In other words, module size may falsely obscure or accentuate the true correlations between software metrics and fault-proneness. Therefore, there is a need to remove the potentially confounding effect of module size in order to understand the essence that a metric measures [53]. In this study, we first apply the linear regression method proposed by Zhou et al. [53] to remove the potentially confounding effect of function size. After that, we use univariate logistic regression to examine the correlations between the cleaned metrics and fault-proneness. For each metric, the null hypothesis H20 corresponding to RQ2 will be rejected if the result of univariate logistic regression is statistically significant at the significant level of 0.10. 3.4.3 Multivariate Logistic Regression Analysis for RQ3 and RQ4 In order to answer RQ3 and RQ4, we perform a stepwise variable selection procedure to build three types of multivariate logistic regression models: (1) the “B” model (using only the most commonly used code and process metrics); (2) the “S” model (using only slice-based cohesion metrics); and (3) the “BþS” model (using all the metrics). As suggested by Zhou et al. [53], before building the multivariate logistic regression models, we remove the confounding effect of function size (measured by SLOC). In addition, many metrics used in this study are defined similarly with each other. For example, CyclomaticModified and CyclomaticStrict are the revised Cyclomatic complexity versions. These highly correlated predictors may lead to a high multicollinearity and hence inaccurate coefficient estimates in a logistic regression model [61]. Variance inflation factor (VIF) is a widely used indicator of multicollinearity. In this study, we use the recommended cut-off value 10 to deal with multicollinearity in a regression model [59]. If an independent variable has a VIF value larger than 10, it will be removed from the multivariate regression model. More specifically, we use the following algorithm BUILD-MODEL to build the multivariate logistic regression models. As can be seen, when building a multivariate model, our algorithm takes into account: (1) the confounding effect of function size; (2) the multicollinearity among the independent variables; and (3) the influential observations. Algorithm 1. BUILD-MODEL Input dataset D(X: set of independent variables, Y: dependent variable) Step 1: Remove the confounding effect of function size from each independent variable in X for D. [53] 2: Use the backward stepwise variable selection method to build the logistic regression model M on D. 3: Calculate the variance inflation factors for all independent variables in the model M. 4: If all the VIFs are less than or equal to 10, goto step 6; otherwise, goto step 5. 5: Remove the variable xi with the largest VIF from X, and goto step 2. 6: Calculate the Cook’s distance for all the observations in D. If the maximum Cook’s distance is less than or equal to 1, then goto step 8; otherwise, goto step 7. 7: Update D by removing the observations whose Cook’s distances are equal to or larger than 1. Goto step 2. 8: Return the model M. YANG ET AL.: ARE SLICE-BASED COHESION METRICS ACTUALLY USEFUL IN EFFORT-AWARE POST-RELEASE FAULT-PRONENESS... 339