Metric Figure 5 provides an overview of the analysis method RO3 for RQ4.In order to address RQ4.we use AIC as the AOR fom univariae logistic regressio criteria to perform a forward stepwise variable selection procedure to build the following two types of multivariate Fault label logistic regression models:(1)the "B"model and (2)the Figure 4:Overview of the analysis method for RQ3 "B+C"model.The logistic regression is a standard statistical modeling technique in which the dependent variable can take Table 5:Summarization of the importance metrics on only one of two different values [3].It is suitable and Metric Description widely used for building fault-proneness prediction models 34. Betweenness shortest paths through the vertex Centr_betw Centrality score according to betweenness 33].We choose the forward rather than the backward variant Centr_clo Centrality score according to the closeness because the former is less time consuming on stepwise variable Centr_degree Centrality score according to the degrees selection especially on a large number of independent metrics. Centr_eigen Centrality score according to eigenvector AIC is a widely used variable selection criteria [33 Closeness How close to other vertices Constraint The Burt's constraint Table 7:The most commonly used product,process, Degree v's adjacent edges and network metrics in this study Eccentricity Maximum graph distance to other vertices Category Description Page rank Google page rank score Product SLOC.FANIN.FANOUT.NPATH,Cyclomatic,Cy- clomaticModified,CyclomaticStrict.Essential.Knot- coefficient from the univariate logistic regression and the s,Nesting.MaxEssentialKnots.MinEssentialKnots. standard deviation of the variable.AOR>1 indicates that n1,n2,N1,N2 the corresponding metric is positively associated with fault- Process Added,Deleted,Modified proneness while AOR <1 indicates a negative association. Network Size,Ties,Pairs,Density,nWeakComp,pWeakCom- p,2StepReach,ReachEffic,Broker,nBroker,EgoBe- 5.3.2 Experimental result tw,nEgoBetw,effsize,efficiency,constraint,Degree, Table 6 summarizes the AORs from univariate logistic Closeness,dwReach,Eigenvector,Betweenness,Pow- er regression analysis for the metrics of functions inside de- pendence clusters.In Table 6,the second and the third rows respectively show the number of functions and faulty Table 8:Description of the studied network metrics functions inside dependence clusters for each subject system. Metric Description After each△ORs,“x"indicate the△ORs is not statistically Size alters that ego is directly connected to Ties significant at a significance level of a =0.05.Note that, ties in the ego network Pairs pairs of alters in the ego network all the p-values are corrected by the Bonferroni correction Density possible ties that are actually present method. nWeakComp weak components in the ego network pWeakComp weak components normalized by size Table 6:Results from univariate analysis for the 2StepReach nodes ego can reach within two steps importance metrics of functions inside dependence Reache伍c 2StepReach normalized by sum of alters'size clusters in terms of△OR(RQ3) Broker pairs not directly connected to each other Metric BASH GCC GIMP GLIB GSTR nBroker Broker normalized by the number of pairs 900 4752 2839 688 598 EgoBetw all shortest paths across ego nEgoBetw faulty functions 49 289 127 100 61 normalized EgoBetween (by ego size) Effsize alters minus the average degree of alters Betweenness 1.394 1.159 1.097×0.876×1.079× Efficiency effsize divided by number of alters Centr betw 1.431 1.194 1.108×0.949×1.080× Centr clo 1.034×1.257 0.957×1.315 1.101× Constraint The extent to which ego is constrained Centr_degree 1.425 1.227 1.051×1.314 1.223 Degree nodes adjacent to a given node Closeness sum of the shortest paths to all other nodes Centr_eigen 1.013×1.004×1.106×0.340×0.947× Closeness dwReach nodes that can be reached 1.035×1.277 0.958×1.310 1.102× Constraint 0.716 0.705 1.039×0.779 0.775× Eigenvector The influence of node in the network 1.425 1.223 Betweenness shortest paths through the vertex Degree 1.227 1.051×1.314 Power The connections of nodes in one's neighbors Eccentricity 0.901×1.068×0.998×1.030×0.963× Page rank 1.264×1.037×1.2460.845×1.127× (I)The“B"nodel.The“B”model is used as the baseline In Table 6,we see that the AORs of the Centr-degree model,which is built with the most commonly used product and Degree metrics are larger than 1.0 in all systems.For process,and network metrics.In this study,the product other metrics,the AORs are larger than 1.0 in most systems. metrics consist of 16 metrics,including one code size metric, This indicates that they are positively associated with fault- 11 complexity metrics,and 4 software science metrics.The proneness.Overall,this result indicates that functions that process metrics consist of 3 code churn metrics [28].The play a more important role in dependence clusters tend to description for the product and the process metrics can be be more fault-prone. found in 38.The network metrics consist of 21 network metrics,which are described in Table 8.We choose these 5.4 RQ4.Are dependence clusters useful in metrics as the baseline metrics for the following reasons. fault-proneness prediction? First,the network analysis metrics are also computed from In the following,we describe the research method and dependence graphs [41].Second,these metrics are widely present the experimental results for RQ4. used and considered as useful indicators for fault-proneness prediction [25,26,28,41].Third,they can be cheaply 5.4.1 Research method collected from source code for large software systems 301f1 f2 f3 ... Metric Fault label RQ3 ΔOR form univariate logistic regression Figure 4: Overview of the analysis method for RQ3 Table 5: Summarization of the importance metrics Metric Description Betweenness # shortest paths through the vertex Centr betw Centrality score according to betweenness Centr clo Centrality score according to the closeness Centr degree Centrality score according to the degrees Centr eigen Centrality score according to eigenvector Closeness How close to other vertices Constraint The Burt’s constraint Degree # v’s adjacent edges Eccentricity Maximum graph distance to other vertices Page rank Google page rank score coefficient from the univariate logistic regression and the standard deviation of the variable. ∆OR > 1 indicates that the corresponding metric is positively associated with faultproneness while ∆OR < 1 indicates a negative association. 5.3.2 Experimental result Table 6 summarizes the ∆ORs from univariate logistic regression analysis for the metrics of functions inside dependence clusters. In Table 6, the second and the third rows respectively show the number of functions and faulty functions inside dependence clusters for each subject system. After each ∆ORs, “×” indicate the ∆ORs is not statistically significant at a significance level of α = 0.05. Note that, all the p-values are corrected by the Bonferroni correction method. Table 6: Results from univariate analysis for the importance metrics of functions inside dependence clusters in terms of ∆OR (RQ3) Metric BASH GCC GIMP GLIB GSTR N 900 4752 2839 688 598 # faulty functions 49 289 127 100 61 Betweenness 1.394 1.159 1.097 × 0.876 × 1.079 × Centr betw 1.431 1.194 1.108 × 0.949 × 1.080 × Centr clo 1.034 × 1.257 0.957 × 1.315 1.101 × Centr degree 1.425 1.227 1.051 × 1.314 1.223 Centr eigen 1.013 × 1.004 × 1.106 × 0.340 × 0.947 × Closeness 1.035 × 1.277 0.958 × 1.310 1.102 × Constraint 0.716 0.705 1.039 × 0.779 0.775 × Degree 1.425 1.227 1.051 × 1.314 1.223 Eccentricity 0.901 × 1.068 × 0.998 × 1.030 × 0.963 × Page rank 1.264 × 1.037 × 1.246 0.845 × 1.127 × In Table 6, we see that the ∆ORs of the Centr degree and Degree metrics are larger than 1.0 in all systems. For other metrics, the ∆ORs are larger than 1.0 in most systems. This indicates that they are positively associated with faultproneness. Overall, this result indicates that functions that play a more important role in dependence clusters tend to be more fault-prone. 5.4 RQ4. Are dependence clusters useful in fault-proneness prediction? In the following, we describe the research method and present the experimental results for RQ4. 5.4.1 Research method Figure 5 provides an overview of the analysis method for RQ4. In order to address RQ4, we use AIC as the criteria to perform a forward stepwise variable selection procedure to build the following two types of multivariate logistic regression models: (1) the “B” model and (2) the “B+C” model. The logistic regression is a standard statistical modeling technique in which the dependent variable can take on only one of two different values [3]. It is suitable and widely used for building fault-proneness prediction models [34, 33]. We choose the forward rather than the backward variant because the former is less time consuming on stepwise variable selection especially on a large number of independent metrics. AIC is a widely used variable selection criteria [33]. Table 7: The most commonly used product, process, and network metrics in this study Category Description Product SLOC, FANIN, FANOUT, NPATH, Cyclomatic, CyclomaticModified, CyclomaticStrict, Essential, Knots, Nesting, MaxEssentialKnots, MinEssentialKnots, n1, n2, N1, N2 Process Added, Deleted, Modified Network Size, Ties, Pairs, Density, nWeakComp, pWeakComp, 2StepReach, ReachEffic, Broker, nBroker, EgoBetw, nEgoBetw, effsize, efficiency, constraint, Degree, Closeness, dwReach, Eigenvector, Betweenness, Power Table 8: Description of the studied network metrics Metric Description Size # alters that ego is directly connected to Ties # ties in the ego network Pairs # pairs of alters in the ego network Density % possible ties that are actually present nWeakComp # weak components in the ego network pWeakComp # weak components normalized by size 2StepReach # nodes ego can reach within two steps ReachEffic 2StepReach normalized by sum of alters’ size Broker # pairs not directly connected to each other nBroker Broker normalized by the number of pairs EgoBetw % all shortest paths across ego nEgoBetw normalized EgoBetween (by ego size) Effsize # alters minus the average degree of alters Efficiency effsize divided by number of alters Constraint The extent to which ego is constrained Degree # nodes adjacent to a given node Closeness sum of the shortest paths to all other nodes dwReach # nodes that can be reached Eigenvector The influence of node in the network Betweenness # shortest paths through the vertex Power The connections of nodes in one’s neighbors (1) The “B” model. The “B” model is used as the baseline model, which is built with the most commonly used product, process, and network metrics. In this study, the product metrics consist of 16 metrics, including one code size metric, 11 complexity metrics, and 4 software science metrics. The process metrics consist of 3 code churn metrics [28]. The description for the product and the process metrics can be found in [38]. The network metrics consist of 21 network metrics, which are described in Table 8. We choose these metrics as the baseline metrics for the following reasons. First, the network analysis metrics are also computed from dependence graphs [41]. Second, these metrics are widely used and considered as useful indicators for fault-proneness prediction [25, 26, 28, 41]. Third, they can be cheaply collected from source code for large software systems. 301