proportions of faulty functions inside and outside dependence @回®⊙! Consistency table RO2 clusters.In Table 4.the second and the third columns re- Fisher's exact test and OR spectively represent the proportion of faulty functions inside @⊙©⊙月 and outside dependence clusters.The fourth and the fifth columns respectively show the Bonferroni adjusted p-value from Fisher's exact test and OR. Figure 3:Overview of the analysis method for RQ2 Table 4:The proportion of faulty functions inside 5.1.2 Experimental result vs.outside dependence clusters(RQ2) In the following,we describe the empirical results used to functions is faulty Fisher's System answer RQ1.Table 3 summarizes the Spearman correlation inside outside exact test OR coefficients relating the size metrics with fault density of BASH 5.44% 1.82% <0.001 3.115 dependence clusters.In Table 3.the second column is the GCC 6.08% 1.59% <0.001 4.004 GIMP 4.47% 4.03% 1.000 1.115 number of dependence clusters in each subject system.The GLIB 14.549% 6.06% <0.001 2.638 third and the fifth columns respectively present the corre- GSTR 10.20a 2.54% <0.001 4.361 lation coefficients for the Size and the Ties metrics from Spearman's singed-rank correlation.The correlation coeffi- From Table 4,we can see that the proportion of faulty cients which are not statistically significant at the significance functions inside dependence clusters is larger than the pro- level of a =0.05 are marked in gray. portion of faulty functions outside dependence clusters in all In Table 3,we see that all the absolute values of the cases,and significantly larger in all but one case.All the correlation coefficients are less than 0.5.This indicates that p-values are less than 0.05 except in GIMP which indicates there is only a weak correlation between these two size metrics statistically significant at the significance level of a =0.05. (i.e.,Size and Ties)with fault density of dependence clusters. This indicates that the proportions of faulty functions be- However.all the correlation coefficients are larger than 0 and tween these two groups are significantly different.Meanwhile. most of them are statistically significant at the significance all the ORs are substantially greater than 1,two are even level of a =0.05.This indicates that these size metrics are greater than 4,which confirms the results from Fisher's exact positively correlated with fault density.In other words,larger test. dependence clusters tend to be more fault-prone.Thus,large Overall,Fisher's exact test and the ORs consistently in- dependence clusters are likely more harmful and hence should dicate that functions inside dependence clusters are more be avoided,advice that is consistent with prior studies 8, fault-prone than functions outside dependence clusters 18. 5.3 RQ3.Are functions playing more impor- 5.2 RQ2.Are functions inside dependence clus- tant roles inside dependence clusters more ters more fault-prone than functions out- fault-prone? side dependence clusters? In the following,we describe the corresponding research In the following,we describe the research method and the method and the experimental results that address RQ3. experimental result answering RQ2. 5.3.I Research method 5.2.1 Research method Figure 4 provides an overview of the analysis method for Figure 3 provides an overview of the data analysis method RQ3.Functions inside dependence clusters form an inde- for addressing RQ2.As can be seen,in order to answer pendent dependence graph (e.g.,SubGin in Figure 1).In RQ2,we use Fisher's exact test and the odds ratio (OR)to order to answer RQ3,we first use this graph to compute the examine whether the proportion of faulty functions inside importance metrics as described in Table 5 for the functions dependence clusters is statistically significantly different from inside dependence clusters in the sub-dependence graph.The the proportion of faulty functions outside dependence clusters metrics in Table 5 are widely used networks metrics [37]that Fisher's exact test is a statistical significance test used in measure the extent to which these functions contribute to the analysis of contingency tables [36].The contingency the sub-dependence graph.For example,the Betweenness table is a matrix that displays the frequency distribution of metric for a vertex measures how many shortest paths pass variables.In our study,the contingency table has four types through the vertex for all pairs of vertices of the subgraph. of functions:(1)functions inside dependence clusters that Thus,vertices with large Betweenness indicates a large im- have faults;(2)functions inside dependence clusters that portance.Note that some of these importance metrics can be have no faults;(3)functions outside dependence clusters that computed by one the following three methods:"IN","OUT", have faults;and (4)functions outside dependence clusters and "ALL".The "IN"method concerns all incoming edges. that have no faults.The OR indicates the likelihood that The "OUT"method concerns all outgoing edges.While the an event (e.g.,that a function is faulty)occurs 36.Assume "ALL"method treats the graph as an undirected graph.In p is the proportion of faulty functions inside dependence this study,we only compute the metrics using the "OUT" clusters and g is the proportion of faulty functions outside method. dependence clusters.Then,OR is defined as.Thus After that,we build univariate logistic regression models OR>1 indicates that faults are more likely to occur inside for each of these metrics with fault-proneness.Similar to dependence clusters.OR =1 indicates an equal probability. prior studies 11,38,we use AOR,the odds ratio associated with one standard deviation increase,to quantify the effect of 5.2.2 Experimental result these metrics on fault-proneness.AOR is defined as follows: Table 4 summarizes the results of the comparison of the AOR=ex.Here,B and o are respectively the regression 300Consistency table Fisher s exact test and OR f1 f2 f3 ... RQ2 f4 f5 f6 ... Figure 3: Overview of the analysis method for RQ2 5.1.2 Experimental result In the following, we describe the empirical results used to answer RQ1. Table 3 summarizes the Spearman correlation coefficients relating the size metrics with fault density of dependence clusters. In Table 3, the second column is the number of dependence clusters in each subject system. The third and the fifth columns respectively present the correlation coefficients for the Size and the Ties metrics from Spearman’s singed-rank correlation. The correlation coeffi- cients which are not statistically significant at the significance level of α = 0.05 are marked in gray. In Table 3, we see that all the absolute values of the correlation coefficients are less than 0.5. This indicates that there is only a weak correlation between these two size metrics (i.e., Size and Ties) with fault density of dependence clusters. However, all the correlation coefficients are larger than 0 and most of them are statistically significant at the significance level of α = 0.05. This indicates that these size metrics are positively correlated with fault density. In other words, larger dependence clusters tend to be more fault-prone. Thus, large dependence clusters are likely more harmful and hence should be avoided, advice that is consistent with prior studies [8, 18]. 5.2 RQ2. Are functions inside dependence clusters more fault-prone than functions outside dependence clusters? In the following, we describe the research method and the experimental result answering RQ2. 5.2.1 Research method Figure 3 provides an overview of the data analysis method for addressing RQ2. As can be seen, in order to answer RQ2, we use Fisher’s exact test and the odds ratio (OR) to examine whether the proportion of faulty functions inside dependence clusters is statistically significantly different from the proportion of faulty functions outside dependence clusters. Fisher’s exact test is a statistical significance test used in the analysis of contingency tables [36]. The contingency table is a matrix that displays the frequency distribution of variables. In our study, the contingency table has four types of functions: (1) functions inside dependence clusters that have faults; (2) functions inside dependence clusters that have no faults; (3) functions outside dependence clusters that have faults; and (4) functions outside dependence clusters that have no faults. The OR indicates the likelihood that an event (e.g., that a function is faulty) occurs [36]. Assume p is the proportion of faulty functions inside dependence clusters and q is the proportion of faulty functions outside dependence clusters. Then, OR is defined as p(1−p) q(1−q) . Thus OR > 1 indicates that faults are more likely to occur inside dependence clusters. OR = 1 indicates an equal probability. 5.2.2 Experimental result Table 4 summarizes the results of the comparison of the proportions of faulty functions inside and outside dependence clusters. In Table 4, the second and the third columns respectively represent the proportion of faulty functions inside and outside dependence clusters. The fourth and the fifth columns respectively show the Bonferroni adjusted p-value from Fisher’s exact test and OR. Table 4: The proportion of faulty functions inside vs. outside dependence clusters (RQ2) System % functions is faulty Fisher’s OR inside outside exact test BASH 5.44% 1.82% < 0.001 3.115 GCC 6.08% 1.59% < 0.001 4.004 GIMP 4.47% 4.03% 1.000 1.115 GLIB 14.54% 6.06% < 0.001 2.638 GSTR 10.20% 2.54% < 0.001 4.361 From Table 4, we can see that the proportion of faulty functions inside dependence clusters is larger than the proportion of faulty functions outside dependence clusters in all cases, and significantly larger in all but one case. All the p-values are less than 0.05 except in GIMP which indicates statistically significant at the significance level of α = 0.05. This indicates that the proportions of faulty functions between these two groups are significantly different. Meanwhile, all the ORs are substantially greater than 1, two are even greater than 4, which confirms the results from Fisher’s exact test. Overall, Fisher’s exact test and the ORs consistently indicate that functions inside dependence clusters are more fault-prone than functions outside dependence clusters. 5.3 RQ3. Are functions playing more important roles inside dependence clusters more fault-prone? In the following, we describe the corresponding research method and the experimental results that address RQ3. 5.3.1 Research method Figure 4 provides an overview of the analysis method for RQ3. Functions inside dependence clusters form an independent dependence graph (e.g., SubGin in Figure 1). In order to answer RQ3, we first use this graph to compute the importance metrics as described in Table 5 for the functions inside dependence clusters in the sub-dependence graph. The metrics in Table 5 are widely used networks metrics [37] that measure the extent to which these functions contribute to the sub-dependence graph. For example, the Betweenness metric for a vertex measures how many shortest paths pass through the vertex for all pairs of vertices of the subgraph. Thus, vertices with large Betweenness indicates a large importance. Note that some of these importance metrics can be computed by one the following three methods: “IN”, “OUT”, and “ALL”. The “IN” method concerns all incoming edges. The “OUT” method concerns all outgoing edges. While the “ALL” method treats the graph as an undirected graph. In this study, we only compute the metrics using the “OUT” method. After that, we build univariate logistic regression models for each of these metrics with fault-proneness. Similar to prior studies [11, 38], we use ∆OR, the odds ratio associated with one standard deviation increase, to quantify the effect of these metrics on fault-proneness. ∆OR is defined as follows: ∆OR = e β×σ . Here, β and σ are respectively the regression 300