An Empirical Study on Dependence Clusters for Effort-Aware Fault-Proneness Prediction Yibiao Yang',Mark Harman',Jens Krinke',Syed Islam,David Binkley, Yuming Zhou and Baowen Xu' Department of Computer Science and Technology,Nanjing University,China Department of Computer Science,University College London,UK School of Architecture,Computing and Engineering,University of East London,UK Department of Computer Science,Loyola University Maryland,USA ABSTRACT CCS Concepts A dependence cluster is a set of mutually inter-dependent .Software and its engineering-Abstraction,model- program elements.Prior studies have found that large de- ing and modularity;Software development process man- pendence clusters are prevalent in software systems.It has agement: been suggested that dependence clusters have potentially harmful effects on software quality.However,little empirical Keywords evidence has been provided to support this claim.The study Dependence clusters,fault-proneness,fault prediction,net- presented in this paper investigates the relationship between dependence clusters and software quality at the function-level work analysis with a focus on effort-aware fault-proneness prediction.The investigation first analyzes whether or not larger dependence 1.INTRODUCTION clusters tend to be more fault-prone.Second,it investigates A dependence cluster is a set of program elements that whether the proportion of faulty functions inside dependence all directly or transitively depend upon one another 8,18. clusters is significantly different from the proportion of faulty Prior empirical studies found that large dependence clusters functions outside dependence clusters.Third,it examines are highly prevalent in software systems and further compli- whether or not functions inside dependence clusters playing cate many software activities such as software maintenance, a more important role than others are more fault-prone.Fi- testing,and comprehension 8,18.In the presence of a nally,based on two groups of functions(i.e.,functions inside (large)dependence cluster,an issue or a code change in one and outside dependence clusters),the investigation considers element likely has significant ripple effects involving the other a segmented fault-proneness prediction model.Our experi- elements of the cluster 8,18.Hence,there is a reason to mental results,based on five well-known open-source systems, believe that dependence clusters have potentially harmful show that (1)larger dependence clusters tend to be more effects on software quality.This suggests that the elements fault-prone;(2)the proportion of faulty functions inside de- inside dependence clusters have relatively lower quality when pendence clusters is significantly larger than the proportion compared to elements outside any dependence cluster.Given of faulty functions outside dependence clusters;(3)functions this observation,dependence clusters should be useful in inside dependence clusters that play more important roles fault-prediction.However,few empirical studies have inves- are more fault-prone;(4)our segmented prediction model tigated the effect of dependence clusters on fault-proneness can significantly improve the effectiveness of effort-aware prediction. fault-proneness prediction in both ranking and classification This paper presents an empirical study of the relationships scenarios.These findings help us better understand how between dependence clusters and fault-proneness.The con- dependence clusters influence software quality. cept of a dependence cluster was originally introduced by Binkley and Harman [8].They treat program statements as basic units,however,they note that dependence clusters Corresponding author:zhouyuming@nju.edu.cn can be also defined at coarser granularities,such as at the function-level 7.For a given program,the identification of function-level dependence clusters consists of two steps. The first step generates a function-level System Dependence Permission to make digital or hard copies of all or part of this work for personal or Graph for all functions of the program.In general,these classroom use is granted without fee provided that copies are not made or distributed graphs involve two types of dependencies between functions: tor commercial advanag oce and the omp ents of this work owned by others than ACM call dependency (i.e.,one function calls another function) must be honored.Abstracting with credit is permitted.To copy otherwise.or republish to post on servers or to redistribute to lists,requires prior specific permission and/or a and data dependency (e.g.,a global variable defined in one fee.Request permissions from Permissions@acm.org. function is used in another function).In the System De- ASE'/6,September 3-7,2016,Singapore,Singapore pendence Graphs used in our study,nodes denote functions ©2016ACM.978-1.4503-3845-5716/09.S15.00 and directed edges denote the dependencies between these http:/dx.doi.org/10.1145/2970276.2970353 functions.In the second step,a clustering algorithm is used 296An Empirical Study on Dependence Clusters for Effort-Aware Fault-Proneness Prediction Yibiao Yang1 , Mark Harman2 , Jens Krinke2 , Syed Islam3 , David Binkley4 , Yuming Zhou1 ∗ , and Baowen Xu1 1 Department of Computer Science and Technology, Nanjing University, China 2 Department of Computer Science, University College London, UK 3 School of Architecture, Computing and Engineering, University of East London, UK 4 Department of Computer Science, Loyola University Maryland, USA ABSTRACT A dependence cluster is a set of mutually inter-dependent program elements. Prior studies have found that large dependence clusters are prevalent in software systems. It has been suggested that dependence clusters have potentially harmful effects on software quality. However, little empirical evidence has been provided to support this claim. The study presented in this paper investigates the relationship between dependence clusters and software quality at the function-level with a focus on effort-aware fault-proneness prediction. The investigation first analyzes whether or not larger dependence clusters tend to be more fault-prone. Second, it investigates whether the proportion of faulty functions inside dependence clusters is significantly different from the proportion of faulty functions outside dependence clusters. Third, it examines whether or not functions inside dependence clusters playing a more important role than others are more fault-prone. Finally, based on two groups of functions (i.e., functions inside and outside dependence clusters), the investigation considers a segmented fault-proneness prediction model. Our experimental results, based on five well-known open-source systems, show that (1) larger dependence clusters tend to be more fault-prone; (2) the proportion of faulty functions inside dependence clusters is significantly larger than the proportion of faulty functions outside dependence clusters; (3) functions inside dependence clusters that play more important roles are more fault-prone; (4) our segmented prediction model can significantly improve the effectiveness of effort-aware fault-proneness prediction in both ranking and classification scenarios. These findings help us better understand how dependence clusters influence software quality. ∗Corresponding author: zhouyuming@nju.edu.cn. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. ASE’16, September 03-07, 2016, Singapore, Singapore c 2016 ACM. ISBN 978-1-4503-3845-5/16/09. . . $15.00 DOI: http://dx.doi.org/10.1145/2970276.2970353 CCS Concepts •Software and its engineering → Abstraction, modeling and modularity; Software development process management; Keywords Dependence clusters, fault-proneness, fault prediction, network analysis 1. INTRODUCTION A dependence cluster is a set of program elements that all directly or transitively depend upon one another [8, 18]. Prior empirical studies found that large dependence clusters are highly prevalent in software systems and further complicate many software activities such as software maintenance, testing, and comprehension [8, 18]. In the presence of a (large) dependence cluster, an issue or a code change in one element likely has significant ripple effects involving the other elements of the cluster [8, 18]. Hence, there is a reason to believe that dependence clusters have potentially harmful effects on software quality. This suggests that the elements inside dependence clusters have relatively lower quality when compared to elements outside any dependence cluster. Given this observation, dependence clusters should be useful in fault-prediction. However, few empirical studies have investigated the effect of dependence clusters on fault-proneness prediction. This paper presents an empirical study of the relationships between dependence clusters and fault-proneness. The concept of a dependence cluster was originally introduced by Binkley and Harman [8]. They treat program statements as basic units, however, they note that dependence clusters can be also defined at coarser granularities, such as at the function-level [7]. For a given program, the identification of function-level dependence clusters consists of two steps. The first step generates a function-level System Dependence Graph for all functions of the program. In general, these graphs involve two types of dependencies between functions: call dependency (i.e., one function calls another function) and data dependency (e.g., a global variable defined in one function is used in another function). In the System Dependence Graphs used in our study, nodes denote functions and directed edges denote the dependencies between these functions. In the second step, a clustering algorithm is used Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. ASE’16, September 3–7, 2016, Singapore, Singapore c 2016 ACM. 978-1-4503-3845-5/16/09...$15.00 http://dx.doi.org/10.1145/2970276.2970353 296