2019 IEEE/ACM 41st International Conference on Software Engineering(ICSE) Hunting for Bugs in Code Coverage Tools via Randomized Differential Testing Yibiao Yang*,Yuming Zhou*,Hao Sunt,Zhendong Sut5,Zhigiang Zuo*,Lei Xu*,and Baowen Xu* *State Key Lab.for Novel Software Technology,Nanjing University,Nanjing,China Unaffiliated Department of Computer Science,ETH Zurich,Switzerland 3Computer Science Department,UC Davis,USA Abstract-Reliable code coverage tools are critically important #include #include as it is heavily used to facilitate many quality assurance activities, int main ( int main() such as software testing,fuzzing,and debugging.However,little attention has been devoted to assessing the reliability of code int g=0,v=1; int g=0.v=1; coverage tools.In this study,we propose a randomized differ- g=vv /8=vv ential testing approach to hunting for bugs in the most widely printf("%d\n”,g) printf("%d\n”,g): used C code coverage tools.Specifically,by generating random input programs,our approach seeks for inconsistencies in code (a) (b) coverage reports produced by different code coverage tools,and Fig.1.(a)Bug #33465 of llvm-cov and (b)The "equivalent"program then identifies inconsistencies as potential code coverage bugs.To obtained by pruning the unexecuted code (Line #4)of the program in (a) effectively report code coverage bugs,we addressed three specific challenges:(1)How to filter out duplicate test programs as many of them triggering the same bugs in code coverage tools;(2)how llvm-cov.Given the program p and its corresponding code to automatically reduce large test programs to much smaller coverage as shown in Fig.1(a).EMI compiler testing [7] ones that have the same properties;and (3)how to determine generates its "equivalent"program p'as shown in Fig.1(b) which code coverage tools have bugs?The extensive evaluations by removing unexecuted code (statement 5).The program p validate the effectiveness of our approach,resulting in 42 and 28 confirmed/fixed bugs for gcov and llvm-cov,respectively.This and p'will be compiled by a compiler under testing and case study indicates that code coverage tools are not as reliable then executed to obtain two different outputs,i.e.I and 0. as it might have been envisaged.It not only demonstrates the resulting in a bug reported by the EMI approach.However, effectiveness of our approach,but also highlights the need to this is obviously not a real compiler bug.The incorrect code continue improving the reliability of code coverage tools.This work opens up a new direction in code coverage validation which coverage report leads to the false positive in compiler testing. As the code coverage tools offer the fundamental information calls for more attention in this area Index Terms-Code Coverage;Differential Testing;Coverage needed during the whole process of software development, Tools;Bug Detection. it is essential to validate the correctness of code coverage. Unfortunately,to our best knowledge,little attention has been I.INTRODUCTION devoted to assessing the reliability of code coverage tools. Code coverage [1]refers to which code in the program This work makes the first attempt in this direction.We and how many times each code is executed when running devise a practical randomized differential testing approach on the particular test cases.The code coverage information to discovering bugs in code coverage tools.Our approach produced by code coverage tools is widely used to facilitate firstly leverages programs generated by a random generator to many quality assurance activities,such as software testing, seek for inconsistencies of code coverage reports produced by fuzzing,and debugging [1]-[15].For example,researchers different code coverage tools,and identifies inconsistencies as recently introduced an EMI ("Equivalence Modulo Inputs") potential code coverage bugs.Secondly,due to the existence of based compiler testing technique [7].The equivalent programs too many inconsistency-triggering test programs reported and are obtained by stochastically pruning the unexecuted code of a large portion of irrelevant code within these test programs, a given program according to the code coverage information reporting these inconsistency-triggering tests directly is hardly given by the code coverage tools (e.g.,llvm-cov,gcov).There- beneficial to debugging.Before reporting them,the reduction fore,the correctness of "equivalence"relies on the reliability to those test programs is required [18].However,it is usually of code coverage tools. very costly and thus unrealistic to reduce every and each of In spite of the prevalent adoption in practice and extensive the test programs [18].We observe that many test programs testing of code coverage tools,a variety of defects still remain.trigger the same coverage bugs.Thus,we can filter out many Fig.1(a)shows a buggy code coverage report produced by duplicate test programs.Note that 'duplicate test programs'in llvm-cov [16],a C code coverage tool of Clang [17].Note this study indicates multiple test programs triggering the same that all the test cases have been reformatted for presentation code coverage bug.Overall,to effectively report coverage in this study.The coverage report is an annotated version of bugs,we need to address the following key challenges: source code,where the first and second column list the line Challenge 1:Filtering Out Test Programs.To filter out number and the execution frequency,respectively.We can see potential test programs triggering the same code coverage that the code at line 5 is marked incorrectly as unexecuted by bugs,the most intuitive way is to calculate similarities between 1558-1225/19/$31.00©20191EEE 488 D0110.1109/1CSE.2019.00061
Hunting for Bugs in Code Coverage Tools via Randomized Differential Testing Yibiao Yang∗, Yuming Zhou∗, Hao Sun†, Zhendong Su‡§, Zhiqiang Zuo∗, Lei Xu∗, and Baowen Xu∗ ∗State Key Lab. for Novel Software Technology, Nanjing University, Nanjing, China †Unaffiliated ‡Department of Computer Science, ETH Zurich, Switzerland §Computer Science Department, UC Davis, USA Abstract—Reliable code coverage tools are critically important as it is heavily used to facilitate many quality assurance activities, such as software testing, fuzzing, and debugging. However, little attention has been devoted to assessing the reliability of code coverage tools. In this study, we propose a randomized differential testing approach to hunting for bugs in the most widely used C code coverage tools. Specifically, by generating random input programs, our approach seeks for inconsistencies in code coverage reports produced by different code coverage tools, and then identifies inconsistencies as potential code coverage bugs. To effectively report code coverage bugs, we addressed three specific challenges: (1) How to filter out duplicate test programs as many of them triggering the same bugs in code coverage tools; (2) how to automatically reduce large test programs to much smaller ones that have the same properties; and (3) how to determine which code coverage tools have bugs? The extensive evaluations validate the effectiveness of our approach, resulting in 42 and 28 confirmed/fixed bugs for gcov and llvm-cov, respectively. This case study indicates that code coverage tools are not as reliable as it might have been envisaged. It not only demonstrates the effectiveness of our approach, but also highlights the need to continue improving the reliability of code coverage tools. This work opens up a new direction in code coverage validation which calls for more attention in this area. Index Terms—Code Coverage; Differential Testing; Coverage Tools; Bug Detection. I. INTRODUCTION Code coverage [1] refers to which code in the program and how many times each code is executed when running on the particular test cases. The code coverage information produced by code coverage tools is widely used to facilitate many quality assurance activities, such as software testing, fuzzing, and debugging [1]–[15]. For example, researchers recently introduced an EMI (“Equivalence Modulo Inputs”) based compiler testing technique [7]. The equivalent programs are obtained by stochastically pruning the unexecuted code of a given program according to the code coverage information given by the code coverage tools (e.g., llvm-cov, gcov). Therefore, the correctness of “equivalence” relies on the reliability of code coverage tools. In spite of the prevalent adoption in practice and extensive testing of code coverage tools, a variety of defects still remain. Fig. 1(a) shows a buggy code coverage report produced by llvm-cov [16], a C code coverage tool of Clang [17]. Note that all the test cases have been reformatted for presentation in this study. The coverage report is an annotated version of source code, where the first and second column list the line number and the execution frequency, respectively. We can see that the code at line 5 is marked incorrectly as unexecuted by 1 | | 2 | | 3 | 1 | 4 | 1 | 5 | 0 | 6 | 1 | 7 | 1 | #include int main ( ) { int g=0, v=1; g=v | | !v; p r i n t f ( ”%d\n” , g ); } #include int main ( ) { int g=0, v=1; // g = v | | !v; p r i n t f ( ”%d\n” , g ); } (a) (b) Fig. 1. (a) Bug #33465 of llvm-cov and (b) The “equivalent” program obtained by pruning the unexecuted code (Line #4) of the program in (a) llvm-cov. Given the program p and its corresponding code coverage as shown in Fig. 1(a), EMI compiler testing [7] generates its “equivalent” program p’ as shown in Fig. 1(b) by removing unexecuted code (statement 5). The program p and p’ will be compiled by a compiler under testing and then executed to obtain two different outputs, i.e. 1 and 0, resulting in a bug reported by the EMI approach. However, this is obviously not a real compiler bug. The incorrect code coverage report leads to the false positive in compiler testing. As the code coverage tools offer the fundamental information needed during the whole process of software development, it is essential to validate the correctness of code coverage. Unfortunately, to our best knowledge, little attention has been devoted to assessing the reliability of code coverage tools. This work makes the first attempt in this direction. We devise a practical randomized differential testing approach to discovering bugs in code coverage tools. Our approach firstly leverages programs generated by a random generator to seek for inconsistencies of code coverage reports produced by different code coverage tools, and identifies inconsistencies as potential code coverage bugs. Secondly, due to the existence of too many inconsistency-triggering test programs reported and a large portion of irrelevant code within these test programs, reporting these inconsistency-triggering tests directly is hardly beneficial to debugging. Before reporting them, the reduction to those test programs is required [18]. However, it is usually very costly and thus unrealistic to reduce every and each of the test programs [18]. We observe that many test programs trigger the same coverage bugs. Thus, we can filter out many duplicate test programs. Note that ‘duplicate test programs’ in this study indicates multiple test programs triggering the same code coverage bug. Overall, to effectively report coverage bugs, we need to address the following key challenges: Challenge 1: Filtering Out Test Programs. To filter out potential test programs triggering the same code coverage bugs, the most intuitive way is to calculate similarities between 488 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE) 1558-1225/19/$31.00 ©2019 IEEE DOI 10.1109/ICSE.2019.00061
programs using the whole text.However,we use Csmith [19] Organization.The rest of this paper is structured as follows as the random program generator and two Csmith-generated Section II introduces the background on code coverage.Sec- programs are not meaningfully comparable as they diverge in tion III describes our approach for code coverage validation. many ways [20].In addition,calculating similarities between Section IV reports the experimental results in detail.Section V programs using the whole text is expensive.To tackle this surveys related work.Section VI concludes the paper and challenge,only lines of code triggering inconsistencies are outlines the direction for future work. used for computing similarities between programs. Challenge 2:Reducing Test Programs.Reducing test pro- II.CODE COVERAGE grams for code coverage bugs is much more complex than In this section,we introduce the preliminary knowledge,the reducing test programs for compiler bugs as the later one importance,and the bug categories of code coverage. only requires testing the behavior of the compiled executables or the exit code of compilers [18].However.reducing test A.Preliminaries programs for code coverage bugs involves processing textual Code coverage is a quality assurance metric used to describe code coverage reports and identify inconsistencies.After each the degree to which the code of a given program is executed iteration of reduction,we need to specify the inconsistency of when a particular test suite executes [1].It is suggested that a interest we would like to preserve.In this study,we design a program with a high test coverage will have a lower chance of set of inconsistency types over coverage reports as the interest. containing undetected software bugs compared to a program Challenge 3:Inspecting Coverage Bugs.With the reduced with a low test coverage. test program,we need to inspect which code coverage tools The code coverage analysis process is generally divided have bugs before reporting bug.In practice,it is usually done into three tasks:code instrumentation,data gathering,and manually [21].In other words,developers manually inspect the coverage analysis.Specifically,code instrumentation inserts coverage reports to determine which coverage tools are buggy. additional code instructions or probes to monitor whether the To relieve the burden of manual intervention,we summarize specific program chunk is executed or not at runtime.The a number of rules that code coverage reports must follow.instrumentation can be done at the source level in a separate With those rules,we develop a tool to examine part of the pre-processing phase or at runtime by instrumenting byte code. inconsistent coverage reports and determine which tools have Data gathering aims to collect the coverage data produced at bugs automatically. test runtime.Finally,coverage analysis aims to analyze the We implemented a differential testing prototype called C2V collected results and to provide test strategy recommendations ("Code Coverage Validation")for code coverage tools.Inin order to reduce,to feed or to modify the relevant test suite. order to evaluate the effectiveness of our approach,we have Currently,many code coverage tools are available [16], applied c2V to gcov and llvm-cov,two widely used C [23]-[31].which support different languages(e.g.,C/C++and code coverage tools respectively in the production compilers Java),instrumentation levels (e.g.source code and byte code GCC [22]and Clang [17].Our evaluation confirms that C2V level),or coverage criteria.Coverage criteria are the rules or is very effective in finding code coverage bugs:46 bugs were requirements that a test suite needs to satisfy [32].In the lit- found(42 bugs confirmed/fixed)for gcov,while 37 bugs were erature,many coverage criteria have been proposed,including found(28 bugs confirmed/fixed)for llvm-cov. statement coverage,branch coverage,path coverage,condition Contributions.We made the following contributions: coverage,decision coverage,and data-flow coverage [33]. We introduce an effective testing approach to validating the These criteria can be used to guide the generation of a test code coverage tools,and have implemented it as a practical suite or to evaluate the effectiveness of a test suite [331. tool C2V for testing C coverage tools.C2V mainly consists of a random program generator,a comparer to identify B.Importance of Code Coverage inconsistencies between coverage reports,a filter to remove Code coverage is widely used in,but not limited to,the test programs triggering same coverage bugs,a test program following software techniques: reducer,and an inspector to automatically determine which Coverage-based regression testing.In the context of regres- coverage tools have bugs for bug reporting sion testing,test case prioritization and test suite augmenta- We adopted c2V to uncover 46 and 37 bugs for gcov and tion are the two widely used techniques [2]-[4],[10].[15], llvm-cov both of which are widely used and extensively [34]-[36].The former aims to improve the ability of test tested C code coverage tools,respectively.Specifically.for cases to find bugs by scheduling test cases in a specific gcov,42 bugs have already been confirmed/fixed:for llvm- order [10].[15].[37].One common practice is to achieve cov,28 bugs have been confirmed/fixed. a high code coverage as fast as possible [38].The latter is Our evaluation indicates that code coverage tools are not to generate new test cases to strengthen the ability of a test as reliable as it might have been envisaged.It opens up a suite to find bugs [6].[14].[36].[39].In practice,it is often new research direction to improve the reliability of code to generate new test cases to cover the source code affected coverage tools which calls for more attention in this area. by code changes. Besides,there is a need to examine the influence of those Coverage-based compiler testing.Recent years have seen an bugs on other techniques which depend on code coverage. increasing interest in compiler testing which aims to validate 489
programs using the whole text. However, we use Csmith [19] as the random program generator and two Csmith-generated programs are not meaningfully comparable as they diverge in many ways [20]. In addition, calculating similarities between programs using the whole text is expensive. To tackle this challenge, only lines of code triggering inconsistencies are used for computing similarities between programs. Challenge 2: Reducing Test Programs. Reducing test programs for code coverage bugs is much more complex than reducing test programs for compiler bugs as the later one only requires testing the behavior of the compiled executables or the exit code of compilers [18]. However, reducing test programs for code coverage bugs involves processing textual code coverage reports and identify inconsistencies. After each iteration of reduction, we need to specify the inconsistency of interest we would like to preserve. In this study, we design a set of inconsistency types over coverage reports as the interest. Challenge 3: Inspecting Coverage Bugs. With the reduced test program, we need to inspect which code coverage tools have bugs before reporting bug. In practice, it is usually done manually [21]. In other words, developers manually inspect the coverage reports to determine which coverage tools are buggy. To relieve the burden of manual intervention, we summarize a number of rules that code coverage reports must follow. With those rules, we develop a tool to examine part of the inconsistent coverage reports and determine which tools have bugs automatically. We implemented a differential testing prototype called C2V (“Code Coverage Validation”) for code coverage tools. In order to evaluate the effectiveness of our approach, we have applied C2V to gcov and llvm-cov, two widely used C code coverage tools respectively in the production compilers GCC [22] and Clang [17]. Our evaluation confirms that C2V is very effective in finding code coverage bugs: 46 bugs were found (42 bugs confirmed/fixed) for gcov, while 37 bugs were found (28 bugs confirmed/fixed) for llvm-cov. Contributions. We made the following contributions: • We introduce an effective testing approach to validating the code coverage tools, and have implemented it as a practical tool C2V for testing C coverage tools. C2V mainly consists of a random program generator, a comparer to identify inconsistencies between coverage reports, a filter to remove test programs triggering same coverage bugs, a test program reducer, and an inspector to automatically determine which coverage tools have bugs for bug reporting. • We adopted C2V to uncover 46 and 37 bugs for gcov and llvm-cov both of which are widely used and extensively tested C code coverage tools, respectively. Specifically, for gcov, 42 bugs have already been confirmed/fixed; for llvmcov, 28 bugs have been confirmed/fixed. • Our evaluation indicates that code coverage tools are not as reliable as it might have been envisaged. It opens up a new research direction to improve the reliability of code coverage tools which calls for more attention in this area. Besides, there is a need to examine the influence of those bugs on other techniques which depend on code coverage. Organization. The rest of this paper is structured as follows. Section II introduces the background on code coverage. Section III describes our approach for code coverage validation. Section IV reports the experimental results in detail. Section V surveys related work. Section VI concludes the paper and outlines the direction for future work. II. CODE COVERAGE In this section, we introduce the preliminary knowledge, the importance, and the bug categories of code coverage. A. Preliminaries Code coverage is a quality assurance metric used to describe the degree to which the code of a given program is executed when a particular test suite executes [1]. It is suggested that a program with a high test coverage will have a lower chance of containing undetected software bugs compared to a program with a low test coverage. The code coverage analysis process is generally divided into three tasks: code instrumentation, data gathering, and coverage analysis. Specifically, code instrumentation inserts additional code instructions or probes to monitor whether the specific program chunk is executed or not at runtime. The instrumentation can be done at the source level in a separate pre-processing phase or at runtime by instrumenting byte code. Data gathering aims to collect the coverage data produced at test runtime. Finally, coverage analysis aims to analyze the collected results and to provide test strategy recommendations in order to reduce, to feed or to modify the relevant test suite. Currently, many code coverage tools are available [16], [23]–[31], which support different languages (e.g., C/C++ and Java), instrumentation levels (e.g. source code and byte code level), or coverage criteria. Coverage criteria are the rules or requirements that a test suite needs to satisfy [32]. In the literature, many coverage criteria have been proposed, including statement coverage, branch coverage, path coverage, condition coverage, decision coverage, and data-flow coverage [33]. These criteria can be used to guide the generation of a test suite or to evaluate the effectiveness of a test suite [33]. B. Importance of Code Coverage Code coverage is widely used in, but not limited to, the following software techniques: • Coverage-based regression testing. In the context of regression testing, test case prioritization and test suite augmentation are the two widely used techniques [2]–[4], [10], [15], [34]–[36]. The former aims to improve the ability of test cases to find bugs by scheduling test cases in a specific order [10], [15], [37]. One common practice is to achieve a high code coverage as fast as possible [38]. The latter is to generate new test cases to strengthen the ability of a test suite to find bugs [6], [14], [36], [39]. In practice, it is often to generate new test cases to cover the source code affected by code changes. • Coverage-based compiler testing. Recent years have seen an increasing interest in compiler testing which aims to validate 489
#include Spurious Marking.Spurious marking denotes that a program 2 int main (void) chunk is unexecuted at runtime but is wrongly marked as executed by a code coverage tool.Figure 2 gives such an 5 static int *p: example.In the main function,the two return-statements 6 at line 9 and line 12 cannot be executed at the same time. p malloc (sizeof (int)): 8 if (p =NULL) Thus,one of them must be wrongly marked by llmv-cov. 9 return 0; At runtime.function main returns "1",indicating that line 10 11 9 was wrongly marked.A code coverage tool with spurious *p=7; 1 return 1; marking bugs will cause non-executed code wrongly marked 13 11 } as executed.Consequently,for a program under test,a part Fig.2. Spurious marking (Bug #37092 of llvm-cov) of elements are indeed untested and hence may have latent bugs undetected. 1: 1: int func(int i) Missing Marking.Missing marking denotes that a program 2: 】 3 if (i>0) chunk is actually executed at runtime but is wrongly marked 1: 4 return 1; as unexecuted by a code coverage tool.The coverage bug as #####: 5 return 0: 6 shown in Figure I belongs to this class.A code coverage tool > with missing marking bugs will cause a predefined coverage 1: 8:int main() goal never achieved,regardless of how much testing time 9 { 1:10: int i 1,j=2; and resource are allocated. 2:11: func((i==0)|(i&&j,1)): Wrong Frequency.Wrong frequency denotes that a program 1:12 return 0; chunk is executed m times at runtime but is wrongly marked -:13:} as executed n times by a code coverage tool {(m n)A Fig.3.Wrong frequency (Bug #85163 of gcov) (m >0)A(n>0)).Figure 3 shows a gcov coverage report, in which the first column lists the execution frequency and the correctness of compilers.One of the most attractive the second column lists the line number.As can be seen,the compiler testing techniques is based on the code coverage code at line 11 is wrongly marked as executed twice but it of a program's execution to generate equivalence modulo was actually executed only once.A code coverage tool with inputs by stochastically pruning its unexecuted code [7], wrong frequency bugs might lead to a suboptimal decision [40].[41].With the equivalence modulo inputs,we can in many coverage-based software engineering activities. differentially test compilers. Coverage-based fuzzing.Fuzzing technique is one of the III.METHODOLOGY most effective testing techniques to find vulnerabilities in software.In practice,coverage-based fuzzing has been Figure 4 presents our framework for code coverage vali- widely used [8].[42].Based on the code coverage infor- dation.In the following,we describe the key steps in this mation of test cases,it aims to determine which test cases framework.In particular,we will use gcov and llvm-cov as should be retained for fuzzing.In general,a test case is the subject code coverage tools to illustrate our approach to retained if a new basic block is covered. hunting for code coverage bugs if necessary. Coverage-based debugging.Debugging is a common activity in software development which aims to find the root cause A.Generating Test Programs of a fault.Spectrum-Based Fault Localization(SBFL)is one of the most extensively studied debugging techniques which Our approach starts with a random program generator.The is heavily based on code coverage [5].[43]-[47].Under a Generator randomly generates a large program set P.Each specific test suite,SBFL leverages the code coverage and program pP will be used as a test program for differentially the corresponding failed/passed information to infer which testing two code coverage tools under examination.In other code is the root cause of a fault. words,each program will be fed to the two code coverage As can be seen,the above-mentioned software techniques tools to obtain two raw coverage reports heavily depend on the code coverage information produced In our study,test programs refer to a collection of compi- by code coverage tools.Therefore,it is of great importance lable C programs with the corresponding inputs.We choose to guarantee the correctness of the code coverage reports. Csmith to generate the test programs because of the following Otherwise,they might inevitably produce incorrect results or reasons:(1)it supports many C language features and can lead to suboptimal decisions. avoid generating programs with undefined behaviors [48],thus outperforming some random C program generators [49],[50]: C.Categories of Code Coverage Bugs (2)each generated program is one single file with input self- contained which does not require extra inputs:and (3)it is Code coverage bugs can be categorized into the following fast enough to generate programs with tens of thousands of three general classes: lines within a few seconds. 490
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | | | | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | #include int main ( void ) { static int *p ; p = malloc ( sizeof ( int )); i f ( p == NULL) return 0 ; *p = 7; return 1 ; } Fig. 2. Spurious marking (Bug #37092 of llvm-cov) 1 : − : 1 : 1 : # #### : − : − : 1 : − : 1 : 2 : 1 : − : 1 : 2 : 3 : 4 : 5 : 6 : 7 : 8 : 9 : 10 : 11 : 12 : 13 : int func ( int i ) { i f ( i > 0 ) return 1 ; return 0 ; } int main ( ) { int i = 1, j = 2; func ( ( i ==0) | | ( i&&j , 1 ) ) ; return 0 ; } Fig. 3. Wrong frequency (Bug #85163 of gcov) the correctness of compilers. One of the most attractive compiler testing techniques is based on the code coverage of a program’s execution to generate equivalence modulo inputs by stochastically pruning its unexecuted code [7], [40], [41]. With the equivalence modulo inputs, we can differentially test compilers. • Coverage-based fuzzing. Fuzzing technique is one of the most effective testing techniques to find vulnerabilities in software. In practice, coverage-based fuzzing has been widely used [8], [42]. Based on the code coverage information of test cases, it aims to determine which test cases should be retained for fuzzing. In general, a test case is retained if a new basic block is covered. • Coverage-based debugging. Debugging is a common activity in software development which aims to find the root cause of a fault. Spectrum-Based Fault Localization (SBFL) is one of the most extensively studied debugging techniques which is heavily based on code coverage [5], [43]–[47]. Under a specific test suite, SBFL leverages the code coverage and the corresponding failed/passed information to infer which code is the root cause of a fault. As can be seen, the above-mentioned software techniques heavily depend on the code coverage information produced by code coverage tools. Therefore, it is of great importance to guarantee the correctness of the code coverage reports. Otherwise, they might inevitably produce incorrect results or lead to suboptimal decisions. C. Categories of Code Coverage Bugs Code coverage bugs can be categorized into the following three general classes: • Spurious Marking. Spurious marking denotes that a program chunk is unexecuted at runtime but is wrongly marked as executed by a code coverage tool. Figure 2 gives such an example. In the main function, the two return-statements at line 9 and line 12 cannot be executed at the same time. Thus, one of them must be wrongly marked by llmv-cov. At runtime, function main returns “1”, indicating that line 9 was wrongly marked. A code coverage tool with spurious marking bugs will cause non-executed code wrongly marked as executed. Consequently, for a program under test, a part of elements are indeed untested and hence may have latent bugs undetected. • Missing Marking. Missing marking denotes that a program chunk is actually executed at runtime but is wrongly marked as unexecuted by a code coverage tool. The coverage bug as shown in Figure 1 belongs to this class. A code coverage tool with missing marking bugs will cause a predefined coverage goal never achieved, regardless of how much testing time and resource are allocated. • Wrong Frequency. Wrong frequency denotes that a program chunk is executed m times at runtime but is wrongly marked as executed n times by a code coverage tool {(m = n) ∧ (m > 0)∧(n > 0)}. Figure 3 shows a gcov coverage report, in which the first column lists the execution frequency and the second column lists the line number. As can be seen, the code at line 11 is wrongly marked as executed twice but it was actually executed only once. A code coverage tool with wrong frequency bugs might lead to a suboptimal decision in many coverage-based software engineering activities. III. METHODOLOGY Figure 4 presents our framework for code coverage validation. In the following, we describe the key steps in this framework. In particular, we will use gcov and llvm-cov as the subject code coverage tools to illustrate our approach to hunting for code coverage bugs if necessary. A. Generating Test Programs Our approach starts with a random program generator. The Generator randomly generates a large program set P. Each program p ∈ P will be used as a test program for differentially testing two code coverage tools under examination. In other words, each program will be fed to the two code coverage tools to obtain two raw coverage reports. In our study, test programs refer to a collection of compilable C programs with the corresponding inputs. We choose Csmith to generate the test programs because of the following reasons: (1) it supports many C language features and can avoid generating programs with undefined behaviors [48], thus outperforming some random C program generators [49], [50]; (2) each generated program is one single file with input selfcontained which does not require extra inputs; and (3) it is fast enough to generate programs with tens of thousands of lines within a few seconds. 490
IPS:Inconsistency-triggering Program Set Figure 5(b)shows the code coverage report r2 produced by R-IPS:Reduced IPS llvm-cov for the same program.r1 and r2 are two annotated versions of the source file.However,there are three differences Generator between ri and r2.First,they have different instrumentation program set P sites.On the one hand.lines 4.6~8.12.19,and 22 are = treated as the instrumentation sites by llvm-cov but not by gcov.On the other hand,lines 3 and 18 are treated as an =P1 instrumentation site by gcov but not by llvm-cov.Hence,only Yes the common instrumentation sites used by gcov and llvm-cov need to be compared later.Second,gcov and llvm-cov have p=Pil Reducer different annotation formats for execution frequency.A non- instrumentation site is noted as hyphen""in rI but is noted coverage tool ty coverage tool f2 R-IPS as a null string in r2(e.g.line 3).Besides,an unexecuted line is noted as hashes "#####in r (e.g.line 15),while it is noted as"0"in r2(e.g.line 9).Third,coverage statistics are r印ortr, T印otr2 Inspector included in r but are not available in r2.Figure 5(c)lists the △ unified coverage reports produced by our parser for gcov and Parser BugR印os llvm-cov respectively.As can be seen,there are 9 common instrumentation sites between them:lines 5,7,9,10,11,13, 14.15.20.and21. 、unified report unified report ur C.Comparing Unified Coverage Reports Comparer After obtaining the unified coverage reports ur and ur2,we consistentNo Filter use a tool Comparer to determine whether they are consistent. If not,the corresponding p and associated coverage reports Yes will be added to a set named IPS(Inconsistency-triggering test Program Set).When comparing the unified coverage reports, i=i+l Ye- Comparer only takes into account the execution frequencies of the common instrumentation sites among the n coverage s=Ps+{(p,ic,)}> No tools.During the comparison of uri and ur2,the following types of cases might occur: Fig.4.The framework for code coverage validation Type A:one line code is marked as executed by ti but as unexecuted by t2: B.Parsing Code Coverage Reports Type B:one line code is marked as unexecuted by ti but For each test program p in the generated program set P, as executed by t2; we obtain the raw code coverage reports r1 and r2 emitted Type C:one line code is marked as executed k times by from the coverage tools t and t2 respectively.However,the t1 and as executed I times by t2 while kl. raw coverage reports cannot be directly compared since they Consequently,the inconsistent unified reports can be di- are presented in different formats.We thus develop a parser vided into seven categories (C001 Type C:C010 to transform ri into uri in a unified format (1 i 2).Type B;C100 Type A;C0ll Type B+C;C101 Specifically,uri is a sequence of two-tuple,including the Type A+C;C110:Type A+B:C111:Type A+B+C).If monitored program chunk and the corresponding execution the unified coverage reports corresponding to the test program frequency.Given a program p with N lines of source code.p are found to have an inconsistency category ic,it will be uri is a sequence of N two-tuples (nj,fi)in ascending order handled by the filter. according to the value of nj,1<j<N.Here,nj is the line Considering the two unified coverage reports shown in number and f;is the corresponding execution frequency.If Fig.5.Comparer will compare the execution frequencies of line n;is not an instrumentation site for a particular coverage the common 9 instrumentation sites to determine whether uri tool,fj is assigned-1. and ur2 are consistent.As can be seen,there are four incon- In our study,gcov and llvm-cov are selected as the subject sistent execution frequencies in the unified coverage reports code coverage tools,which are integrated with the mature between gcov and llvm-cov:Type C at line 5,Type C at line production compilers GCC and Clang respectively.In the 13,Type A at line 9,and Type B at line 15.Consequently,the following,we give an illustrating example to demonstrate inconsistency category of ur and ur2 is found to be C111.In how the parser works in real world.Figure 5(a)shows the other words,the inconsistency introduced by the test program code coverage report r produced by gcov for a program,and p belongs to the C111 category. 491
IPS: Inconsistency-triggering Program Set R-IPS: Reduced IPS Bug Reports IPS = IPS + { ( p, ic, ) } i <= |P| program set P Generator Parser Reducer i = i+1 IPS Comparer Inspector p = P[i] R-IPS Yes Yes duplicate? No No i = 1 coverage tool t1 coverage tool t2 report r1 report r2 unified report ur1 unified report ur2 consistent? No Filter Yes Fig. 4. The framework for code coverage validation B. Parsing Code Coverage Reports For each test program p in the generated program set P, we obtain the raw code coverage reports r1 and r2 emitted from the coverage tools t1 and t2 respectively. However, the raw coverage reports cannot be directly compared since they are presented in different formats. We thus develop a parser to transform ri into uri in a unified format (1 ≤ i ≤ 2). Specifically, uri is a sequence of two-tuple, including the monitored program chunk and the corresponding execution frequency. Given a program p with N lines of source code, uri is a sequence of N two-tuples (nj , fj ) in ascending order according to the value of nj , 1 ≤ j ≤ N. Here, nj is the line number and fj is the corresponding execution frequency. If line nj is not an instrumentation site for a particular coverage tool, fj is assigned −1. In our study, gcov and llvm-cov are selected as the subject code coverage tools, which are integrated with the mature production compilers GCC and Clang respectively. In the following, we give an illustrating example to demonstrate how the parser works in real world. Figure 5(a) shows the code coverage report r1 produced by gcov for a program, and Figure 5(b) shows the code coverage report r2 produced by llvm-cov for the same program. r1 and r2 are two annotated versions of the source file. However, there are three differences between r1 and r2. First, they have different instrumentation sites. On the one hand, lines 4, 6∼8, 12, 19, and 22 are treated as the instrumentation sites by llvm-cov but not by gcov. On the other hand, lines 3 and 18 are treated as an instrumentation site by gcov but not by llvm-cov. Hence, only the common instrumentation sites used by gcov and llvm-cov need to be compared later. Second, gcov and llvm-cov have different annotation formats for execution frequency. A noninstrumentation site is noted as hyphen “-” in r1 but is noted as a null string in r2 (e.g. line 3). Besides, an unexecuted line is noted as hashes “#####” in r1 (e.g. line 15), while it is noted as “0” in r2 (e.g. line 9). Third, coverage statistics are included in r1 but are not available in r2. Figure 5(c) lists the unified coverage reports produced by our parser for gcov and llvm-cov respectively. As can be seen, there are 9 common instrumentation sites between them: lines 5, 7, 9, 10, 11, 13, 14, 15, 20, and 21. C. Comparing Unified Coverage Reports After obtaining the unified coverage reports ur1 and ur2, we use a tool Comparer to determine whether they are consistent. If not, the corresponding p and associated coverage reports will be added to a set named IPS (Inconsistency-triggering test Program Set). When comparing the unified coverage reports, Comparer only takes into account the execution frequencies of the common instrumentation sites among the n coverage tools. During the comparison of ur1 and ur2, the following types of cases might occur: • T ype A: one line code is marked as executed by t1 but as unexecuted by t2; • T ype B: one line code is marked as unexecuted by t1 but as executed by t2; • T ype C: one line code is marked as executed k times by t1 and as executed l times by t2 while k = l. Consequently, the inconsistent unified reports can be divided into seven categories (C001 : T ype C; C010 : T ype B; C100 : T ype A; C011 : T ype B + C; C101 : T ype A+C; C110 : T ype A+B; C111 : T ype A+B+C). If the unified coverage reports corresponding to the test program p are found to have an inconsistency category ic, it will be handled by the filter. Considering the two unified coverage reports shown in Fig. 5, Comparer will compare the execution frequencies of the common 9 instrumentation sites to determine whether ur1 and ur2 are consistent. As can be seen, there are four inconsistent execution frequencies in the unified coverage reports between gcov and llvm-cov: T ype C at line 5, T ype C at line 13, T ype A at line 9, and T ype B at line 15. Consequently, the inconsistency category of ur1 and ur2 is found to be C111. In other words, the inconsistency introduced by the test program p belongs to the C111 category. 491
int g=1: int g=1; -1) 2 3 int f(int arg) 3 2》 3 int f(int arg) -1) 4,-1) 4 2) 6 for(inti=0:i1=1:++i)(5, 6)1 for(int i=0:i!=1:++i)( 5 3) 6 6。-11 2 6, 3) 7 int f[l]; 7 -1). > 8 if(0) 8,-1) 22 int f[l]: 7. 2) if(0) 8,2) 1 9 break: (9,1), 9 0 break; ( 9, 0). 10 if (arg) 10, 2)+ 10 if(arg) (10. 2) 1 break: (11, 11 l break: (11.1). 12 (12,-1), 12 22 (12 2) 413 for (g:) (13, 4)1 13 for(:g:) (13.2) 2 14 return 0; (14, 2), 14 2 return 0; (14, 2). ####柱 15 return 0: 15 0)+ 15 return 0: 15 21 16 16.】. 16 2 (16.2) 17 (17,-1). 1 (17.-1) 1: int main() (18 18 int main() (18.-1) 19 (19,-1) 19 (19, 1) 1: f(0):f(1): (20, 1) 20 f(0):f(1) (20, 1) 1: 21 return 0; 2】. 21 return 0; (21, 1) - 22 (22,-1) 22 (22,1) (a)Raw coverage report by gcov 7.3.0 and unified report ur1 (b)Raw coverage report by llvm-cov 6.0.0 and unified report ur2 Fig.5.An illustrating example. D.Filtering out Test Programs semi"with the help of the clang compiler.After that,we calculate its similarities with other programs.In this study. Intuitively,using a test program set P with a larger size has we use Jaccard index similarity as it is the most widely used a higher chance to discover code coverage bugs.Therefore, token based similarity algorithm [51].If all the similarity in practice,we prefer to generate a very large number of coefficients between p and each of the programs in IPS are test programs and further resulting in a large number of less than 0.8,a tuple (p,ic,will be added inconsistency-triggering test programs.This will lead to the to the set IPS.Otherwise,p will be filtered out. following two problems.On the one hand,it may be unrealistic to inspect all the inconsistency-triggering test programs,as the reduction is very costly [18]and inspection resources are E.Reducing Test Programs often limited.On the other hand,we observe that many test An inconsistency-triggering test program only indicates that programs trigger the same code coverage bugs.Therefore,we some of the two code coverage tools are buggy but does not can filter out duplicate test programs. inform which one contains bugs.Therefore,there is a need In Fig.4,we use Filter to determine whether the to inspect the inconsistency-triggering test program to obtain inconsistency-triggering test program is "duplicate",i.e, the correct coverage report (as the coverage oracle).After triggering the same code coverage bug with other test that,for each code coverage tool,we can easily determine programs.To filter out potential "duplicate"test programs, whether it is buggy by comparing the corresponding coverage the most intuitive way is to calculate similarities between report with the coverage oracle.Since a test program may programs using the whole text.However,we use Csmith [19] have a large code size,it is time-consuming and error-prone as the random program generator and two Csmith-generated to determine the corresponding coverage oracle.Furthermore, programs are not meaningfully comparable as they diverge in if a large test program is submitted as a bug report,it is also many ways [20].In addition,calculating similarities between difficult for developers to determine the location of bugs in the programs using the whole text is expensive.To tackle this code coverage tool(s).To tackle this problem,we use Reducer challenge,only lines of code triggering inconsistencies are to reduce each inconsistency-triggering test program in IPS by used for computing similarities between programs.More removing the code irrelevant to the inconsistency. specifically,we first transform the inconsistency-triggering For each inconsistency-triggering test program p in IPS. lines of code in each program into a token based string Reducer takes the following steps to obtain the reduced test with variable names insensitive and then use a token program.At step 1,the tool C-Reduce [18]is applied to p to based similarity algorithm to calculate similarities between obtain the reduced version p'.if p'has a smaller size than programs.We calculate program similarity with variable name p,go to step 2;otherwise,stop the reduction.At step 2. insensitive since the variable names in Csmith-generated feed p'to the n coverage tools,collect the coverage reports. programs have no specific meaning.For example,we assume and determine the inconsistency category.If the inconsistency "int i=0;i=1;"are inconsistency-triggering lines of category triggered by p'is the same as the inconsistency code in program p.We first transform these two statements category triggered by p,assign p'to p.The above process is into "int identifier equal numeric_constant repeated until p cannot be further reduced.Consequently,we semi identifier equal numeric_constant reduce IPS to obtain R-IPS,the set of reduced test programs. 492
− : − : 2 : − : 6 : − : − : − : 1 : 2 : 1 : − : 4 : 2 : # #### : − : − : 1 : − : 1 : 1 : − : 1 : 2 : 3 : 4 : 5 : 6 : 7 : 8 : 9 : 10 : 11 : 12 : 13 : 14 : 15 : 16 : 17 : 18 : 19 : 20 : 21 : 22 : int g=1; int f ( int arg ) { for ( int i =0; i !=1;++ i ) { int f [1]; i f (0) break ; i f ( arg ) break ; } for (;g;) return 0 ; return 0 ; } int main ( ) { f (0); f (1); return 0 ; } ( 1, −1), ( 2, −1), ( 3, 2) , ( 4, −1), ( 5, 6) , ( 6, −1), ( 7, −1), ( 8, −1), ( 9, 1) , (10 , 2) , (11 , 1) , (12, −1), (13 , 4) , (14 , 2) , (15 , 0) , (16, −1), (17, −1), (18 , 1) , (19, −1), (20 , 1) , (21 , 1) , (22, −1 ) 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | | | | 2 | 3 | 2 | 2 | 2 | 0 | 2 | 1 | 2 | 2 | 2 | 2 | 2 | | | 1 | 1 | 1 | 1 | int g=1; int f ( int arg ) { for ( int i =0; i !=1;++ i ) { int f [1]; i f (0) break ; i f ( arg ) break ; } for (;g;) return 0 ; return 0 ; } int main ( ) { f (0); f (1); return 0 ; } ( 1, −1), ( 2, −1), ( 3, −1), ( 4, 2) , ( 5, 3) , ( 6, 3) , ( 7, 2) , ( 8, 2) , ( 9, 0) , (10 , 2) , (11 , 1) , (12 , 2) , (13 , 2) , (14 , 2) , (15 , 2) , (16 , 2) , (17, −1), (18, −1), (19 , 1) , (20 , 1) , (21 , 1) , (22 , 1) (a) Raw coverage report by gcov 7.3.0 and unified report ur1 (b) Raw coverage report by llvm-cov 6.0.0 and unified report ur2 Fig. 5. An illustrating example. D. Filtering out Test Programs Intuitively, using a test program set P with a larger size has a higher chance to discover code coverage bugs. Therefore, in practice, we prefer to generate a very large number of test programs and further resulting in a large number of inconsistency-triggering test programs. This will lead to the following two problems. On the one hand, it may be unrealistic to inspect all the inconsistency-triggering test programs, as the reduction is very costly [18] and inspection resources are often limited. On the other hand, we observe that many test programs trigger the same code coverage bugs. Therefore, we can filter out duplicate test programs. In Fig. 4, we use Filter to determine whether the inconsistency-triggering test program is “duplicate”, i.e, triggering the same code coverage bug with other test programs. To filter out potential “duplicate” test programs, the most intuitive way is to calculate similarities between programs using the whole text. However, we use Csmith [19] as the random program generator and two Csmith-generated programs are not meaningfully comparable as they diverge in many ways [20]. In addition, calculating similarities between programs using the whole text is expensive. To tackle this challenge, only lines of code triggering inconsistencies are used for computing similarities between programs. More specifically, we first transform the inconsistency-triggering lines of code in each program into a token based string with variable names insensitive and then use a token based similarity algorithm to calculate similarities between programs. We calculate program similarity with variable name insensitive since the variable names in Csmith-generated programs have no specific meaning. For example, we assume “int i = 0; i = 1;” are inconsistency-triggering lines of code in program p. We first transform these two statements into “int identifier equal numeric_constant semi identifier equal numeric_constant semi” with the help of the clang compiler. After that, we calculate its similarities with other programs. In this study, we use Jaccard index similarity as it is the most widely used token based similarity algorithm [51]. If all the similarity coefficients between p and each of the programs in IPS are less than 0.8, a tuple (p, ic, ) will be added to the set IPS. Otherwise, p will be filtered out. E. Reducing Test Programs An inconsistency-triggering test program only indicates that some of the two code coverage tools are buggy but does not inform which one contains bugs. Therefore, there is a need to inspect the inconsistency-triggering test program to obtain the correct coverage report (as the coverage oracle). After that, for each code coverage tool, we can easily determine whether it is buggy by comparing the corresponding coverage report with the coverage oracle. Since a test program may have a large code size, it is time-consuming and error-prone to determine the corresponding coverage oracle. Furthermore, if a large test program is submitted as a bug report, it is also difficult for developers to determine the location of bugs in the code coverage tool(s). To tackle this problem, we use Reducer to reduce each inconsistency-triggering test program in IPS by removing the code irrelevant to the inconsistency. For each inconsistency-triggering test program p in IPS, Reducer takes the following steps to obtain the reduced test program. At step 1, the tool C-Reduce [18] is applied to p to obtain the reduced version p . if p has a smaller size than p, go to step 2; otherwise, stop the reduction. At step 2, feed p to the n coverage tools, collect the coverage reports, and determine the inconsistency category. If the inconsistency category triggered by p is the same as the inconsistency category triggered by p, assign p to p. The above process is repeated until p cannot be further reduced. Consequently, we reduce IPS to obtain R-IPS, the set of reduced test programs. 492
F.Inspecting Reduced Test Programs -00 t.c;./a.out;1lvm-profdata merge With the reduced test program,we need to inspect which default.profraw -o t.profdata;1lvm-cov show code coverage tools have bugs before reporting.In practice, -instr-profile=t.profdata./a.out t.c t.c.lcov it is usually done manually [21].In other words,developers The llvm-cov-generated coverage report is stored in a file manually inspect the coverage reports to determine which named t.c.Icov.Then these two produced coverage coverage tools are buggy.To relieve the burden of manual reports will be parsed into unified format and compared to intervention,we summarize the following rules that code detect inconsistency.It is worth noting that we are using -00 coverage reports must comply with: option to turn off compiler optimizations.It make sense to Identical Coverage:Assuming statements s1 and s2 in the compare the coverage reports produced in this way. same block:{s1;s2;).If s1 is not a jump statement (i.e, B.Testing Environment Setup break,goto,return,exit,or abort statement)and s2 is not a label statement nor a loop (for or while) Our evaluation was conducted on a Linux server with statement,s1 and s2 should have identical coverage. Intel(R)Xeon(R)CPU@2.00GHz (60 cores)and 32GB RAM. Unexecuted Coverage:Assuming statements sl and s2 in The server is running on Ubuntu 17.10(x86_64).We spent the same block:{s1;s2;).If s1 is a return,break, non-continuous four months,of which over one month we goto,or exit statement and s2 is not a labeled statement, devoted to developing various tools.The rest of time was s2 should be never executed. spent in testing the two code coverage tools,filtering out test .Ordered Coverage:Assuming statements s1 and s2 form: programs,reducing test programs,and inspect test programs. s1;if (...{s2;...}.If s2 is not a labeled state- Initially,we only used Csmith-generated programs as the test ment,the execution time of sl should be no less than s2 programs.Later,we made use of the programs selected from With the above rules,we develop the tool Inspector to examine GCC's and Clang's test-suites as our test programs since they may cover many C semantics that Csmith does not cover.This the inconsistent coverage reports and determine which tools is further confirmed in Section IV-C as a number of bugs are have bugs automatically.There is still some inconsistent cov- erage reports in R-IPS that can not be inspected automatically detected by programs inside test-suites. by our tool.We inspect those coverage reports manually.This C.Experimental Results process does not require too much human inspection effort,as the reduced test programs only have a few lines (usually less Inconsistent Coverage Reports.Table I shows the statis- than 13 lines in our study). tics of inconsistency-triggering test programs over Csmith- generated programs,GCC's test-suite,and Clang's test-suite. G.Reporting Test Programs to Developers Column 2 shows the total number of test programs and For each test program in RS-IPS,this step simply generates Column 3 shows the number of test programs which run out bug reports for the corresponding buggy tool(s).A bug report of time (10 seconds in our experiment).We used I million mainly consists of the reduced test program and the affected Csmith-generated programs and collected 2,756 and 106 C versions.If a test program triggers multiple bugs,multiple compilable programs respectively from GCC's and Clang's separate bug reports will be generated. test-suites.Note that there are more than tens of thousands of test programs in GCC and Clang's test-suites.Only those IV.EVALUATION C files that can be compiled independently are considered In this section,we first present the subject coverage tools here.Among them,182,927 programs executed more than and the testing environment.Then,we describe our experi- 10 seconds and hence were excluded for further analysis.The mental results in detail. remaining test programs were fed to C2V.Column 4 is in the form of 'a b',where 'a'refers to the total number A.Subject Code Coverage Tools of test programs which can lead to inconsistencies andb' In this study,we select gcov and llvm-cov as our subject refers to the percentage of 'a'over the number of all the code coverage tools.We choose these two code coverage tools test programs C2V analyzed (i.e.Column 2 -Column 3). since:(1)they have been widely used in software engineering We found 261,347 programs leading to inconsistent coverage community;and(2)they have been integrated with the most reports (261,065,262,and 20 respectively from Csmith- widely-used production compilers,i.e.GCC and Clang.More generated programs,GCC's test-suite,and Clang's test-suite). specifically,we chose gcov in latest development trunk of GCC About 31.95%Csmith-generated programs caused inconsistent and llvm-cov in the latest LLVM development trunk. coverage reports,much higher than those from GCC's test- For gcov,the command flags we used to compile a given suite and Clang's test-suite.Columns 5 to 11 display the source file,e.g.,t.c,and produce the corresponding coverage distributions of inconsistency-triggering test programs over 7 report is as follows: different categories.In the third rows of these columns,the #gcc -00 --coverage t.c;./a.out;gcov t.c number in parentheses indicates the number of test programs The gcov-generated coverage report is stored in a file named after filtering potential test programs that trigger the same code t.c.gcov.For llvm-cov,we use the following command: coverage bugs.Most of inconsistent reports fell into the C010 clang -fprofile-instr-generate -fcoverage-mapping category,indicating that the majority of inconsistencies belong 493
F. Inspecting Reduced Test Programs With the reduced test program, we need to inspect which code coverage tools have bugs before reporting. In practice, it is usually done manually [21]. In other words, developers manually inspect the coverage reports to determine which coverage tools are buggy. To relieve the burden of manual intervention, we summarize the following rules that code coverage reports must comply with: • Identical Coverage: Assuming statements s1 and s2 in the same block: {s1; s2;}. If s1 is not a jump statement (i.e, break, goto, return, exit, or abort statement) and s2 is not a label statement nor a loop (for or while) statement, s1 and s2 should have identical coverage. • Unexecuted Coverage: Assuming statements s1 and s2 in the same block: {s1; s2;}. If s1 is a return, break, goto, or exit statement and s2 is not a labeled statement, s2 should be never executed. • Ordered Coverage: Assuming statements s1 and s2 form: s1; if (...) {s2; ...}. If s2 is not a labeled statement, the execution time of s1 should be no less than s2. With the above rules, we develop the tool Inspector to examine the inconsistent coverage reports and determine which tools have bugs automatically. There is still some inconsistent coverage reports in R-IPS that can not be inspected automatically by our tool. We inspect those coverage reports manually. This process does not require too much human inspection effort, as the reduced test programs only have a few lines (usually less than 13 lines in our study). G. Reporting Test Programs to Developers For each test program in RS-IPS, this step simply generates bug reports for the corresponding buggy tool(s). A bug report mainly consists of the reduced test program and the affected versions. If a test program triggers multiple bugs, multiple separate bug reports will be generated. IV. EVALUATION In this section, we first present the subject coverage tools and the testing environment. Then, we describe our experimental results in detail. A. Subject Code Coverage Tools In this study, we select gcov and llvm-cov as our subject code coverage tools. We choose these two code coverage tools since: (1) they have been widely used in software engineering community; and (2) they have been integrated with the most widely-used production compilers, i.e. GCC and Clang. More specifically, we chose gcov in latest development trunk of GCC and llvm-cov in the latest LLVM development trunk. For gcov, the command flags we used to compile a given source file, e.g., t.c, and produce the corresponding coverage report is as follows: # gcc -O0 --coverage t.c; ./a.out; gcov t.c The gcov-generated coverage report is stored in a file named t.c.gcov. For llvm-cov, we use the following command: # clang -fprofile-instr-generate -fcoverage-mapping -O0 t.c; ./a.out; llvm-profdata merge default.profraw -o t.profdata; llvm-cov show -instr-profile=t.profdata ./a.out t.c > t.c.lcov The llvm-cov-generated coverage report is stored in a file named t.c.lcov. Then these two produced coverage reports will be parsed into unified format and compared to detect inconsistency. It is worth noting that we are using -O0 option to turn off compiler optimizations. It make sense to compare the coverage reports produced in this way. B. Testing Environment Setup Our evaluation was conducted on a Linux server with Intel(R) Xeon(R) CPU@2.00GHz (60 cores) and 32GB RAM. The server is running on Ubuntu 17.10 (x86 64). We spent non-continuous four months, of which over one month we devoted to developing various tools. The rest of time was spent in testing the two code coverage tools, filtering out test programs, reducing test programs, and inspect test programs. Initially, we only used Csmith-generated programs as the test programs. Later, we made use of the programs selected from GCC’s and Clang’s test-suites as our test programs since they may cover many C semantics that Csmith does not cover. This is further confirmed in Section IV-C as a number of bugs are detected by programs inside test-suites. C. Experimental Results Inconsistent Coverage Reports. Table I shows the statistics of inconsistency-triggering test programs over Csmithgenerated programs, GCC’s test-suite, and Clang’s test-suite. Column 2 shows the total number of test programs and Column 3 shows the number of test programs which run out of time (10 seconds in our experiment). We used 1 million Csmith-generated programs and collected 2, 756 and 106 C compilable programs respectively from GCC’s and Clang’s test-suites. Note that there are more than tens of thousands of test programs in GCC and Clang’s test-suites. Only those C files that can be compiled independently are considered here. Among them, 182, 927 programs executed more than 10 seconds and hence were excluded for further analysis. The remaining test programs were fed to C2V. Column 4 is in the form of ‘a/b’, where ‘a’ refers to the total number of test programs which can lead to inconsistencies and ‘b’ refers to the percentage of ‘a’ over the number of all the test programs C2V analyzed (i.e. Column 2 − Column 3). We found 261, 347 programs leading to inconsistent coverage reports (261, 065, 262, and 20 respectively from Csmithgenerated programs, GCC’s test-suite, and Clang’s test-suite). About 31.95% Csmith-generated programs caused inconsistent coverage reports, much higher than those from GCC’s testsuite and Clang’s test-suite. Columns 5 to 11 display the distributions of inconsistency-triggering test programs over 7 different categories. In the third rows of these columns, the number in parentheses indicates the number of test programs after filtering potential test programs that trigger the same code coverage bugs. Most of inconsistent reports fell into the C010 category, indicating that the majority of inconsistencies belong 493
TABLE I STATISTICS OF INCONSISTENCY-TRIGGERING TEST PROGRAMS. #Test Programs #Time out Inconsistency-triggering Test Programs(After filtering) Num(After filtering)/%P CUOT C010 co cTor cIo cm Csmith-generate 1,000.0 182.92 261,065(758)131.95%3,331(119の238,554(251)1,625(36)4,547115)87(3312,470(151)451(53 GcC's test-suite 2.750 19 262/9.56% 81 124 15 30 Clang's test-suite 106 20/21.98% 9 4 TABLE II in latest version.It is worth noting that"won't fix"indicates a STATISTICS OF LINES OF CODE FOR THE ORIGINAL AND REDUCED VERSION OF CSMITH-GENERATED TEST PROGRAMS confirmed bug but will not be fixed by developers.There are total 6 bugs are pending for developers'responses which are min mean median max not listed in Table V.Consistent with C.Sun et al's [21]and Original 41 210 167 874 V.Le et al's [52]studies,due to the bug management process Reduced 2 10 8 27 Relative of LLVM is not organized as that of GCC,if a llvm-cov bug 4.76%4.79% report has been CCed by Clang developers and there is no TABLE III objection in the comments,we label the bug as confirmed.In INFORMATION OF ALL REPORTED BUGS addition,as stated by developers,if someone does not close gcov llvm-cov Total the reported bug as "invalid",then the bug is real in LLVM Confirmed 42 28 70 Bugzilla.Another 4 reported bugs were marked as duplicate Pending 6 by developers,since similar test programs trigger the same Duplicate 0 Rejected 3 3 coverage bug inside gcov or llvm-cov.The remaining 3 reports Reported 46 37 83 were rejected by GCC developers.Two of them is GCC's TABLE IV default optimization strategy and the other is with invalid code. BUG TYPES OF CONFIRMED BUGS Table V lists all the confirmed bugs in detail,including the identity,priority,current report status,bug types,the origins of gcov llvm-cov Total Spurious Marking 34 the bug-triggering test programs(i.e.Csmith-generated,GCC's 16 Missing Marking 5 18 or Clang's test-suite),and affected versions.Note that 'New' Wrong Frequency 11 indicates confirmed in GCC's Bugzilla,and 'Assigned'refers Total 28 70 to that the confirmed bugs are under the process of fixing. Bug type.We categorize coverage bugs into three classes as to the cases where some code is marked as unexecuted by gcov mentioned in Section II-C:Spurious Marking,Missing Mark- but as executed by llmv-cov.C101 category has the minimal ing,and Wrong Frequency.Table IV shows the breakdown number of inconsistent reports,indicating that inconsistencies of the bug types of all the confirmed bugs.Most of the bugs of Type A and Type C rarely occur simultaneously.In are spurious marking bugs,i.e.unexecuted code is wrongly addition,we can found that our method is very efficient for marked as executed. filtering potential"duplicate"test programs. Bug importance.As shown in Column 4 of Table V.all For those Csmith-generated inconsistency-triggering pro- our confirmed bugs have the default priority P3 except 14 of grams,our filtering strategy led to 758 test programs for re- them are reset to P5 by developers.Besides,13 of our reported duction and inspection.Table II lists their code size before and coverage bugs have been fixed by developers.Note that 3 bugs after reduction.We can see that the mean lines of code drops are confirmed as 'Works'which means that they are fixed in from 210 to 10,thus helping effectively report code coverage developers'version.Thus,we consider these three bugs as bugs.For those inconsistency-triggering test programs from fixed by developers.This together shows that our reported GCC's and Clang's test-suites,we intended to inspect all of bugs are important and worth the effort. them.The reasons are two-fold.First,programs in these test- Program source.As shown in Column 7 of Table V. suites are usually unique in program structures (unlike the test programs from all the three main sources (i.e.Csmith- randomly generated programs produced by Csmith).Second, generated,GCC's test-suite,and Clang's test-suite)can trigger the total number is not large (i.e.262 from GCC's test-suite coverage bugs.Two programs from Clang'test-suite trigger and 20 from Clang'test-suite).In summary,we analyzed about coverage bugs of gcov,and a number of programs from 1000 inconsistency-triggering test programs. GCC's test-suite can also help find coverage bugs for llvm- Bug Count.We filed a total of 83 bug reports for gcov and cov as well.It is worth noting that test programs from llvm-cov during our testing period.They can be found un-different sources may induce the same coverage bugs.It indeed der "yangyibiaoenju.edu.cn"in GCC's and LLVM's happened in our experiment,and we only reported once. Bugzilla databases.Table IlI shows the details of all the bugs Affected versions.We only tested gcov and llvm-cov inside we have reported so far.As shown in Column 4.till April the latest development trunks of GCC and Clang respectively. 16,2018.we have reported 83 bugs,of which 70 bugs are When we find a test case that reveals a bug in gcov or llvm-cov, confirmed by developers.Of the 70 confirmed bugs,11 are we also check the corresponding compiler's stable releases resolved as“fixed'",l7as“won'tfix”,and2as“works for me' against the same test case.Column 8 of Table V shows the 494
TABLE I STATISTICS OF INCONSISTENCY-TRIGGERING TEST PROGRAMS. #Test Programs #Time out Inconsistency-triggering Test Programs(After filtering) #Num(After filtering) / %P C001 C010 C100 C011 C101 C110 C111 Csmith-generated 1,000,000 182,927 261,065 (758) / 31.95% 3,331 (119) 238,554 (251) 1,625 (36) 4,547 (115) 87 (33) 12,470 (151) 451 (53) GCC’s test-suite 2,756 15 262 / 9.56% 81 124 15 30 8 1 3 Clang’s test-suite 106 15 20 / 21.98% 9 5 4 10 1 0 TABLE II STATISTICS OF LINES OF CODE FOR THE ORIGINAL AND REDUCED VERSION OF CSMITH-GENERATED TEST PROGRAMS min mean median max Original 41 210 167 874 Reduced 2 10 8 27 Relative – 4.76% 4.79% – TABLE III INFORMATION OF ALL REPORTED BUGS gcov llvm-cov Total Confirmed 42 28 70 Pending 1 5 6 Duplicate 0 4 4 Rejected 3 0 3 Reported 46 37 83 TABLE IV BUG TYPES OF CONFIRMED BUGS gcov llvm-cov Total Spurious Marking 18 16 34 Missing Marking 13 5 18 Wrong Frequency 11 7 18 Total 42 28 70 to the cases where some code is marked as unexecuted by gcov but as executed by llmv-cov. C101 category has the minimal number of inconsistent reports, indicating that inconsistencies of T ype A and T ype C rarely occur simultaneously. In addition, we can found that our method is very efficient for filtering potential “duplicate” test programs. For those Csmith-generated inconsistency-triggering programs, our filtering strategy led to 758 test programs for reduction and inspection. Table II lists their code size before and after reduction. We can see that the mean lines of code drops from 210 to 10, thus helping effectively report code coverage bugs. For those inconsistency-triggering test programs from GCC’s and Clang’s test-suites, we intended to inspect all of them. The reasons are two-fold. First, programs in these testsuites are usually unique in program structures (unlike the randomly generated programs produced by Csmith). Second, the total number is not large (i.e. 262 from GCC’s test-suite and 20 from Clang’ test-suite). In summary, we analyzed about 1000 inconsistency-triggering test programs. Bug Count. We filed a total of 83 bug reports for gcov and llvm-cov during our testing period. They can be found under “yangyibiao@nju.edu.cn” in GCC’s and LLVM’s Bugzilla databases. Table III shows the details of all the bugs we have reported so far. As shown in Column 4, till April 16, 2018, we have reported 83 bugs, of which 70 bugs are confirmed by developers. Of the 70 confirmed bugs, 11 are resolved as “fixed”, 17 as “won’t fix”, and 2 as “works for me” in latest version. It is worth noting that “won’t fix” indicates a confirmed bug but will not be fixed by developers. There are total 6 bugs are pending for developers’ responses which are not listed in Table V. Consistent with C. Sun et al’s [21] and V. Le et al’s [52] studies, due to the bug management process of LLVM is not organized as that of GCC, if a llvm-cov bug report has been CCed by Clang developers and there is no objection in the comments, we label the bug as confirmed. In addition, as stated by developers, if someone does not close the reported bug as “invalid”, then the bug is real in LLVM Bugzilla. Another 4 reported bugs were marked as duplicate by developers, since similar test programs trigger the same coverage bug inside gcov or llvm-cov. The remaining 3 reports were rejected by GCC developers. Two of them is GCC’s default optimization strategy and the other is with invalid code. Table V lists all the confirmed bugs in detail, including the identity, priority, current report status, bug types, the origins of the bug-triggering test programs (i.e. Csmith-generated, GCC’s or Clang’s test-suite), and affected versions. Note that ‘New’ indicates confirmed in GCC’s Bugzilla, and ‘Assigned’ refers to that the confirmed bugs are under the process of fixing. Bug type. We categorize coverage bugs into three classes as mentioned in Section II-C: Spurious Marking, Missing Marking, and Wrong Frequency. Table IV shows the breakdown of the bug types of all the confirmed bugs. Most of the bugs are spurious marking bugs, i.e. unexecuted code is wrongly marked as executed. Bug importance. As shown in Column 4 of Table V, all our confirmed bugs have the default priority P3 except 14 of them are reset to P5 by developers. Besides, 13 of our reported coverage bugs have been fixed by developers. Note that 3 bugs are confirmed as ‘Works’ which means that they are fixed in developers’ version. Thus, we consider these three bugs as fixed by developers. This together shows that our reported bugs are important and worth the effort. Program source. As shown in Column 7 of Table V, test programs from all the three main sources (i.e. Csmithgenerated, GCC’s test-suite, and Clang’s test-suite) can trigger coverage bugs. Two programs from Clang’ test-suite trigger coverage bugs of gcov, and a number of programs from GCC’s test-suite can also help find coverage bugs for llvmcov as well. It is worth noting that test programs from different sources may induce the same coverage bugs. It indeed happened in our experiment, and we only reported once. Affected versions. We only tested gcov and llvm-cov inside the latest development trunks of GCC and Clang respectively. When we find a test case that reveals a bug in gcov or llvm-cov, we also check the corresponding compiler’s stable releases against the same test case. Column 8 of Table V shows the 494
TABLE V gcov llvm-cov N Source Code CONFIRMED BUGS 1 void func(int i){ #IDPrio. Status Type Source Affected Ver. 0 0 switch (i) gcov 83434 P4 New MissingCsmith v7.trunk 1 10 3 case 1:break; 10 case 2: gco R3465 Won't fix MissingCsmith v4~7.trunk 83486 default:break: Won't tx yss1ne●smith v4~-7.trunk 9 9 83505 P5 New v4~7.trunk 6 P5 MissingCsmith v7 trunk 10 0 gcov 83616 MissingCsmith 8 Assigned v/trunk 83617 3 Works Wrong Freq.Csmith v47 9 int main() v4 7.trunk 1 10 for oint Spurious i=0:i<10:++i) 83678 Won't fix Ic smith 83813 P3 trunk 10 Fixed : 0 func(i): gcov 85163 New return 0: Wrong Freq. trunk 85178 P3 Won't fx MissingCsmith v4~7.trunk 13 gcov 85179 P4 Assigned MissingCsmith trunk Fig 6. gcov 85188 SpuriousCsmith v4~7.trunk One program triggering multiple bugs CO 85197 P3 Won't fix Wrong Freq. Csmith v4~-7.trunk 85199 P4 Assigned Missing Csmith v4~7,trunk affected various versions of GCC and Clang for each bug. 6 gcov 85201 Assigned Wrong Freq. ICsmith trunk Note that as for GCC.we select GCC-4.8.0.GCC-5.4.1.GCC gco 85202 Won't fix Spuriou Csmith v4~7.trunk 6.4.0.GCC-7.2.0.and the latest trunk version (GCC-8.0).and 85717 Fixed SDuo1s●smith v4~7.trunk gcov 85218 P3 Won't fix Spurious Csmith v4~7.trunk for Clang.we select Clang-3.8,Clang-4.0.Clang-5.0,and the 02 gco 85219 Won't fix SpuriousCsmith v4~7,trunk latest trunk version (Clang-7.0).As can be seen.42 out of all 2CO 85225 Assigned Wrong frea. ●smith V4~/trunk the bugs can affect stable releases,and a considerable number 23 85243 P3 Won't fix Spurious Csmith v4~7.trunk of bugs have been latent in the old stable releases (such as gcov 85245 Won't fix SpunousICsmith V4~7 trunk 4 gcov 85272 P3 Won't fix Spurious v4~7.trunk GCC-4.8.0 and Clang-3.8)for many years. 5 85273 Won't fix Spurious Csmith v4~7.trunk 85274 Won't fix Spunous ICsmith v4~7,trunk D.Interesting Bugs Found 27 gcov 85276 New Wrong Freq. Csmith trunk 85294 Won't fix Spurious v4~7.trunk One program triggering multiple bugs.In Figure 6,Col- 85297 P3 Won't fix Spunous●gmth v4~7.trunk umn 4 lists a code snippet and Column 3 shows the line 85790 P3 Won't fix SpuriousCsmith v4~7,trunk 31 numbers.Column 1 and Column 2 display the coverage results 85332 Fixed Wrong Freq. GO v7.trunk 23 gcov 85333 P3 Won't fix Missing v4~7,trunk by gcov and llvm-cov respectively.This code snippet is a 85336 P4 New Wrong Freq. v4~7,trunk simple switch-statement inside a for-loop from the caller.The 4 gcov 85337 P5 Neu Wrong Freq Clang v7 trunk code at line 3 and line 4 actually executes only one time. 5 gco R533 Fixed Wrong Freq Clang v4~7,trunk 85349 P5 Spunous G●e v6~7.trunk however,they are wrongly marked as executed 10 times by 37 gcov 85350 P3 Fixed Spurious GCC v4 7.trunk llvm-cov and line 4 is wrongly marked as executed 9 times 89 85351 Assigned Missing v4~7,trunk gCO 85367 Works Wrong Freq. GCC by gcov.We have reported this case as Bug #85337 of gcov VS~/trunk 40 85370 P3 Fixed Spurious GCC trunk and Bug #37124 of llvm-cov.This case is quite simple and 85372 Fixed Missing G●● trunk common in real world.Unfortunately,this single case triggers 4 gcov 85377 P4 New Missing GCC v4~7.trunk 3llvm-co coverage bugs for both gcov and llvm-cov,indicating that 3346 P3 Fixed Missing Csmith v3~-5.trunk 4 vm-cov 3540 Contirmed SpuriousCsmith trunk coverage bugs are subtle and prevalent. llvm-cov 35426 P3 Fixed Spurious Csmith trunk Coverage bugs or compiler bugs?Figure 7 shows the 46llvm-cov 35495 Fixed Missing Csmith trunk coverage report for Bug #37082 of Clang.A definition of llvm-cov 35556 P3 Fixed MissingCsmith trunk 48 llvm-co 37001 SpuriousCsmith trunk strcmp function is first given at line 2,while strcmp is 49llvm-cov 37003 P3Confirmed SpuriousCsmith trunk redefined as the libc function.As a result,the calling of 50llvm-cov 37012 P3 Confirmed SpuriousCsmith trunk strcmp at line 6 will invoke the libc function with internal 51llvm-cov 37017 trunk Ilvm-cov 3704 P3iContrmed SpuriousCsmith trunk linkage,instead of the self-defined function (line 2).However. 53 llvm- 37046 SpuriousCsmith trunk in our test,.GCC outputs“-1”whereas Clang outputs“20”for 54llvm-cov 37070 P3 Missing Clang v3~5.trunk this case.That means,GCC invokes the correct libc function 55 llvm-cov 37071 P30 Spurious Clang trunk 56 llvm-co 37072 Confirmed ong Freq GCC trunk but Clang still invokes the self-defined function at line 2.That 57 vm-cov 3708 Clang v3~-5.trunk is why line 2 is reported as executed one time by llvm-cov but 58 llvm-cov 37083 Wrong Freq. trunk as not executed by gcov.As for this case,the root cause of 59llvm-cov 37084 Wrong Freq G●● trunk 60 llvm-cov P3Confirmed GCC trunk producing the incorrect coverage report is not the bugs inside 6 Spurious llvm-cov 37090 P3Confirmed Spurious trunk llvm-cov,but the bugs inside Clang instead. Ivm-co 37092 Spurious G●a trunk Code formatting problem.Figure 8 shows the coverage 63 llvm-cov 37099 P3 Spurious GCC trunk llvm-cov 3710 P3 Confirmed Spurious ●● trunk report for Bug #37102 of Clang.Line 8 is marked as executed vm-cov 37103 Wrong Freq. GCC v35.tunk twice by gcov,but wrongly marked as executed only once in llvm-co 37105 Confirmed GCC trunk llvm-cov.Along the execution,we can see that each of the four lvm-co 37107 P3 ong Freg G●● v3~5,trunk llvm-co 37124 P3 Confirme Wrong Freq- trunk statements at line 8 (i.e.foo ()goto L2;L1:bar()) 69 llvm-co 37125 P3Confirme Wrong Freg Clan v3~5.trunk executes actually only one time before the main function ovm-co Spurious V~5.trunk finishes.However,from the perspective of line coverage,the 495
TABLE V CONFIRMED BUGS #ID Prio. Status Type Source Affected Ver. 1 gcov 83434 P4 New Missing Csmith v7,trunk 2 gcov 83465 P3 Won’t fix Missing Csmith v4∼7,trunk 3 gcov 83486 P3 Won’t fix Missing Csmith v4∼7,trunk 4 gcov 83505 P5 New Spurious Csmith v4∼7,trunk 5 gcov 83587 P5 New Missing Csmith v7,trunk 6 gcov 83616 P4 Assigned Missing Csmith v7,trunk 7 gcov 83617 P3 Works Wrong Freq. Csmith v4∼7 8 gcov 83678 P3 Won’t fix Spurious Csmith v4∼7,trunk 9 gcov 83813 P3 Fixed Missing Cmsith trunk 10 gcov 85163 P5 New Wrong Freq. Csmith trunk 11 gcov 85178 P3 Won’t fix Missing Csmith v4∼7,trunk 12 gcov 85179 P4 Assigned Missing Csmith trunk 13 gcov 85188 P3 Assigned Spurious Csmith v4∼7,trunk 14 gcov 85197 P3 Won’t fix Wrong Freq. Csmith v4∼7,trunk 15 gcov 85199 P4 Assigned Missing Csmith v4∼7,trunk 16 gcov 85201 P5 Assigned Wrong Freq. Csmith trunk 17 gcov 85202 P3 Won’t fix Spurious Csmith v4∼7,trunk 18 gcov 85217 P3 Fixed Spurious Csmith v4∼7,trunk 19 gcov 85218 P3 Won’t fix Spurious Csmith v4∼7,trunk 20 gcov 85219 P3 Won’t fix Spurious Csmith v4∼7,trunk 21 gcov 85225 P5 Assigned Wrong Freq. Csmith v4∼7,trunk 22 gcov 85243 P3 Won’t fix Spurious Csmith v4∼7,trunk 23 gcov 85245 P3 Won’t fix Spurious Csmith v4∼7,trunk 24 gcov 85272 P3 Won’t fix Spurious Csmith v4∼7,trunk 25 gcov 85273 P3 Won’t fix Spurious Csmith v4∼7,trunk 26 gcov 85274 P3 Won’t fix Spurious Csmith v4∼7,trunk 27 gcov 85276 P5 New Wrong Freq. Csmith trunk 28 gcov 85294 P3 Won’t fix Spurious Csmith v4∼7,trunk 29 gcov 85297 P3 Won’t fix Spurious Csmith v4∼7,trunk 30 gcov 85299 P3 Won’t fix Spurious Csmith v4∼7,trunk 31 gcov 85332 P3 Fixed Wrong Freq. GCC v7,trunk 32 gcov 85333 P3 Won’t fix Missing GCC v4∼7,trunk 33 gcov 85336 P4 New Wrong Freq. GCC v4∼7,trunk 34 gcov 85337 P5 New Wrong Freq. Clang v7,trunk 35 gcov 85338 P3 Fixed Wrong Freq. Clang v4∼7,trunk 36 gcov 85349 P5 New Spurious GCC v6∼7,trunk 37 gcov 85350 P3 Fixed Spurious GCC v4∼7,trunk 38 gcov 85351 P5 Assigned Missing GCC v4∼7,trunk 39 gcov 85367 P3 Works Wrong Freq. GCC v5∼7,trunk 40 gcov 85370 P3 Fixed Spurious GCC trunk 41 gcov 85372 P3 Fixed Missing GCC trunk 42 gcov 85377 P4 New Missing GCC v4∼7,trunk 43 llvm-cov 33465 P3 Fixed Missing Csmith v3∼5,trunk 44 llvm-cov 35404 P3 Confirmed Spurious Csmith trunk 45 llvm-cov 35426 P3 Fixed Spurious Csmith trunk 46 llvm-cov 35495 P3 Fixed Missing Csmith trunk 47 llvm-cov 35556 P3 Fixed Missing Csmith trunk 48 llvm-cov 37001 P3 Confirmed Spurious Csmith trunk 49 llvm-cov 37003 P3 Confirmed Spurious Csmith trunk 50 llvm-cov 37012 P3 Confirmed Spurious Csmith trunk 51 llvm-cov 37017 P3 Confirmed Spurious Csmith trunk 52 llvm-cov 37043 P3 Confirmed Spurious Csmith trunk 53 llvm-cov 37046 P3 Confirmed Spurious Csmith trunk 54 llvm-cov 37070 P3 Confirmed Missing Clang v3∼5,trunk 55 llvm-cov 37071 P3 Confirmed Spurious Clang trunk 56 llvm-cov 37072 P3 Confirmed Wrong Freq. GCC trunk 57 llvm-cov 37081 P3 Confirmed Missing Clang v3∼5,trunk 58 llvm-cov 37083 P3 Confirmed Wrong Freq. GCC trunk 59 llvm-cov 37084 P3 Confirmed Wrong Freq. GCC trunk 60 llvm-cov 37085 P3 Confirmed Spurious GCC trunk 61 llvm-cov 37090 P3 Confirmed Spurious GCC trunk 62 llvm-cov 37092 P3 Confirmed Spurious GCC trunk 63 llvm-cov 37099 P3 Confirmed Spurious GCC trunk 64 llvm-cov 37102 P3 Confirmed Spurious GCC trunk 65 llvm-cov 37103 P3 Confirmed Wrong Freq. GCC v3∼5,trunk 66 llvm-cov 37105 P3 Confirmed Spurious GCC trunk 67 llvm-cov 37107 P3 Confirmed Wrong Freq. GCC v3∼5,trunk 68 llvm-cov 37124 P3 Confirmed Wrong Freq. GCC trunk 69 llvm-cov 37125 P3 Confirmed Wrong Freq. Clang v3∼5,trunk 70 llvm-cov 37126 P3 Confirmed Spurious GCC v3∼5,trunk gcov llvm-cov #N Source Code 10 : 10 : 1 : 9 : 9 : − : 10 : − : 1 : 11 : 10 : 1 : − : 10 | 10 | 10 | 10 | 9 | 10 | 10 | | 1 | 11 | 10 | 1 | 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 void func ( int i ) { switch (i) { case 1 : break ; case 2: ; default : break ; } } int main ( ) { for ( int i =0;i <10;++ i ) func ( i ); return 0 ; } Fig. 6. One program triggering multiple bugs affected various versions of GCC and Clang for each bug. Note that as for GCC, we select GCC-4.8.0, GCC-5.4.1, GCC- 6.4.0, GCC-7.2.0, and the latest trunk version (GCC-8.0), and for Clang, we select Clang-3.8, Clang-4.0, Clang-5.0, and the latest trunk version (Clang-7.0). As can be seen, 42 out of all the bugs can affect stable releases, and a considerable number of bugs have been latent in the old stable releases (such as GCC-4.8.0 and Clang-3.8) for many years. D. Interesting Bugs Found One program triggering multiple bugs. In Figure 6, Column 4 lists a code snippet and Column 3 shows the line numbers. Column 1 and Column 2 display the coverage results by gcov and llvm-cov respectively. This code snippet is a simple switch-statement inside a for-loop from the caller. The code at line 3 and line 4 actually executes only one time, however, they are wrongly marked as executed 10 times by llvm-cov and line 4 is wrongly marked as executed 9 times by gcov. We have reported this case as Bug #85337 of gcov and Bug #37124 of llvm-cov. This case is quite simple and common in real world. Unfortunately, this single case triggers coverage bugs for both gcov and llvm-cov, indicating that coverage bugs are subtle and prevalent. Coverage bugs or compiler bugs? Figure 7 shows the coverage report for Bug #37082 of Clang. A definition of strcmp function is first given at line 2, while strcmp is redefined as the libc function. As a result, the calling of strcmp at line 6 will invoke the libc function with internal linkage, instead of the self-defined function (line 2). However, in our test, GCC outputs “−1” whereas Clang outputs “20” for this case. That means, GCC invokes the correct libc function but Clang still invokes the self-defined function at line 2. That is why line 2 is reported as executed one time by llvm-cov but as not executed by gcov. As for this case, the root cause of producing the incorrect coverage report is not the bugs inside llvm-cov, but the bugs inside Clang instead. Code formatting problem. Figure 8 shows the coverage report for Bug #37102 of Clang. Line 8 is marked as executed twice by gcov, but wrongly marked as executed only once in llvm-cov. Along the execution, we can see that each of the four statements at line 8 (i.e. foo(); goto L2; L1: bar()) executes actually only one time before the main function finishes. However, from the perspective of line coverage, the 495
#include gcov llvm-cov#N Source Code static int stremp(){return 2;} 3 #define strcmp __builtin_strcmp #include #include 45 int main() 3 int foo(imp_buf b)flongimp(b.1);} int ret=stremp(”a”,"b”); > 5 printf("%d\n”,ret): 1 return 0: 6 int main() 9 8 int ret; Fig.7.Coverage or compiler bug (Bug #37082 of llvm-cov) jmp_bufbuf: 10 1 int a 0; 1 11 if(setjmp(buf)){ #杆#护# 0 foo(buf): 3 void foo(){a++; 0 13 } void bar(){a++;} 14 5 1 15 if(setimp(buf)!=0){ 6 int main() 16 ret 0: 1 17 else t 8 foo();goto L2:LI:bar(): 1 0 18 ret 1; 9 0 foo (buf); 10 L2: 20 11 if (a =1) 12 goto Ll: 22 printf("%d”,ret) 13 23 Fig.8.Code formatting problem (Bug #37102 of llvm-cov) Fig.9. Non-trivial inspection (Bug #37081 of Clang) code at line 8 executes twice.For the first time,the first two programs from GCC's and Clang's test-suites,and fed them to statements,i.e.foo()and goto L2,get executed and then C2V.which can mitigate this problem to some extent.Second. the control flow jumps to line 10.After executing the goto- we only take account inconsistency-triggering lines of code for statement at line 12,the control flow jumps back to line 8. computing program similarities to filter out test programs that Then the last two statements,i.e.L1:bar();,get executed. potentially trigger the same code coverage bugs.In this study, Note that the coverage result will be correct if we put the four a large number of inconsistency-triggering test programs are statements at line 8 into four separate lines. filtered out which may miss a number of quality test pro- Non-trivial inspection.Figure 9 follows the same notions grams.If we can inspect all Csmith-generated inconsistency- with Figure 6.Lines 18 and 19 are marked as executed by gcov triggering test programs,it is reasonable to expect that more but as unexecuted by llvm-cov.Human inspection is conducted code coverage bugs would be found.Third,it is possible to determine whether gcov or llvm-cov produces the incorrect that both code coverage tools may have the same bugs.In coverage report.But this process is non-trivial.Intuitively,the other words,these two coverage tools might produce same two branches of the if-statement at line 15 should not be but incorrect code coverage reports for a given program.Our executed simultaneously,implying that the coverage report by approach can not identify any inconsistencies from such paired gcov is incorrect.Besides,this code outputs "0"instead of coverage reports and further miss this kind of bugs.Therefore. "1"at runtime,further supporting the implication.However. in the future,more research efforts should be paid in this area it is actually llvm-cov that produces the incorrect report(Bug to improve the quality of code coverage tools.Forth,different #37081 of Clang).Function set jmp and function long jmp code coverage tools having different implementations may are provided to perform complex flow-of-control in C.Due to make the coverage reports difficult to be compared.To mitigate their existence,the execution flows for this code are:(1)the this problem,we have taken the following steps:(1)we if-statement at line 11 takes the false branch,(2)then the reformatted the generated test programs before feeding them if-statement at line 15 also takes the false branch,assigning to the coverage tools,which led to formatted and comparable variable ret as 1 and calling function foo at line 4,(3) coverage reports;(2)before comparing coverage reports,we function long jmp restores the program state when set jmp identified and excluded specific behavioral differences of the at line 15 are called and returns 1,hence taking the true branch coverage tools;and (3)before reporting bugs,we inspected at line 15.As a result,variable ret is assigned as 0.(4)the inconsistent coverage reports to determine which tools are main function returns after printing the value of variable ret. buggy.During inspection,false alarms are manually identified. Therefore,we have taken careful steps to reduce false positives E.Limitations resulting from the variability among different tools.Besides, In this study,we assess the reliability of code coverage it is also interesting to develop more accurate techniques for tools via differential testing.This is a first effort towards this coverage reports comparison in the future. direction.However,our technique has a number of limitations. First,most of the test programs we used were generated by V.RELATED WORK Csmith.The Csmith-generated programs only cover a subset This section introduces the related work on randomized of C semantics,which might cause C2V to miss a number of differential testing,coverage-directed differential testing,and code coverage defects.As a complement,we collected 2862 C testing via equivalence modulo inputs. 496
1 | | 2 | 1 | 3 | 1 | 4 | | 5 | 1 | 6 | 1 | 7 | 1 | 8 | 1 | 9 | 1 | #include static int strcmp () { return 2;} #define strcmp builtin strcmp int main ( ) { int ret = strcmp (”a” ,”b” ); p r i n t f ( ”%d\n” , ret ); return 0 ; } Fig. 7. Coverage or compiler bug (Bug #37082 of llvm-cov) 1 | | 2 | | 3 | 1 | 4 | 1 | 5 | | 6 | | 7 | 1 | 8 | 1 | 9 | 1 | 10 | 2 | 11 | 2 | 12 | 1 | 13 | 1 | int a = 0; void foo () { a++; } void bar () { a++; } int main ( ) { foo ( ); goto L2 ; L1 : ba r ( ) ; L2 : i f ( a == 1 ) goto L1 ; } Fig. 8. Code formatting problem (Bug #37102 of llvm-cov) code at line 8 executes twice. For the first time, the first two statements, i.e. foo() and goto L2, get executed and then the control flow jumps to line 10. After executing the gotostatement at line 12, the control flow jumps back to line 8. Then the last two statements, i.e. L1: bar();, get executed. Note that the coverage result will be correct if we put the four statements at line 8 into four separate lines. Non-trivial inspection. Figure 9 follows the same notions with Figure 6. Lines 18 and 19 are marked as executed by gcov but as unexecuted by llvm-cov. Human inspection is conducted to determine whether gcov or llvm-cov produces the incorrect coverage report. But this process is non-trivial. Intuitively, the two branches of the if-statement at line 15 should not be executed simultaneously, implying that the coverage report by gcov is incorrect. Besides, this code outputs “0” instead of “1” at runtime, further supporting the implication. However, it is actually llvm-cov that produces the incorrect report (Bug #37081 of Clang). Function setjmp and function longjmp are provided to perform complex flow-of-control in C. Due to their existence, the execution flows for this code are: (1) the if-statement at line 11 takes the false branch, (2) then the if-statement at line 15 also takes the false branch, assigning variable ret as 1 and calling function foo at line 4, (3) function longjmp restores the program state when setjmp at line 15 are called and returns 1, hence taking the true branch at line 15. As a result, variable ret is assigned as 0. (4) the main function returns after printing the value of variable ret. E. Limitations In this study, we assess the reliability of code coverage tools via differential testing. This is a first effort towards this direction. However, our technique has a number of limitations. First, most of the test programs we used were generated by Csmith. The Csmith-generated programs only cover a subset of C semantics, which might cause C2V to miss a number of code coverage defects. As a complement, we collected 2862 C gcov llvm-cov #N Source Code − : − : − : 1 : − : 1 : − : − : − : − : 1 : # #### : − : − : 1 : 1 : − : 1 : 1 : − : − : 1 : − : | | | 1 | | | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 #include #include int foo (jmp buf b) { longjmp (b ,1); } int main ( ) { int ret ; jmp buf buf ; i f ( setjmp ( buf )) { foo ( buf ); } i f ( setjmp ( buf )!= 0) { ret = 0; } else { ret = 1; foo ( buf ); } p r i n t f ( ”%d ” , r e t ) ; } Fig. 9. Non-trivial inspection (Bug #37081 of Clang) programs from GCC’s and Clang’s test-suites, and fed them to C2V, which can mitigate this problem to some extent. Second, we only take account inconsistency-triggering lines of code for computing program similarities to filter out test programs that potentially trigger the same code coverage bugs. In this study, a large number of inconsistency-triggering test programs are filtered out which may miss a number of quality test programs. If we can inspect all Csmith-generated inconsistencytriggering test programs, it is reasonable to expect that more code coverage bugs would be found. Third, it is possible that both code coverage tools may have the same bugs. In other words, these two coverage tools might produce same but incorrect code coverage reports for a given program. Our approach can not identify any inconsistencies from such paired coverage reports and further miss this kind of bugs. Therefore, in the future, more research efforts should be paid in this area to improve the quality of code coverage tools. Forth, different code coverage tools having different implementations may make the coverage reports difficult to be compared. To mitigate this problem, we have taken the following steps: (1) we reformatted the generated test programs before feeding them to the coverage tools, which led to formatted and comparable coverage reports; (2) before comparing coverage reports, we identified and excluded specific behavioral differences of the coverage tools; and (3) before reporting bugs, we inspected inconsistent coverage reports to determine which tools are buggy. During inspection, false alarms are manually identified. Therefore, we have taken careful steps to reduce false positives resulting from the variability among different tools. Besides, it is also interesting to develop more accurate techniques for coverage reports comparison in the future. V. RELATED WORK This section introduces the related work on randomized differential testing, coverage-directed differential testing, and testing via equivalence modulo inputs. 496
A.Randomized Differential Testing C compilers,GCC and LLVM.Based on this idea,Athena Differential testing is originally introduced by McKee- [59]and Hermes [60]are developed subsequently.Athena [59] man [53]which attempt to detect bugs by checking in- generates EMI by randomly inserting code into and removing consistent behaviors across different comparable software or statements from dead code regions.Hermes [60]complements different software versions.Randomized differential testing mutation strategies by operating on live code regions,which is a widely-used black-box differential testing technique in overcomes the limitations of mutating dead code regions. which the inputs are randomly generated [19].[54].Yang et Le et al.[52]first used Csmith to generate single-file test al.[19]developed Csmith,a randomized test case generation programs and transformed each single-file test program into tool that can support a large subset of C features and avoid multiple compilation units.Then,they stochastically assigned introducing undefined and unspecified behaviors,to find C each unit an optimization level to thoroughly exercise link- compiler bugs.Lidbury et al.[40]developed CLsmith,a tool time-optimizers.They discovered and reported 37 LTO bugs built on top of Csmith,to validate OpenCL compilers based for GCC and LLVM in 11 months.These techniques heavily on differential testing and testing via equivalence modulo depend on the code coverage information. inputs (EMD).They presented several strategies for random generation of OpenCL kernels and an injection mechanism VI.CONCLUSION AND FUTURE WORK which allowed EMI testing to be applied to kernel in order to avoid little or no dynamically-dead code.Their study revealed We proposed a randomized differential testing approach to a significant number of OpenCL compiler bugs in commercial hunting code coverage bugs and implemented a tool named implementations.Sun et al.[21]applied randomized differen- c2V to test two C code coverage tools,gcov and llvm-cov. tial testing to find and analyze compiler warning defects across Our evaluations where 42 and 28 bugs confirmed from gcov GCC and LLVM.In less than six months,they successfully and llvm-cov respectively in a short few months provided a found 52 confirmed/fixed bugs.Different from prior studies. strong evidence that code coverage tools are not as reliable as we apply randomized differential testing to find code coverage they might have been envisaged.Overall,our approach has the bugs which we believe is an important topic. following main advantages:(1)it simplifies the difficult code B.Coverage-based Differential Testing coverage validation problem as a simple comparison problem; (2)the comparison between code coverage reports not only A number of recent studies leverage coverage to improve the checks whether a program chunk gets executed or not,but effectiveness of differential testing.Chen et al.[55]proposed also the exact execution frequencies.Any discrepancy in these a coverage-directed fuzzing approach to detecting inconsis- dimensions would alert a potential bug report,which helps find tencies between different implementations of Java Virtual subtle but deep semantic bugs in code coverage tools;and(3) Machine (JVM).They mutated seeding classfiles,executed our approach is simple,straightforward,and general.It can mutants on a reference JVM implementation,and used cov- be easily applied to validate different code coverage tools, erage uniqueness as a discipline for accepting representative under various programming languages and coverage criteria. mutants.The accepted mutants were then used as the inputs In the future,more efforts should be paid on this area and to differentially test different JVM implementations.Pei et there is a need to examine the influence of those bugs on al.[56]proposed DeepXplore,a whitebox coverage-directed other techniques which depend on code coverage. differential testing for detecting inconsistencies between multi- ple DNNs.They first introduced neuron coverage as a system- atic metric for measuring how much of the internal logic of a ACKNOWLEDGMENT DNNs had been tested and then used this information to guide the testing process.As can be seen,the prerequisite of the We thank Yanyan Jiang,Zhaogui Xu,and the anony- above techniques is to obtain the correct coverage.Our work mous reviewers for their constructive comments.We also provides a general and practical approach to finding coverage thank the GCC and LLVM developers especially Martin bugs,thus helping improve the quality of code coverage tools. Liska for analyzing and fixing our reported bugs.This work is supported by the National Natural Science Founda- C.Testing via Equivalence Modulo Inputs tion of China(61702256,61772259,61432001,61832009 Testing via equivalence modulo inputs is a new testing tech- 61772263.61802168.61872177).the Natural Science Foun- nique proposed in recent years.In nature,EMI testing is a kind dation of Jiangsu Province(BK20170652),the China Post- of metamorphic testing,which modifies a program to generate doctoral Science Foundation(2018T110481).the Fundamental variants with the same outputs as the original program [57]. Research Funds for the Central Universities (020214380032 [58].Le et al.[7]proposed to generate equivalent versions 02021430047),the National Key R&D Program of China of the program by profiling program's execution and pruning (2018YFB1003901).Zhendong Su was supported by United unexecuted code.Once a program and its equivalent variant are States NSF Grants 1528133 and 1618158,and Google and constructed,both are used as input of the compiler under test. Mozilla Faculty Research awards.Yuming Zhou (zhouyum- checking for inconsistencies in their results.So far,this method ing@nju.edu.cn)and Baowen Xu (bwxu@nju.edu.cn)are the has been used to detect 147 confirmed bugs in two open source corresponding authors. 497
A. Randomized Differential Testing Differential testing is originally introduced by McKeeman [53] which attempt to detect bugs by checking inconsistent behaviors across different comparable software or different software versions. Randomized differential testing is a widely-used black-box differential testing technique in which the inputs are randomly generated [19], [54]. Yang et al. [19] developed Csmith, a randomized test case generation tool that can support a large subset of C features and avoid introducing undefined and unspecified behaviors, to find C compiler bugs. Lidbury et al. [40] developed CLsmith, a tool built on top of Csmith, to validate OpenCL compilers based on differential testing and testing via equivalence modulo inputs (EMI). They presented several strategies for random generation of OpenCL kernels and an injection mechanism which allowed EMI testing to be applied to kernel in order to avoid little or no dynamically-dead code. Their study revealed a significant number of OpenCL compiler bugs in commercial implementations. Sun et al. [21] applied randomized differential testing to find and analyze compiler warning defects across GCC and LLVM. In less than six months, they successfully found 52 confirmed/fixed bugs. Different from prior studies, we apply randomized differential testing to find code coverage bugs which we believe is an important topic. B. Coverage-based Differential Testing A number of recent studies leverage coverage to improve the effectiveness of differential testing. Chen et al. [55] proposed a coverage-directed fuzzing approach to detecting inconsistencies between different implementations of Java Virtual Machine (JVM). They mutated seeding classfiles, executed mutants on a reference JVM implementation, and used coverage uniqueness as a discipline for accepting representative mutants. The accepted mutants were then used as the inputs to differentially test different JVM implementations. Pei et al. [56] proposed DeepXplore, a whitebox coverage-directed differential testing for detecting inconsistencies between multiple DNNs. They first introduced neuron coverage as a systematic metric for measuring how much of the internal logic of a DNNs had been tested and then used this information to guide the testing process. As can be seen, the prerequisite of the above techniques is to obtain the correct coverage. Our work provides a general and practical approach to finding coverage bugs, thus helping improve the quality of code coverage tools. C. Testing via Equivalence Modulo Inputs Testing via equivalence modulo inputs is a new testing technique proposed in recent years. In nature, EMI testing is a kind of metamorphic testing, which modifies a program to generate variants with the same outputs as the original program [57], [58]. Le et al. [7] proposed to generate equivalent versions of the program by profiling program’s execution and pruning unexecuted code. Once a program and its equivalent variant are constructed, both are used as input of the compiler under test, checking for inconsistencies in their results. So far, this method has been used to detect 147 confirmed bugs in two open source C compilers, GCC and LLVM. Based on this idea, Athena [59] and Hermes [60] are developed subsequently. Athena [59] generates EMI by randomly inserting code into and removing statements from dead code regions. Hermes [60] complements mutation strategies by operating on live code regions, which overcomes the limitations of mutating dead code regions. Le et al. [52] first used Csmith to generate single-file test programs and transformed each single-file test program into multiple compilation units. Then, they stochastically assigned each unit an optimization level to thoroughly exercise linktime-optimizers. They discovered and reported 37 LTO bugs for GCC and LLVM in 11 months. These techniques heavily depend on the code coverage information. VI. CONCLUSION AND FUTURE WORK We proposed a randomized differential testing approach to hunting code coverage bugs and implemented a tool named C2V to test two C code coverage tools, gcov and llvm-cov. Our evaluations where 42 and 28 bugs confirmed from gcov and llvm-cov respectively in a short few months provided a strong evidence that code coverage tools are not as reliable as they might have been envisaged. Overall, our approach has the following main advantages: (1) it simplifies the difficult code coverage validation problem as a simple comparison problem; (2) the comparison between code coverage reports not only checks whether a program chunk gets executed or not, but also the exact execution frequencies. Any discrepancy in these dimensions would alert a potential bug report, which helps find subtle but deep semantic bugs in code coverage tools; and (3) our approach is simple, straightforward, and general. It can be easily applied to validate different code coverage tools, under various programming languages and coverage criteria. In the future, more efforts should be paid on this area and there is a need to examine the influence of those bugs on other techniques which depend on code coverage. ACKNOWLEDGMENT We thank Yanyan Jiang, Zhaogui Xu, and the anonymous reviewers for their constructive comments. We also thank the GCC and LLVM developers especially Martin Liska for analyzing and fixing our reported bugs. This ˇ work is supported by the National Natural Science Foundation of China (61702256, 61772259, 61432001, 61832009, 61772263, 61802168, 61872177), the Natural Science Foundation of Jiangsu Province (BK20170652), the China Postdoctoral Science Foundation (2018T110481), the Fundamental Research Funds for the Central Universities (020214380032, 02021430047), the National Key R&D Program of China (2018YFB1003901). Zhendong Su was supported by United States NSF Grants 1528133 and 1618158, and Google and Mozilla Faculty Research awards. Yuming Zhou (zhouyuming@nju.edu.cn) and Baowen Xu (bwxu@nju.edu.cn) are the corresponding authors. 497