#include <stdlib.h> Spurious Marking.Spurious marking denotes that a program 2 int main (void) chunk is unexecuted at runtime but is wrongly marked as executed by a code coverage tool.Figure 2 gives such an 5 static int *p: example.In the main function,the two return-statements 6 at line 9 and line 12 cannot be executed at the same time. p malloc (sizeof (int)): 8 if (p =NULL) Thus,one of them must be wrongly marked by llmv-cov. 9 return 0; At runtime.function main returns "1",indicating that line 10 11 9 was wrongly marked.A code coverage tool with spurious *p=7; 1 return 1; marking bugs will cause non-executed code wrongly marked 13 11 } as executed.Consequently,for a program under test,a part Fig.2. Spurious marking (Bug #37092 of llvm-cov) of elements are indeed untested and hence may have latent bugs undetected. 1: 1: int func(int i) Missing Marking.Missing marking denotes that a program 2: 】 3 if (i>0) chunk is actually executed at runtime but is wrongly marked 1: 4 return 1; as unexecuted by a code coverage tool.The coverage bug as #####: 5 return 0: 6 shown in Figure I belongs to this class.A code coverage tool > with missing marking bugs will cause a predefined coverage 1: 8:int main() goal never achieved,regardless of how much testing time 9 { 1:10: int i 1,j=2; and resource are allocated. 2:11: func((i==0)|(i&&j,1)): Wrong Frequency.Wrong frequency denotes that a program 1:12 return 0; chunk is executed m times at runtime but is wrongly marked -:13:} as executed n times by a code coverage tool {(m n)A Fig.3.Wrong frequency (Bug #85163 of gcov) (m >0)A(n>0)).Figure 3 shows a gcov coverage report, in which the first column lists the execution frequency and the correctness of compilers.One of the most attractive the second column lists the line number.As can be seen,the compiler testing techniques is based on the code coverage code at line 11 is wrongly marked as executed twice but it of a program's execution to generate equivalence modulo was actually executed only once.A code coverage tool with inputs by stochastically pruning its unexecuted code [7], wrong frequency bugs might lead to a suboptimal decision [40].[41].With the equivalence modulo inputs,we can in many coverage-based software engineering activities. differentially test compilers. Coverage-based fuzzing.Fuzzing technique is one of the III.METHODOLOGY most effective testing techniques to find vulnerabilities in software.In practice,coverage-based fuzzing has been Figure 4 presents our framework for code coverage vali- widely used [8].[42].Based on the code coverage infor- dation.In the following,we describe the key steps in this mation of test cases,it aims to determine which test cases framework.In particular,we will use gcov and llvm-cov as should be retained for fuzzing.In general,a test case is the subject code coverage tools to illustrate our approach to retained if a new basic block is covered. hunting for code coverage bugs if necessary. Coverage-based debugging.Debugging is a common activity in software development which aims to find the root cause A.Generating Test Programs of a fault.Spectrum-Based Fault Localization(SBFL)is one of the most extensively studied debugging techniques which Our approach starts with a random program generator.The is heavily based on code coverage [5].[43]-[47].Under a Generator randomly generates a large program set P.Each specific test suite,SBFL leverages the code coverage and program pP will be used as a test program for differentially the corresponding failed/passed information to infer which testing two code coverage tools under examination.In other code is the root cause of a fault. words,each program will be fed to the two code coverage As can be seen,the above-mentioned software techniques tools to obtain two raw coverage reports heavily depend on the code coverage information produced In our study,test programs refer to a collection of compi- by code coverage tools.Therefore,it is of great importance lable C programs with the corresponding inputs.We choose to guarantee the correctness of the code coverage reports. Csmith to generate the test programs because of the following Otherwise,they might inevitably produce incorrect results or reasons:(1)it supports many C language features and can lead to suboptimal decisions. avoid generating programs with undefined behaviors [48],thus outperforming some random C program generators [49],[50]: C.Categories of Code Coverage Bugs (2)each generated program is one single file with input self- contained which does not require extra inputs:and (3)it is Code coverage bugs can be categorized into the following fast enough to generate programs with tens of thousands of three general classes: lines within a few seconds. 4901 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | | | | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | #include <stdlib .h> int main ( void ) { static int *p ; p = malloc ( sizeof ( int )); i f ( p == NULL) return 0 ; *p = 7; return 1 ; } Fig. 2. Spurious marking (Bug #37092 of llvm-cov) 1 : − : 1 : 1 : # #### : − : − : 1 : − : 1 : 2 : 1 : − : 1 : 2 : 3 : 4 : 5 : 6 : 7 : 8 : 9 : 10 : 11 : 12 : 13 : int func ( int i ) { i f ( i > 0 ) return 1 ; return 0 ; } int main ( ) { int i = 1, j = 2; func ( ( i ==0) | | ( i&&j , 1 ) ) ; return 0 ; } Fig. 3. Wrong frequency (Bug #85163 of gcov) the correctness of compilers. One of the most attractive compiler testing techniques is based on the code coverage of a program’s execution to generate equivalence modulo inputs by stochastically pruning its unexecuted code [7], [40], [41]. With the equivalence modulo inputs, we can differentially test compilers. • Coverage-based fuzzing. Fuzzing technique is one of the most effective testing techniques to find vulnerabilities in software. In practice, coverage-based fuzzing has been widely used [8], [42]. Based on the code coverage information of test cases, it aims to determine which test cases should be retained for fuzzing. In general, a test case is retained if a new basic block is covered. • Coverage-based debugging. Debugging is a common activity in software development which aims to find the root cause of a fault. Spectrum-Based Fault Localization (SBFL) is one of the most extensively studied debugging techniques which is heavily based on code coverage [5], [43]–[47]. Under a specific test suite, SBFL leverages the code coverage and the corresponding failed/passed information to infer which code is the root cause of a fault. As can be seen, the above-mentioned software techniques heavily depend on the code coverage information produced by code coverage tools. Therefore, it is of great importance to guarantee the correctness of the code coverage reports. Otherwise, they might inevitably produce incorrect results or lead to suboptimal decisions. C. Categories of Code Coverage Bugs Code coverage bugs can be categorized into the following three general classes: • Spurious Marking. Spurious marking denotes that a program chunk is unexecuted at runtime but is wrongly marked as executed by a code coverage tool. Figure 2 gives such an example. In the main function, the two return-statements at line 9 and line 12 cannot be executed at the same time. Thus, one of them must be wrongly marked by llmv-cov. At runtime, function main returns “1”, indicating that line 9 was wrongly marked. A code coverage tool with spurious marking bugs will cause non-executed code wrongly marked as executed. Consequently, for a program under test, a part of elements are indeed untested and hence may have latent bugs undetected. • Missing Marking. Missing marking denotes that a program chunk is actually executed at runtime but is wrongly marked as unexecuted by a code coverage tool. The coverage bug as shown in Figure 1 belongs to this class. A code coverage tool with missing marking bugs will cause a predefined coverage goal never achieved, regardless of how much testing time and resource are allocated. • Wrong Frequency. Wrong frequency denotes that a program chunk is executed m times at runtime but is wrongly marked as executed n times by a code coverage tool {(m = n) ∧ (m > 0)∧(n > 0)}. Figure 3 shows a gcov coverage report, in which the first column lists the execution frequency and the second column lists the line number. As can be seen, the code at line 11 is wrongly marked as executed twice but it was actually executed only once. A code coverage tool with wrong frequency bugs might lead to a suboptimal decision in many coverage-based software engineering activities. III. METHODOLOGY Figure 4 presents our framework for code coverage validation. In the following, we describe the key steps in this framework. In particular, we will use gcov and llvm-cov as the subject code coverage tools to illustrate our approach to hunting for code coverage bugs if necessary. A. Generating Test Programs Our approach starts with a random program generator. The Generator randomly generates a large program set P. Each program p ∈ P will be used as a test program for differentially testing two code coverage tools under examination. In other words, each program will be fed to the two code coverage tools to obtain two raw coverage reports. In our study, test programs refer to a collection of compilable C programs with the corresponding inputs. We choose Csmith to generate the test programs because of the following reasons: (1) it supports many C language features and can avoid generating programs with undefined behaviors [48], thus outperforming some random C program generators [49], [50]; (2) each generated program is one single file with input selfcontained which does not require extra inputs; and (3) it is fast enough to generate programs with tens of thousands of lines within a few seconds. 490