F.Inspecting Reduced Test Programs -00 t.c;./a.out;1lvm-profdata merge With the reduced test program,we need to inspect which default.profraw -o t.profdata;1lvm-cov show code coverage tools have bugs before reporting.In practice, -instr-profile=t.profdata./a.out t.c t.c.lcov it is usually done manually [21].In other words,developers The llvm-cov-generated coverage report is stored in a file manually inspect the coverage reports to determine which named t.c.Icov.Then these two produced coverage coverage tools are buggy.To relieve the burden of manual reports will be parsed into unified format and compared to intervention,we summarize the following rules that code detect inconsistency.It is worth noting that we are using -00 coverage reports must comply with: option to turn off compiler optimizations.It make sense to Identical Coverage:Assuming statements s1 and s2 in the compare the coverage reports produced in this way. same block:{s1;s2;).If s1 is not a jump statement (i.e, B.Testing Environment Setup break,goto,return,exit,or abort statement)and s2 is not a label statement nor a loop (for or while) Our evaluation was conducted on a Linux server with statement,s1 and s2 should have identical coverage. Intel(R)Xeon(R)CPU@2.00GHz (60 cores)and 32GB RAM. Unexecuted Coverage:Assuming statements sl and s2 in The server is running on Ubuntu 17.10(x86_64).We spent the same block:{s1;s2;).If s1 is a return,break, non-continuous four months,of which over one month we goto,or exit statement and s2 is not a labeled statement, devoted to developing various tools.The rest of time was s2 should be never executed. spent in testing the two code coverage tools,filtering out test .Ordered Coverage:Assuming statements s1 and s2 form: programs,reducing test programs,and inspect test programs. s1;if (...{s2;...}.If s2 is not a labeled state- Initially,we only used Csmith-generated programs as the test ment,the execution time of sl should be no less than s2 programs.Later,we made use of the programs selected from With the above rules,we develop the tool Inspector to examine GCC's and Clang's test-suites as our test programs since they may cover many C semantics that Csmith does not cover.This the inconsistent coverage reports and determine which tools is further confirmed in Section IV-C as a number of bugs are have bugs automatically.There is still some inconsistent cov- erage reports in R-IPS that can not be inspected automatically detected by programs inside test-suites. by our tool.We inspect those coverage reports manually.This C.Experimental Results process does not require too much human inspection effort,as the reduced test programs only have a few lines (usually less Inconsistent Coverage Reports.Table I shows the statis- than 13 lines in our study). tics of inconsistency-triggering test programs over Csmith- generated programs,GCC's test-suite,and Clang's test-suite. G.Reporting Test Programs to Developers Column 2 shows the total number of test programs and For each test program in RS-IPS,this step simply generates Column 3 shows the number of test programs which run out bug reports for the corresponding buggy tool(s).A bug report of time (10 seconds in our experiment).We used I million mainly consists of the reduced test program and the affected Csmith-generated programs and collected 2,756 and 106 C versions.If a test program triggers multiple bugs,multiple compilable programs respectively from GCC's and Clang's separate bug reports will be generated. test-suites.Note that there are more than tens of thousands of test programs in GCC and Clang's test-suites.Only those IV.EVALUATION C files that can be compiled independently are considered In this section,we first present the subject coverage tools here.Among them,182,927 programs executed more than and the testing environment.Then,we describe our experi- 10 seconds and hence were excluded for further analysis.The mental results in detail. remaining test programs were fed to C2V.Column 4 is in the form of 'a b',where 'a'refers to the total number A.Subject Code Coverage Tools of test programs which can lead to inconsistencies andb' In this study,we select gcov and llvm-cov as our subject refers to the percentage of 'a'over the number of all the code coverage tools.We choose these two code coverage tools test programs C2V analyzed (i.e.Column 2 -Column 3). since:(1)they have been widely used in software engineering We found 261,347 programs leading to inconsistent coverage community;and(2)they have been integrated with the most reports (261,065,262,and 20 respectively from Csmith- widely-used production compilers,i.e.GCC and Clang.More generated programs,GCC's test-suite,and Clang's test-suite). specifically,we chose gcov in latest development trunk of GCC About 31.95%Csmith-generated programs caused inconsistent and llvm-cov in the latest LLVM development trunk. coverage reports,much higher than those from GCC's test- For gcov,the command flags we used to compile a given suite and Clang's test-suite.Columns 5 to 11 display the source file,e.g.,t.c,and produce the corresponding coverage distributions of inconsistency-triggering test programs over 7 report is as follows: different categories.In the third rows of these columns,the #gcc -00 --coverage t.c;./a.out;gcov t.c number in parentheses indicates the number of test programs The gcov-generated coverage report is stored in a file named after filtering potential test programs that trigger the same code t.c.gcov.For llvm-cov,we use the following command: coverage bugs.Most of inconsistent reports fell into the C010 clang -fprofile-instr-generate -fcoverage-mapping category,indicating that the majority of inconsistencies belong 493F. Inspecting Reduced Test Programs With the reduced test program, we need to inspect which code coverage tools have bugs before reporting. In practice, it is usually done manually [21]. In other words, developers manually inspect the coverage reports to determine which coverage tools are buggy. To relieve the burden of manual intervention, we summarize the following rules that code coverage reports must comply with: • Identical Coverage: Assuming statements s1 and s2 in the same block: {s1; s2;}. If s1 is not a jump statement (i.e, break, goto, return, exit, or abort statement) and s2 is not a label statement nor a loop (for or while) statement, s1 and s2 should have identical coverage. • Unexecuted Coverage: Assuming statements s1 and s2 in the same block: {s1; s2;}. If s1 is a return, break, goto, or exit statement and s2 is not a labeled statement, s2 should be never executed. • Ordered Coverage: Assuming statements s1 and s2 form: s1; if (...) {s2; ...}. If s2 is not a labeled statement, the execution time of s1 should be no less than s2. With the above rules, we develop the tool Inspector to examine the inconsistent coverage reports and determine which tools have bugs automatically. There is still some inconsistent coverage reports in R-IPS that can not be inspected automatically by our tool. We inspect those coverage reports manually. This process does not require too much human inspection effort, as the reduced test programs only have a few lines (usually less than 13 lines in our study). G. Reporting Test Programs to Developers For each test program in RS-IPS, this step simply generates bug reports for the corresponding buggy tool(s). A bug report mainly consists of the reduced test program and the affected versions. If a test program triggers multiple bugs, multiple separate bug reports will be generated. IV. EVALUATION In this section, we first present the subject coverage tools and the testing environment. Then, we describe our experimental results in detail. A. Subject Code Coverage Tools In this study, we select gcov and llvm-cov as our subject code coverage tools. We choose these two code coverage tools since: (1) they have been widely used in software engineering community; and (2) they have been integrated with the most widely-used production compilers, i.e. GCC and Clang. More specifically, we chose gcov in latest development trunk of GCC and llvm-cov in the latest LLVM development trunk. For gcov, the command flags we used to compile a given source file, e.g., t.c, and produce the corresponding coverage report is as follows: # gcc -O0 --coverage t.c; ./a.out; gcov t.c The gcov-generated coverage report is stored in a file named t.c.gcov. For llvm-cov, we use the following command: # clang -fprofile-instr-generate -fcoverage-mapping -O0 t.c; ./a.out; llvm-profdata merge default.profraw -o t.profdata; llvm-cov show -instr-profile=t.profdata ./a.out t.c > t.c.lcov The llvm-cov-generated coverage report is stored in a file named t.c.lcov. Then these two produced coverage reports will be parsed into unified format and compared to detect inconsistency. It is worth noting that we are using -O0 option to turn off compiler optimizations. It make sense to compare the coverage reports produced in this way. B. Testing Environment Setup Our evaluation was conducted on a Linux server with Intel(R) Xeon(R) CPU@2.00GHz (60 cores) and 32GB RAM. The server is running on Ubuntu 17.10 (x86 64). We spent non-continuous four months, of which over one month we devoted to developing various tools. The rest of time was spent in testing the two code coverage tools, filtering out test programs, reducing test programs, and inspect test programs. Initially, we only used Csmith-generated programs as the test programs. Later, we made use of the programs selected from GCC’s and Clang’s test-suites as our test programs since they may cover many C semantics that Csmith does not cover. This is further confirmed in Section IV-C as a number of bugs are detected by programs inside test-suites. C. Experimental Results Inconsistent Coverage Reports. Table I shows the statistics of inconsistency-triggering test programs over Csmithgenerated programs, GCC’s test-suite, and Clang’s test-suite. Column 2 shows the total number of test programs and Column 3 shows the number of test programs which run out of time (10 seconds in our experiment). We used 1 million Csmith-generated programs and collected 2, 756 and 106 C compilable programs respectively from GCC’s and Clang’s test-suites. Note that there are more than tens of thousands of test programs in GCC and Clang’s test-suites. Only those C files that can be compiled independently are considered here. Among them, 182, 927 programs executed more than 10 seconds and hence were excluded for further analysis. The remaining test programs were fed to C2V. Column 4 is in the form of ‘a/b’, where ‘a’ refers to the total number of test programs which can lead to inconsistencies and ‘b’ refers to the percentage of ‘a’ over the number of all the test programs C2V analyzed (i.e. Column 2 − Column 3). We found 261, 347 programs leading to inconsistent coverage reports (261, 065, 262, and 20 respectively from Csmithgenerated programs, GCC’s test-suite, and Clang’s test-suite). About 31.95% Csmith-generated programs caused inconsistent coverage reports, much higher than those from GCC’s testsuite and Clang’s test-suite. Columns 5 to 11 display the distributions of inconsistency-triggering test programs over 7 different categories. In the third rows of these columns, the number in parentheses indicates the number of test programs after filtering potential test programs that trigger the same code coverage bugs. Most of inconsistent reports fell into the C010 category, indicating that the majority of inconsistencies belong 493