#include <stdio.h> gcov llvm-cov#N Source Code static int stremp(){return 2;} 3 #define strcmp __builtin_strcmp #include <stdio.h> #include <setjmp.h> 45 int main() 3 int foo(imp_buf b)flongimp(b.1);} int ret=stremp(”a”,"b”); > 5 printf("%d\n”,ret): 1 return 0: 6 int main() 9 8 int ret; Fig.7.Coverage or compiler bug (Bug #37082 of llvm-cov) jmp_bufbuf: 10 1 int a 0; 1 11 if(setjmp(buf)){ #杆#护# 0 foo(buf): 3 void foo(){a++; 0 13 } void bar(){a++;} 14 5 1 15 if(setimp(buf)!=0){ 6 int main() 16 ret 0: 1 17 else t 8 foo();goto L2:LI:bar(): 1 0 18 ret 1; 9 0 foo (buf); 10 L2: 20 11 if (a =1) 12 goto Ll: 22 printf("%d”,ret) 13 23 Fig.8.Code formatting problem (Bug #37102 of llvm-cov) Fig.9. Non-trivial inspection (Bug #37081 of Clang) code at line 8 executes twice.For the first time,the first two programs from GCC's and Clang's test-suites,and fed them to statements,i.e.foo()and goto L2,get executed and then C2V.which can mitigate this problem to some extent.Second. the control flow jumps to line 10.After executing the goto- we only take account inconsistency-triggering lines of code for statement at line 12,the control flow jumps back to line 8. computing program similarities to filter out test programs that Then the last two statements,i.e.L1:bar();,get executed. potentially trigger the same code coverage bugs.In this study, Note that the coverage result will be correct if we put the four a large number of inconsistency-triggering test programs are statements at line 8 into four separate lines. filtered out which may miss a number of quality test pro- Non-trivial inspection.Figure 9 follows the same notions grams.If we can inspect all Csmith-generated inconsistency- with Figure 6.Lines 18 and 19 are marked as executed by gcov triggering test programs,it is reasonable to expect that more but as unexecuted by llvm-cov.Human inspection is conducted code coverage bugs would be found.Third,it is possible to determine whether gcov or llvm-cov produces the incorrect that both code coverage tools may have the same bugs.In coverage report.But this process is non-trivial.Intuitively,the other words,these two coverage tools might produce same two branches of the if-statement at line 15 should not be but incorrect code coverage reports for a given program.Our executed simultaneously,implying that the coverage report by approach can not identify any inconsistencies from such paired gcov is incorrect.Besides,this code outputs "0"instead of coverage reports and further miss this kind of bugs.Therefore. "1"at runtime,further supporting the implication.However. in the future,more research efforts should be paid in this area it is actually llvm-cov that produces the incorrect report(Bug to improve the quality of code coverage tools.Forth,different #37081 of Clang).Function set jmp and function long jmp code coverage tools having different implementations may are provided to perform complex flow-of-control in C.Due to make the coverage reports difficult to be compared.To mitigate their existence,the execution flows for this code are:(1)the this problem,we have taken the following steps:(1)we if-statement at line 11 takes the false branch,(2)then the reformatted the generated test programs before feeding them if-statement at line 15 also takes the false branch,assigning to the coverage tools,which led to formatted and comparable variable ret as 1 and calling function foo at line 4,(3) coverage reports;(2)before comparing coverage reports,we function long jmp restores the program state when set jmp identified and excluded specific behavioral differences of the at line 15 are called and returns 1,hence taking the true branch coverage tools;and (3)before reporting bugs,we inspected at line 15.As a result,variable ret is assigned as 0.(4)the inconsistent coverage reports to determine which tools are main function returns after printing the value of variable ret. buggy.During inspection,false alarms are manually identified. Therefore,we have taken careful steps to reduce false positives E.Limitations resulting from the variability among different tools.Besides, In this study,we assess the reliability of code coverage it is also interesting to develop more accurate techniques for tools via differential testing.This is a first effort towards this coverage reports comparison in the future. direction.However,our technique has a number of limitations. First,most of the test programs we used were generated by V.RELATED WORK Csmith.The Csmith-generated programs only cover a subset This section introduces the related work on randomized of C semantics,which might cause C2V to miss a number of differential testing,coverage-directed differential testing,and code coverage defects.As a complement,we collected 2862 C testing via equivalence modulo inputs. 4961 | | 2 | 1 | 3 | 1 | 4 | | 5 | 1 | 6 | 1 | 7 | 1 | 8 | 1 | 9 | 1 | #include <stdio .h> static int strcmp () { return 2;} #define strcmp builtin strcmp int main ( ) { int ret = strcmp (”a” ,”b” ); p r i n t f ( ”%d\n” , ret ); return 0 ; } Fig. 7. Coverage or compiler bug (Bug #37082 of llvm-cov) 1 | | 2 | | 3 | 1 | 4 | 1 | 5 | | 6 | | 7 | 1 | 8 | 1 | 9 | 1 | 10 | 2 | 11 | 2 | 12 | 1 | 13 | 1 | int a = 0; void foo () { a++; } void bar () { a++; } int main ( ) { foo ( ); goto L2 ; L1 : ba r ( ) ; L2 : i f ( a == 1 ) goto L1 ; } Fig. 8. Code formatting problem (Bug #37102 of llvm-cov) code at line 8 executes twice. For the first time, the first two statements, i.e. foo() and goto L2, get executed and then the control flow jumps to line 10. After executing the gotostatement at line 12, the control flow jumps back to line 8. Then the last two statements, i.e. L1: bar();, get executed. Note that the coverage result will be correct if we put the four statements at line 8 into four separate lines. Non-trivial inspection. Figure 9 follows the same notions with Figure 6. Lines 18 and 19 are marked as executed by gcov but as unexecuted by llvm-cov. Human inspection is conducted to determine whether gcov or llvm-cov produces the incorrect coverage report. But this process is non-trivial. Intuitively, the two branches of the if-statement at line 15 should not be executed simultaneously, implying that the coverage report by gcov is incorrect. Besides, this code outputs “0” instead of “1” at runtime, further supporting the implication. However, it is actually llvm-cov that produces the incorrect report (Bug #37081 of Clang). Function setjmp and function longjmp are provided to perform complex flow-of-control in C. Due to their existence, the execution flows for this code are: (1) the if-statement at line 11 takes the false branch, (2) then the if-statement at line 15 also takes the false branch, assigning variable ret as 1 and calling function foo at line 4, (3) function longjmp restores the program state when setjmp at line 15 are called and returns 1, hence taking the true branch at line 15. As a result, variable ret is assigned as 0. (4) the main function returns after printing the value of variable ret. E. Limitations In this study, we assess the reliability of code coverage tools via differential testing. This is a first effort towards this direction. However, our technique has a number of limitations. First, most of the test programs we used were generated by Csmith. The Csmith-generated programs only cover a subset of C semantics, which might cause C2V to miss a number of code coverage defects. As a complement, we collected 2862 C gcov llvm-cov #N Source Code − : − : − : 1 : − : 1 : − : − : − : − : 1 : # #### : − : − : 1 : 1 : − : 1 : 1 : − : − : 1 : − : | | | 1 | | | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 #include <stdio .h> #include <setjmp .h> int foo (jmp buf b) { longjmp (b ,1); } int main ( ) { int ret ; jmp buf buf ; i f ( setjmp ( buf )) { foo ( buf ); } i f ( setjmp ( buf )!= 0) { ret = 0; } else { ret = 1; foo ( buf ); } p r i n t f ( ”%d ” , r e t ) ; } Fig. 9. Non-trivial inspection (Bug #37081 of Clang) programs from GCC’s and Clang’s test-suites, and fed them to C2V, which can mitigate this problem to some extent. Second, we only take account inconsistency-triggering lines of code for computing program similarities to filter out test programs that potentially trigger the same code coverage bugs. In this study, a large number of inconsistency-triggering test programs are filtered out which may miss a number of quality test programs. If we can inspect all Csmith-generated inconsistencytriggering test programs, it is reasonable to expect that more code coverage bugs would be found. Third, it is possible that both code coverage tools may have the same bugs. In other words, these two coverage tools might produce same but incorrect code coverage reports for a given program. Our approach can not identify any inconsistencies from such paired coverage reports and further miss this kind of bugs. Therefore, in the future, more research efforts should be paid in this area to improve the quality of code coverage tools. Forth, different code coverage tools having different implementations may make the coverage reports difficult to be compared. To mitigate this problem, we have taken the following steps: (1) we reformatted the generated test programs before feeding them to the coverage tools, which led to formatted and comparable coverage reports; (2) before comparing coverage reports, we identified and excluded specific behavioral differences of the coverage tools; and (3) before reporting bugs, we inspected inconsistent coverage reports to determine which tools are buggy. During inspection, false alarms are manually identified. Therefore, we have taken careful steps to reduce false positives resulting from the variability among different tools. Besides, it is also interesting to develop more accurate techniques for coverage reports comparison in the future. V. RELATED WORK This section introduces the related work on randomized differential testing, coverage-directed differential testing, and testing via equivalence modulo inputs. 496