2019 IEEE/ACM 41st International Conference on Software Engineering(ICSE) Hunting for Bugs in Code Coverage Tools via Randomized Differential Testing Yibiao Yang*,Yuming Zhou*,Hao Sunt,Zhendong Sut5,Zhigiang Zuo*,Lei Xu*,and Baowen Xu* *State Key Lab.for Novel Software Technology,Nanjing University,Nanjing,China Unaffiliated Department of Computer Science,ETH Zurich,Switzerland 3Computer Science Department,UC Davis,USA Abstract-Reliable code coverage tools are critically important #include <stdio.h> #include <stdio.h> as it is heavily used to facilitate many quality assurance activities, int main ( int main() such as software testing,fuzzing,and debugging.However,little attention has been devoted to assessing the reliability of code int g=0,v=1; int g=0.v=1; coverage tools.In this study,we propose a randomized differ- g=vv /8=vv ential testing approach to hunting for bugs in the most widely printf("%d\n”,g) printf("%d\n”,g): used C code coverage tools.Specifically,by generating random input programs,our approach seeks for inconsistencies in code (a) (b) coverage reports produced by different code coverage tools,and Fig.1.(a)Bug #33465 of llvm-cov and (b)The "equivalent"program then identifies inconsistencies as potential code coverage bugs.To obtained by pruning the unexecuted code (Line #4)of the program in (a) effectively report code coverage bugs,we addressed three specific challenges:(1)How to filter out duplicate test programs as many of them triggering the same bugs in code coverage tools;(2)how llvm-cov.Given the program p and its corresponding code to automatically reduce large test programs to much smaller coverage as shown in Fig.1(a).EMI compiler testing [7] ones that have the same properties;and (3)how to determine generates its "equivalent"program p'as shown in Fig.1(b) which code coverage tools have bugs?The extensive evaluations by removing unexecuted code (statement 5).The program p validate the effectiveness of our approach,resulting in 42 and 28 confirmed/fixed bugs for gcov and llvm-cov,respectively.This and p'will be compiled by a compiler under testing and case study indicates that code coverage tools are not as reliable then executed to obtain two different outputs,i.e.I and 0. as it might have been envisaged.It not only demonstrates the resulting in a bug reported by the EMI approach.However, effectiveness of our approach,but also highlights the need to this is obviously not a real compiler bug.The incorrect code continue improving the reliability of code coverage tools.This work opens up a new direction in code coverage validation which coverage report leads to the false positive in compiler testing. As the code coverage tools offer the fundamental information calls for more attention in this area Index Terms-Code Coverage;Differential Testing;Coverage needed during the whole process of software development, Tools;Bug Detection. it is essential to validate the correctness of code coverage. Unfortunately,to our best knowledge,little attention has been I.INTRODUCTION devoted to assessing the reliability of code coverage tools. Code coverage [1]refers to which code in the program This work makes the first attempt in this direction.We and how many times each code is executed when running devise a practical randomized differential testing approach on the particular test cases.The code coverage information to discovering bugs in code coverage tools.Our approach produced by code coverage tools is widely used to facilitate firstly leverages programs generated by a random generator to many quality assurance activities,such as software testing, seek for inconsistencies of code coverage reports produced by fuzzing,and debugging [1]-[15].For example,researchers different code coverage tools,and identifies inconsistencies as recently introduced an EMI ("Equivalence Modulo Inputs") potential code coverage bugs.Secondly,due to the existence of based compiler testing technique [7].The equivalent programs too many inconsistency-triggering test programs reported and are obtained by stochastically pruning the unexecuted code of a large portion of irrelevant code within these test programs, a given program according to the code coverage information reporting these inconsistency-triggering tests directly is hardly given by the code coverage tools (e.g.,llvm-cov,gcov).There- beneficial to debugging.Before reporting them,the reduction fore,the correctness of "equivalence"relies on the reliability to those test programs is required [18].However,it is usually of code coverage tools. very costly and thus unrealistic to reduce every and each of In spite of the prevalent adoption in practice and extensive the test programs [18].We observe that many test programs testing of code coverage tools,a variety of defects still remain.trigger the same coverage bugs.Thus,we can filter out many Fig.1(a)shows a buggy code coverage report produced by duplicate test programs.Note that 'duplicate test programs'in llvm-cov [16],a C code coverage tool of Clang [17].Note this study indicates multiple test programs triggering the same that all the test cases have been reformatted for presentation code coverage bug.Overall,to effectively report coverage in this study.The coverage report is an annotated version of bugs,we need to address the following key challenges: source code,where the first and second column list the line Challenge 1:Filtering Out Test Programs.To filter out number and the execution frequency,respectively.We can see potential test programs triggering the same code coverage that the code at line 5 is marked incorrectly as unexecuted by bugs,the most intuitive way is to calculate similarities between 1558-1225/19/$31.00©20191EEE 488 D0110.1109/1CSE.2019.00061Hunting for Bugs in Code Coverage Tools via Randomized Differential Testing Yibiao Yang∗, Yuming Zhou∗, Hao Sun†, Zhendong Su‡§, Zhiqiang Zuo∗, Lei Xu∗, and Baowen Xu∗ ∗State Key Lab. for Novel Software Technology, Nanjing University, Nanjing, China †Unaffiliated ‡Department of Computer Science, ETH Zurich, Switzerland §Computer Science Department, UC Davis, USA Abstract—Reliable code coverage tools are critically important as it is heavily used to facilitate many quality assurance activities, such as software testing, fuzzing, and debugging. However, little attention has been devoted to assessing the reliability of code coverage tools. In this study, we propose a randomized differential testing approach to hunting for bugs in the most widely used C code coverage tools. Specifically, by generating random input programs, our approach seeks for inconsistencies in code coverage reports produced by different code coverage tools, and then identifies inconsistencies as potential code coverage bugs. To effectively report code coverage bugs, we addressed three specific challenges: (1) How to filter out duplicate test programs as many of them triggering the same bugs in code coverage tools; (2) how to automatically reduce large test programs to much smaller ones that have the same properties; and (3) how to determine which code coverage tools have bugs? The extensive evaluations validate the effectiveness of our approach, resulting in 42 and 28 confirmed/fixed bugs for gcov and llvm-cov, respectively. This case study indicates that code coverage tools are not as reliable as it might have been envisaged. It not only demonstrates the effectiveness of our approach, but also highlights the need to continue improving the reliability of code coverage tools. This work opens up a new direction in code coverage validation which calls for more attention in this area. Index Terms—Code Coverage; Differential Testing; Coverage Tools; Bug Detection. I. INTRODUCTION Code coverage [1] refers to which code in the program and how many times each code is executed when running on the particular test cases. The code coverage information produced by code coverage tools is widely used to facilitate many quality assurance activities, such as software testing, fuzzing, and debugging [1]–[15]. For example, researchers recently introduced an EMI (“Equivalence Modulo Inputs”) based compiler testing technique [7]. The equivalent programs are obtained by stochastically pruning the unexecuted code of a given program according to the code coverage information given by the code coverage tools (e.g., llvm-cov, gcov). Therefore, the correctness of “equivalence” relies on the reliability of code coverage tools. In spite of the prevalent adoption in practice and extensive testing of code coverage tools, a variety of defects still remain. Fig. 1(a) shows a buggy code coverage report produced by llvm-cov [16], a C code coverage tool of Clang [17]. Note that all the test cases have been reformatted for presentation in this study. The coverage report is an annotated version of source code, where the first and second column list the line number and the execution frequency, respectively. We can see that the code at line 5 is marked incorrectly as unexecuted by 1 | | 2 | | 3 | 1 | 4 | 1 | 5 | 0 | 6 | 1 | 7 | 1 | #include <stdio .h> int main ( ) { int g=0, v=1; g=v | | !v; p r i n t f ( ”%d\n” , g ); } #include <stdio .h> int main ( ) { int g=0, v=1; // g = v | | !v; p r i n t f ( ”%d\n” , g ); } (a) (b) Fig. 1. (a) Bug #33465 of llvm-cov and (b) The “equivalent” program obtained by pruning the unexecuted code (Line #4) of the program in (a) llvm-cov. Given the program p and its corresponding code coverage as shown in Fig. 1(a), EMI compiler testing [7] generates its “equivalent” program p’ as shown in Fig. 1(b) by removing unexecuted code (statement 5). The program p and p’ will be compiled by a compiler under testing and then executed to obtain two different outputs, i.e. 1 and 0, resulting in a bug reported by the EMI approach. However, this is obviously not a real compiler bug. The incorrect code coverage report leads to the false positive in compiler testing. As the code coverage tools offer the fundamental information needed during the whole process of software development, it is essential to validate the correctness of code coverage. Unfortunately, to our best knowledge, little attention has been devoted to assessing the reliability of code coverage tools. This work makes the first attempt in this direction. We devise a practical randomized differential testing approach to discovering bugs in code coverage tools. Our approach firstly leverages programs generated by a random generator to seek for inconsistencies of code coverage reports produced by different code coverage tools, and identifies inconsistencies as potential code coverage bugs. Secondly, due to the existence of too many inconsistency-triggering test programs reported and a large portion of irrelevant code within these test programs, reporting these inconsistency-triggering tests directly is hardly beneficial to debugging. Before reporting them, the reduction to those test programs is required [18]. However, it is usually very costly and thus unrealistic to reduce every and each of the test programs [18]. We observe that many test programs trigger the same coverage bugs. Thus, we can filter out many duplicate test programs. Note that ‘duplicate test programs’ in this study indicates multiple test programs triggering the same code coverage bug. Overall, to effectively report coverage bugs, we need to address the following key challenges: Challenge 1: Filtering Out Test Programs. To filter out potential test programs triggering the same code coverage bugs, the most intuitive way is to calculate similarities between 488 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE) 1558-1225/19/$31.00 ©2019 IEEE DOI 10.1109/ICSE.2019.00061