2019 34th IEEE/ACM International Conference on Automated Software Engineering(ASE) Automatic Self-Validation for Code Coverage Profilers Yibiao Yang*1,Yanyan Jiang",Zhiqiang Zuo*,Yang Wang", Hao Sun*,Hongmin Lu*,Yuming Zhou',and Baowen Xu* *State Key Laboratory for Novel Software Technology,Nanjing University,Nanjing,China School of Cyber Science and Engineering,Huazhong University of Science and Technology,Wuhan.China [yangyibiao,jyy,zqzuo)@nju.edu,cn,dz1933028@smail.nju.edu.cn shqking @gmail.com,hmlu,zhouyuming,bwxu@nju.edu.cn Abstract-Code coverage as the primitive dynamic program Even though the programming experts can specify the oracle behavior information,is widely adopted to facilitate a rich precisely,it requires enormous human intervention,making it spectrum of software engineering tasks,such as testing,fuzzing, impractical. debugging,fault detection,reverse engineering,and program understanding.Thanks to the widespread applications,it is A simple differential testing approach C2V tried to uncover crucial to ensure the reliability of the code coverage profilers. coverage bugs by comparing the coverage profiling results of Unfortunately,due to the lack of research attention and the the same input program over two different profiler implemen- existence of testing oracle problem,coverage profilers are far tations (e.g.,gcov and llvm-cov)[11].For instance,if gcov away from being tested sufficiently.Bugs are still regularly seen and llvm-cov provide different coverage information for the in the widely deployed profilers,like gcov and llvm-cov,along with gcc and llvm,respectively. same statement of the profiled program,a bug is reported. This paper proposes Cod,an automated self-validator for Due to the inconsistency of coverage semantics defined by effectively uncovering bugs in the coverage profilers.Starting different profiler implementations,it is rather common that from a test program (either from a compiler's test suite or generated randomly),Cod detects profiler bugs with zero false independently implemented coverage profilers exhibit different positive using a metamorphic relation in which the coverage opinions on the code-line based statistics (e.g.,the case in statistics of that program and a mutated variant are bridged. Figure 1)-this essentially contradicts the fundamental as- We evaluated Cod over two of the most well-known code sumption of differential testing that distinct coverage profilers coverage profilers,namely gcov and llvm-cov.Within a four should output identical coverage statistics for the same input month testing period,a total of 196 potential bugs(123 for gcov, program. 73 for llvm-cov)are found,among which 23 are confirmed by the developers. Approach To tackle the flaws of the existing approach,this pa- Index Terms-Code coverage,Metamorphic testing,Coverage per presents Cod,a fully automated self-validator of coverage profilers,Bug detection. profilers,based on the metamorphic testing formulation [121. I.INTRODUCTION Instead of comparing outputs from two independent profilers Cod takes a single profiler and a program P (either from Profiling code coverage data [I](e.g.,executed branches, a compiler's test suite or generated randomly)as input and paths,functions,etc.)of the instrumented subject programs is the cornerstone of a rich spectrum of software engineering uncovers the bugs by identifying the inconsistency of coverage results from P and its equivalent mutated variants whose practices,such as testing [2],fuzzing [3].debugging [4]- [6].specification mining [7].[8],fault detection [9].reverse coverage statistics are expected to be identical.The equivalent program variants are generated based on the assumption that engineering,and program understanding [10].Incorrect cov- modifying unexecuted code blocks should not affect the cover- erage information would severely mislead developers in their age statistics of executed blocks under the identical profiler, software engineering practices. Unfortunately,coverage profilers themselves (e.g.,gcov and which should generally hold in a non-optimized setting.This idea originates from EMI [2],a metamorphic testing approach llvm-cov)are prone to errors.Even a simple randomized which is targeted at compiler optimization bugs. differential testing technique exposed more than 70 bugs in coverage profilers [11].The reasons are two-fold.Firstly,nei- Specifically,assuming that the compiler is correct and ther the application-end developers nor academic researchers given a deterministic program P under profiling (either from paid sufficient attention to the testing of code coverage pro- a compiler's test suite or generated randomly)and fixate its filers.Secondly,automatic testing of coverage profilers is still input,Cod obtains a reference program P by removing the unexecuted statements in P.P/should strictly follow the same challenging due to the lack of test oracles.During the code coverage testing,the oracle is supposed to constitute the rich execution path as long as the coverage profiling data of p is correct.Therefore,Cod asserts that the coverage statistics execution information,e.g.,the execution frequency of each should be exactly the same over all unchanged statements code statement in the program under a given particular test case.Different from the functional oracle which usually can be obtained via the given specification,achieving the complete 1According to the developers [13]coverage statistics are only stable under zero optimization level. code coverage oracles turns out to be extremely challenging. -We assume this because mis-compilations are rare. 978-1-7281-2508-4/19/S31.00©2019EEE 79 IEEE D0I10.1109/ASE.2019.00018 Φcomputer society
Automatic Self-Validation for Code Coverage Profilers Yibiao Yang∗†, Yanyan Jiang∗, Zhiqiang Zuo∗, Yang Wang∗, Hao Sun∗, Hongmin Lu∗, Yuming Zhou∗, and Baowen Xu∗ ∗State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China †School of Cyber Science and Engineering, Huazhong University of Science and Technology, Wuhan, China {yangyibiao, jyy, zqzuo}@nju.edu,cn, dz1933028@smail.nju.edu.cn shqking@gmail.com, {hmlu, zhouyuming, bwxu}@nju.edu.cn Abstract—Code coverage as the primitive dynamic program behavior information, is widely adopted to facilitate a rich spectrum of software engineering tasks, such as testing, fuzzing, debugging, fault detection, reverse engineering, and program understanding. Thanks to the widespread applications, it is crucial to ensure the reliability of the code coverage profilers. Unfortunately, due to the lack of research attention and the existence of testing oracle problem, coverage profilers are far away from being tested sufficiently. Bugs are still regularly seen in the widely deployed profilers, like gcov and llvm-cov, along with gcc and llvm, respectively. This paper proposes Cod, an automated self-validator for effectively uncovering bugs in the coverage profilers. Starting from a test program (either from a compiler’s test suite or generated randomly), Cod detects profiler bugs with zero false positive using a metamorphic relation in which the coverage statistics of that program and a mutated variant are bridged. We evaluated Cod over two of the most well-known code coverage profilers, namely gcov and llvm-cov. Within a fourmonth testing period, a total of 196 potential bugs (123 for gcov, 73 for llvm-cov) are found, among which 23 are confirmed by the developers. Index Terms—Code coverage, Metamorphic testing, Coverage profilers, Bug detection. I. INTRODUCTION Profiling code coverage data [1] (e.g., executed branches, paths, functions, etc.) of the instrumented subject programs is the cornerstone of a rich spectrum of software engineering practices, such as testing [2], fuzzing [3], debugging [4]– [6], specification mining [7], [8], fault detection [9], reverse engineering, and program understanding [10]. Incorrect coverage information would severely mislead developers in their software engineering practices. Unfortunately, coverage profilers themselves (e.g., gcov and llvm-cov) are prone to errors. Even a simple randomized differential testing technique exposed more than 70 bugs in coverage profilers [11]. The reasons are two-fold. Firstly, neither the application-end developers nor academic researchers paid sufficient attention to the testing of code coverage pro- filers. Secondly, automatic testing of coverage profilers is still challenging due to the lack of test oracles. During the code coverage testing, the oracle is supposed to constitute the rich execution information, e.g., the execution frequency of each code statement in the program under a given particular test case. Different from the functional oracle which usually can be obtained via the given specification, achieving the complete code coverage oracles turns out to be extremely challenging. Even though the programming experts can specify the oracle precisely, it requires enormous human intervention, making it impractical. A simple differential testing approach C2V tried to uncover coverage bugs by comparing the coverage profiling results of the same input program over two different profiler implementations (e.g., gcov and llvm-cov) [11]. For instance, if gcov and llvm-cov provide different coverage information for the same statement of the profiled program, a bug is reported. Due to the inconsistency of coverage semantics defined by different profiler implementations, it is rather common that independently implemented coverage profilers exhibit different opinions on the code-line based statistics (e.g., the case in Figure 1) — this essentially contradicts the fundamental assumption of differential testing that distinct coverage profilers should output identical coverage statistics for the same input program. Approach To tackle the flaws of the existing approach, this paper presents Cod, a fully automated self-validator of coverage profilers, based on the metamorphic testing formulation [12]. Instead of comparing outputs from two independent profilers, Cod takes a single profiler and a program P (either from a compiler’s test suite or generated randomly) as input and uncovers the bugs by identifying the inconsistency of coverage results from P and its equivalent mutated variants whose coverage statistics are expected to be identical. The equivalent program variants are generated based on the assumption that modifying unexecuted code blocks should not affect the coverage statistics of executed blocks under the identical profiler, which should generally hold in a non-optimized setting1. This idea originates from EMI [2], a metamorphic testing approach which is targeted at compiler optimization bugs. Specifically, assuming that the compiler is correct2 and given a deterministic program P under profiling (either from a compiler’s test suite or generated randomly) and fixate its input, Cod obtains a reference program P by removing the unexecuted statements in P. P should strictly follow the same execution path as long as the coverage profiling data of P is correct. Therefore, Cod asserts that the coverage statistics should be exactly the same over all unchanged statements 1According to the developers [13], coverage statistics are only stable under zero optimization level. 2We assume this because mis-compilations are rare. 79 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE) 978-1-7281-2508-4/19/$31.00 ©2019 IEEE DOI 10.1109/ASE.2019.00018
in P and P'.However,consider a line of code s in P and II.BACKGROUND AND MOTIVATION P,for which the same profiler reported different coverage A.Coverage Profilers results,i.e..Cp(s)Cp(s)where Cp(s)refers to the profiled runtime execution count of statement s in program P. Code coverage profiling data (each line of code's execution count in a program execution)is the foundation of a broad The execution count Cp(s)is usually a nonnegative number spectrum of software engineering practices.Code coverage except for a special value-1 indicating the unknown coverage is the most widely adopted criteria for measuring testing information.This could happen when the coverage profiler failed to obtain the coverage information of statement s due thoroughness,and is also widely used in the automation of software engineering tasks.For example,test input generation to the information loss caused by the abstraction gap between techniques leverage code coverage to guide search iterations the source code and the intermediate code transformed during [3];fault localization techniques use code coverage to isolate compilation.Given Cp(s)Cp(s),either of the two cases applies: potentially faulty branches [4]-[6]. To obtain code coverage statistics,a code coverage profiler maintains each source code line an execution counter and I)(Strong Inconsistency)Cp(s)≥0ACp(s)≥0, updates them along with the program execution.Specifically, meaning that the coverage profiler reports the inconsis- given a program P,a coverage profiler runs P and outputs tent coverage information.Definitely there is a bug in each line of code s EP a number Cp(s)=n,indicating that the profiler because p'should follow exactly the same s was executed n times.A special value n =-1 indicates that execution path as P assuming that the coverage statistics the profiler provides no coverage information for this line. of program P are correct. Where should a profiler report a coverage statistics for a line 2)(Weak Inconsistency)Cp(s)=-1 V Cp(s)=-1, of code s (i.e.,whether Cp(s)=-1)is not well defined.Code indicating an inaccurate statistics because that a non- transformations (e.g.,expansion of macros,or compilation instrumented line is actually executed in its equivalent. from source code to intermediate code)and optimizations This is also for sure a bug because non-optimized cov- may lead to Cp(s)=-1 for a line,and different profilers erage statistics should faithfully reflect the program's generally have different opinions upon which lines would have execution path. Cp(s)>0.Later we see that this is a major limitation of existing techniques for validating coverage profilers. The self-validator Cod fully exploits the inconsistencies be- tween path-equivalent programs with zero false positive.Cod B.Validating Coverage Profilers addresses the limitation of C2V in Section II-C and handles Validating the correctness a coverage profiler is challenging weak inconsistencies whereas C2V has to ignore all weak because it is labor-intensive to obtain the ground truth of inconsistencies between independent profiler implementations coverage statistics.Though we have large number of test inputs to avoid being flooded by false positives.It is worth noting (any program used to test a compiler also works in testing a that such a technique of obtaining path-equivalent programs, profiler),lacking of a test oracle became the problem. firstly known as EMI,was proposed to validate the correctness The only known technique to uncover coverage profiler bugs of compiler optimizations [2].We found that this idea is also is C2V which is based on differential testing [11].Given a powerful in the validation of coverage profilers.Nevertheless, program P under profiling,C2V profiles it using two indepen- Cod differs from EMI as we proposed the specialized mutation dently implemented profilers to obtain for each statement s the strategies to acquire the program variants,and adopted the coverage statistics Cp(s)and cp(s).When Cp(s)Cp(s). results verification criterion in particular for testing coverage an inconsistency is found.When Cp(s)>0ACp(s)>0 profilers.We defer to Section V for the comparison details. (a strong inconsistency),C2V reports it as a bug candidate and uses clustering to filter out potential false positives and duplicates. Results We implemented Cod as a prototype and evaluated it on two popular coverage profilers.namely gcov and llvm-cov C.Limitations of Differential Testing integrated with the compiler gcc and llvm,respectively.As of Though being effective in uncovering coverage profiler the submission deadline,a total of 196 potential bugs (123 for bugs,differential testing also has the following major limi- gcov,73 for llvm-cov)are uncovered within 4 months,among tations: which 23 bugs have already been confirmed by the developers. First,differential testing cannot be applied when there is Promisingly,all the detected bugs are new bugs according to only a single coverage profiler.This is the case for many the developers'feedback. mainstream programming languages (e.g,Python and Perl). Second,differential testing requires heavy human efforts on Outline The rest of the paper is organized as follows.We analyzing the reports because it is hard to determine which introduce the necessary background and a brief motivation in one is faulty when two profilers disagree on the statistics of a Section II.Section III elaborates on the detailed approach, line of code. followed by the evaluation in Section IV.We discuss related Third,differential testing miss many potential bugs on weak work in Section V and conclude the paper in Section VI. inconsistencies.Two profilers can have inconsistent (but both 80
in P and P . However, consider a line of code s in P and P , for which the same profiler reported different coverage results, i.e., CP (s) = CP (s) where CP (s) refers to the profiled runtime execution count of statement s in program P. The execution count CP (s) is usually a nonnegative number except for a special value −1 indicating the unknown coverage information. This could happen when the coverage profiler failed to obtain the coverage information of statement s due to the information loss caused by the abstraction gap between the source code and the intermediate code transformed during compilation. Given CP (s) = CP (s), either of the two cases applies: 1) (Strong Inconsistency) CP (s) ≥ 0 ∧ CP (s) ≥ 0, meaning that the coverage profiler reports the inconsistent coverage information. Definitely there is a bug in the profiler because P should follow exactly the same execution path as P assuming that the coverage statistics of program P are correct. 2) (Weak Inconsistency) CP (s) = −1 ∨ CP (s) = −1, indicating an inaccurate statistics because that a noninstrumented line is actually executed in its equivalent. This is also for sure a bug because non-optimized coverage statistics should faithfully reflect the program’s execution path. The self-validator Cod fully exploits the inconsistencies between path-equivalent programs with zero false positive. Cod addresses the limitation of C2V in Section II-C and handles weak inconsistencies whereas C2V has to ignore all weak inconsistencies between independent profiler implementations to avoid being flooded by false positives. It is worth noting that such a technique of obtaining path-equivalent programs, firstly known as EMI, was proposed to validate the correctness of compiler optimizations [2]. We found that this idea is also powerful in the validation of coverage profilers. Nevertheless, Cod differs from EMI as we proposed the specialized mutation strategies to acquire the program variants, and adopted the results verification criterion in particular for testing coverage profilers. We defer to Section V for the comparison details. Results We implemented Cod as a prototype and evaluated it on two popular coverage profilers, namely gcov and llvm-cov integrated with the compiler gcc and llvm, respectively. As of the submission deadline, a total of 196 potential bugs (123 for gcov, 73 for llvm-cov) are uncovered within 4 months, among which 23 bugs have already been confirmed by the developers. Promisingly, all the detected bugs are new bugs according to the developers’ feedback. Outline The rest of the paper is organized as follows. We introduce the necessary background and a brief motivation in Section II. Section III elaborates on the detailed approach, followed by the evaluation in Section IV. We discuss related work in Section V and conclude the paper in Section VI. II. BACKGROUND AND MOTIVATION A. Coverage Profilers Code coverage profiling data (each line of code’s execution count in a program execution) is the foundation of a broad spectrum of software engineering practices. Code coverage is the most widely adopted criteria for measuring testing thoroughness, and is also widely used in the automation of software engineering tasks. For example, test input generation techniques leverage code coverage to guide search iterations [3]; fault localization techniques use code coverage to isolate potentially faulty branches [4]–[6]. To obtain code coverage statistics, a code coverage profiler maintains each source code line an execution counter and updates them along with the program execution. Specifically, given a program P, a coverage profiler runs P and outputs each line of code s ∈ P a number CP (s) = n, indicating that s was executed n times. A special value n = −1 indicates that the profiler provides no coverage information for this line. Where should a profiler report a coverage statistics for a line of code s (i.e., whether CP (s) = −1) is not well defined. Code transformations (e.g., expansion of macros, or compilation from source code to intermediate code) and optimizations may lead to CP (s) = −1 for a line, and different profilers generally have different opinions upon which lines would have CP (s) ≥ 0. Later we see that this is a major limitation of existing techniques for validating coverage profilers. B. Validating Coverage Profilers Validating the correctness a coverage profiler is challenging because it is labor-intensive to obtain the ground truth of coverage statistics. Though we have large number of test inputs (any program used to test a compiler also works in testing a profiler), lacking of a test oracle became the problem. The only known technique to uncover coverage profiler bugs is C2V which is based on differential testing [11]. Given a program P under profiling, C2V profiles it using two independently implemented profilers to obtain for each statement s the coverage statistics CP (s) and C P (s). When CP (s) = C P (s), an inconsistency is found. When CP (s) ≥ 0 ∧ C P (s) ≥ 0 (a strong inconsistency), C2V reports it as a bug candidate and uses clustering to filter out potential false positives and duplicates. C. Limitations of Differential Testing Though being effective in uncovering coverage profiler bugs, differential testing also has the following major limitations: First, differential testing cannot be applied when there is only a single coverage profiler. This is the case for many mainstream programming languages (e.g, Python and Perl). Second, differential testing requires heavy human efforts on analyzing the reports because it is hard to determine which one is faulty when two profilers disagree on the statistics of a line of code. Third, differential testing miss many potential bugs on weak inconsistencies. Two profilers can have inconsistent (but both 80
1: 1:int main() 11 -1lint main() 11 -1lint main() =1 2:1 21×11{ 2111{ -1: 3: switch (8) 3111 switch (8) 31 11 switch (8) -1: 4 411 4111 =1g case 8: 5111 case 8: 51y11 case 8: 1: 6 break; 61 1 break; 61 1 break; -1: 7 default: 7|11 default: 711 default: -1:8: abort () 81×01 abort () 81x0 i//abort () -1:9: br色ak: 91.1 break: 91x01 b色a: -1:10: 10111 10111 1:11: return 0; 1111 return 0; 11111 return 0; =1:12: 121/111 121111 (a)Cp (gcov) (b)Cp (llvm-cov) ©CPAias}U(ss}Ivm-cov) Fig.I.The bug case of LLVM #41821.Ilvm-cov incorrectly reported that the break in Line 9 is executed.This bug cannot be detected by differential testing [11]because gcov does not provide coverage statistics for Line 9.Visual conventions of coverage statistics:For a line of code s,gcov and llvm-cov output coverage statistics(s)in the first and second column,respectively.A-1 denotes that the profiler does not provide coverage information of s.A check mark or cross mark followed by a number n denotes that cp(s)=n. correct)interpretations over the coverage statistics.Section IV Program P Program Program P reveal that 92.9%of inconsistencies found by differential Pruner testing are weak,i.e.,Cp(s)Cp(s)with Cp(s)=-1V Cp(s)=-1.Our motivating example in Figures 1 (a)-(b) showed that for 10/12 lines,exactly one of gcov or llvm-cov Coverage Coverage profiler profiler provides no coverage information.Unfortunately.C2V has to ignore all of such weak consistencies (i.e.,not report any of them as a bug)because the vast majority of them are attributed to the feature of independently implemented profilers. Coveraoe Coverage Output O report C Finally,differential testing reports false positive even when report C Output O' two profilers report strongly inconsistent coverage statistics, because two profilers may disagree over the definition of the Consistent? execution count of a line,e.g..whether the initialization code of a for loop counts for one time of execution. D.Motivation Bug report The key observation leading to automatic self-validation of No a single profiler is that changing unexecuted statements in a Equal? program should not affect the coverage statistics.Take the program in Figure I (a)(b)as an example.Suppose that we comment out the function call in Line 8 of P and obtain Fig.2.The framework of Cod =P{ss}u[ss}as shown in Figure 1 (c)We assert that unchanged statements should have identical coverage statistics, the profiler because we have full control over P(thus easily i.e., to guarantee it is deterministic)and there is little chance that Vs ePnp'.Cp(s)=Cp(s), the compiler/hardware is defective. In the motivating example,Cp(sg)Cp(sg)revealed a reasonably assuming that: previously unknown profiler bug in which llvm-cov incorrectly 1)P is deterministic,contains no undefined behavior,and reported an unexecuted break as being executed once.This does not depend on external environment; bug case is missed by differential testing (particularly,C2V) 2)the coverage statistics is correct;and because all inconsistencies between Cp (gcov)and Cp(llvm- 3)the executions of P and p are consistent with their cov)are weak (Lines 1-5,7-10,and 12).Generally,weak semantics. inconsistencies between different compiler implementations Since we only remove "unexecuted"statements reported indicate different compilation strategies (thus do not indicate by a coverage profiler,P and p'should be semantically a profiler bug)and should not be reported by c2V. equivalent.Furthermore,a profiler(particularly under minimal III.APPROACH optimization)should be self-consistent in terms of which statement should have a coverage statistics.Therefore,if there A.Metamorphic Testing is an inconsistency Cp(s)Cp(s),no matter whether A test oracle is a mechanism for determining whether it is a strong or weak inconsistency,either of the above a test has passed or failed.Under certain circumstances, assumptions is violated.It turns out that we should blame however,the oracle is not available or too expensive to achieve. 81
1: 1:int main() -1: 2:{ -1: 3: switch (8) -1: 4: { -1: 5: case 8: 1: 6: break; -1: 7: default: -1: 8: abort (); -1: 9: break; -1: 10: } 1: 11: return 0; -1: 12:} 1| -1|int main() 2| 1|{ 3| 1| switch (8) 4| 1| { 5| 1| case 8: 6| 1| break; 7| 1| default: 8| ×0| abort (); 9| 1| break; 10| 1| } 11| 1| return 0; 12| 1|} 1| -1|int main() 2| 1|{ 3| 1| switch (8) 4| 1| { 5| 1| case 8: 6| 1| break; 7| 1| default: 8| ×0| ; // abort (); 9| ×0| break; 10| 1| } 11| 1| return 0; 12| 1|} (a) CP (gcov) (b) CP (llvm-cov) (c) CP\{s8}∪{s 8} (llvm-cov) Fig. 1. The bug case of LLVM #41821. llvm-cov incorrectly reported that the break in Line 9 is executed. This bug cannot be detected by differential testing [11] because gcov does not provide coverage statistics for Line 9. Visual conventions of coverage statistics: For a line of code s, gcov and llvm-cov output coverage statistics cP (s) in the first and second column, respectively. A -1 denotes that the profiler does not provide coverage information of s. A check mark or cross mark followed by a number n denotes that cP (s) = n. correct) interpretations over the coverage statistics. Section IV reveal that 92.9% of inconsistencies found by differential testing are weak, i.e., CP (s) = C P (s) with CP (s) = −1 ∨ C P (s) = −1. Our motivating example in Figures 1 (a)–(b) showed that for 10/12 lines, exactly one of gcov or llvm-cov provides no coverage information. Unfortunately, C2V has to ignore all of such weak consistencies (i.e., not report any of them as a bug) because the vast majority of them are attributed to the feature of independently implemented profilers. Finally, differential testing reports false positive even when two profilers report strongly inconsistent coverage statistics, because two profilers may disagree over the definition of the execution count of a line, e.g., whether the initialization code of a for loop counts for one time of execution. D. Motivation The key observation leading to automatic self-validation of a single profiler is that changing unexecuted statements in a program should not affect the coverage statistics. Take the program in Figure 1 (a)–(b) as an example. Suppose that we comment out the function call in Line 8 of P and obtain P = P\{s8}∪{s 8} as shown in Figure 1 (c) We assert that unchanged statements should have identical coverage statistics, i.e., ∀s ∈P∩P . CP (s) = CP (s), reasonably assuming that: 1) P is deterministic, contains no undefined behavior, and does not depend on external environment; 2) the coverage statistics is correct; and 3) the executions of P and P are consistent with their semantics. Since we only remove “unexecuted” statements reported by a coverage profiler, P and P should be semantically equivalent. Furthermore, a profiler (particularly under minimal optimization) should be self-consistent in terms of which statement should have a coverage statistics. Therefore, if there is an inconsistency CP (s) = CP (s), no matter whether it is a strong or weak inconsistency, either of the above assumptions is violated. It turns out that we should blame Bug report No Coverage report C Coverage report C’ Consistent? Equal? No Program P Program P’ Program Pruner Coverage profiler Coverage profiler Output O Output O’ Fig. 2. The framework of Cod the profiler because we have full control over P (thus easily to guarantee it is deterministic) and there is little chance that the compiler/hardware is defective. In the motivating example, CP (s9) = CP (s9) revealed a previously unknown profiler bug in which llvm-cov incorrectly reported an unexecuted break as being executed once. This bug case is missed by differential testing (particularly, C2V) because all inconsistencies between CP (gcov) and CP (llvmcov) are weak (Lines 1–5, 7–10, and 12). Generally, weak inconsistencies between different compiler implementations indicate different compilation strategies (thus do not indicate a profiler bug) and should not be reported by C2V. III. APPROACH A. Metamorphic Testing A test oracle is a mechanism for determining whether a test has passed or failed. Under certain circumstances, however, the oracle is not available or too expensive to achieve. 81
This is known as the oracle problem [14].For example,in Algorithm 1:Cod's process for coverage tool validation compiler testing.it is not easy to verify whether the generated Data:The profiler T under test,the program P,the input i executable code by a compiler is functionally equivalent to the Result:reported bugs given source code.Even if the oracle is available,manually 1 begin checking the oracle results is tedious and error-prone [15]. /Step 1:Extract output and coverage [16].As a matter of fact,the oracle problem has been "one of information the most difficult tasks in software testing"[16]. 2 P:re←-compi1e(P) Metamorphic testing (MT)was coined by T.Y.Chen in 3 getOutput(execute(Pere,i)) 4 C+T.extractCoverage (execute(Pere.i)) 1998 [12],which can be exploited to alleviate the oracle problem.Based on the existing successful test cases (that /Step 2:Generate variants via transformation have not revealed any failure,such as running too long or p← genVariant (P.C) returning abnormal values),MT generates follow-up test cases D xe←compile(P') by making reference to metamorphic relations (MR),which 7 Ogetoutput (execute(Pize i)) are the necessary properties of the target function or algorithm C'T.extractCoverage(execute(Pere,i)) in terms of multiple inputs and their expected outputs.Let us /Step 3:Compare outputs and reports consider a program P implementing function F on domain /First Stage D.Let t be an initial successful test case,i.e.t D and ifO≠O'then reportBug() the execution output P(t)equals to the expected value F(t). /Second Stage MT can be applied to generate a follow-up test case t'D 1 else if inconsistent (C,C')then base on t and a pre-defined MR (i.e.,metamorphic relation). reportBug() For program P,a MR is a property of its target function F.For instance,suppose F(x)=sin(z),then the property /Generate a variant for program P under sin(-x)=sin(z)is a typical MR with respect to F.Hence, coverage C 13 Function genvariant (Program P.Coverage C) given a successful test case,say t =1.2,MT generates its p'←p follow-up test case t'=-1.2,and then runs the program foreach s E getstmts (P)A Cp(s)=0 do over t'.Finally,two outputs (i.e.,P(t)and P(t'))are checked 16 Lp'.delete(s) to see if they satisfy the expected relation P(t)=P(t').If 公 if isCompiable(P')then the identity does not hold,a failure manifests. 18 return p In our work,we apply MT to the validation of code coverage 19 else 20 genvariant(P,C) profilers.Each program becomes a test case fed to the profilers. Given a program P and the code coverage profiler under /Check whether coverages is inconsistent testing,running P with an input i would produce the execution 21 Function inconsistent(Coverage C.Coverage C') output O and a coverage report C from the coverage profiler, 22 foreach s∈CAs∈C'do The coverage report C records which lines of code are executed 23 ifCp(s)≠Cp(s)then (or unexecuted)and how many times are executed exactly. return True Note that this program P is the initial test case according return False to the notions of MT.A follow-up test program p'can be generated based on the following MR: Given a program,the unexecuted code can be eliminated tion of a given test program,(2)generating equivalent variants since these code has no impact with the execution output or based on code coverage report,and (3)comparing the outputs the coverage report specifically for the executed part. and coverage results to uncover bugs. In other words,by taking advantage of the coverage infor- Algorithm I is the main process of Cod.At the first mation C,we generate P,a functionally equivalent variant of place,Cod compiles the program P and profiles the execution P by removing the un-executed code statements of P.We run information with input i to collect:(1)the output O(which p'on the same input i and obtain the output 'and coverage may correspond to a return value or an exit code)and(2)the results C,accordingly.A bug can then be discovered if 1)the code coverage information C(Lines 2-4).It then generates execution output is not equal to the new one O,or 2)there the variant P with respect to program P(Line 5)and collects exists inconsistency between the coverage information for the respective output O'and code coverage C'(Lines 6- executed code inside C and C.Figure 2 shows our framework 8).Finally,it compares the outputs together with the code for the self-validation of code coverage profilers. coverage reports to validate the coverage profiler.A potential bug in 7 is reported if any of them is inconsistent (Lines 9- B.Our Algorithm 12).We discuss each step in details as follows. Based on our formulation,we implemented a tool Cod for Extracting Coverage Information For each test program P, detecting bugs in C code coverage profilers.Cod consists of we first compile it with particular options to generate the three main steps:(1)extracting output and coverage informa- executable binary Pere.The options enables the compiler 82
This is known as the oracle problem [14]. For example, in compiler testing, it is not easy to verify whether the generated executable code by a compiler is functionally equivalent to the given source code. Even if the oracle is available, manually checking the oracle results is tedious and error-prone [15], [16]. As a matter of fact, the oracle problem has been “one of the most difficult tasks in software testing” [16]. Metamorphic testing (MT) was coined by T.Y. Chen in 1998 [12], which can be exploited to alleviate the oracle problem. Based on the existing successful test cases (that have not revealed any failure, such as running too long or returning abnormal values), MT generates follow-up test cases by making reference to metamorphic relations (MR), which are the necessary properties of the target function or algorithm in terms of multiple inputs and their expected outputs. Let us consider a program P implementing function F on domain D. Let t be an initial successful test case, i.e. t ∈ D and the execution output P(t) equals to the expected value F(t). MT can be applied to generate a follow-up test case t ∈ D base on t and a pre-defined MR (i.e., metamorphic relation). For program P, a MR is a property of its target function F. For instance, suppose F(x) = sin(x), then the property sin(π−x) = sin(x) is a typical MR with respect to F. Hence, given a successful test case, say t = 1.2, MT generates its follow-up test case t = π − 1.2, and then runs the program over t . Finally, two outputs (i.e., P(t) and P(t )) are checked to see if they satisfy the expected relation P(t) = P(t ). If the identity does not hold, a failure manifests. In our work, we apply MT to the validation of code coverage profilers. Each program becomes a test case fed to the profilers. Given a program P and the code coverage profiler under testing, running P with an input i would produce the execution output O and a coverage report C from the coverage profiler, The coverage report C records which lines of code are executed (or unexecuted) and how many times are executed exactly. Note that this program P is the initial test case according to the notions of MT. A follow-up test program P can be generated based on the following MR: Given a program, the unexecuted code can be eliminated since these code has no impact with the execution output or the coverage report specifically for the executed part. In other words, by taking advantage of the coverage information C, we generate P , a functionally equivalent variant of P by removing the un-executed code statements of P. We run P on the same input i and obtain the output O and coverage results C , accordingly. A bug can then be discovered if 1) the execution output O is not equal to the new one O , or 2) there exists inconsistency between the coverage information for executed code inside C and C . Figure 2 shows our framework for the self-validation of code coverage profilers. B. Our Algorithm Based on our formulation, we implemented a tool Cod for detecting bugs in C code coverage profilers. Cod consists of three main steps: (1) extracting output and coverage informaAlgorithm 1: Cod’s process for coverage tool validation Data: The profiler T under test, the program P, the input i Result: reported bugs 1 begin /* Step 1: Extract output and coverage information */ 2 Pexe ← compile(P) 3 O ← getOutput(execute(Pexe, i)) 4 C←T .extractCoverage(execute(Pexe, i)) /* Step 2: Generate variants via transformation */ 5 P ← genVariant(P, C) 6 P exe ← compile(P ) 7 O ← getOutput(execute(P exe, i)) 8 C ← T .extractCoverage(execute(P exe, i)) /* Step 3: Compare outputs and reports */ // First Stage 9 if O = O then 10 reportBug() // Second Stage 11 else if inconsistent(C, C ) then 12 reportBug() /* Generate a variant for program P under coverage C */ 13 Function genVariant(Program P, Coverage C) 14 P ← P 15 foreach s ∈ getStmts(P) ∧ CP (s)=0 do 16 P .delete(s) 17 if isCompiable(P ) then 18 return P 19 else 20 genVariant(P, C) /* Check whether coverages is inconsistent */ 21 Function inconsistent(Coverage C, Coverage C ) 22 foreach s ∈C∧ s ∈ C do 23 if CP (s) = C P (s) then 24 return True 25 return False tion of a given test program, (2) generating equivalent variants based on code coverage report, and (3) comparing the outputs and coverage results to uncover bugs. Algorithm 1 is the main process of Cod. At the first place, Cod compiles the program P and profiles the execution information with input i to collect: (1) the output O (which may correspond to a return value or an exit code) and (2) the code coverage information C (Lines 2 - 4). It then generates the variant P with respect to program P (Line 5) and collects the respective output O and code coverage C (Lines 6 - 8). Finally, it compares the outputs together with the code coverage reports to validate the coverage profiler. A potential bug in T is reported if any of them is inconsistent (Lines 9 - 12). We discuss each step in details as follows. Extracting Coverage Information For each test program P, we first compile it with particular options to generate the executable binary Pexe. The options enables the compiler 82
to integrate the necessary instrumentation code into the ex- with the execution frequency of each line.The first and second ecutable.While executing Pere with input i,we obtain the column list the execution frequency and the line number.The output O of the program.Meanwhile,the code coverage report frequency number"-1"in the first column indicates that the C can also be readily extracted by the coverage profiler T. coverage information is unknown. Each code coverage report contains the lines of code executed In this example,we first utilize gcc to compile the program and unexecuted in the test program P under the input i.Those p and then execute it to produce the output and coverage statements marked as unexecuted will be randomly pruned for report (shown as Figure 3 (a)).Note that the output in this case the purpose of generating P's equivalent variants,which will is 0.According to the original code coverage report of P,Cod be discussed shortly.Cod implemented the supports for both decides to remove the 6th statement from the original program, gcov and llvm-cov.Take gcov as an example,Cod extracts resulting in an equivalent program P'shown as Figure 3(b). coverage information by compiling the program P with the Next,we compile and execute p'to get the new output and flag:"-00 --coverage"under gcc.It tells the compiler to coverage report.Here,the output turns to be 1. instrument additional code in the object files for generating Since the outputs of these two program are not equal,P the extra profiling information at runtime.Cod then runs the and P'are somehow not equivalent,meaning that we actually executable binary under input i to produce coverage report for deleted some executed code.The code coverage tool wrongly the program P. marked some executed statements as not executed.A potential Generating Variants via Transformation Based on the bug is identified.We reported this bug to Bugzilla.The gcov coverage report C for the original program P,its variants are developers quickly confirmed and fixed it. generated (Line 5).Cod produces the variants by stochastically Bug Example Exposed by Strongly Inconsistent Coverage removing unexecuted program statements from the original Figure 4 illustrates another real bug example uncovered by program P.Specifically,for each of these removable lines strongly inconsistent code coverage reports between the pro- of code,we made a random choice.As such,we obtain a gram and its "equivalence"variant.Figure 4 (a)shows the number of variants P'that should be equivalent to the original coverage report for P.We can read from it that Line 10 is not program.The function genvariant (Lines 13-20)describe executed at all (i.e.,the execution count is 0).Cod prunes Line Cod's process for generating equivalence mutants via transfor- 10 to generate the equivalent program P'.After compiling and mation.Note that stochastically removing unexecuted program executing P',another coverage report shown as Figure 4(a)is statements would lead to many uncompilable mutants.Only produced.As can be seen,there exists an strong inconsistency the compilable ones are returned by genvariant (Line 17). in term of the execution frequency of Line 6,indicating a Comparing the Outputs and Coverage Reports Having the potential bug.This bug is submitted and confirmed already by outputs and the coverage reports for the orginal program P and gcov developers. its variants P,we detect bugs in the code coverage tool T by Bug Example Exposed by Weakly Inconsistent Coverage checking the existence of inconsistency.More specifically,We Figure 5 presents another confirmed real bug example found first compare the outputs of P and P'.If they are not identical, via the weakly inconsistent code coverage reports between the a potential bug would be reported in the code coverage tool. program and its equivalent variant.In Figure 5(a),Line 6 in Otherwise,the code coverage reports are further compared to p is not executed (i.e.,the execution count is 0).Cod gets seeking for inconsistencies.Note that only the code coverage rid of Line 6 to generate the equivalent program P'.Upon of the common lines of code between the programs P and p compiling and executing P',another coverage report shown as (i.e.those lines of code left in the variant program),will be Figure 5(a)is generated.Apparently,the weakly inconsistency considered for comparison.If the code coverage reports is not with respect to the execution frequency of Line 5 appears, consistent over the common lines,a potential bug is reported indicating a potential bug as well (Lines 9-12). IV.EVALUATION C.Illustrative Examples In the following,we take our reported three concrete bug This section presents our evaluation of Cod.We evaluated Cod using the most popular practical code coverage profilers: examples to illustrate how Cod works.Three bugs are newly discovered by Cod and confirmed by the GCC developers. gcov and llvm-cov and a set of testing programs for testing compilers,and compared the results with existing differential Bug Example Exposed by Different Outputs Figure 3 shows technique C2V [11]. a real bug example exposed via different outputs of two "equivalent"programs in gcov [17],a C code coverage tool A.Evaluation Setup integrated in GCC [18].Figure 3 (a)and (b)are the code coverage reports produced by gcov for the original program Profilers for Validation We evaluated Cod using the latest P and its equivalent program P'(by removing an unexecuted versions of gcov and llvm-cov,the most popular two code Line 8).respectively.Note that all the test programs are coverage profilers of C programs,as our experimental subjects. reformatted for presentation.As can be seen,a code coverage Both profilers are: report is an annotated version of the source code augmented 1)popular in the software engineering community; 83
to integrate the necessary instrumentation code into the executable. While executing Pexe with input i, we obtain the output O of the program. Meanwhile, the code coverage report C can also be readily extracted by the coverage profiler T . Each code coverage report contains the lines of code executed and unexecuted in the test program P under the input i. Those statements marked as unexecuted will be randomly pruned for the purpose of generating P’s equivalent variants, which will be discussed shortly. Cod implemented the supports for both gcov and llvm-cov. Take gcov as an example, Cod extracts coverage information by compiling the program P with the flag: “-O0 --coverage” under gcc. It tells the compiler to instrument additional code in the object files for generating the extra profiling information at runtime. Cod then runs the executable binary under input i to produce coverage report for the program P. Generating Variants via Transformation Based on the coverage report C for the original program P, its variants are generated (Line 5). Cod produces the variants by stochastically removing unexecuted program statements from the original program P. Specifically, for each of these removable lines of code, we made a random choice. As such, we obtain a number of variants P that should be equivalent to the original program. The function genVariant (Lines 13 - 20) describe Cod’s process for generating equivalence mutants via transformation. Note that stochastically removing unexecuted program statements would lead to many uncompilable mutants. Only the compilable ones are returned by genVariant (Line 17). Comparing the Outputs and Coverage Reports Having the outputs and the coverage reports for the orginal program P and its variants P , we detect bugs in the code coverage tool T by checking the existence of inconsistency. More specifically, We first compare the outputs of P and P . If they are not identical, a potential bug would be reported in the code coverage tool. Otherwise, the code coverage reports are further compared to seeking for inconsistencies. Note that only the code coverage of the common lines of code between the programs P and P (i.e. those lines of code left in the variant program), will be considered for comparison. If the code coverage reports is not consistent over the common lines, a potential bug is reported as well (Lines 9–12). C. Illustrative Examples In the following, we take our reported three concrete bug examples to illustrate how Cod works. Three bugs are newly discovered by Cod and confirmed by the GCC developers. Bug Example Exposed by Different Outputs Figure 3 shows a real bug example exposed via different outputs of two “equivalent” programs in gcov [17], a C code coverage tool integrated in GCC [18]. Figure 3 (a) and (b) are the code coverage reports produced by gcov for the original program P and its equivalent program P (by removing an unexecuted Line 8), respectively. Note that all the test programs are reformatted for presentation. As can be seen, a code coverage report is an annotated version of the source code augmented with the execution frequency of each line. The first and second column list the execution frequency and the line number. The frequency number “-1” in the first column indicates that the coverage information is unknown. In this example, we first utilize gcc to compile the program P and then execute it to produce the output and coverage report (shown as Figure 3 (a)). Note that the output in this case is 0. According to the original code coverage report of P, Cod decides to remove the 6th statement from the original program, resulting in an equivalent program P shown as Figure 3(b). Next, we compile and execute P to get the new output and coverage report. Here, the output turns to be 1. Since the outputs of these two program are not equal, P and P are somehow not equivalent, meaning that we actually deleted some executed code. The code coverage tool wrongly marked some executed statements as not executed. A potential bug is identified. We reported this bug to Bugzilla. The gcov developers quickly confirmed and fixed it. Bug Example Exposed by Strongly Inconsistent Coverage Figure 4 illustrates another real bug example uncovered by strongly inconsistent code coverage reports between the program and its “equivalence” variant. Figure 4 (a) shows the coverage report for P. We can read from it that Line 10 is not executed at all (i.e., the execution count is 0). Cod prunes Line 10 to generate the equivalent program P . After compiling and executing P , another coverage report shown as Figure 4 (a) is produced. As can be seen, there exists an strong inconsistency in term of the execution frequency of Line 6, indicating a potential bug. This bug is submitted and confirmed already by gcov developers. Bug Example Exposed by Weakly Inconsistent Coverage Figure 5 presents another confirmed real bug example found via the weakly inconsistent code coverage reports between the program and its equivalent variant. In Figure 5 (a), Line 6 in P is not executed (i.e., the execution count is 0). Cod gets rid of Line 6 to generate the equivalent program P . Upon compiling and executing P , another coverage report shown as Figure 5 (a) is generated. Apparently, the weakly inconsistency with respect to the execution frequency of Line 5 appears, indicating a potential bug. IV. EVALUATION This section presents our evaluation of Cod. We evaluated Cod using the most popular practical code coverage profilers: gcov and llvm-cov and a set of testing programs for testing compilers, and compared the results with existing differential technique C2V [11]. A. Evaluation Setup Profilers for Validation We evaluated Cod using the latest versions of gcov and llvm-cov, the most popular two code coverage profilers of C programs, as our experimental subjects. Both profilers are: 1) popular in the software engineering community; 83
-1: 1:#include -1: 1:#include =1 2:int*p=0,a=0,b=2; =1 2:int*p=0,a=0,b=2: 1: 3:int +foo() 1: 3:int*foo()【 1: 41 int*r=(int +)1 1: 4:int +r (int *)1; =11 5 hi1e(1)( =1: 5: whi1e(1)【 6: r=(int)(a+p)“1; 6: /1; 7 if (a b)return r; 1: 7: if (a b)return r; -1: 8: -1: 8: =11 9 return ri =11 9: return r; =11 10:} -1:10:1 1:11:void main () 1:11:void main ( 1;12: int *r foo(); /1:12: int *r=foo(); 1:13: printf("3d\n",r); 1:13: printf("Sd\n",r); 1:14:} 1:14:1 (a)Cp (gcov,output:0) (b)Cp(gcov,output:1) Fig.3.A real bug example exposed by Cod via different outputs.This is Bug #89675 of gcov 8.2.0.In (a),Line #6 is marked as not executed;(b)is the "equivalent"program by deleting Line #6 from the original program in (a).The outputs of these two "equivalent"programs are not identical,indicating a bug in gcov 8.2.0. 1 1:int foo() 1 1:int foo(){ 1: 2: inth=2,f=1,k=0: 1: 2: inth=2,f=1,k=0: w1: 3: inty=18481,x=y; 1: 3 inty=18481,x=y: 1 4: 1f(y!=0&&(k>4)){ 广1 4: 1f(yI=0五&(k>4)){ 1 5: h=y>0?2:1; 1: 5: h=y>0?2:1; 2: 6 if (f) 1: 6 1E(f){ 1: 7: h=3; 1: h^=3; =1 8 -1: 81 =1 9: }else( =1年0年 else ×0:10: h=0; -1:10: //h=0: -1:11: -1:11: 1:12: return h; 1:12: return h; -1:13:} -1:13:1 1:14:void main()(foo(); 1:14:void main()foo();) (a)Cp (gcov) (b)Cp\imoU(si)(gcov) Fig.4.A real bug example discovered by Cod,with confirmed bug id #89470 of gcov 8.2.0.When the unexecuted Line #10 is pruned from the original program in (a),the code coverage of Line #6 is inconsistent between that of the original program and the new program in (b),which indicates a bug.A star after a number in Line #5 denotes that this number may be inaccurate. 1▣ 1:void foo(int x,unsigned u){ /1 1:void foo(int x,unsigned u){ 1: 2: if ((10 <x) 1=64 1, 2: iE((1U<x)!=64 1: 3 11(2<<x)1=u 1: 3: 11(2<<x)!=u =1 4: 11(1<<x)==14 =1 4: 1(1<<x)==14 1 5: 11(3<<2)4=12) -1: 5 (3<<2)1=12) ×0: 6: builtin_abort () =1:6年 builtin abort () 1: 7:》 1:7:} 1: 8:int main() 8:int main()( 1: 9: foo(6,128U); 1: 9: fo0(6,128U): 1:10: return 0; 1:10:return 0; =1:11:1 -1:11:} (a)Cp (gcov) (b)Cp\ss)(gcov) Fig.5.A real bug example discovered by Cod,with confirmed bug id #90439 of gcov 9.0.When the unexecuted Line #5 is pruned from the original program in (a).the code coverage of Line #5 is weakly inconsistent between that of the original program and the new program in (b). 2)integrated in the most widely used production compilers, gcov test.c i.e.GCC and Clang; 3)extensive validated by existing research,both for the For llvm-cov,we use the following commands to produce the compilers and the profilers. coverage report test.c.Icov: Following the existing research [11],we use the default clang-00 -fcoverage-mapping -fprofile-instr-generate -o test test.c complier flags to obtain coverage report for gcov and llvm-cov ./test under zero-level optimization.Given a piece of source code 1lvm-profdata merge default.profraw -o test.pd test.c,the following commands are used to produce the llvm-cov show test -instr-profile=test.pd test.c test.c.lcov coverage report test.c.gcov: gcc -00--coverage -o test test.c Evaluation Steps To run either differential testing or Cod,we ./test obtain code coverage statistics for the 26,530 test programs 84
-1: 1:#include -1: 2:int *p=0, a=0, b=2; 1: 3:int *foo() { 1: 4: int *r=(int *)1; -1: 5: while (1) { ×0: 6: r = (int)(a+p) & ˜1; 1: 7: if (a -1: 2:int *p=0, a=0, b=2; 1: 3:int *foo() { 1: 4: int *r=(int *)1; -1: 5: while (1) { -1: 6: // r = (int)(a+p) & ˜1; 1: 7: if (a >4)) { 1 : 5: h=y>0 ? 2:1; 2: 6: if (f) { 1: 7: hˆ=3; -1: 8: } -1: 9: } else { ×0: 10: h = 0; -1: 11: } 1: 12: return h; -1: 13:} 1: 14:void main() { foo(); } 1: 1:int foo() { 1: 2: int h=2, f=1, k=0; 1: 3: int y=18481, x=y; 1: 4: if(y!=0 && (k>4)) { 1 : 5: h=y>0 ? 2:1; 1: 6: if (f) { 1: 7: hˆ=3; -1: 8: } -1: 9: } else { -1: 10: // h = 0; -1: 11: } 1: 12: return h; -1: 13:} 1: 14:void main() { foo(); } (a) CP (gcov) (b) CP\{s10}∪{s 10} (gcov) Fig. 4. A real bug example discovered by Cod, with confirmed bug id #89470 of gcov 8.2.0. When the unexecuted Line #10 is pruned from the original program in (a), the code coverage of Line #6 is inconsistent between that of the original program and the new program in (b), which indicates a bug. A star after a number in Line #5 denotes that this number may be inaccurate. 1: 1:void foo(int x, unsigned u) { 1: 2: if ((1U test.c.lcov Evaluation Steps To run either differential testing or Cod, we obtain code coverage statistics for the 26,530 test programs 84
TABLE I TABLE II STATISTICS OF BUG-TRIGGERING TEST PROGRAMS. LIST OF CONFIRMED OR FIXED BUGS.PN DENOTES A NORMAL PRIORITY DIFFTEST DENOTES WHETHER THE BUG CAN BE FOUND BY A Inconsistent Reports DIFFERENTIAL TESTING. Profilers Different Outputs Strong Weak D Profiler Bugzilla ID Priority Status Type DiffTest gcov 69 54 llvm-cov 0 62 11 gcov 88913 P3 Fixed Wrong Freq. gcov 88914 P3 Fixed Wrong Freq. 3 gcov 88924 P5 New Wrong Freq. in the test-suite shipped with the latest gcc release (7.4.0) gcov 88930 Fixed Wrong Freq. and 5,000 random programs generated by csmith [19].All 5 evaluated programs contain neither external environmental gcov 89465 令 Fixed Missing dependency nor undefined behavior.We run Cod over all the 6 gcov 89467 P3 Fixed Wrong Freq. test programs and collect all reported inconsistencies for a gcov 89468 P5 New Wrong Freq. manual inspection.We also compare these results with the gcov 89469 New Wrong Freq. state-of-the-art differential testing technique C2V [11]. 9 gcov 89470 P5 New Wrong Freq. Testing Environment We evaluated gcov shipped with the 10 gcov 89673 P5 New Spurious latest version of gcov (until gcc 9.0.1-20190414)and llvm- 11 gcov 89674 P5 New Spurious cov (until llvm 9.0.0-svn358899)during our experiments.All 12 gcov 89675 P3 Fixed Missing experiments were conducted on a hexa-core Intel(R)Core(TM) 13 gcov 90023P5New Spurious CPU@3.20GHz virtual machine with 10GiB of RAM running 14 90054P3 Fixed Missing Ubuntu Linux 18.04. gcov 15 gcov 90057 P3 Fixed Wrong Freq. B.Experimental Results gcov 90066 P5 New Wrong Freg. Inconsistent Reports For each of the test cases in our 17 gcov 90091 P3 New Wrong Freg. testbed,only one variant was generated by using Cod for the 18 gcov 90104 P3 New Wrong Freq. validation.The only variant is generated by removing all the 19 gcov 90425 P5 New Wrong Freq. unexecuted statements reported by coverage profilers from the 20 gcov 90439 P3 New Missing original test cases.It is obvious that generating more variants 21 llvm-cov 41051 PN New Wrong Freg. for each test program may trigger more inconsistencies over 22 llvm-cov 41821 PN New Spurious the test programs and probably detect more bugs in those cov- 23 llvm-cov 41849 PN New Missing erage profilers.Table I shows the statistics of bug-triggering test programs over two code coverage profilers under test,i.e.. gcov and llvm-cov.Column 2 refers to the total number of the pairs of test program with its variant,which can lead confirmation state,one was marked as duplicate,and only one to different execution outputs,and Column 3 shows the total was rejected by the developer (gcov #90438).This rejected number that can impose inconsistent coverage reports. case is controversial because gcc is performing optimization The single case in which the variant outputs a different value even under the zero optimization levels (as shown in Figure 6), (Figure 3)is due to the incorrect coverage statistics causing which may mislead a developer or an automated tool that are Cod to create functionally different "equivalent"mutated vari- based on the branch information in the coverage statistics. ants.Others inconsistencies also due to profiler bugs,which Following the notions from C2V,code coverage bugs inside are discussed as follows. coverage profilers can categorized as Spurious Marking,Miss- ing Marking,and Wrong Frequency.As shown in Column 6 Bugs Found We manually inspected all cases and found that of Table II,we can find that Cod is able to detect all three all reported (strong and weak)inconsistencies revealed defects types of bugs in coverage profilers.14 bugs belong to Wrong in the profiler.By far,we reported a total of 26 bugs to the Frequency,5 bugs belong to Missing Marking,and the rest 4 developers of gcov and llvm-cov.The manual classification bugs is Spurious.Besides,most of bugs are Wrong Freguency and reporting of profiler bugs is still on-going.We believe bugs,i.e.,the execution frequencies is wrongly reported. that more bugs will be reported in the future. 23/26 bugs are confirmed3 by the developers as listed in Among all these bugs,nearly half (12/26)cannot be mani- Table II.One of the remaining three is still in the pending fested by differential testing.Considering that differential test- ing leverages the coverage statistics of an independent profiler 3Consistent with C.Sun et al's [20]and V.Le et al's [21]studies,due to implementation(which produces correct coverage information the bug management process of LLVM is not as organized as that of GCC, in all these cases,and thus differential testing is essentially if a llvm-cov bug report has been CCed by Clang developers and there is comparing with a golden version)while Cod is merely self- no objection in the comments,we label the bug as confirmed.In addition.as stated by developers,if someone does not close the reported bug as"invalid". validation,we are expecting Cod to be effective and useful in then the bug is real in LLVM Bugzilla. finding code coverage profiler bugs. 85
TABLE I STATISTICS OF BUG-TRIGGERING TEST PROGRAMS. Profilers Different Outputs Inconsistent Reports Strong Weak gcov 1 69 54 llvm-cov 0 62 11 in the test-suite shipped with the latest gcc release (7.4.0) and 5,000 random programs generated by csmith [19]. All evaluated programs contain neither external environmental dependency nor undefined behavior. We run Cod over all the test programs and collect all reported inconsistencies for a manual inspection. We also compare these results with the state-of-the-art differential testing technique C2V [11]. Testing Environment We evaluated gcov shipped with the latest version of gcov (until gcc 9.0.1-20190414) and llvmcov (until llvm 9.0.0-svn358899) during our experiments. All experiments were conducted on a hexa-core Intel(R) Core(TM) CPU@3.20GHz virtual machine with 10GiB of RAM running Ubuntu Linux 18.04. B. Experimental Results Inconsistent Reports For each of the test cases in our testbed, only one variant was generated by using Cod for the validation. The only variant is generated by removing all the unexecuted statements reported by coverage profilers from the original test cases. It is obvious that generating more variants for each test program may trigger more inconsistencies over the test programs and probably detect more bugs in those coverage profilers. Table I shows the statistics of bug-triggering test programs over two code coverage profilers under test, i.e., gcov and llvm-cov. Column 2 refers to the total number of the pairs of test program with its variant, which can lead to different execution outputs, and Column 3 shows the total number that can impose inconsistent coverage reports. The single case in which the variant outputs a different value (Figure 3) is due to the incorrect coverage statistics causing Cod to create functionally different “equivalent” mutated variants. Others inconsistencies also due to profiler bugs, which are discussed as follows. Bugs Found We manually inspected all cases and found that all reported (strong and weak) inconsistencies revealed defects in the profiler. By far, we reported a total of 26 bugs to the developers of gcov and llvm-cov. The manual classification and reporting of profiler bugs is still on-going. We believe that more bugs will be reported in the future. 23/26 bugs are confirmed3 by the developers as listed in Table II. One of the remaining three is still in the pending 3Consistent with C. Sun et al’s [20] and V. Le et al’s [21] studies, due to the bug management process of LLVM is not as organized as that of GCC, if a llvm-cov bug report has been CCed by Clang developers and there is no objection in the comments, we label the bug as confirmed. In addition, as stated by developers, if someone does not close the reported bug as “invalid”, then the bug is real in LLVM Bugzilla. TABLE II LIST OF CONFIRMED OR FIXED BUGS. PN DENOTES A NORMAL PRIORITY. DIFFTEST DENOTES WHETHER THE BUG CAN BE FOUND BY A DIFFERENTIAL TESTING. ID Profiler Bugzilla ID Priority Status Type DiffTest 1 gcov 88913 P3 Fixed Wrong Freq. 2 gcov 88914 P3 Fixed Wrong Freq. 3 gcov 88924 P5 New Wrong Freq. 4 gcov 88930 P3 Fixed Wrong Freq. 5 gcov 89465 P3 Fixed Missing × 6 gcov 89467 P3 Fixed Wrong Freq. 7 gcov 89468 P5 New Wrong Freq. × 8 gcov 89469 P5 New Wrong Freq. 9 gcov 89470 P5 New Wrong Freq. 10 gcov 89673 P5 New Spurious × 11 gcov 89674 P5 New Spurious × 12 gcov 89675 P3 Fixed Missing × 13 gcov 90023 P5 New Spurious × 14 gcov 90054 P3 Fixed Missing 15 gcov 90057 P3 Fixed Wrong Freq. 16 gcov 90066 P5 New Wrong Freq. × 17 gcov 90091 P3 New Wrong Freq. 18 gcov 90104 P3 New Wrong Freq. × 19 gcov 90425 P5 New Wrong Freq. × 20 gcov 90439 P3 New Missing × 21 llvm-cov 41051 PN New Wrong Freq. 22 llvm-cov 41821 PN New Spurious × 23 llvm-cov 41849 PN New Missing × confirmation state, one was marked as duplicate, and only one was rejected by the developer (gcov #90438). This rejected case is controversial because gcc is performing optimization even under the zero optimization levels (as shown in Figure 6), which may mislead a developer or an automated tool that are based on the branch information in the coverage statistics. Following the notions from C2V, code coverage bugs inside coverage profilers can categorized as Spurious Marking, Missing Marking, and Wrong Frequency. As shown in Column 6 of Table II, we can find that Cod is able to detect all three types of bugs in coverage profilers. 14 bugs belong to Wrong Frequency, 5 bugs belong to Missing Marking, and the rest 4 bugs is Spurious. Besides, most of bugs are Wrong Frequency bugs, i.e., the execution frequencies is wrongly reported. Among all these bugs, nearly half (12/26) cannot be manifested by differential testing. Considering that differential testing leverages the coverage statistics of an independent profiler implementation (which produces correct coverage information in all these cases, and thus differential testing is essentially comparing with a golden version) while Cod is merely selfvalidation, we are expecting Cod to be effective and useful in finding code coverage profiler bugs. 85
1: 1:int f(int i)( /1: 1:int f(int i)( -1: 2:int res; =1; 2:int res; 3: switch (i) 3: switch (i) ×0: 4: case 5: 4: case 5: ×0: 5: res i-i; =1; 5: /res =i-if ×0: 6: break; -1: 6: 1: 7: default: =1; 7 default: 1: 8 res =it 2; y1: 8: res it 2; y1: 9 break: w1: 9: break; -1:10: -1:10: 1:11: return res; 1:11: return res; =1:12:1 =1:12:1 1:13:int main(woid){ 1:13:int main(void)( 1:14: f2): /1:14: f(2】: /1:15: return 0; 1:15: return 0; -1:16:) -1:16:1 (a)P (gcov) (b)p=Ps5,s6 Uis5,s6}(gcov) Fig.6.In the case of gcov #90438,gcov refuses to report coverage information for the case statement after removing Lines #5-6,but reports the execution of its default branch,which may mislead a developer or an automated tool.Note that though Line #4 is not covered,it is not removed otherwise will result in a compilation error. TABLE III TABLE IV SUMMARY OF THE TEST PROGRAMS WITH INCONSISTENT COVERAGE SUMMARIZATION OF THE COMMON AND NON-COMMON REPORTS BY COD ON THE CONSISTENT TEST PROGRAMS BY C2V. INSTRUMENTATION SITES BETWEEN GCOV AND LLVM-COV FOR THE TEST PROGRAMS IN GCC TESTSUITES 7.4.0. C/C:NUMBER OF COMMON /NON-COMMON INSTRUMENTATION SITES. inconsistent in terms of Cod weakly consistent under C2V gcov llvm-cov c C Strong Weak Strong Weak Total Avg. Total Avg. 3745 10 23 19 9 8302616.499852319.56 % 45.73% 、 54.27% C.Discussions Only 5036 test programs in GCC testsuites 7.4.0 can be Statistics of Inconsistencies Table IlI summarizes the test successfully compiled and further processed by both gcov and programs in which inconsistencies are identified by Cod but llvm-cov.Table IV summarizes the total number and total unable to be identified by C2V.All these inconsistencies are percentage of common instrumentation sites and non-common true positives:one is either a bug or may mislead a developer instrumentation sites.The second and the forth columns re- or automatic tool. spectively show the total number/percentage C and C.The While using these test programs to test gcov by Cod,we third and the last columns respectively shows the average C respectively identified 23 weak and 10 strong inconsistencies and C.From Table IV,we can found that about 46%code from these test programs.For llvm-cov,28 weak and 19 strong lines are C and each test program has about 16 code lines are inconsistencies are identified.This indicates that Cod has C its unique ability to identify many inconsistencies that c2V Table V summarizes the statistics of the proportion of C for unable to given the same test programs.We thus believe that the 5036 test programs in GCC testsuite 7.4.0.We calculate the Cod is more powerful and useful than C2V. proportion as p=Cl/(C+Cl)for each test program.Then, Weak Inconsistencies Between Independently Implemented we can obtain how many test programs falls into different Coverage Profilers As aforementioned,independently imple- intervals as listed in the second row of Table V.From Table V. mented code coverage profilers might have different interpre- we can find that about 40%070%code lines in most test tations for the same code.This is the major source of weak programs are in C.This indicates that most code lines of inconsistencies that C2V cannot recognize as a bug. each program is instrumented by only one of the two coverage To further understand weak inconsistencies among profilers, profilers.Besides,we also found that only 1.14%test programs we collect the common instrumentation sites between gcov have exactly the same instrumentation sites under the two 9.0.0 and llvm-cov 9.0 for the test programs using programs profilers. Overall,our core observation is that different coverage in GCC testsuites 7.4.0.A code line s is a common instru- mentation site s∈C if cs(s)≠-lAC%(s)≠-l,where profilers indeed have quite different interpretations on a same c(s)and c(s)refer to the profiled runtime execution count piece of code. of code line s in program P respectively by gcov and llvm- Reliability of Code Coverage Profilers Under Compiler cov.When cs(s)c(s)A (c(s)=-1vc(s)=-1), Optimizations Finally,even though coverage profilers provide s is an non-common instrumentation site sC. only faithful statistics under the zero optimization level,we 86
1: 1:int f(int i) { -1: 2: int res; 1: 3: switch (i) { ×0: 4: case 5: ×0: 5: res = i - i; ×0: 6: break; 1: 7: default: 1: 8: res = i * 2; 1: 9: break; -1: 10: } 1: 11: return res; -1: 12:} 1: 13:int main(void) { 1: 14: f(2); 1: 15: return 0; -1: 16:} 1: 1:int f(int i) { -1: 2: int res; -1: 3: switch (i) { -1: 4: case 5: -1: 5: // res = i - i; -1: 6: // break; -1: 7: default: 1: 8: res = i * 2; 1: 9: break; -1: 10: } 1: 11: return res; -1: 12:} 1: 13:int main(void) { 1: 14: f(2); 1: 15: return 0; -1: 16:} (a) P (gcov) (b) P = P\{s5, s6}∪{s 5, s 6} (gcov) Fig. 6. In the case of gcov #90438, gcov refuses to report coverage information for the case statement after removing Lines #5–6, but reports the execution of its default branch, which may mislead a developer or an automated tool. Note that though Line #4 is not covered, it is not removed otherwise will result in a compilation error. TABLE III SUMMARY OF THE TEST PROGRAMS WITH INCONSISTENT COVERAGE REPORTS BY CO D ON THE CONSISTENT TEST PROGRAMS BY C2V. # weakly consistent under C2V # inconsistent in terms of Cod gcov llvm-cov Strong Weak Strong Weak 3745 10 23 19 9 C. Discussions Statistics of Inconsistencies Table III summarizes the test programs in which inconsistencies are identified by Cod but unable to be identified by C2V. All these inconsistencies are true positives: one is either a bug or may mislead a developer or automatic tool. While using these test programs to test gcov by Cod, we respectively identified 23 weak and 10 strong inconsistencies from these test programs. For llvm-cov, 28 weak and 19 strong inconsistencies are identified. This indicates that Cod has its unique ability to identify many inconsistencies that C2V unable to given the same test programs. We thus believe that Cod is more powerful and useful than C2V. Weak Inconsistencies Between Independently Implemented Coverage Profilers As aforementioned, independently implemented code coverage profilers might have different interpretations for the same code. This is the major source of weak inconsistencies that C2V cannot recognize as a bug. To further understand weak inconsistencies among profilers, we collect the common instrumentation sites between gcov 9.0.0 and llvm-cov 9.0 for the test programs using programs in GCC testsuites 7.4.0. A code line s is a common instrumentation site s ∈ C if CG P (s) = −1 ∧ CL P (s) = −1, where CG P (s) and CL P (s) refer to the profiled runtime execution count of code line s in program P respectively by gcov and llvmcov. When CG P (s) = CL P (s) ∧ (CG P (s) = −1 ∨ CL P (s) = −1), s is an non-common instrumentation site s ∈ C . TABLE IV SUMMARIZATION OF THE COMMON AND NON-COMMON INSTRUMENTATION SITES BETWEEN GCOV AND LLVM-COV FOR THE TEST PROGRAMS IN GCC TESTSUITES 7.4.0. C / C: NUMBER OF COMMON / NON-COMMON INSTRUMENTATION SITES. C C Total Avg. Total Avg. # 83026 16.49 98523 19.56 % 45.73% - 54.27% - Only 5036 test programs in GCC testsuites 7.4.0 can be successfully compiled and further processed by both gcov and llvm-cov. Table IV summarizes the total number and total percentage of common instrumentation sites and non-common instrumentation sites. The second and the forth columns respectively show the total number/percentage C and C. The third and the last columns respectively shows the average C and C. From Table IV, we can found that about 46% code lines are C and each test program has about 16 code lines are C. Table V summarizes the statistics of the proportion of C for the 5036 test programs in GCC testsuite 7.4.0. We calculate the proportion as p = |C|/(|C|+|C|) for each test program. Then, we can obtain how many test programs falls into different intervals as listed in the second row of Table V. From Table V, we can find that about 40%∼70% code lines in most test programs are in C. This indicates that most code lines of each program is instrumented by only one of the two coverage profilers. Besides, we also found that only 1.14% test programs have exactly the same instrumentation sites under the two profilers. Overall, our core observation is that different coverage profilers indeed have quite different interpretations on a same piece of code. Reliability of Code Coverage Profilers Under Compiler Optimizations Finally, even though coverage profilers provide only faithful statistics under the zero optimization level, we 86
TABLE V DISTRIBUTION OF THE PERCENTAGE OF NON-COMMON INSTRUMENTATION SITES. p=C/(C+C):THE PROPORTION OF NON-COMMON INSTRUMENTATION SITES. 0 010%10%20% 20%30%30%40% 40%50%50%60% 60%70%70呢80% 80%90%90%100% 61 21 91 228 703 1722 1433 678 276 84 58 %1.14% 0.39% 16.70% 4.26% 13.13% 32.16% 26.76% 12.66% 5.15% 15.69% 1.08% TABLE VI among which 42 and 28 bugs are confirmed/fixed by gcov SUMMARY OF INCONSISTENT REPORTS WITH DIFFERENT OPTIMIZATION LEVELS. and llvm-cov,respectively.In essence,2V is a randomized differential testing approach.As stated in Section II-C,C2V Inconsistent lines suffers from a bunch of drawbacks.The work presented in this Optimization level gcov llvm-cov paper attempts to fill the gap. Strongly Weakly Strongly Weakly -00 69 54 62 11 B.Metamorphic Testing -01 1115 635 62 11 As a simple but effective approach to alleviating the oracle -02 678 936 63 11 problem,metamorphic testing (MT)exploits the metamorphic -03 679 937 63 11 relation(MR)among multiple inputs and their expected out- -0s 799 977 63 11 puts,to generated follow-up test cases from existing ones,and -Ofast 677 927 61 11 verifies the corresponding outputs against the MR.Since its first publication in 1998,MT has been successful applied in a variety of domains including bioinformatics [22].[23].web still wonder whether inconsistencies reported by Cod for services [24],[25],embedded systems [26],components [27]. optimized binaries may reveal bugs in a profiler.Therefore,we databases [28],machine learning classifiers [29],online search conducted our experiments under optimized compiler settings, functions and search engines [30],[31],and security [32]. and the results are summarized in Table VI. Several representative work are listed below.Chan et After the manual inspection of a few cases,we found al.[24],[25]presented a metamorphic testing methodology that gcov generally does not provide comprehensive cover- for Service Oriented Applications (SOA).Their method age statistics for optimized code (sometimes even obviously relies on so-called metamorphic services to encapsulate the wrong),however,llvm-cov is much more reliable-coverage services under test,executes the seed test and the followup statistics barely change across optimization levels. test cases,and finally check their results.Zhou et al.[30]. We attempted to report such an obviously incorrect coverage [31]employed metamorphic testing to detect inconsistencies in statistics of gcov to the developers,as shown in Figure 7 online web search applications.Several metamorphic relations (gcov #90420,under -03 optimization level).Line #11 cannot are proposed and utilized in a number of experiments with be executed for 11 times in any circumstance,however,the the web search engines,like Google.Yahoo!and Live Search. developer rejected this bug report and argued that "it's the Jiang et al.[26]presented several metamorphic relations for nature of any optimizing compiler.If you want to have the fault detection in Central Processing Unit(CPU)scheduling best results,then don't use-03,or any other optimization algorithms.Two real bugs are found in one of the simulators level."This case is also controversial.however.revealed that under test.Beydeda [27]proposed a selftesting method for providing guarantee of coverage statistics under compiler commercial offtheshelf components via metamorphic testing. optimizations would be a worthwhile future direction. Zhou et al.[33]applied metamorphic testing to self-driving V.RELATED WORK cars,and detected fatal software bugs in the LiDAR obstacle- perception module.Chen et al.[22]presented several meta- This section surveys some related work on coverage profiler morphic relations for the detection of faults in two opensource testing,metamorphic testing,testing via equivalence module bioinformatics programs for gene regulatory networks simula- inputs,and techniques relied on code coverage. tions and short sequence mapping. In this paper,we applied MT to a new domain,i.e., A.Code Coverage Profiler Testing validating the correctness of coverage profilers. To the best of our knowledge,C2V [1I]is the first and also the state-of-the-art work for hunting bugs in code coverage C.Testing via Equivalence Modulo Inputs profilers.It feeds a randomly generated program to both gcov Testing via equivalence modulo inputs (EMI)[2],[34],[35] and llvm-cov,and then reports a bug if there exist inconsis- is a new testing technique proposed in recent years,being tencies between the produced coverage reports.Within non- targeted at discovering the compiler optimization bugs.The continuous four months of testing,C2V uncovered 83 bugs, basic idea of EMI is to modify a program to generate variants 87
TABLE V DISTRIBUTION OF THE PERCENTAGE OF NON-COMMON INSTRUMENTATION SITES. p = |C|/(|C| + |C|): THE PROPORTION OF NON-COMMON INSTRUMENTATION SITES. p 0 0∼10% 10%∼20% 20%∼30% 30%∼40% 40%∼50% 50%∼60% 60%∼70% 70%∼80% 80%∼90% 90%∼100% # 61 21 91 228 703 1722 1433 678 276 84 58 % 1.14% 0.39% 16.70% 4.26% 13.13% 32.16% 26.76% 12.66% 5.15% 15.69% 1.08% TABLE VI SUMMARY OF INCONSISTENT REPORTS WITH DIFFERENT OPTIMIZATION LEVELS. Optimization level Inconsistent lines gcov llvm-cov Strongly Weakly Strongly Weakly -O0 69 54 62 11 -O1 1115 635 62 11 -O2 678 936 63 11 -O3 679 937 63 11 -Os 799 977 63 11 -Ofast 677 927 61 11 still wonder whether inconsistencies reported by Cod for optimized binaries may reveal bugs in a profiler. Therefore, we conducted our experiments under optimized compiler settings, and the results are summarized in Table VI. After the manual inspection of a few cases, we found that gcov generally does not provide comprehensive coverage statistics for optimized code (sometimes even obviously wrong), however, llvm-cov is much more reliable–coverage statistics barely change across optimization levels. We attempted to report such an obviously incorrect coverage statistics of gcov to the developers, as shown in Figure 7 (gcov #90420, under -O3 optimization level). Line #11 cannot be executed for 11 times in any circumstance, however, the developer rejected this bug report and argued that “it’s the nature of any optimizing compiler. If you want to have the best results, then don’t use -O3, or any other optimization level.” This case is also controversial, however, revealed that providing guarantee of coverage statistics under compiler optimizations would be a worthwhile future direction. V. RELATED WORK This section surveys some related work on coverage profiler testing, metamorphic testing, testing via equivalence module inputs, and techniques relied on code coverage. A. Code Coverage Profiler Testing To the best of our knowledge, C2V [11] is the first and also the state-of-the-art work for hunting bugs in code coverage profilers. It feeds a randomly generated program to both gcov and llvm-cov, and then reports a bug if there exist inconsistencies between the produced coverage reports. Within noncontinuous four months of testing, C2V uncovered 83 bugs, among which 42 and 28 bugs are confirmed/fixed by gcov and llvm-cov, respectively. In essence, C2V is a randomized differential testing approach. As stated in Section II-C, C2V suffers from a bunch of drawbacks. The work presented in this paper attempts to fill the gap. B. Metamorphic Testing As a simple but effective approach to alleviating the oracle problem, metamorphic testing (MT) exploits the metamorphic relation (MR) among multiple inputs and their expected outputs, to generated follow-up test cases from existing ones, and verifies the corresponding outputs against the MR. Since its first publication in 1998, MT has been successful applied in a variety of domains including bioinformatics [22], [23], web services [24], [25], embedded systems [26], components [27], databases [28], machine learning classifiers [29], online search functions and search engines [30], [31], and security [32]. Several representative work are listed below. Chan et al. [24], [25] presented a metamorphic testing methodology for Service Oriented Applications (SOA). Their method relies on so-called metamorphic services to encapsulate the services under test, executes the seed test and the followup test cases, and finally check their results. Zhou et al. [30], [31] employed metamorphic testing to detect inconsistencies in online web search applications. Several metamorphic relations are proposed and utilized in a number of experiments with the web search engines, like Google, Yahoo! and Live Search. Jiang et al. [26] presented several metamorphic relations for fault detection in Central Processing Unit (CPU) scheduling algorithms. Two real bugs are found in one of the simulators under test. Beydeda [27] proposed a selftesting method for commercial offtheshelf components via metamorphic testing. Zhou et al. [33] applied metamorphic testing to self-driving cars, and detected fatal software bugs in the LiDAR obstacleperception module. Chen et al. [22] presented several metamorphic relations for the detection of faults in two opensource bioinformatics programs for gene regulatory networks simulations and short sequence mapping. In this paper, we applied MT to a new domain, i.e., validating the correctness of coverage profilers. C. Testing via Equivalence Modulo Inputs Testing via equivalence modulo inputs (EMI) [2], [34], [35] is a new testing technique proposed in recent years, being targeted at discovering the compiler optimization bugs. The basic idea of EMI is to modify a program to generate variants 87
×0: 1:int func (int ×0: 1:int func (int *p)( 11"▣ 2:1ntx=0; 2:int x =0; ×0: 3:for (int i =0;i<10;i++) ×0: 3: for(int1=0:i<10;i++) 10: 4 x +p[i]; x0: 4: ×+=p[i]; 1: 5:return x; 1: 5: return x; -1: 6:1 -1: 6:1 I 7:int main() 1g 7:int main() 1:8: int a(10]; 1:8: int a[10]; 11. 9 for(inti=0<10;i++) 1 9:for(int1=0;i<10;i++) 10:10: a[1】=1; -1:10: a[i]=1; 11:11: 1f(func(a)=10) 1:11: if(func(a) 1=10) ×0:12: return 1; -1:12: /return 1; -1:13: return 0; 1:13: return 0; -1:14:) -1:14:} (a)P(gcov) (b)P=Ps12Us12(gcov) Fig.7.The bug case of GCC #90420.gcov incorrectly reported that the if (func(a)!=10)in Line 11 was executed 11 times.Deleting Line #12 revealed this bug. with the same outputs as the original program.Initially,Le et One of the most attractive compiler testing techniques is al.[2]proposed to generate equivalent versions of the program based on the code coverage of a program's execution to by profiling program's execution and pruning unexecuted generate equivalence modulo inputs by stochastically pruning code inside.Once a program and its equivalent variant are its unexecuted code [2].[48].With the equivalence modulo constructed,both are fed to the compiler under test,and the in- inputs,we can differentially test compilers.It is obvious that consistencies of the outputs are checked.Following this work, the correctness of "equivalance"relies on the reliability of Athena [34]and Hermes [35]are developed subsequently. code coverage.Debugging is a common activity in software Athena [34]generates EMI by randomly inserting code into development which aims to locating the root cause of a fault. and removing statements from dead code regions.Hermes [35] Spectrum-Based Fault Localization(SBFL)is one of the most complements mutation strategies by operating on live code extensively studied debugging techniques which is heavily regions,which overcomes the limitations of mutating dead based on code coverage [4].[49]-[51].Under a specific test code regions. suite,SBFL leverages the code coverage and the corresponding In Cod.we followed the similar way to generate program failed/passed information to statistically infer which code is variants as EMI did,but focused on validating the correctness the root cause of a fault. of coverage profilers instead of optimization bugs in compilers As we can see,the correct code coverage information is As such,during the results verification,Cod not only checked one of the prerequisites for the techniques above,indicating the inconsistencies in terms of the outputs,but more impor- the importance of our work. tantly the coverage reports.Through our evaluations,it is also VI.CONCLUSION shown that only few bugs(1 among 23 confirmed bugs)can be discovered by looking at only the outputs.Moreover,different This paper presents Cod,an automated self-validator for from EMI performing a random modification,Cod mutates code coverage profilers based on metamorphic testing.Cod ad- the original program by aggressive statement pruning,thus dressed the limitation of the state-of-the-art differential testing triggering different coverage behaviors as much as possible. approach,and encouragingly found many previously unknown bugs which cannot be revealed by existing approaches. D.Techniques relied on code coverage ACKNOWLEDGMENT Code coverage is widely adopted in practice and extensively We thank the anonymous reviewers for their construc- used to facilitate many software engineering tasks,such as tive comments.We also thank the GCC and LLVM de- coverage-based regression testing,coverage-based compiler velopers especially Martin Liska for analyzing and fixing testing,and coverage-based debugging.In the context of our reported bugs.This work is supported by the Na- regression testing,test case prioritization and test suite aug- tional Key R&D Program of China(2018YFB1003901),the mentation are the two widely used techniques [36]-1431.The National Natural Science Foundation of China (61832009, former aims to improve the ability of test cases in finding 61432001,61932021,61690204,61772259,61772263, faults by scheduling test cases in a specific order [41].[43]. 61802165.61802168.61702256).the Natural Science Foun- [44].To achieve a high code coverage as fast as possible dation of Jiangsu Province BK20191247,BK20170652). is a common practice [45].The latter is to generate new the China Postdoctoral Science Foundation (2018T110481), test cases to strengthen the ability of a test suite in finding the Fundamental Research Funds for the Central Universities faults [42].[46],[47].In practice,it is often to generate new (020214380032.02021430047).We would also like to thank test cases to cover the source code affected by code changes. the support from the Collaborative Innovation Center of Novel Recent years have seen an increasing interest in compiler Software Technology and Industrialization,Jiangsu,China. testing which aims to validate the correctness of compilers. Yuming Zhou and Baowen Xu are the corresponding authors. 88
×0: 1:int func (int *p) { 11 : 2: int x = 0; ×0: 3: for (int i = 0; i < 10; i++) 10 : 4: x += p[i]; 1 : 5: return x; -1: 6:} 1: 7:int main() { 1: 8: int a[10]; 11: 9: for (int i = 0; i < 10; i++) 10: 10: a[i] = 1; 11: 11: if (func(a) != 10) ×0: 12: return 1; -1: 13: return 0; -1: 14:} ×0: 1:int func (int *p) { 1 : 2: int x = 0; ×0: 3: for (int i = 0; i < 10; i++) ×0: 4: x += p[i]; 1 : 5: return x; -1: 6:} 1: 7:int main() { 1: 8: int a[10]; 1: 9: for (int i = 0; i < 10; i++) -1: 10: a[i] = 1; 1: 11: if (func(a) != 10) -1: 12: ; // return 1; 1: 13: return 0; -1: 14:} (a) P (gcov) (b) P = P\{s12}∪{s 12} (gcov) Fig. 7. The bug case of GCC #90420. gcov incorrectly reported that the if(func(a) != 10) in Line 11 was executed 11 times. Deleting Line #12 revealed this bug. with the same outputs as the original program. Initially, Le et al. [2] proposed to generate equivalent versions of the program by profiling program’s execution and pruning unexecuted code inside. Once a program and its equivalent variant are constructed, both are fed to the compiler under test, and the inconsistencies of the outputs are checked. Following this work, Athena [34] and Hermes [35] are developed subsequently. Athena [34] generates EMI by randomly inserting code into and removing statements from dead code regions. Hermes [35] complements mutation strategies by operating on live code regions, which overcomes the limitations of mutating dead code regions. In Cod, we followed the similar way to generate program variants as EMI did, but focused on validating the correctness of coverage profilers instead of optimization bugs in compilers. As such, during the results verification, Cod not only checked the inconsistencies in terms of the outputs, but more importantly the coverage reports. Through our evaluations, it is also shown that only few bugs (1 among 23 confirmed bugs) can be discovered by looking at only the outputs. Moreover, different from EMI performing a random modification, Cod mutates the original program by aggressive statement pruning, thus triggering different coverage behaviors as much as possible. D. Techniques relied on code coverage Code coverage is widely adopted in practice and extensively used to facilitate many software engineering tasks, such as coverage-based regression testing, coverage-based compiler testing, and coverage-based debugging. In the context of regression testing, test case prioritization and test suite augmentation are the two widely used techniques [36]–[43]. The former aims to improve the ability of test cases in finding faults by scheduling test cases in a specific order [41], [43], [44]. To achieve a high code coverage as fast as possible is a common practice [45]. The latter is to generate new test cases to strengthen the ability of a test suite in finding faults [42], [46], [47]. In practice, it is often to generate new test cases to cover the source code affected by code changes. Recent years have seen an increasing interest in compiler testing which aims to validate the correctness of compilers. One of the most attractive compiler testing techniques is based on the code coverage of a program’s execution to generate equivalence modulo inputs by stochastically pruning its unexecuted code [2], [48]. With the equivalence modulo inputs, we can differentially test compilers. It is obvious that the correctness of “equivalance” relies on the reliability of code coverage. Debugging is a common activity in software development which aims to locating the root cause of a fault. Spectrum-Based Fault Localization (SBFL) is one of the most extensively studied debugging techniques which is heavily based on code coverage [4], [49]–[51]. Under a specific test suite, SBFL leverages the code coverage and the corresponding failed/passed information to statistically infer which code is the root cause of a fault. As we can see, the correct code coverage information is one of the prerequisites for the techniques above, indicating the importance of our work. VI. CONCLUSION This paper presents Cod, an automated self-validator for code coverage profilers based on metamorphic testing. Cod addressed the limitation of the state-of-the-art differential testing approach, and encouragingly found many previously unknown bugs which cannot be revealed by existing approaches. ACKNOWLEDGMENT We thank the anonymous reviewers for their constructive comments. We also thank the GCC and LLVM developers especially Martin Liska for analyzing and fixing ˇ our reported bugs. This work is supported by the National Key R&D Program of China (2018YFB1003901), the National Natural Science Foundation of China (61832009, 61432001, 61932021, 61690204, 61772259, 61772263, 61802165, 61802168, 61702256), the Natural Science Foundation of Jiangsu Province ( BK20191247, BK20170652), the China Postdoctoral Science Foundation (2018T110481), the Fundamental Research Funds for the Central Universities (020214380032, 02021430047). We would also like to thank the support from the Collaborative Innovation Center of Novel Software Technology and Industrialization, Jiangsu, China. Yuming Zhou and Baowen Xu are the corresponding authors. 88