Program Tailoring:Slicing by Sequential Criteria rtifac Yue Li*,Tian Tan*,Yifei Zhang,and Jingling Xue School of Computer Science and Engineering,UNSW Australia -Abstract Protocol and typestate analyses often report some sequences of statements ending at a program point P that needs to be scrutinized,since P may be erroneous or imprecisely analyzed.Program slicing focuses only on the behavior at P by computing a slice of the program affecting the values at P.In this paper,we propose to restrict our attention to the subset of that behavior at P affected by one or several statement sequences,called a sequential criterion (SC).By leveraging the ordering information in a SC,e.g.,the temporal order in a few valid/invalid API method invocation sequences,we introduce a new technique,program tailoring,to compute a tailored program that comprises the statements in all possible execution paths passing through at least one sequence in SC in the given order.With a prototyping implementation,TAILOR,we show why tailoring is practically useful by conducting two case studies on seven large real-world Java applications.For program debugging and understanding,TAILOR can complement program slicing by removing SCirrelevant statements.For program analysis,TAILOR can enable a pointer analysis,which is unscalable to a program,to perform a more focused and therefore potentially scalable analysis to its SC-relevant parts containing hard language features such as reflection. 1998 ACM Subject Classification F.3.2 Semantics of Programming Languages-Program Ana- lysis,D.2.5 Testing and Debugging-Code inspections and Debugging aids Keywords and phrases Program Slicing,Program Analysis,API Protocol Analysis Digital Object Identifier XXX/LIPIcs.ECOOP.2016. 1 Introduction Program slicing,supported by industry-strength tools,such as WALA [52]and CodeSurfer [18], has found many diverse applications,such as program debugging,comprehension,analysis testing,verification and optimization 8,20,43,49.Given a slicing criterion consisting of a program point P and several variables used at P [53],program slicing computes a slice of the program that may affect their values at P in terms of data and control dependences.In the past three decades,several variations on this theme of program slicing have been proposed, including static vs.dynamic,backward vs.forward,and closure vs.executable 43,49. In practice,API protocol analysis [7,37]and typestate analysis [9,16,34]often report some statement sequences ending at a program point P that needs to be scrutinized,since P may be erroneous or imprecisely analyzed.Each sequence can represent a valid or invalid API usage call sequence.Such protocol specifications or violations can also be provided manually or mined automatically [1,3,17,36,42,60.As such analyses are either conservative or unsound,the temporal order specified in a sequence may or may not be feasible.However, program slicing focuses on P by ignoring this order,which is often essential for analyzing P. As illustrated in Figure 1,we introduce a new technique,program tailoring to reap additional benefits missed by program slicing at a point P(e.g.,file.write())by leveraging the temporal order specified in a new criterion,called a sequential criterion (SC)for P.A These authors contributed equally to this work Yue Li,Tian Tan,Vifei Zhang.and Jingling Xue licensed under Creative Commons License CC-BY 30th European Conference on Object Oriented Programming (ECOOP 2016). Leibniz International Proceedings in Informatics LIPICS Schloss Dagstuhl-Leibniz-Zentrum fuir Informatik,Dagstuhl Publishing,Germany
C onsistent * Complete * Well D ocumen et d t ysaE * o Reuse * * Eva ul det a * EC O O P * Artifact * AE C Program Tailoring: Slicing by Sequential Criteria Yue Li∗ , Tian Tan∗ , Yifei Zhang, and Jingling Xue School of Computer Science and Engineering, UNSW Australia Abstract Protocol and typestate analyses often report some sequences of statements ending at a program point P that needs to be scrutinized, since P may be erroneous or imprecisely analyzed. Program slicing focuses only on the behavior at P by computing a slice of the program affecting the values at P. In this paper, we propose to restrict our attention to the subset of that behavior at P affected by one or several statement sequences, called a sequential criterion (SC). By leveraging the ordering information in a SC, e.g., the temporal order in a few valid/invalid API method invocation sequences, we introduce a new technique, program tailoring, to compute a tailored program that comprises the statements in all possible execution paths passing through at least one sequence in SC in the given order. With a prototyping implementation, Tailor, we show why tailoring is practically useful by conducting two case studies on seven large real-world Java applications. For program debugging and understanding, Tailor can complement program slicing by removing SC-irrelevant statements. For program analysis, Tailor can enable a pointer analysis, which is unscalable to a program, to perform a more focused and therefore potentially scalable analysis to its SC-relevant parts containing hard language features such as reflection. 1998 ACM Subject Classification F.3.2 Semantics of Programming Languages - Program Analysis, D.2.5 Testing and Debugging - Code inspections and Debugging aids Keywords and phrases Program Slicing, Program Analysis, API Protocol Analysis Digital Object Identifier XXX/LIPIcs.ECOOP.2016. 1 Introduction Program slicing, supported by industry-strength tools, such as WALA [52] and CodeSurfer [18], has found many diverse applications, such as program debugging, comprehension, analysis, testing, verification and optimization [8, 20, 43, 49]. Given a slicing criterion consisting of a program point P and several variables used at P [53], program slicing computes a slice of the program that may affect their values at P in terms of data and control dependences. In the past three decades, several variations on this theme of program slicing have been proposed, including static vs. dynamic, backward vs. forward, and closure vs. executable [43, 49]. In practice, API protocol analysis [7, 37] and typestate analysis [9, 16, 34] often report some statement sequences ending at a program point P that needs to be scrutinized, since P may be erroneous or imprecisely analyzed. Each sequence can represent a valid or invalid API usage call sequence. Such protocol specifications or violations can also be provided manually or mined automatically [1, 3, 17, 36, 42, 60]. As such analyses are either conservative or unsound, the temporal order specified in a sequence may or may not be feasible. However, program slicing focuses on P by ignoring this order, which is often essential for analyzing P. As illustrated in Figure 1, we introduce a new technique, program tailoring to reap additional benefits missed by program slicing at a point P (e.g., file.write()) by leveraging the temporal order specified in a new criterion, called a sequential criterion (SC) for P. A ∗ These authors contributed equally to this work © Yue Li, Tian Tan, Yifei Zhang, and Jingling Xue; licensed under Creative Commons License CC-BY 30th European Conference on Object Oriented Programming (ECOOP 2016). Leibniz International Proceedings in Informatics Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany
XX:2 Program Tailoring:Slicing by Sequential Criteria Sliced Program [write] SC-Irrelevant Statements Traditiona Slicing-Related Criterion lechnigues Analysis Program Debugging Outputs Tools Program Understanding Program aoen close Criterion(SC) Tailoring E.g.,API Protocol write (Error) Analysis 【open+close-+write】 Open→write Tailored Program Program Analysis Figure 1 Program tailoring with some of its potential applications highlighted SCp consists of one or several statement sequences ending at P,with each representing,e.g., a valid API call sequence like file.open()-file.write()or an invalid API call sequence like file.open()-file.close()->file.write().In what follows,we will drop P from SCp when the context is clear or when we are not interested in it.Given a SCp,tailoring aims to obtain a tailored program,denoted T(SCp),that comprises the statements in all the possible execution paths passing through at least one statement sequence in SCp in the given order.By construction,all data and control dependences needed for understanding the behavior at P affected by SCp are included.Any statement that is not in T(SCp)is irrelevant to SCp,i.e.,SCp-irrelevant.For illustration purposes,we write S(P)to represent the(backward)slice affecting P obtained by program slicing.Note that slicing all the points in SCp independently still fails to capture their ordering constraint(and is unscalable,too). Like slicing,tailoring enables software developers or client applications to inspect only the interesting parts of a program.Unlike slicing,which focuses on understanding the program behavior at P,tailoring restricts our attention to the subset of that behavior affected by SCp only.Due to incompatible criteria used,tailoring can be used either as a complementary technique to slicing or in cases where slicing is ineffective,as discussed below. 1.1 Goals and Motivations Given a SCp,we have developed a prototyping implementation of program tailoring,denoted TAILOR,for Java programs,with the following three goals in mind: Precision TAILOR is designed to sharpen the precision of many client applications,as highlighted in Figure 1,by exploiting the temporal order in a SCp.One significant class of client applications includes many slicing-related techniques,such as thin slicing [47], program chopping 22 and value slicing 26.For a given program,T(SCp)is shown as the blue circle and S(P)as the white circle.The statements in Z(SCp)=S(P)-S(P)n T(SCp)are SCp-irrelevant and can thus be pruned away to facilitate program debugging and understanding by a human.If Z(SCp)=,then TAILOR is ineffective at P but no harm is done.If I(SCp),then TAILOR can make slicing more precise,by leveraging the otherwise wasted SCp information that is widely available.In Section 6.1,we show that TAILOR can improve the precision of thin slicing [47],a state-of-the-art practical but unsound slicing technique,for Java programs.In Section 6.2,we show that TAILOR can enable a sophisticated pointer analysis for Java,S-2OBJ [24],which is unscalable for a program,to perform a focused analysis on its SC-relevant parts containing hard language features such as refection,where existing slicing techniques are ineffective. Scalability TAILOR is designed to work efficiently for large object-oriented programs,for which traditional slicing [21]is unscalable(even with industry-strength implementations [18,521), with the key bottleneck coming from handling of the heap [47].Like any slicing tool, TAILOR is not always scalable.However,TAILOR is designed to scale significantly better
XX:2 Program Tailoring: Slicing by Sequential Criteria Analysis E.g., API Protocol Analysis Outputs open close write (Error) Tailoring Sequential Program Criterion (SC) [open close write] Traditional Criterion [write] Techniques Slicing-Related Program Debugging Program Understanding … ... Program Analysis SC-Irrelevant Statements open write [open write] Sliced Program Tailored Program Tools Figure 1 Program tailoring with some of its potential applications highlighted. SC P consists of one or several statement sequences ending at P, with each representing, e.g., a valid API call sequence like file.open() → file.write() or an invalid API call sequence like file.open() → file.close() → file.write(). In what follows, we will drop P from SC P when the context is clear or when we are not interested in it. Given a SC P , tailoring aims to obtain a tailored program, denoted T (SC P ), that comprises the statements in all the possible execution paths passing through at least one statement sequence in SC P in the given order. By construction, all data and control dependences needed for understanding the behavior at P affected by SC P are included. Any statement that is not in T (SC P ) is irrelevant to SC P , i.e., SCP -irrelevant. For illustration purposes, we write S(P) to represent the (backward) slice affecting P obtained by program slicing. Note that slicing all the points in SC P independently still fails to capture their ordering constraint (and is unscalable, too). Like slicing, tailoring enables software developers or client applications to inspect only the interesting parts of a program. Unlike slicing, which focuses on understanding the program behavior at P, tailoring restricts our attention to the subset of that behavior affected by SC P only. Due to incompatible criteria used, tailoring can be used either as a complementary technique to slicing or in cases where slicing is ineffective, as discussed below. 1.1 Goals and Motivations Given a SC P , we have developed a prototyping implementation of program tailoring, denoted Tailor, for Java programs, with the following three goals in mind: Precision Tailor is designed to sharpen the precision of many client applications, as highlighted in Figure 1, by exploiting the temporal order in a SC P . One significant class of client applications includes many slicing-related techniques, such as thin slicing [47], program chopping [22] and value slicing [26]. For a given program, T (SC P ) is shown as the blue circle and S(P) as the white circle. The statements in I(SC P ) = S(P) − S(P) ∩ T (SC P ) are SC P -irrelevant and can thus be pruned away to facilitate program debugging and understanding by a human. If I(SC P ) = ∅, then Tailor is ineffective at P but no harm is done. If I(SC P ) 6= ∅, then Tailor can make slicing more precise, by leveraging the otherwise wasted SC P information that is widely available. In Section 6.1, we show that Tailor can improve the precision of thin slicing [47], a state-of-the-art practical but unsound slicing technique, for Java programs. In Section 6.2, we show that Tailor can enable a sophisticated pointer analysis for Java, S-2Obj [24], which is unscalable for a program, to perform a focused analysis on its SC-relevant parts containing hard language features such as reflection, where existing slicing techniques are ineffective. Scalability Tailor is designed to work efficiently for large object-oriented programs, for which traditional slicing [21] is unscalable (even with industry-strength implementations [18,52]), with the key bottleneck coming from handling of the heap [47]. Like any slicing tool, Tailor is not always scalable. However, Tailor is designed to scale significantly better
Y.Li,T.Tan,Y.Zhang,and J.Xue XX:3 than traditional slicing,as TAILOR avoids handling of the heap by reasoning about essentially the (un)reachability of a statement towards the statement sequences in a SCp. Soundiness TAILOR is designed to be a practical tool to facilitate programming debugging and understanding as well as program analysis (among others)for large object-oriented programs.In this setting,any sound static analysis would be either unscalable or imprecise due to the presence of many hard language features,such as native code,dynamic class loading,reflection and multi-threading [30].Therefore,TAILOR is designed to be sound whenever a sound graph representation is available for capturing all the control flows in a (single-or multi-threaded)program.In other words,TAILOR is always sound with respect to part of the program behavior modeled.According to [30],"a soundy analysis aims to be as sound as possible without excessively compromising precision and/or scalability." Therefore,TAILOR represents one such soundy analysis. 1.2 Challenges and Solutions We examine some challenges faced in achieving our three goals and describe our solutions: Precision/Scalability Tradeoffs How do we compute T(SC)efficiently and precisely for large object-oriented programs?Due to exponential blowup of program paths,both DFS and BFS are out of question.We formulate the problem of computing T(SC)by solving an IFDS(Interprocedural Finite Distributive Subset)data-flow analysis problem [38], efficiently on its interprocedural control-flow graph(ICFG)representation of the program. To avoid unrealizable paths with mismatched calls and returns,our analysis must be (fully)context-sensitive in order to achieve useful precision for Java programs.Otherwise, many unrealizable paths,which go through the constructor of java.lang.Object,cannot be filtered out,causing T(SC)to be severely over-approximated.However,distinguishing calling contexts fully with call strings,object-sensitivity [33],and method cloning 54]are known to be unscalable for large programs [12,44].We achieve (full)context sensitivity by solving a CFL-reachability problem over a balanced-parentheses language by matching call and return edges also in the IFDS framework as described in 38. How do we ensure that TAILOR works effectively for a SC of any given length,which is defined to be the number of statements in its longest statement sequence?In general, the longer a SC is,the more SC-irrelevant statements will be removed.We propose to lengthen any given SCby leveraging the concept of object-sensitivity [33,44]developed by the pointer analysis community for Java.As a result,some infeasible paths that would otherwise be introduced are avoided.We will also try to avoid making SC extensions that make our analysis run longer but contribute nothing to precision improvement. Soundiness How do we make TAILOR as soundly as possible?We decompose the problem of tailoring a program for a given SC into two sub-problems:(1)building an ICFG,GICFG, for the program and(2)computing T(SC)from GICFG.TAILOR is sound if GICFG is sound(representation of all control flows in the program).In fact,TAILOR is designed to be practically useful for analyzing the program behavior modeled by GICFG even if GICFG is unsound.This paper solves(2)while resorting to the state-of-the-art for(1). 1.3 Contributions We introduce program tailoring,a new technique for trimming a program based on SCs, which are widely available from other analyses but never exploited by program slicing. We describe how to extend a given SC for object-oriented programs in order to improve the precision of program tailoring (by making tailored programs smaller). EC00P2016
Y. Li, T. Tan, Y. Zhang, and J. Xue XX:3 than traditional slicing, as Tailor avoids handling of the heap by reasoning about essentially the (un)reachability of a statement towards the statement sequences in a SC P . Soundiness Tailor is designed to be a practical tool to facilitate programming debugging and understanding as well as program analysis (among others) for large object-oriented programs. In this setting, any sound static analysis would be either unscalable or imprecise due to the presence of many hard language features, such as native code, dynamic class loading, reflection and multi-threading [30]. Therefore, Tailor is designed to be sound whenever a sound graph representation is available for capturing all the control flows in a (single- or multi-threaded) program. In other words, Tailor is always sound with respect to part of the program behavior modeled. According to [30], “a soundy analysis aims to be as sound as possible without excessively compromising precision and/or scalability.” Therefore, Tailor represents one such soundy analysis. 1.2 Challenges and Solutions We examine some challenges faced in achieving our three goals and describe our solutions: Precision/Scalability Tradeoffs How do we compute T (SC) efficiently and precisely for large object-oriented programs? Due to exponential blowup of program paths, both DFS and BFS are out of question. We formulate the problem of computing T (SC) by solving an IFDS (Interprocedural Finite Distributive Subset) data-flow analysis problem [38], efficiently on its interprocedural control-flow graph (ICFG) representation of the program. To avoid unrealizable paths with mismatched calls and returns, our analysis must be (fully) context-sensitive in order to achieve useful precision for Java programs. Otherwise, many unrealizable paths, which go through the constructor of java.lang.Object, cannot be filtered out, causing T (SC) to be severely over-approximated. However, distinguishing calling contexts fully with call strings, object-sensitivity [33], and method cloning [54] are known to be unscalable for large programs [12, 44]. We achieve (full) context sensitivity by solving a CFL-reachability problem over a balanced-parentheses language by matching call and return edges also in the IFDS framework as described in [38]. How do we ensure that Tailor works effectively for a SC of any given length, which is defined to be the number of statements in its longest statement sequence? In general, the longer a SC is, the more SC-irrelevant statements will be removed. We propose to lengthen any given SC by leveraging the concept of object-sensitivity [33, 44] developed by the pointer analysis community for Java. As a result, some infeasible paths that would otherwise be introduced are avoided. We will also try to avoid making SC extensions that make our analysis run longer but contribute nothing to precision improvement. Soundiness How do we make Tailor as soundly as possible? We decompose the problem of tailoring a program for a given SC into two sub-problems: (1) building an ICFG, GICFG, for the program and (2) computing T (SC) from GICFG. Tailor is sound if GICFG is sound (representation of all control flows in the program). In fact, Tailor is designed to be practically useful for analyzing the program behavior modeled by GICFG even if GICFG is unsound. This paper solves (2) while resorting to the state-of-the-art for (1). 1.3 Contributions We introduce program tailoring, a new technique for trimming a program based on SCs, which are widely available from other analyses but never exploited by program slicing. We describe how to extend a given SC for object-oriented programs in order to improve the precision of program tailoring (by making tailored programs smaller). E COO P 2 0 1 6
XX:4 Program Tailoring:Slicing by Sequential Criteria We formulate the problem of computing a tailored program as one of solving a data-flow problem efficiently in the IFDS framework.TAILOR,which is implemented in Soor [51], is released as an open-source tool at http://www.cse.unsw.edu.au/~corg/tailor We describe two case studies to demonstrate why TAILOR is practically useful on a set of seven large real-world Java programs,by(1)assisting program slicing with program debugging and understanding tasks,and(2)enabling a focused pointer analysis on the parts of a program containing hard language features such as reflection. 2 A Motivating Example We use an example to describe how TAILOR can assist slicing tools to simplify program debugging and understanding tasks through exploiting the temporal ordering information in a given SCthat is otherwise ignored by program slicing.In Section 6.2,we provide additional motivations on why TAILOR can be useful in simplifying program analysis tasks. Large object-oriented programs are very difficult to debug and understand,due to the pervasive use of heap-allocated data,nested data structures,and large libraries with complex dependences and configurations.Tracing the flow of values via multiple levels of pointer indirection through the heap across many classes in both the application and libraries is unworkable.A practical tool is needed to pinpoint relevant statements for the task at hand. Our Java example is given in Figure 2.The Driver class is used to create and initialize a Driver object according to some user input(lines 33-36)or by default (lines 37-41).Then the corresponding initialization information stored in info is dumped to a log at line 42. 1 class Driver{ void main(String[]args){ Writer fw new FileWriter(..): 32 Driver d:String info; Driver (String s){...) if (args[o].equals(...)){ Driver0{】 3 d new Driver(args[o]): void config(String[]args){ d.config(args); Writer bw=new BufferedWriter(fw): info getConfiglnfo(args): > for(int i=1:i else{ 8 bw.write(args间+"n: 3 d=new Driver(; bw.close(): File f getSystemFile(...): 10 } 4 info=getSystemlnfo(f): void log(String info){ fw.write(info); 2 d.log(info); 13 433} 4} :15 class FileWriter{ 23 class BufferedWriter{ 16 boolean isopen tre Writer out: 17 void close isOpen false; BufferedWriter(Writer w)fout w.} :18 void write(...){ 26 void close({ 19 if(isOpen) 27 out.close(): :20 throw new IOException(); 28 21 29} 221 30 Figure 2 An example showing how TAILOR removes SCline 12-irrelevant statements. This example has a typical error found in Java programs caused by ignoring the fact that closing a wrapper file handler will also close its internal file handler.The internal file handler,fw is passed as an argument at line 6 and assigned to out at line 25.Later,closing its wrapper file handler,bw,at line 9 will also close out (i.e.,fw)at line 27.Then any further access to a closed fw(e.g.,at line 12)will trigger a runtime exception at line 20. Now we use a static typestate analysis tool CLARA [9]to analyze this program.Some
XX:4 Program Tailoring: Slicing by Sequential Criteria We formulate the problem of computing a tailored program as one of solving a data-flow problem efficiently in the IFDS framework. Tailor, which is implemented in Soot [51], is released as an open-source tool at http://www.cse.unsw.edu.au/~corg/tailor. We describe two case studies to demonstrate why Tailor is practically useful on a set of seven large real-world Java programs, by (1) assisting program slicing with program debugging and understanding tasks, and (2) enabling a focused pointer analysis on the parts of a program containing hard language features such as reflection. 2 A Motivating Example We use an example to describe how Tailor can assist slicing tools to simplify program debugging and understanding tasks through exploiting the temporal ordering information in a given SC that is otherwise ignored by program slicing. In Section 6.2, we provide additional motivations on why Tailor can be useful in simplifying program analysis tasks. Large object-oriented programs are very difficult to debug and understand, due to the pervasive use of heap-allocated data, nested data structures, and large libraries with complex dependences and configurations. Tracing the flow of values via multiple levels of pointer indirection through the heap across many classes in both the application and libraries is unworkable. A practical tool is needed to pinpoint relevant statements for the task at hand. Our Java example is given in Figure 2. The Driver class is used to create and initialize a Driver object according to some user input (lines 33 – 36) or by default (lines 37 – 41). Then the corresponding initialization information stored in info is dumped to a log at line 42. class Driver { Writer fw = new FileWriter(...); Driver (String s) {…} Driver () {…} void config(String[] args) { Writer bw = new BufferedWriter(fw); for(int i = 1; i < args.length; i++) bw.write(args[i] + "\n"); bw.close(); } void log(String info) { fw.write(info); } } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 class FileWriter { boolean isOpen = true; void close() { isOpen = false; } if (!isOpen) throw new IOException(); } } void write(...) { 15 16 17 18 19 20 21 22 class BufferedWriter { Writer out; void close() { } } out.close(); BufferedWriter(Writer w) {out = w;} 23 24 25 26 27 28 29 30 void main(String[] args) { Driver d; String info; if (args[0].equals(…)) { d = new Driver(args[0]); } else { } d.config(args); info = getConfigInfo(args); d = new Driver(); File f = getSystemFile(...); info = getSystemInfo(f); d.log(info); } 40 41 42 43 31 32 33 34 35 36 37 38 39 Figure 2 An example showing how Tailor removes SCline 12-irrelevant statements. This example has a typical error found in Java programs caused by ignoring the fact that closing a wrapper file handler will also close its internal file handler. The internal file handler, fw is passed as an argument at line 6 and assigned to out at line 25. Later, closing its wrapper file handler, bw, at line 9 will also close out (i.e., fw) at line 27. Then any further access to a closed fw (e.g., at line 12) will trigger a runtime exception at line 20. Now we use a static typestate analysis tool Clara [9] to analyze this program. Some
Y.Li,T.Tan,Y.Zhang,and J.Xue XX:5 Potential Point of Failure typestate specifications regarding file operations used will in- Statement:fw.write(info)at line 12 clude "accessing a closed file leads to an error state".For our Related Program Points: Statement out.close()at line 27 example,CLARA produces an error report shown on the left Statement:new FileWriter(..)at line 2 marking line 12 as a"potential point of failure",together with a sequence of two method calls leading to line 12.We have one SCline 12:line 2-line 27-line 12.As static analyses like CLARA are either conservative or unsound,there may or may not be an error at line 12.Now our debugging task begins. The error at line 12 happens only when a Driver object is created in the if branch of main().Therefore,a tool that instructs the developer to examine this if branch only would enable its cause to be identified quickly.In contrast,marking some lines in its matching else branch as also being relevant can increase human analysis effort significantly(especially if the developer has to trace the flow of values across many nested heap structures). Traditional Slicing For large object-oriented programs,traditional(sound)slicing [21,53] is unscalable or yields slices that are too large for human consumption 47.Given a virtual call fw.write(info)at line 12,the set of variables of interest consists of(1)the receiver reference and the arguments of the call [52],and(2)some relevant variables selected manually or recognized automatically [2.13.In our example,these variables form the set {fw,info,fw.isOpen}.Then,a(backward)slice that affects their values comprises all the statements except lines 7-8 and 19-20.This slice,which includes everything in main(), contains too many statements that are not all directly relevant to the task at line 12. Thin Slicing Thin slicing [47]is introduced to facilitate program debugging and understand- ing for object-oriented programs by trading soundness for scalability and(direct)relevance. All control dependences and all base pointer data dependences are excluded.Given a point of interest,thin slicing includes only so-called producer statements that affect directly the values at the point.Statements that serve to explain why producer statements affect the point are ignored.For example,given x=p.f and q.f=y,where p and q may be aliased, q.f=y is a producer statement for r=p.f,because there may be a direct value flow from y to z.All statements that help explain or establish why p and q are aliases are ignored. If we adopt the same slicing criterion at line 12 as above,thin slicing will include only seven statements at lines 2,12,16.17,36,39 and 40.Compared with the traditional slice obtained,this smaller slice still retains line 17,a statement for explaining an immediate cause of the error at line 12.However,two SCline 12-irrelevant statements at lines 39-40(in the else branch of main())are also present,which can cost human analysis effort unnecessarily. Program Tailoring Given SCline 12:line 2-line 27-line 12,TAILOR produces a tailored program consisting of all the statements in all execution paths passing through lines 2,27 and 12 in that order.As line 27 is only reachable from line 9,which resides in the config method invoked at line 35,the tailored program includes all the lines in the example except lines 37-41 that appear in the else branch of main()and lines 19-20. Let us revisit the two slices obtained above,the traditional slice,Ptrad and the thin slice,Pthin.Let our tailored program be identified as Ptal.Despite Pthin<Ptall<Ptradl, tailoring brings several benefits,obtained from exploiting the temporal order of invocation sequences in SCline 12.First,Ptal is the only one that includes nothing from the else branch of main(),revealing more clearly to a human debugger that the potential error at line 12 is triggered by a Driver object created in the if branch of main()("according to some user input")rather than in its matching else branch ("according to some default configuration").Such contextual information enables the developer to locate the cause for the error at line 12 more quickly.In the case of Ptrad and Pthin,the developer may end up wasting a lot of analysis effort on navigating through a lot of SCline 12-irrelevant code, EC00P2016
Y. Li, T. Tan, Y. Zhang, and J. Xue XX:5 Potential Point of Failure : Statement: fw.write(info) at line 12 Related Program Points : Statement: out.close() at line 27 Statement: new FileWriter(...) at line 2 typestate specifications regarding file operations used will include “accessing a closed file leads to an error state”. For our example, Clara produces an error report shown on the left, marking line 12 as a “potential point of failure”, together with a sequence of two method calls leading to line 12. We have one SCline 12 : line 2 → line 27 → line 12. As static analyses like Clara are either conservative or unsound, there may or may not be an error at line 12. Now our debugging task begins. The error at line 12 happens only when a Driver object is created in the if branch of main(). Therefore, a tool that instructs the developer to examine this if branch only would enable its cause to be identified quickly. In contrast, marking some lines in its matching else branch as also being relevant can increase human analysis effort significantly (especially if the developer has to trace the flow of values across many nested heap structures). Traditional Slicing For large object-oriented programs, traditional (sound) slicing [21, 53] is unscalable or yields slices that are too large for human consumption [47]. Given a virtual call fw.write(info) at line 12, the set of variables of interest consists of (1) the receiver reference and the arguments of the call [52], and (2) some relevant variables selected manually or recognized automatically [2, 13]. In our example, these variables form the set {fw, info, fw.isOpen}. Then, a (backward) slice that affects their values comprises all the statements except lines 7 – 8 and 19 – 20. This slice, which includes everything in main(), contains too many statements that are not all directly relevant to the task at line 12. Thin Slicing Thin slicing [47] is introduced to facilitate program debugging and understanding for object-oriented programs by trading soundness for scalability and (direct) relevance. All control dependences and all base pointer data dependences are excluded. Given a point of interest, thin slicing includes only so-called producer statements that affect directly the values at the point. Statements that serve to explain why producer statements affect the point are ignored. For example, given x = p.f and q.f = y, where p and q may be aliased, q.f = y is a producer statement for x = p.f, because there may be a direct value flow from y to x. All statements that help explain or establish why p and q are aliases are ignored. If we adopt the same slicing criterion at line 12 as above, thin slicing will include only seven statements at lines 2, 12, 16, 17, 36, 39 and 40. Compared with the traditional slice obtained, this smaller slice still retains line 17, a statement for explaining an immediate cause of the error at line 12. However, two SCline 12-irrelevant statements at lines 39 – 40 (in the else branch of main()) are also present, which can cost human analysis effort unnecessarily. Program Tailoring Given SCline 12 : line 2→line 27→line 12, Tailor produces a tailored program consisting of all the statements in all execution paths passing through lines 2, 27 and 12 in that order. As line 27 is only reachable from line 9, which resides in the config method invoked at line 35, the tailored program includes all the lines in the example except lines 37 – 41 that appear in the else branch of main() and lines 19 – 20. Let us revisit the two slices obtained above, the traditional slice, Ptrad and the thin slice, Pthin. Let our tailored program be identified as Ptal. Despite |Pthin| < |Ptal| < |Ptrad|, tailoring brings several benefits, obtained from exploiting the temporal order of invocation sequences in SCline 12. First, Ptal is the only one that includes nothing from the else branch of main(), revealing more clearly to a human debugger that the potential error at line 12 is triggered by a Driver object created in the if branch of main() (“according to some user input”) rather than in its matching else branch (“according to some default configuration”). Such contextual information enables the developer to locate the cause for the error at line 12 more quickly. In the case of Ptrad and Pthin, the developer may end up wasting a lot of analysis effort on navigating through a lot of SCline 12-irrelevant code, E COO P 2 0 1 6
XX:6 Program Tailoring:Slicing by Sequential Criteria highlighted by getSystemInfo()and getSystemFile(),in the else branch.Second,any statement that is not included in Ptal is SCline 12-irrelevant for understanding the SC-specific behavior at line 12.Thus,TAILOR can make,e.g.,thin slicing more effective,by removing the SCline 12-irrelevant statements Pthin-Pthin nPtal,i.e.,lines 39 and 40,from Pthin.Then Pthin is down to only five statements at lines 2,12,16,17 and 36.Starting from line 17 (an immediate cause for the error at line 12),the developer can trace the flow of isOpen backwards to find the original cause.Finally,Ptal includes all data and control dependences reaching line 12 affected by SCline 12,enabling it to be analyzed further by other analyses, e.g.,a pointer analysis,as will be discussed in Section 6.However,Ptrad and Pthin will not be applicable since Ptrad is unobtainable scalably for large programs and Pthin is unsound. Note that Ptal still contains lines 7-8 that do not affect SCline 12.Removing all such irrelevant statements for large programs may be neither necessary (due to the first two points made earlier)nor practical,as a sound slicing tool would be unscalable.Thus,we have designed TAILOR based on the precision/scalability tradeoffs described in Section 1. Figure 3 recaps our insight behind tailoring.Given a SC:A-B-C.we focus on the behavior at C affected ntra-proco by a sequential execution of A,B and C.If one point in SC is reached from only one branch (e.g.,the one containing A) Figure 3 Leveraging the order- of a multi-way branching statement,then the statements in ing information SC:A-BC the other branches are SC-irrelevant and can be trimmed to trim irrelevant statements away.away (x).This is particularly suitable for object-oriented languages,since a virtual call site behaves as a multi-way branch switching to its target methods.For example,B can be regarded as residing in a target method invoked at the marked virtual call on a receiver object created only at the allocation site at A.Thus,the other target methods unreachable to C are trimmed away (x). 3 Methodology Figure 4 gives an overview of TAILOR,with all the components implemented in this paper highlighted in blue.Given a program,we rely on the state-of-the-art (shown as "ICFG Construction")to build an inter-procedural control flow graph(ICFG),denoted GICFG, to represent all the possible control flows in the program.A SC is simply a set of one or more statement sequences ending at the same statement,with all statements identified by their line numbers,i.e.,program points only.The length of a SC is the number of statements in its longest sequence.SCs can be deduced from the results returned by many analysis tools such as API protocol analysis [7,37]and typestate analysis [9,16,34].For example,a typestate analysis may report a potential error at line C,f.write(),together with two invocation sequences of related methods,A:f.open()B1:f1.close()and A:f.open()B2:f2.close(),leading to line C.Therefore,we may choose to tailor the program at C to investigate its behavior affected by SC={A→B1→C,A→B2→C}. Given a SC,a tailored program,T(SC),consists of the statements on all possible execution paths in GIcFc passing through at least one statement sequence in SC.By exploiting the temporal ordering information in SC,it is possible to scale tailoring significantly better than slicing for large object-oriented programs and makes it a practically useful technique. TAILOR is sound as it computes T(SC)over-approximately with respect to GICFG.In contrast,traditional(sound)slicing [21]is unscalable when GIcFG is large and thin slicing [47]
XX:6 Program Tailoring: Slicing by Sequential Criteria highlighted by getSystemInfo() and getSystemFile(), in the else branch. Second, any statement that is not included in Ptal is SCline 12-irrelevant for understanding the SC-specific behavior at line 12. Thus, Tailor can make, e.g., thin slicing more effective, by removing the SCline 12-irrelevant statements Pthin − Pthin ∩Ptal, i.e., lines 39 and 40, from Pthin. Then Pthin is down to only five statements at lines 2, 12, 16, 17 and 36. Starting from line 17 (an immediate cause for the error at line 12), the developer can trace the flow of isOpen backwards to find the original cause. Finally, Ptal includes all data and control dependences reaching line 12 affected by SCline 12, enabling it to be analyzed further by other analyses, e.g., a pointer analysis, as will be discussed in Section 6. However, Ptrad and Pthin will not be applicable since Ptrad is unobtainable scalably for large programs and Pthin is unsound. A … B C X X X Virtual Call Branch Intra-procedural Inter-procedural control flow control flow Figure 3 Leveraging the ordering information SC : A → B → C to trim irrelevant statements away. Note that Ptal still contains lines 7 – 8 that do not affect SCline 12. Removing all such irrelevant statements for large programs may be neither necessary (due to the first two points made earlier) nor practical, as a sound slicing tool would be unscalable. Thus, we have designed Tailor based on the precision/scalability tradeoffs described in Section 1. Figure 3 recaps our insight behind tailoring. Given a SC : A → B → C, we focus on the behavior at C affected by a sequential execution of A, B and C. If one point in SC is reached from only one branch (e.g., the one containing A) of a multi-way branching statement, then the statements in the other branches are SC-irrelevant and can be trimmed away ( ). This is particularly suitable for object-oriented languages, since a virtual call site behaves as a multi-way branch switching to its target methods. For example, B can be regarded as residing in a target method invoked at the marked virtual call on a receiver object created only at the allocation site at A. Thus, the other target methods unreachable to C are trimmed away ( ). 3 Methodology Figure 4 gives an overview of Tailor, with all the components implemented in this paper highlighted in blue. Given a program, we rely on the state-of-the-art (shown as “ICFG Construction”) to build an inter-procedural control flow graph (ICFG), denoted GICFG, to represent all the possible control flows in the program. A SC is simply a set of one or more statement sequences ending at the same statement, with all statements identified by their line numbers, i.e., program points only. The length of a SC is the number of statements in its longest sequence. SCs can be deduced from the results returned by many analysis tools such as API protocol analysis [7, 37] and typestate analysis [9, 16, 34]. For example, a typestate analysis may report a potential error at line C, f.write(), together with two invocation sequences of related methods, A : f.open() → B1 : f1.close() and A : f.open() → B2 : f2.close(), leading to line C. Therefore, we may choose to tailor the program at C to investigate its behavior affected by SC = {A → B1 → C, A → B2 → C}. Given a SC, a tailored program, T (SC), consists of the statements on all possible execution paths in GICFG passing through at least one statement sequence in SC. By exploiting the temporal ordering information in SC, it is possible to scale tailoring significantly better than slicing for large object-oriented programs and makes it a practically useful technique. Tailor is sound as it computes T (SC) over-approximately with respect to GICFG. In contrast, traditional (sound) slicing [21] is unscalable when GICFG is large and thin slicing [47]
Y.Li,T.Tan,Y.Zhang,and J.Xue XX:7 TAILOR Program rogram Analysis Tools SCEXT) SC SC-Based Data-Flow Analysis (SCDFA) Figure 4 Overview of TAILOR(with all the components implemented in this paper in blue). is unsound for GICFG (as it is designed for program debugging and understanding only). TAILOR computes T(SC)in two stages,SC Extension and SC-based Data-Flow Analysis. In the first stage(Section 3.2),we exploit"branch correlations"in object-oriented programs to lengthen a given SC in order to avoid some infeasible paths that would otherwise be introduced in the second stage.In the second stage(Section 3.1),we compute T(SC)by solving a data-flow problem in order to avoid unrealizable paths efficiently.This design allows TAILOR to achieve good precision and scale well to large object-oriented programs. 3.1 SC-Based Data-Flow Analysis We compute T(SC)by solving flow-and context-sensitively an IFDS(Interprocedural Finite Distributive Subset)data-flow problem [38],efficiently on GIcrG,via graph reachability. This formulation of our SC-based data-flow analysis,denoted SCDFA,is significant for three reasons.(1)With flow-sensitivity,SCDFA can filter out imprecisely ordered statement sequences in a SC,as many static analyses from which SCs are deduced are flow-insensitive in order to be scalable.Conversely,precisely ordered statement sequences in SC also enable SCDFA to filter out irrelevant control flows in tailoring a program.This mutually beneficial process improves the precision of both parts,therefore avoiding the unnecessary complexity faced for solving both as one problem.(2)With context-sensitivity,we avoid introducing unrealizable paths with mismatched calls and returns,which is critical for achieving precision in computing T(C).(3)With an IFDS formulation,SCDFA can scale well to reasonably large object-oriented programs.In particular,(full)context-sensitivity can be realized more efficiently by solving a CFL-reachability problem over a simplified balanced-parentheses language.As an IFDS problem,SCDFA has a time complexity of O(ED3),where E is the number of edges in GIcFG and D is the size of the set of SCrelated data-flow facts used(i.e., the set of suffixes of sequences in SC,as will be clear below and defined in Section 4). SCDFA starts with a bottom-up pass(BU)and finishes with a top-down pass(TD),with both performed(fully)context-sensitively.By using the example program from Figure 2, we first explain the functionalities of two passes and then examine briefy how context- sensitivity is realized efficiently in the IFDS framework.Suppose we are given a SC as line 2:fw new FileWriter()-line 27:out.close()-line 12:fw.write().For convenience,we write it as SC:n-cw.In its GIcFG,the allocation site for Filewriter at line 2 is replicated in the two constructors of Driver.We identify the one in the single-arg constructor as line 2(denoted by n)and the one in the non-arg constructor as line 2'(denoted byn.As a result,we finally have a two-sequence SCo={n→c→w,n'→c→w}. BU and TD Passes Both passes operate on GICFG,as shown in Figure 5,for our example. As in any data-flow analysis,all control flow edges in GICFG are treated non-deterministically executable.Let ENTRY be the entry node of GICFG and EXIT the node marking the point of interest w at line 12 in SC.Conceptually,if a sequence SSC,which is ncw or n'cw, lies on a path,then BU must see a suffix of S at every node n on the path backwards from EXIT and TD must see the corresponding prefix of S at n forwards from ENTRY. EC00P2016
Y. Li, T. Tan, Y. Zhang, and J. Xue XX:7 Extended SC Bottom-Up Pass Top-Down Pass TAILOR Tailored Program E.g., API Protocol Analysis Analysis Tools SC Extension Program ICFG Construction GICFG SC-Based Data-Flow Analysis (SCDFA) SC (SCEXT) Figure 4 Overview of Tailor (with all the components implemented in this paper in blue). is unsound for GICFG (as it is designed for program debugging and understanding only). Tailor computes T (SC) in two stages, SC Extension and SC-based Data-Flow Analysis. In the first stage (Section 3.2), we exploit “branch correlations” in object-oriented programs to lengthen a given SC in order to avoid some infeasible paths that would otherwise be introduced in the second stage. In the second stage (Section 3.1), we compute T (SC) by solving a data-flow problem in order to avoid unrealizable paths efficiently. This design allows Tailor to achieve good precision and scale well to large object-oriented programs. 3.1 SC-Based Data-Flow Analysis We compute T (SC) by solving flow- and context-sensitively an IFDS (Interprocedural Finite Distributive Subset) data-flow problem [38], efficiently on GICFG, via graph reachability. This formulation of our SC-based data-flow analysis, denoted SCDFA, is significant for three reasons. (1) With flow-sensitivity, SCDFA can filter out imprecisely ordered statement sequences in a SC, as many static analyses from which SCs are deduced are flow-insensitive in order to be scalable. Conversely, precisely ordered statement sequences in SC also enable SCDFA to filter out irrelevant control flows in tailoring a program. This mutually beneficial process improves the precision of both parts, therefore avoiding the unnecessary complexity faced for solving both as one problem. (2) With context-sensitivity, we avoid introducing unrealizable paths with mismatched calls and returns, which is critical for achieving precision in computing T (SC). (3) With an IFDS formulation, SCDFA can scale well to reasonably large object-oriented programs. In particular, (full) context-sensitivity can be realized more efficiently by solving a CFL-reachability problem over a simplified balanced-parentheses language. As an IFDS problem, SCDFA has a time complexity of O(ED3 ), where E is the number of edges in GICFG and D is the size of the set of SC-related data-flow facts used (i.e., the set of suffixes of sequences in SC, as will be clear below and defined in Section 4). SCDFA starts with a bottom-up pass (BU) and finishes with a top-down pass (TD), with both performed (fully) context-sensitively. By using the example program from Figure 2, we first explain the functionalities of two passes and then examine briefly how contextsensitivity is realized efficiently in the IFDS framework. Suppose we are given a SC as line 2 : fw = new FileWriter() → line 27 : out.close() → line 12 : fw.write(). For convenience, we write it as SC : n→c→w. In its GICFG, the allocation site for FileWriter at line 2 is replicated in the two constructors of Driver. We identify the one in the single-arg constructor as line 2 (denoted by n) and the one in the non-arg constructor as line 2 0 (denoted by n 0 ). As a result, we finally have a two-sequence SC w = {n→c→w, n0→c→w}. BU and TD Passes Both passes operate on GICFG, as shown in Figure 5, for our example. As in any data-flow analysis, all control flow edges in GICFG are treated non-deterministically executable. Let ENTRY be the entry node of GICFG and EXIT the node marking the point of interest w at line 12 in SC w. Conceptually, if a sequence S ∈ SC w, which is ncw or n 0 cw, lies on a path, then BU must see a suffix of S at every node n on the path backwards from EXIT and TD must see the corresponding prefix of S at n forwards from ENTRY. E COO P 2 0 1 6
XX:8 Program Tailoring:Slicing by Sequential Criteria TD BU BU Entry Node ......ncw.cw,w){ncw} 33) ncw .cwcw w.......... …{w}…{ncw】 ncw )--(ncw.cw.w) 34 new Driver new Driver 38 dw《ncw) n 2 new() new Filewiter 2 n {Gw…Gww} 35 dconfig0☐ getSystemFale 3w)(ncw) {cw…{cww} c 27 outclose() …w}《nCw {w}…{w}… 3 getConfiglngetSystemlnlo40 {w}w …《W}…{ncw} ao902- …《w}…{nCww) Non-SC Statame fw.wrRe(12 w -Inter-prooedural control fiow End Node☐ Figure 5 A simplified GIcFG of the program given in Figure 2 for illustrating the bottom-up (BU)and top-down(TD)passes of SCDFA with SC={n→c→w,n'→c→w} BU computes a global property,PANTI,for every node n in GICFc,backwards against the control flow,starting at EXIT.PANTIout(n)represents the set of suffixes of some sequences in SCw that are partially anticipable at the entry of n,i.e.,appear on some outgoing path of n ending at EXIT.In our example,PANTIout (ENTRY)={ncw,cw,w}.As ncw is partially anticipable(but n'cw is not)at ENTRY,GICFG contains some paths passing through ncw. TD computes a global property,PAVAlL,for every node n in GICFG,forwards along the control flow,starting from ENTRY.PAVAlLi(n)specifies the set of suffixes of some sequences in PANTIot(ENTRY)to represent implicitly (for efficiency reasons)the fact that their corresponding prefixes are partially available at the entry of n,i.e.,appear on some incoming paths of n starting from ENTRY.For example,PAVAlLi(12)={ncw,w},indicating that the prefixes e and nc of ncw are partially available at the entry of node 12. For this example,a node n is included in the tailored program if PAVAlLin(n)nPANTIout(n) since a suffix s E PAVAlLin(n)nPANTIout (n)is partially anticipable at the entry of n and some sequence in SC with s removed is partially available at the entry of n.In our example,the tailored program consists of all the lines except lines 38-40. Context Sensitivity Figure 6 illustrates as an example how the TD pass shown in Figure 5 is performed for FileWriter()context-sensitively by solving a CFL-reachability-based balanced-parentheses problem,efficiently call-o-start edge.( in the IFDS framework.For technical new Filewriter(2 details,we refer to 38.There are four 士 data-flow facts,ncw,cw,w,and 0 (for NTER FleWrer.cnnib the empty set).There are two call sites to FileWriter()at lines 2 and 2',which are identified as nodes n and n',respectively, to-return cda with their call-to-start and exit-to-return ●1 35 d.config0 edges labeled as(n:)n,(n and )n appro- ● priately.ENTER and EXIT are the start Figure 6 Context-sensitivity via CFL-reachability. and exit nodes of FileWriter(),whose CFG is irrelevant and thus elided.In SCDFA,the call-to-return edge always serves as a kill edge(to stop the data-flow facts from bypassing a callee).For each node n,PAVAlLin(n)is the set of facts,i.e.,suffixes of ncw shown in black dots.According to Figure 5,PAVAlLin(2)=PAVAlLin(2')={ncw).If FileWriter()
XX:8 Program Tailoring: Slicing by Sequential Criteria TD BU BU TD 36 getConfigInfo() new Driver() 38 33 if (...) getSystemFile() 39 getSystemInfo() 40 End Node 34 new Driver(...) Entry Node 35 d.config() fw.write() 12 2 new FileWriter() new FileWriter() 2' 27 out.close() d.log() 42 { } w { } w { } w { } w { } w { } w { } w { } w { } w { } w { } cw { } cw { } ncw { } ncw { } ncw { } ncw, cw, w { } ncw, cw, w { } ncw, cw, w Non-SC Statement Inter-procedural control flow SC Statement Intra-procedural control flow { } ncw { } ncw { } ncw { } ncw { } ncw { } ncw, w { } cw, w { } cw, w n n' c w Figure 5 A simplified GICFG of the program given in Figure 2 for illustrating the bottom-up (BU) and top-down (TD) passes of SCDFA with SC w = {n→c→w, n0→c→w}. BU computes a global property, PANTI, for every node n in GICFG, backwards against the control flow, starting at EXIT. PANTIout(n) represents the set of suffixes of some sequences in SC w that are partially anticipable at the entry of n, i.e., appear on some outgoing path of n ending at EXIT. In our example, PANTIout(ENTRY) = {ncw, cw, w}. As ncw is partially anticipable (but n 0 cw is not) at ENTRY, GICFG contains some paths passing through ncw. TD computes a global property, PAVAIL, for every node n in GICFG, forwards along the control flow, starting from ENTRY. PAVAILin(n) specifies the set of suffixes of some sequences in PANTIout(ENTRY) to represent implicitly (for efficiency reasons) the fact that their corresponding prefixes are partially available at the entry of n, i.e., appear on some incoming paths of n starting from ENTRY. For example, PAVAILin(12) = {ncw, w}, indicating that the prefixes and nc of ncw are partially available at the entry of node 12. For this example, a node n is included in the tailored program if PAVAILin(n)∩PANTIout(n) 6= ∅, since a suffix s ∈ PAVAILin(n) ∩ PANTIout(n) is partially anticipable at the entry of n and some sequence in SC w with s removed is partially available at the entry of n. In our example, the tailored program consists of all the lines except lines 38 – 40. Context Sensitivity Figure 6 illustrates as an example how the TD pass shown in Figure 5 is performed for FileWriter() context-sensitively by solving a CFL-reachability-based ( n ( n' ) n' ) n 35 d.config() getSystemFile() 39 2 new FileWriter() new FileWriter() 2' FileWriter. ENTER 0 ncw cw w FileWriter. EXIT 0 ncw cw w 0 ncw cw w 0 ncw cw w 0 ncw cw w 0 ncw cw w call-to-return edge call-to-start edge exit-to-return edge … … n n' Figure 6 Context-sensitivity via CFL-reachability. balanced-parentheses problem, efficiently in the IFDS framework. For technical details, we refer to [38]. There are four data-flow facts, ncw, cw, w, and 0 (for the empty set). There are two call sites to FileWriter() at lines 2 and 2 0 , which are identified as nodes n and n 0 , respectively, with their call-to-start and exit-to-return edges labeled as (n, )n, (n0 and )n0 appropriately. ENTER and EXIT are the start and exit nodes of FileWriter(), whose CFG is irrelevant and thus elided. In SCDFA, the call-to-return edge always serves as a kill edge (to stop the data-flow facts from bypassing a callee). For each node n, PAVAILin(n) is the set of facts, i.e., suffixes of ncw shown in black dots. According to Figure 5, PAVAILin(2) = PAVAILin(20 ) = {ncw}. If FileWriter()
Y.Li,T.Tan,Y.Zhang,and J.Xue XX:9 is entered from(n and exited from )n,then PAVAILin(ENTER)=PAVAILin(EXIT)={cw}. Hence,PAVAILiM(35)={cw}.However,if FileWriter()is entered from(n and exited from )n,then PAVAILiM(ENTER)=PAVAILi(EXIT)=(ncw}.Hence,PAVAILin(39)={ncw} 3.2 SC Extension TAILOR is designed to work effectively with any SC.While an API specification mining tool [17]may generate SCs longer than 10,an assertion verifier may produce only single-point SCs.In general,the longer a SCis,the more SC-irrelevant statements will be eliminated. To improve the precision of SCDFA,we perform first a SC extension pass,denoted SCEXT, as shown in Figure 4.Given a sequence SE SC,with F being its first point,we look for a set of n extension points E,..,En,that collectively dominate F,such that any program execution that passes through F must pass through one Ei.We do so effectively by leveraging the concept of object-sensitivity [33,44]developed by the pointer analysis community for Java. For F,its extension points are chosen as the object allocation sites used for representing the object-sensitive calling contexts for the method containing F.As a result,some infeasible paths are avoided by exploiting branch correlations in object-oriented programs(Figure 3). To make SC extensions,we use the points-to information obtained by,e.g.,the pointer analysis performed earlier for building GICFG.Consider an ICFG fragment discussed earlier in Figure 3.Suppose we are given SC:B-C such that B resides in a target method m for the virtual call site shown and m is only invokable on the objects created at A,which represents an object allocation site.Then we can prepend A to SC to obtain ESC:A-B-C. According to SCDFA,T(SC)will include both branches of "Branch"shown in Figure 3 since both reach SC but T(ESC)will include only the branch containing A as the other(infeasible) branch does not reach ESC.However,lengthening a SC increases the number of SCrelated data-flow facts used (Figure 5).So a precision/scalability tradeoff needs to be made. 】class PMD( void main(Stringll args){ 1ass CMLOpon Renderer createRenderer ( 3 Ext 3 CMLOptions opts=new CMLOptions(args): f(…) Renderer renderer opts.createRenderer(; eturn new EmacsRendererO; renderer..end0:于 17 else if(...) 18 return new HTMLRenderer(); 6 class AbstractRenderer...{ else if (... void end0【t 20 Ext 2 return new SummaryHTMLRenderer(): render(writer,.:十 122 9 class SummaryHTMLRenderer...{ 10 void render (Writer writer)f 23 dass HTMLRenderer 11 Ext 1 renderer=new HTMLRer voi render (Writer writer....){ 12 renderer..renderBody(writer,.-于十 5 riter.write():) Figure 7 A code snippet from PMD for illustrating SCEXT in a program understanding task. We use a real program understanding task to show why SCEXT is useful for real code.PMD is a static analyzer for analyzing Java programs and can print a range of source code flaws in different formats such as Emacs,CSV,and HTML.In this task,we want to understand how PMD renders its outputs in the commonly used HTML format.Figure 7 shows the simplified code.The only knowledge we have initially is that the HTMLRenderer class is responsible for writing to the file at line 25,but how it is done is unknown.At this stage,we have a single-point,SCline25:line 25,representing just the write statement at line 25. If we apply SCDFA to SCline 25,T(SCline 25)will include all the lines in the code given. Below we show how to extend this to ESCline 25 line 20line 11-line 25 object- sensitively [33,44].If we apply SCDFA to ESCline 25,T(ESCline25)will now be smaller, consisting of all the lines except lines 16,18 and 22 in method createRenderer().In other words,all the branches except the one enclosing line 20 in method createRenderer()are EC00P2016
Y. Li, T. Tan, Y. Zhang, and J. Xue XX:9 is entered from (n and exited from )n, then PAVAILin(ENTER) = PAVAILin(EXIT) = {cw}. Hence, PAVAILin(35) = {cw}. However, if FileWriter() is entered from (n0 and exited from )n0 , then PAVAILin(ENTER) = PAVAILin(EXIT) = {ncw}. Hence, PAVAILin(39) = {ncw}. 3.2 SC Extension Tailor is designed to work effectively with any SC. While an API specification mining tool [17] may generate SCs longer than 10, an assertion verifier may produce only single-point SCs. In general, the longer a SC is, the more SC-irrelevant statements will be eliminated. To improve the precision of SCDFA, we perform first a SC extension pass, denoted SCEXT, as shown in Figure 4. Given a sequence S ∈ SC, with F being its first point, we look for a set of n extension points E1, · · · , En, that collectively dominate F, such that any program execution that passes through F must pass through one Ei . We do so effectively by leveraging the concept of object-sensitivity [33,44] developed by the pointer analysis community for Java. For F, its extension points are chosen as the object allocation sites used for representing the object-sensitive calling contexts for the method containing F. As a result, some infeasible paths are avoided by exploiting branch correlations in object-oriented programs (Figure 3). To make SC extensions, we use the points-to information obtained by, e.g., the pointer analysis performed earlier for building GICFG. Consider an ICFG fragment discussed earlier in Figure 3. Suppose we are given SC : B → C such that B resides in a target method m for the virtual call site shown and m is only invokable on the objects created at A, which represents an object allocation site. Then we can prepend A to SC to obtain ESC : A → B → C. According to SCDFA, T (SC) will include both branches of “Branch” shown in Figure 3 since both reach SC but T (ESC) will include only the branch containing A as the other (infeasible) branch does not reach ESC. However, lengthening a SC increases the number of SC-related data-flow facts used (Figure 5). So a precision/scalability tradeoff needs to be made. class PMD { CMLOptions opts = new CMLOptions(args); void main (String[] args) { Renderer renderer = opts.createRenderer(); } } 1 2 3 4 5 9 23 24 25 13 14 … … renderer.end(); class CMLOptions { if ( … ) { Renderer createRenderer () { return new EmacsRenderer(); } else if ( … ) return new HTMLRenderer(); } else if ( … ) return new SummaryHTMLRenderer(); } else if ( … ) } class SummaryHTMLRenderer ... { renderer = new HTMLRenderer(...); void render (Writer writer, … ) { renderer.renderBody(writer, ...); class HTMLRenderer ... { writer.write(...); void renderBody (Writer writer, … ) { } } } } 15 16 17 18 19 20 21 22 } } class AbstractRenderer ... { render(writer, ... ); void end () { } } 6 7 8 10 11 12 Ext 1 Ext 2 Ext 3 Figure 7 A code snippet from PMD for illustrating SCEXT in a program understanding task. We use a real program understanding task to show why SCEXT is useful for real code. PMD is a static analyzer for analyzing Java programs and can print a range of source code flaws in different formats such as Emacs, CSV, and HTML. In this task, we want to understand how PMD renders its outputs in the commonly used HTML format. Figure 7 shows the simplified code. The only knowledge we have initially is that the HTMLRenderer class is responsible for writing to the file at line 25, but how it is done is unknown. At this stage, we have a single-point, SCline 25: line 25, representing just the write statement at line 25. If we apply SCDFA to SCline 25, T (SCline 25) will include all the lines in the code given. Below we show how to extend this to ESCline 25 : line 20 → line 11 → line 25 objectsensitively [33, 44]. If we apply SCDFA to ESCline 25, T (ESCline 25) will now be smaller, consisting of all the lines except lines 16, 18 and 22 in method createRenderer(). In other words, all the branches except the one enclosing line 20 in method createRenderer() are E COO P 2 0 1 6
XX:10 Program Tailoring:Slicing by Seguential Criteria infeasible for SCline 25,according to the points-to information provided for this program. Let us see how SCEXT works in growing SCline 25 to become ESCline 25.Initially,SCline25 has one statement at line 25.which resides in method renderBody().The (abstract)receiver object pointed to by renderer at line 12 is allocated only at the allocation site at line 11 and is considered as the calling context for line 25 in an object-sensitive pointer analysis. There is another allocation site at line 18 for the same type,HTMLRenderer,but this is not considered,since the abstract object created at line 18 does not flow to line 12(according to the points-to information available).At this stage,we have SCline 25:line 11line 25. Similarly,we now look for the allocation sites that can be used as the calling contexts for line 11 contained in method render()at line 10,which is called from this.render()at line 8 in method end()defined in the AbstractRenderer class.Thus,the receiver obiects on which method render()(line 10)is invoked are the ones pointed to by renderer at line 5.These objects are returned from the call to createRenderer()at line 4.Thus,renderer points to the objects created by the allocation sites in all different branches in createRenderer(). According to the points-to information,the target method end()invoked at the virtual call site at line 5 can only be made on an receiver object of type SummaryHTMLRenderer.Thus, we now have an even longer SCline 25 line 20-line 11-line 25. The allocation site at line 3 is the object-sensitive context for the target method createRenderer(),which contains line 20,invoked at line 4.No further extension is possible, since main()has been reached.Now,we have SCline 25:line 3-line 20-line 11-line 25. If we apply SCDFA to this four-point SCline 25,T(SCline 25)will consist of all the lines except lines 16,18 and 22.However,including line 3 does not help as it dominates line 20 in GICFG.By removing it,we obtain ESCline 25:line 20-line 11-line 25 as desired.Applying SCDFA to ESCline25 yields the same tailored program obtained for the three-point SCline25. When lengthening SCs,we aim to reduce infeasible paths by exploiting branch correlations. However,with longer SCs,SCDFA will end up using more data-flow facts,as seen in Figure 5, making it less efficient than before.So a precision/scalability tradeoff has to be made.Our key observation is to keep a SC extension point found if it is inside one branch in a multi-way branching statement or a target method invoked at a polymorphic call site (Figure 3), provided that the point is not in a cycle.This way,the other branches (or target methods) are infeasible (with respect to a given SC)and will not show up in the tailored program. Let us return to the program understanding task at hand.From the tailored program T(ESCline 25)obtained for ESCline 25,we can see clearly that the allocation site at line 20 in method createRenderer()is responsible for the rendering-related write at line 25. 4 Formalism We first formalize SCDFA and SCEXT and then prove some properties about TAILOR. 4.1 SC-based Data-Flow Analysis We provide a context-insensitive formalization of SCDFA as context sensitivity is achieved on top of this formalization via CFL-reachability,as shown in Figure 6.Essentially,we describe the data-flow equations needed for solving its BU and TD passes in the IFDS framework [38] The IFDS problem framework is precise and efficient for solving data-flow problems with three properties.First,the set D of data-flow facts is finite.Second,the domain and range of each flow,i.e.,transfer function is the power set of D,denoted 2,the lattice ordering relation E on 2 is or 2,and the meet operator nis nor U.Finally,each fow function f must be distributive over 2:VS1,S2 E2D:f(S1nS2)=f(S1)nf(S2)
XX:10 Program Tailoring: Slicing by Sequential Criteria infeasible for SCline 25, according to the points-to information provided for this program. Let us see how SCEXT works in growing SCline 25 to become ESCline 25. Initially, SCline25 has one statement at line 25, which resides in method renderBody(). The (abstract) receiver object pointed to by renderer at line 12 is allocated only at the allocation site at line 11 and is considered as the calling context for line 25 in an object-sensitive pointer analysis. There is another allocation site at line 18 for the same type, HTMLRenderer, but this is not considered, since the abstract object created at line 18 does not flow to line 12 (according to the points-to information available). At this stage, we have SCline 25 : line 11 → line 25. Similarly, we now look for the allocation sites that can be used as the calling contexts for line 11 contained in method render() at line 10, which is called from this.render() at line 8 in method end() defined in the AbstractRenderer class. Thus, the receiver objects on which method render() (line 10) is invoked are the ones pointed to by renderer at line 5. These objects are returned from the call to createRenderer() at line 4. Thus, renderer points to the objects created by the allocation sites in all different branches in createRenderer(). According to the points-to information, the target method end() invoked at the virtual call site at line 5 can only be made on an receiver object of type SummaryHTMLRenderer. Thus, we now have an even longer SCline 25 : line 20 → line 11 → line 25. The allocation site at line 3 is the object-sensitive context for the target method createRenderer(), which contains line 20, invoked at line 4. No further extension is possible, since main() has been reached. Now, we have SCline 25 : line 3 → line 20 → line 11 → line 25. If we apply SCDFA to this four-point SCline 25, T (SCline 25) will consist of all the lines except lines 16, 18 and 22. However, including line 3 does not help as it dominates line 20 in GICFG. By removing it, we obtain ESCline 25 : line 20→line 11→line 25 as desired. Applying SCDFA to ESCline 25 yields the same tailored program obtained for the three-point SCline 25. When lengthening SCs, we aim to reduce infeasible paths by exploiting branch correlations. However, with longer SCs, SCDFA will end up using more data-flow facts, as seen in Figure 5, making it less efficient than before. So a precision/scalability tradeoff has to be made. Our key observation is to keep a SC extension point found if it is inside one branch in a multi-way branching statement or a target method invoked at a polymorphic call site (Figure 3), provided that the point is not in a cycle. This way, the other branches (or target methods) are infeasible (with respect to a given SC) and will not show up in the tailored program. Let us return to the program understanding task at hand. From the tailored program T (ESCline 25) obtained for ESCline 25, we can see clearly that the allocation site at line 20 in method createRenderer() is responsible for the rendering-related write at line 25. 4 Formalism We first formalize SCDFA and SCEXT and then prove some properties about Tailor. 4.1 SC-based Data-Flow Analysis We provide a context-insensitive formalization of SCDFA as context sensitivity is achieved on top of this formalization via CFL-reachability, as shown in Figure 6. Essentially, we describe the data-flow equations needed for solving its BU and TD passes in the IFDS framework [38]. The IFDS problem framework is precise and efficient for solving data-flow problems with three properties. First, the set D of data-flow facts is finite. Second, the domain and range of each flow, i.e., transfer function is the power set of D, denoted 2 D, the lattice ordering relation v on 2 D is ⊆ or ⊇, and the meet operator u is ∩ or ∪. Finally, each fow function f must be distributive over 2 D: ∀S1, S2 ∈ 2 D: f(S1 u S2) = f(S1) u f(S2)