4.2 Assumptions When analysing real code under-approximately,we accommodate Assumptions 1 -4 as follows.For Assumption 1,we rely on DooP's pointer analysis to simulate the behaviors of Java native methods.Dynamic class loading is assumed to be resolved separately [25].To simulate its effect,we create a closed world for a program,by locating the classes referenced with Doop's fact generator and adding additional ones found through program runs under TAMIFLEX [22].For the DaCapo benchmarks,avrora and checkstyle,their associated inputs are used.For findbugs,one Java program is developed as its input.For freecs,a server requiring user interactions,we only initialize it as the input in order to ensure repeatability.Assumptions 2 and 3 are taken for granted. As for Assumption 4,we validate it for all reflective allocation sites where o"is created in the application code of the 10 programs that can be analyzed scalably.This assumption is found to hold at 75%of these sites automatically by performing a simple intraprocedural analysis.We have inspected the remain- ing 25%interprocedurally and found only two violating sites(in eclipse and checkstyle),where o"is never used.In the other sites inspected,o"flows through only local variables with all the call-chain lengths being at most 2. 4.3 RQ1:Full Automation Fig.10 compares SoLAR and existing reflection analyses [5,17,18,20-22]de- noted by "Others"by the degree of automation achieved.For an analysis,this is measured by the number of annotations required in order to improve the soundness of the reflective calls identified to be potentially unsoundly resolved. freecs ■SOAR SOLAR analyzes 7 out of the 11 pro- findbugs Others grams scalably with full automation. checkstyle For hsqldb,xalan and checkstyle. avrora SOLAR is unscalable (under 3 hours). xalan pmd With PROBE,13 reflective calls are flagged as being potentially un- fop soundly resolved.After 7 annotations, eclipse 2 in hsqldb,2 in xalan and 3 in chart 0 5 10152025303540 45 checkstyle,SOLAR is scalable,as Fig.10.The number of annotations re- discussed in Section 4.4.However,So- quired for improving the soundness of un- LAR,like DOOP and ELF,is unscalable soundly resolved reflective calls. (under 3 hours)for jython,an inter- preter for Python in which the Java libraries and application code are invoked reflectively from the Python code. "Others"cannot identify which reflective calls may be unsoundly resolved. However,they may improve soundness by requiring users to annotate the string arguments of calls to,e.g.,Class.forName()and getMethod(),as suggested in [20].As shown in Fig.10,"Others"will require 338 annotations initially and possibly more in the subsequent iterations(when more code is discovered).As discussed in Section 2.3,SOLAR's annotation approach is also iterative.However, for these programs,SoLAR requires only 7 annotations in one iteration. SOLAR outperforms "Others"due to its powerful inference system for per- forming reflection resolution and effective mechanism in identifying unsoundness.4.2 Assumptions When analysing real code under-approximately, we accommodate Assumptions 1 – 4 as follows. For Assumption 1, we rely on Doop’s pointer analysis to simulate the behaviors of Java native methods. Dynamic class loading is assumed to be resolved separately [25]. To simulate its effect, we create a closed world for a program, by locating the classes referenced with Doop’s fact generator and adding additional ones found through program runs under TamiFlex [22]. For the DaCapo benchmarks, avrora and checkstyle, their associated inputs are used. For findbugs, one Java program is developed as its input. For freecs, a server requiring user interactions, we only initialize it as the input in order to ensure repeatability. Assumptions 2 and 3 are taken for granted. As for Assumption 4, we validate it for all reflective allocation sites where o u i is created in the application code of the 10 programs that can be analyzed scalably. This assumption is found to hold at 75% of these sites automatically by performing a simple intraprocedural analysis. We have inspected the remaining 25% interprocedurally and found only two violating sites (in eclipse and checkstyle), where o u i is never used. In the other sites inspected, o u i flows through only local variables with all the call-chain lengths being at most 2. 4.3 RQ1: Full Automation Fig. 10 compares Solar and existing reflection analyses [5, 17, 18, 20–22] denoted by “Others” by the degree of automation achieved. For an analysis, this is measured by the number of annotations required in order to improve the soundness of the reflective calls identified to be potentially unsoundly resolved. 0 5 10 15 20 25 30 35 40 45 chart eclipse fop hsqldb pmd xalan avrora checkstyle findbugs freecs SOLAR Others Fig. 10. The number of annotations required for improving the soundness of unsoundly resolved reflective calls. Solar analyzes 7 out of the 11 programs scalably with full automation. For hsqldb, xalan and checkstyle, Solar is unscalable (under 3 hours). With Probe, 13 reflective calls are flagged as being potentially unsoundly resolved. After 7 annotations, 2 in hsqldb, 2 in xalan and 3 in checkstyle, Solar is scalable, as discussed in Section 4.4. However, Solar, like Doop and Elf, is unscalable (under 3 hours) for jython, an interpreter for Python in which the Java libraries and application code are invoked reflectively from the Python code. “Others” cannot identify which reflective calls may be unsoundly resolved. However, they may improve soundness by requiring users to annotate the string arguments of calls to, e.g., Class.forName() and getMethod(), as suggested in [20]. As shown in Fig. 10, “Others” will require 338 annotations initially and possibly more in the subsequent iterations (when more code is discovered). As discussed in Section 2.3, Solar’s annotation approach is also iterative. However, for these programs, Solar requires only 7 annotations in one iteration. Solar outperforms “Others” due to its powerful inference system for performing reflection resolution and effective mechanism in identifying unsoundness