There are three minor differences.First,we do not need to Algorithm 4:EQUIV-CHECKER handle (non-existent)e-transitions.Second.we can find the next states of a DFA state g more efficiently.In the general Input DFA1=(Q1,1,d1,q,T1,1) DFA2=(Q2,∑2,02,92,T2,2) case,all input symbols must be examined.In our case (lines Output:TRUE or FALSE (Are DFA and DFA2 equivalent?) 7-9),we only need to iterate over the fields (input symbols) 1Q=Q1UQ2 of an arbitrarily picked object(an NFA state)in g to find its 2∑=∑1U∑2 next states.Due to SINGLETYPE-CHECK in lines 6-7 of J6[4,f月ifq∈Q1 Algorithm 1,the objects grouped in a DFA state g must have 3la,月={24,月ifg∈Q2 the same type.Finally,we need to compute I(set of output 4T=TiUr2 symbols)and (output map)at lines 14-16, s Yg= ∫h[gifq∈Q1 U2[gifq∈Q2 Algorithm 3:DFA-CONVERTER 6DFA=(Q,2,6,91,T,Y) nput:NFA=(Q,∑,6,o,T,Y) 7 Let V be a new set Output:DFA =(',,,o.) 8 foreach g∈Qdo 190={90} Add {9}tov 2∑=∑ 10 V.UNION(g1,g2) 3 Let O'and I'be two new sets 11 Push(,g2)to a new stack,STACK 4 Let and be two new maps 12 while STACK is not empty do s Add go as an unmarked state to Q' 13 Pop (p1,p2)from STACK 6 while there is an unmarked state qQ'do 14 foreach f∈∑do Mark g 15 r1=V.FIND(6[p1,f]),r2 =V.FIND([p2,f]) 7 8 Pick any oi from q 16 ifri≠r2then foreach f E FIELDSOF(o;)do 17 V.UNION(r1.r2) Push(r1.r2)to STACK 10 q={6[o,f|o∈9} ifg走Q'then 12 Add 'as an unmarked state to TRUE 19 return ifs∈V:p,q∈s:p=y[g间 FALSE otherwise 6'[g,f月=q l4 foreach q∈Q'do 15 Y[g]={TYPEOF(O:)1o:Eg} 16 LT=TUYlg] to work with mainstream allocation-site-based points-to 17 return DFA (Q','6',go,I',) analysis frameworks such as CHORD [10],WALA [50], SOoT [49]and Doop [14].To demonstrate its effective- ness,we have integrated MAHJONG with DOOP [9,14]. 4.4 The Automata Equivalence Checker a state-of-the-art whole-program points-to analysis frame- Algorithm 4 (EQUIV-CHECKER)tests the equivalence of work for Java.MAHJONG is released as an open-source tool two 6-tuple DFAs,by applying a Hopcroft-Karp algorithm at http://www.cse.unsw.edu.au/-corg/mahjong.Be- low we discuss three major optimizations. that was proposed for two 5-tuple DFAs [18]with minor modifications at line 19 on testing whether all states in s EV Disjoint-Set Forest In Algorithms 1 and 4,disjoint sets are have the same type.As discussed in Section 2.2.2,a 5-tuple used.For efficiency,we have implemented a set of disjoint DFA can be modeled as a special case of a 6-tuple DFA sets as a disjoint-set forest,by representing each disjoint set EQUIV-CHECKER iterates over all fields f >(line 14) as a tree with its root being its representative.Thus,UNION and queries the transition map 6 to obtain the next states(line amounts to linking the roots of different trees while FIND 15).By convention,if 6[q,f]is not defined,since the objects returns the root of a tree.To improve the efficiency further, in g do not have the field f,we assume that 6[q,f]=qerror. we have also implemented two heuristics,union by rank and In addition,gerror]returns a special type for qerror. path compression [11].As a result,the average execution 4.5 The Heap Modeler time of each UNION/FIND operation over a disjoint-set for- After Algorithm 1 has terminated,we have W=H/=in est can be reduced to nearly O(1)[11]. its line 16.Then MOM specifies the new heap abstraction Shared Sequential Automata In Algorithms 2 and 3,new given in Definition 2.2,as discussed in Section 3.5. automata are frequently created.However,different au- tomata can be partly identical,since their common parts 5.Implementation correspond to the same objects.Instead of always creating We have implemented MAHJONG as a standalone tool in a new automata,we allow different automata to share their total of only 1500 LOC in Java to build a new heap abstrac- common parts.This optimization reduces significantly both tion by merging equivalent automata.MAHJONG is designed the time and space costs of the overall algorithm. 285There are three minor differences. First, we do not need to handle (non-existent) ǫ-transitions. Second, we can find the next states of a DFA state q more efficiently. In the general case, all input symbols must be examined. In our case (lines 7 – 9), we only need to iterate over the fields (input symbols) of an arbitrarily picked object (an NFA state) in q to find its next states. Due to SINGLETYPE-CHECK in lines 6 – 7 of Algorithm 1, the objects grouped in a DFA state q must have the same type. Finally, we need to compute Γ ′ (set of output symbols) and γ ′ (output map) at lines 14 – 16, Algorithm 3: DFA-CONVERTER Input : NFA = (Q, Σ, δ, q0, Γ, γ) Output: DFA = (Q ′ , Σ ′ , δ′ , q′ 0, Γ ′ , γ′ ) 1 q ′ 0 = {q0} 2 Σ ′ = Σ 3 Let Q ′ and Γ ′ be two new sets 4 Let δ ′ and γ ′ be two new maps 5 Add q ′ 0 as an unmarked state to Q ′ 6 while there is an unmarked state q ∈ Q ′ do 7 Mark q 8 Pick any oi from q 9 foreach f ∈ FIELDSOF(oi) do 10 q ′ = { δ[oj , f] | oj ∈ q } 11 if q ′ ∈/ Q ′ then 12 Add q ′ as an unmarked state to Q ′ 13 δ ′ [q, f] = q ′ 14 foreach q ∈ Q ′ do 15 γ ′ [q] = { TYPEOF(oi) | oi ∈ q } 16 Γ ′ = Γ′ ∪ γ ′ [q] 17 return DFA = (Q ′ , Σ ′ , δ′ , q′ 0, Γ ′ , γ′ ) 4.4 The Automata Equivalence Checker Algorithm 4 (EQUIV-CHECKER) tests the equivalence of two 6-tuple DFAs, by applying a Hopcroft-Karp algorithm that was proposed for two 5-tuple DFAs [18] with minor modifications at line 19 on testing whether all states in s ∈ V have the same type. As discussed in Section 2.2.2, a 5-tuple DFA can be modeled as a special case of a 6-tuple DFA. EQUIV-CHECKER iterates over all fields f ∈ Σ (line 14) and queries the transition map δ to obtain the next states (line 15). By convention, if δ[q, f] is not defined, since the objects in q do not have the field f, we assume that δ[q, f] = qerror. In addition, γ[qerror] returns a special type for qerror. 4.5 The Heap Modeler After Algorithm 1 has terminated, we have W = H / ≡ in its line 16. Then MOM specifies the new heap abstraction given in Definition 2.2, as discussed in Section 3.5. 5. Implementation We have implemented MAHJONG as a standalone tool in a total of only 1500 LOC in Java to build a new heap abstraction by merging equivalent automata. MAHJONG is designed Algorithm 4: EQUIV-CHECKER Input : DFA1 = (Q1, Σ1, δ1, q1, Γ1, γ1) DFA2 = (Q2, Σ2, δ2, q2, Γ2, γ2) Output: TRUE or FALSE (Are DFA1 and DFA2 equivalent?) 1 Q = Q1 ∪ Q2 2 Σ = Σ1 ∪ Σ2 3 δ[q, f] = δ1[q, f] if q ∈ Q1 δ2[q, f] if q ∈ Q2 4 Γ = Γ1 ∪ Γ2 5 γ[q] = γ1[q] if q ∈ Q1 γ2[q] if q ∈ Q2 6 DFA = (Q, Σ, δ, q1, Γ, γ) 7 Let V be a new set 8 foreach q ∈ Q do 9 Add {q} to V 10 V .UNION(q1, q2) 11 Push (q1, q2) to a new stack, STACK 12 while STACK is not empty do 13 Pop (p1, p2) from STACK 14 foreach f ∈ Σ do 15 r1 =V.FIND(δ[p1, f]), r2 =V.FIND(δ[p2, f]) 16 if r1 6= r2 then 17 V .UNION(r1, r2) 18 Push (r1, r2) to STACK 19 return TRUE if ∀s ∈ V : ∀p, q ∈ s : γ[p] = γ[q] FALSE otherwise to work with mainstream allocation-site-based points-to analysis frameworks such as CHORD [10], WALA [50], SOOT [49] and DOOP [14]. To demonstrate its effectiveness, we have integrated MAHJONG with DOOP [9, 14], a state-of-the-art whole-program points-to analysis framework for Java. MAHJONG is released as an open-source tool at http://www.cse.unsw.edu.au/~corg/mahjong. Below we discuss three major optimizations. Disjoint-Set Forest In Algorithms 1 and 4, disjoint sets are used. For efficiency, we have implemented a set of disjoint sets as a disjoint-set forest, by representing each disjoint set as a tree with its root being its representative. Thus, UNION amounts to linking the roots of different trees while FIND returns the root of a tree. To improve the efficiency further, we have also implemented two heuristics, union by rank and path compression [11]. As a result, the average execution time of each UNION/FIND operation over a disjoint-set forest can be reduced to nearly O(1) [11]. Shared Sequential Automata In Algorithms 2 and 3, new automata are frequently created. However, different automata can be partly identical, since their common parts correspond to the same objects. Instead of always creating new automata, we allow different automata to share their common parts. This optimization reduces significantly both the time and space costs of the overall algorithm. 285