Data Structures and Algorithm Xiaoqing Zheng Zhengxq@fudan.edu.cn
Data Structures and Algorithm Xiaoqing Zheng zhengxq@fudan.edu.cn
String matching problem Pattern P occurs with shift s in text T(or, equivalently that pattern P occurs beginning at position S I in text T)if Ts+1….s+m]=P1….m」and0≤s≤n-m.If P occurs with shift s in T, the we call s a valid shift The string-matching problem is the problem of finding all valid shifts with which a given pattern P occurs in a given text 1. Text T b c lab b c a b Pattern p albl a
String matching problem Pattern P occurs with shift s in text T (or , equivalently, that pattern P occurs beginning at position s + 1 in text T) if T[ s + 1 ... s + m] = P[1 ... m ] and 0 ≤ s ≤ n − m. If P occurs with shift s in T, the we call s a valid shift. The string-matching problem is the problem of finding all valid shifts with which a given pattern P occurs in a given text T. a b c a b a a b c a b a b a a Text T Pattern P s = 3
Notation and terminology ∑ finite alphabet 2-fa 2 set of all finite-length strings formed using characters from the alphabet 2 zero-length empty string belongs to 2 I x length of a string x concatenation of two string x and y WDx w is a prefix of a string x, if x= wy for some string y∈2 wax w is a suffix of a string x, ifx= yw for some string y∈2 k-character prefix P[l... k] of the pattern P
Notation and terminology finite alphabet. ∑ = { a, b, ..., z }. set of all finite-length strings formed using characters from the alphabet ∑. zero-length empty string belongs to ∑ *. * Σ Σ ε | | x length of a string x. xy concatenation of two string x and y. w is a prefix of a string x, if x = wy for some string y ∈ ∑ *. w x w is a suffix of a string x, if x = yw for some string y ∈ ∑ *. w x k-character prefix P[1 ... k] of the pattern P. Pk
Naive string-matching algorithm NAIVE-STRING-MATCHER(T P 1.n← length[T 2.m← length[P 3.fors←0ton-m 4. do if Pll.m=T[s+l.S+m then print "Pattern occurs with shift" s Running time is O((n-m+ I)m)
Naive string-matching algorithm NAIVE-STRING-MATCHER ( T, P ) 1. n ← length [ T ] 2. m ← length [ P ] 3. for s ← 0 to n − m 4. do if P[1 ... m] = T[ s +1 ... s + m ] 5. then print "Pattern occurs with shift" s Running time is O(( n − m + 1) m )
Idea of rabin-Karp algorithm Input characters and string can be represented by y graphical symbols or digits Given a pattern Pll.ml, let p denote its corresponding decimal value p=P[m]+10(P[m-1)+ 10(P[m-2]+….+10([2]+10P[]).) We also let t denote the decimal value of the length-m substring T[s+1….s+ml,fors=0,12…,n-m Then,t= p if and only if Ts+1….s+m]=P[1….m
Idea of Rabin-Karp algorithm Input characters and string can be represented by graphical symbols or digits. Given a pattern P[1 ... m ], let p denote its corresponding decimal value. Then, ts = p if and only if T[ s + 1 ... s + m] = P[1 ... m ]. We also let ts denote the decimal value of the lengthm substring T[ s + 1 ... s + m ], for s = 0, 1, ..., n − m. p = P [ m] + 10(P [ m − 1]) + 10( P [ m − 2] + ... + 10( P[2] + 10 P[1])...))
dea o f Rabin-Karp algorithm tu+i can be computed from ts in constant time, since ts+1=10(5-10m7s+1)+7s+m+1 Constant is precomputed which can be done in time o(gm) For example, if=5 and t=31415, then we remove the high-order digit T[s+1-3 and bring in the new low-order digit T[s +5+1]=2 to obtain 10(31415-10000·3)+2 14152
Constant is precomputed which can be done in time O (lgm ) Idea of Rabin-Karp algorithm ts+1 can be computed from ts in constant time, since, ts+1 = 10( ts − 10 m − 1 T[s + 1]) + T[ s + m + 1]. For example, if m = 5 and ts = 31415, then we remove the high-order digit T[ s + 1] = 3 and bring in the new low-order digit T[ s + 5 + 1] = 2 to obtain ts+1 = 10(31415 − 10000 · 3) + 2 = 14152
Improved Rabin-Karp algorithm ts+1=(10(-11s+1h)+Ts+m+1)modq g is typically chosen as a prime such that 10g just fits within one computer word h=10 -(mod g)
Improved Rabin-Karp algorithm ts+1 = (10( ts − T[ s + 1] h) + T[ s + m + 1]) mod q. y q is typically chosen as a prime such that 10 q just fits within one computer word. 1 10 m h − y ≡ (mod q )
Improved Rabin-Karp algorithm old high new low order digit 8 order digit 314 1 52 7 14152≡10·(31415-310000+2(mod13) =10·(31415-3·3)+2(mod13) 8(mod13)
Improved Rabin-Karp algorithm 3 1 4 1 5 2 7 8 old highorder digit new low- order digit 14152 10 (31415 3 10000) 2 (mod13) ≡ ⋅ −⋅ + ≡ ⋅ −⋅ + 10 (31415 3 3) 2 (mod13) ≡ 8 (mod13)
Improved Rabin-Karp algorithm 123456789101112131415161718 2|=|59o2|3 1415 267|3|9|9|2 89|31101784510117911 mod 13 Valid match Spurious hit s= p(mod g) does not imply that ts-p
Improved Rabin-Karp algorithm 2 3 5 9 0 2 3 1 4 1 5 2 6 7 3 9 9 2 8 9 3 11 0 1 7 8 4 5 10 11 7 9 11 1 10 11 2345678 9 12 13 14 15 16 18 17 ... ... mod 13 Valid match Spurious hit st p ≡ (mod q ) does not imply that ts = p
Rabin-Karp algorithm RABIN-KARP-MATCHER(T, P, d, g) 1.n← length[ 2.m← length 3.h←dm- mod q Preprocessing 4.p←0 0 6.fori←ltom ∥ Preprocessing 7.dop←(cp+P[)modq Matching 8.doto←(do+7ij)mod 9.fors←0ton-m∥ Matching 10. do if p=t then if P[1l….m]=7[s+1….s+m] 12 then print "Pattern occurs with shift" s 13 ifs<n 14 then ts+1+((ts- T[s+I]h)+T[s+m+lD mod q
Rabin-Karp algorithm RABIN-KARP-MATCHER ( T, P, d, q ) 1. n ← length [ T ] 2. m ← length [ P ] 3. h ← d m −1 mod q 4. p ← 0 5. t 0 ← 0 6. for i ← 1 to m // Preprocessing. 7. do p ← (dp + P [ i]) mod q 8. do t 0 ← (dt 0 + T[ i]) mod q 9. for s ← 0 to n − m // Matching. 10. do if p = ts 11. then if P[1 ... m] = T [ s +1 ... s + m ] 12. then print "Pattern occurs with shift" s 13. if s < n − m 14. then ts+1 ← ( d ( ts − T [s + 1] h) + T [ s + m + 1]) mod q Preprocessing Ф ( m ) Matching O(( n − m + 1) m )