snowbird UT. USA. 2014 A Pivotal Prefix Based Filtering Algorithm for String Similarity search Dong deng, guoliang Li, Jianhua Feng Database Group, Tsinghua University 小 1911 Present by dong deng
Dong Deng, Guoliang Li, Jianhua Feng Database Group, Tsinghua University Present by Dong Deng A Pivotal Prefix Based Filtering Algorithm for String Similarity Search
Search is Important Google Searches per Year 1,600,000,000,000 ■ Search queries 1,200,000,000,000 800000,000,000 40,000,000,000 19992000200120022003200420052006200720082009201020112012 Google searches per Year Source:http://www.internetlivestats.com/google-search-statistics/
Search is Important Source: http://www.internetlivestats.com/google-search-statistics/ Google Searches per Year
Speed matters But when questions aren' t answered quickly, people ask less GOOGLE FOUND THAT SLOWING SEARCH RESULTS BY JUST 4/10THS OF A SECOND would reduce the number of searches by 8,000,000 a d Source
Speed Matters Source:
Data is dirty DBLP Complete Search ypos Argyrios zymnis 008 2EE Argyrios Zymis, Stephen P. Boyd, Dimitry Ml. Gorinevsky: Mixed state estimation for a linear Gaussi an Markov model. CDC2008:3219-3226 I EE Argyris Zymmis, Stephen P. Boyd, Dimitry M. Gorinevsky: Mixed state estimation for a linear Gaussian Markov model. cAo00:1011 argyris Zymnis relaxed E Yai I, Moses Charikar, Piotr Indyk: On page migration and otherrelaxed task systems. Theor. Comput. Sci. Tcs)268(1):43-6(2001) 1997 1 EE Yair Bartal, Moses Charikar, Piotr Indyk: On Page Migration and Other Related Task Systems. SODA 1997: 43-52 related
Data is Dirty • Typos • Typo in “title” relaxed related Argyrios Zymnis Argyris Zymnis DBLP Complete Search
Similarity Search Query All the strings similar to the query String Dataset
Similarity Search Query String Dataset All the strings similar to the query
Edit distance ED( s): The min number of edit operations (insertion/deletion/substitution) needed to transform r to s For example: ED(sigcom, sigmod )=2 sitcom substitute c with m sIemon substitute m with d sigmod
• ED(r, s): The min number of edit operations (insertion/deletion/substitution) needed to transform r to s. • For example: ED(sigcom, sigmod) = 2 Edit Distance sigcom sigmom sigmod substitute c with m substitute m with d
Problem definition DEFINITION 1 (STRING SIMILARITY SEARCH). Given a string dataset R, a query string s, and an edit-distance thresh old T, the string similarity search with edit-distance con- straints finds all strings r ER such that ed(r,S)<T id string Query string lmyouteca s= yotubecom"andτ=2 r2 ubuntucom r3 utubbecou r4 youtbecom r5 yoytubeca ed(s, r4<=2 output ra as a result string dataset R
Problem Definition Query string s = “yotubecom” and τ = 2 string dataset R ed(s, r4 ) <= 2 output r4 as a result
Application Spell checking Copy detection Entity Linking ·Bⅰ informatic
Application • Spell Checking • Copy Detection • Entity Linking • Bioinformatic …
Challenge Naive method Time complexity: for each query O(RIs t
Challenge Naïve Method Time complexity: for each query 𝑂 |𝑅| |𝑠| τ
Filter-and-Verification framework Query string s Filter: Index No es Dataset r Signature()∩ Verify: Results Signature (r)=o? ED(rs)≤τ? Threshold T Pruning power Filtering Cost depends on depends on Sig(r)Ix(sig(s)) and Index: (sig(r)I(Probe set size) min((sig(r)(Sig(s) No Index: (sig(r)+(sig(s)I
No Filter-and-Verification Framework Dataset R Threshold τ Query string s Results Filter: Signature(s) ∩ Signature(r) = ϕ? Verify: ED(r,s) ≤ τ? Yes Index