Top-k String Similarity Search with Edit-Distance Constraints Dong Deng (Tsinghua, Beijing, China) Guoliang Li (Tsinghua, Beijing, China) Jianhua Feng (Tsinghua, Beijing, China) GHUA NIVE -1911- Wen-Syan Li(SAP Labs, Shanghai, China)
Dong Deng (Tsinghua, Beijing, China) Guoliang Li (Tsinghua, Beijing, China) Jianhua Feng (Tsinghua, Beijing, China) Wen-Syan Li(SAP Labs, Shanghai, China)
Outline ● Motivation ● Problem formulation o Progressive Framework Pivotal Entry-based Method ● Range- based method ° Experiment ● Conclusion 2/2/2021 Topk Search(@ ICD E2013
Outline Motivation Problem Formulation Progressive Framework Pivotal Entry-based Method Range-based Method Experiment Conclusion 2/2/2021 TopkSearch @ ICDE2013 2/42
Real-world Data is Rather Dirty! a DBLP Complete search ● Typo in“aut Argyrios zymnis 008 2EE Argyrios Zymis, Stephen P. Boyd, Dimitry Ml. Gorinevsky: Mixed state estimation for a linear Gaussi an Markov model. CDC2008:3219-3226 I EE Argyris Zymmis, Stephen P. Boyd, Dimitry M. Gorinevsky: Mixed state estimation for a linear Gaussian Markov model. □Awvw:051011 Argyris Zymnis Typo in“ title relaxe Yair Bartal, Moses Charikar, Piotr Indyk: On page migration and otherrelaxed task systems. Theor. Comput. Sci. Tcs)268(1):43-6(2001) 1997 1 EE Yair Bartal, Moses Charikar, Piotr Indyk: On Page Migration and Other Related Task Systems. SODA 1997: 43-52 related 2/2/2021 Topk Search(@ ICD E2013 /42
Typo in “author” Typo in “title” relaxed related Argyrios Zymnis Argyris Zymnis DBLP Complete Search 2/2/2021 Real-world Data is Rather Dirty! TopkSearch @ ICDE2013 3/42
Web search Google litany spears. Search Advanced Search Web News Images video Did you mean: britney spears o Errors in queries 488941 britney spears o Errors in data 40134 brittany spears 36315 brittney spears 24342 britany spears o Bring query and meaningful 7331 britny spears 6633 briteny spears Actual 2696 brittany spears results closer together 1807 briney spears queres 1635 brittny spears gathered 1479 brintey spears bv 479 britanny spears 338 britiny spears Google 1211 bitnet spears 1096 britiney spears 2/2/2021 Topk Search(@ ICDE2013 4
Web Search 2/2/2021 TopkSearch @ ICDE2013 4/42 Errors in queries Errors in data Bring query and meaningful results closer together Actual queries gathered by Google
Example: a movie database Iron man The man Find movies starred Samuel Jackson Star Title Year Genre Keanu reeves The matrix 1999Scⅰ-Fi Samuel Jackson Iron man 2008Sc|-Fi Schwarzenegger The Terminator 1984Scⅰ-Fi Samuel Jackson「 The man 2006 Crime
5 Example: a movie database Star Title Year Genre Keanu Reeves The Matrix 1999 Sci-Fi Samuel Jackson Iron man 2008 Sci-Fi Schwarzenegger The Terminator 1984 Sci-Fi Samuel Jackson The man 2006 Crime Find movies starred Samuel Jackson Iron man The man
Query: Schwarzenegger? The user doesnt know the exact spelling Star Title Year Genre Keanu reeves The matrix 1999Sc|-Fi Samuel jackson Iron man 2008sc|-Fi Schwarzenegger The Terminator 1984 Sci-Fi Samuel Jackson The man 2006Crime
6 Query: Schwarzenegger? Star Title Year Genre Keanu Reeves The Matrix 1999 Sci-Fi Samuel Jackson Iron man 2008 Sci-Fi Schwarzenegger The Terminator 1984 Sci-Fi Samuel Jackson The man 2006 Crime The user doesn’t know the exact spelling!
Relaxing Conditions Find movies with a star similar to Schwarrzenger Star Title ear Genre Keanu reeves The matrix 1999 Sci-Fi Samuel Jackson Iron man 2008 Sci-Fi Schwarzeneggerpfrhe Terminator 1984 Sci-Fi Samuel Jackson The man 2006 Crime
7 Relaxing Conditions Star Title Year Genre Keanu Reeves The Matrix 1999 Sci-Fi Samuel Jackson Iron man 2008 Sci-Fi Schwarzenegger The Terminator 1984 Sci-Fi Samuel Jackson The man 2006 Crime Find movies with a star “similar to” Schwarrzenger
String Similarity Search String Similarity Search finds all entries from the dictionary that approximately match the query. ° Applications: Biology, Bioinformatics Information retrieve Data Quality, Data Cleaning 2/2/2021 Topk Search(@ ICD E2013
String Similarity Search finds all entries from the dictionary that approximately match the query. Applications: Biology, Bioinformatics Information Retrieve Data Quality, Data Cleaning …. String Similarity Search 2/2/2021 TopkSearch @ ICDE2013 8/42
Outline ● Motivation ● Problem formulation o Progressive Framework Pivotal Entry-based Method ● Range- based method ° Experiment ● Conclusion 2/2/2021 Topk Search(@ ICD E2013 9/42
Outline Motivation Problem Formulation Progressive Framework Pivotal Entry-based Method Range-based Method Experiment Conclusion 2/2/2021 TopkSearch @ ICDE2013 9/42
Problem formulation o Top-k String Similarity Search: Given a string set s and a query string g, top-k string similarity search returns a string setr sS such that r/=k and for any string reR and sES-R, ED( q s ED(s, q) TABLE I A STRING SET S AND A QUERY q="srajit ID s2 S3 S4 S5 S6 String (sarit (erajD suijt suit(urajitthrifty the top-3 similar strings of srajit 2/2/2021 Topk Search(@ ICD E2013
Problem Formulation 2/2/2021 TopkSearch @ ICDE2013 Top-k String Similarity Search: Given a string set S and a query string q, top-k string similarity search returns a string set R ⊆ S such that |R|=k and for any string r∈ R and s∈ S − R, ED(r, q) ≤ ED(s, q). 10/42 the top-3 similar strings of srajit