Pass-Join:A Partition based Method for Similarity Joins Guoliang Li (Tsinghua,China) Dong Deng (Tsinghua,China) Jiannan Wang (Tsinghua,China) h心 -1911- Jianhua Feng (Tsinghua,China)
Guoliang Li (Tsinghua, China) Dong Deng (Tsinghua, China) Jiannan Wang (Tsinghua, China) Jianhua Feng (Tsinghua, China)
Real-world Data is Rather Dirty! a DBLP Complete search ● Typo in“aut Argyrios zymnis 008 2EE Argyrios Zymis, Stephen P. Boyd, Dimitry Ml. Gorinevsky: Mixed state estimation for a linear Gaussi an Markov model. CDC2008:3219-3226 I EE Argyris Zymmis, Stephen P. Boyd, Dimitry M. Gorinevsky: Mixed state estimation for a linear Gaussian Markov model. □Awvw:051011 Argyris Zymnis Typo in“ title relaxe Yair Bartal, Moses Charikar, Piotr Indyk: On page migration and otherrelaxed task systems. Theor. Comput. Sci. Tcs)268(1):43-6(2001) 1997 1 EE Yair Bartal, Moses Charikar, Piotr Indyk: On Page Migration and Other Related Task Systems. SODA 1997: 43-52 1/29/2021 related PassJoin a VLDB2012
Typo in “author” Typo in “title” relaxed related Argyrios Zymnis Argyris Zymnis DBLP Complete Search 1/29/2021 Real-world Data is Rather Dirty! PassJoin @ VLDB2012 2
Similarity Join Equal Join Datasets Dataset s Conference Conference VLDB CIDR SIGMOD SIGMOD ICDE PVLDB 20/2021 PassJoin a VLDB2012
Similarity Join 1/29/2021 PassJoin @ VLDB2012 3 Conference VLDB SIGMOD ICDE Conference CIDR SIGMOD PVLDB Dataset R Dataset S Equal Join
Similarity Join Similarity join Datasets Dataset s Conference Conference VLDB CIDR SIGMOD SIGMOD ICDE PVLDB 20/2021 PassJoin a VLDB2012
Similarity Join 1/29/2021 PassJoin @ VLDB2012 4 Conference VLDB SIGMOD ICDE Conference CIDR SIGMOD PVLDB Dataset R Dataset S Similarity Join
Applications o Data Cleaning and Integration Near duplicate object detection o Collaborative Filtering 20/2021 PassJoin a VLDB2012
Applications Data Cleaning and Integration Near Duplicate Object Detection Collaborative Filtering …….. 1/29/2021 PassJoin @ VLDB2012 5
Edit distance ED(r s): The minimum number of single-character edit operations(insertion/deletion/substitution to transform r to s Forexample: ED(hilton, huston)=2 hilton substitute i with u hulton substitute l with u huston Property:ED(r, s)2IIrI-Isl 20/2021 PassJoin a VLDB2012
ED(r, s): The minimum number of single-character edit operations(insertion/deletion/substitution) to transform r to s. For example: ED(hilton, huston) = 2 Property: ED(r, s) ≥ ||r|-|s|| Edit Distance 1/29/2021 PassJoin @ VLDB2012 6 hilton hulton huston substitute i with u substitute l with u
Problem formulation Give threshold t=3 [ID Strings $1 vankatesh S2 avataresha s3 kaushic chaduri S4 kaushik chakra S5 kaushuk chadhui s6=5ED=13ED=12 ED=14ED=12ED=12ED=12 ED=14ED=4ED=8 ED=4ED=8 1/29/2021 PassJoin a VLDB2012
Problem Formulation 1/29/2021 PassJoin @ VLDB2012 7 Give threshold τ=3 ED =5 ED =13 ED =12 ED =12 ED =14 ED =12 ED =12 ED =12 ED =14 ED =5 ED =4 ED =8 ED =4 ED =3 ED =8
Filter-and-refine methods Basic idea e Filter a large number of dissimilar string pairs Verify the remaining potentially similar pairs 20/2021 PassJoin a VLDB2012
Filter-and-refine Methods Basic idea Filter a large number of dissimilar string pairs Verify the remaining potentially similar pairs 1/29/2021 PassJoin @ VLDB2012 8
Filter-and-refine methods Give threshold t=3 DI Strings Length s1 vankatesh 9 Pruning Condition. s2/ avataresha s3 kaushic chaduri 15 S 3 s4 kaushik chakrab15 ss kaushuk chadhui|15 s6 caushik chakrabar|17 ED≤s1,2>=5113B1412E“112 E1当-44E23212E=5ED=4ED=8 ED=4ED=3ED=8 1/29/2021 PassJoin a VLDB2012
Filter-and-refine Methods 1/29/2021 PassJoin @ VLDB2012 9 Give threshold τ=3 ED =5 ED =13 ED =12 ED =12 ED =14 ED =12 ED =12 ED =12 ED =14 ED =5 ED =4 ED =8 ED =4 ED =3 ED =8 Pruning Condition: ||si | - |sj || > 3
Filter-and-refine methods ● Basic idea Filter a large number of dissimilar string pairs Verify the remaining potentially similar pairs Drawbacks o Need to tune parameters Bad for short strings 20/2021 PassJoin a VLDB2012
Filter-and-refine Methods Basic idea Filter a large number of dissimilar string pairs Verify the remaining potentially similar pairs Drawbacks Need to tune parameters Bad for short strings 1/29/2021 PassJoin @ VLDB2012 10