Record linkage for big data Slides from Luna Dongs VLDB Tutoria
Record Linkage for Big Data Slides from Luna Dong’s VLDB Tutorial 1
Record Linkage Matching based on identifying content: color, pattern
Record Linkage Matching based on identifying content: color, pattern 2
Record linkage Matching based on identifying content: color, pattern
Record Linkage Matching based on identifying content: color, pattern 3
Record Linkage: Three Steps [ElVO7, GMi2] Record linkage blocking+ pairwise matching+ clustering Scalability, similarity semantics Blocking Pairwise Matching Clustering
Record Linkage: Three Steps [EIV07, GM12] Record linkage: blocking + pairwise matching + clustering – Scalability, similarity, semantics 4 Blocking Pairwise Matching Clustering
Record linkage: Three Steps Blocking: efficiently create small blocks of similar records Ensures scalability Blocking 学事 Pairwise Matching Clustering
Record Linkage: Three Steps Blocking: efficiently create small blocks of similar records – Ensures scalability 5 Blocking Pairwise Matching Clustering
Record linkage: Three Steps Pairwise matching: compares all record pairs in a block Computes similarity Blocking Pairwise Matching Clustering
Record Linkage: Three Steps Pairwise matching: compares all record pairs in a block – Computes similarity 6 Blocking Pairwise Matching Clustering
Record linkage: Three steps Clustering: groups sets of records into entities Ensures semantics Blocking 事 Pairwise Matching Clustering
Record Linkage: Three Steps Clustering: groups sets of records into entities – Ensures semantics 7 Blocking Pairwise Matching Clustering
BDI: Record Linkage 4 Volume: dealing with billions of records Map-reduce based record linkage [vcl10, KTr12 Adaptive record blocking [DNS+12, MKB12, VN12 Blocking in heterogeneous data spaces [Plp+12, PKP+13] ◆ Velocity Incremental record linkage [wgm10, WGM13
BDI: Record Linkage Volume: dealing with billions of records – Map-reduce based record linkage [VCL10, KTR12] – Adaptive record blocking [DNS+12, MKB12, VN12] – Blocking in heterogeneous data spaces [PIP+12, PKP+13] Velocity – Incremental record linkage [WGM10, WGM13] 8
BDI: Record Linkage ◆ variety Matching structured and unstructured data [KGA+11, KTT+12 Matching Web tables and catalogs [lsc10 ◆ Veracity Linking temporal records [ldm+11
BDI: Record Linkage Variety – Matching structured and unstructured data [KGA+11, KTT+12] – Matching Web tables and catalogs [LSC10] Veracity – Linking temporal records [LDM+11] 9
Matching with Unstructured Data Matching product offers: 1000s of stores, millions of products Product offers are terse, unstructured text Many similar but different product offers Panasonic Lumix DMC-SZ3 16 1 MP Digital camera -Black Other style options: Violet($124)White($125) Panasonic Lumix-Point Shoot-161 megapixel- Compact Sensor -CCD optical zoom -SD Card-Built-in Flash-39 ounce-ISo 6, 400 a Add to Shortlist Panasonic Lumix DMC-ZS25 16.1 MP Digital camera-SilverC Other style options: Black ($225 Panasonic Lumix- Point Shoot- 16.1 megapixel- Compact Sensor R Add toshertli Panasonic Lumix DMC-ZS8 14.1 MP Digital camera-Blackv Other style options: Silver($200) Panasonic Lumix-Point& Shoot-141 megapixel -Compact Sensor -16x optical zoom-SD Card-Built-in Flash- 6.6 ounce-Iso 6,400 2 ★★★到 a Add to shortlist
Matching with Unstructured Data Matching product offers: 1000s of stores, millions of products – Product offers are terse, unstructured text – Many similar but different product offers 10