Big Data Integration Xin Luna Dong Google Inc) Divesh Srivastava(AT&T Labs-Research)
Big Data Integration Xin Luna Dong (Google Inc.) Divesh Srivastava (AT&T Labs-Research)
What is“ Big data Integration?” o Big data integration= Big data+ data integration Data integration: easy access to multiple data sources[DH[12 Virtual: mediated schema, query reformulation, link fuse answers Warehouse: materialized data, easy querying, consistency issues ◆ Big data: all about the v Size: large volume of data, collected and analyzed at high velocity Complexity huge variety of data, of questionable veracity Utility: data of considerable value
What is “Big Data Integration?” Big data integration = Big data + data integration Data integration: easy access to multiple data sources [DHI12] – Virtual: mediated schema, query reformulation, link + fuse answers – Warehouse: materialized data, easy querying, consistency issues Big data: all about the V’s ☺ – Size: large volume of data, collected and analyzed at high velocity – Complexity: huge variety of data, of questionable veracity – Utility: data of considerable value 2
What is“ Big data Integration?” o Big data integration= Big data+ data integration Data integration: easy access to multiple data sources[DH[12 Virtual: mediated schema, query reformulation, link fuse answers Warehouse: materialized data, easy querying, consistency issues Big data in the context of data integration: still about the v's g Size: large volume of sources, changing at high velocity Complexity huge variety of sources, of questionable veracity Utility: sources of considerable value
What is “Big Data Integration?” Big data integration = Big data + data integration Data integration: easy access to multiple data sources [DHI12] – Virtual: mediated schema, query reformulation, link + fuse answers – Warehouse: materialized data, easy querying, consistency issues Big data in the context of data integration: still about the V’s ☺ – Size: large volume of sources, changing at high velocity – Complexity: huge variety of sources, of questionable veracity – Utility: sources of considerable value 3
Outline ◆ Motivation Why do we need big data integration? How has"small"data integration been done? Challenges in big data integration ◆ Schema alignment ◆ Record linkage ◆ Data fusion ◆ merging topICs
Outline Motivation – Why do we need big data integration? – How has “small” data integration been done? – Challenges in big data integration Schema alignment Record linkage Data fusion Emerging topics 4
Why do We need"Big Data Integration? Building web-scale knowledge bases ProBase MSR knowledge base A Little Knowledge Goes a Long Way Google knowledge graph 产 Freebase Doman Topics Facts 24M161M aGO ct knowledge Meda common
Why Do We Need “Big Data Integration?” Building web-scale knowledge bases 5 Google knowledge graph MSR knowledge base A Little Knowledge Goes a Long Way. NELL
Why do We need"Big Data Integration? Reasoning over linked data N m①②
Why Do We Need “Big Data Integration?” Reasoning over linked data 6
Why do We need"Big Data Integration? Geo-spatial data fusion ident Data Cnme Data SARS atellite Analytic Critica Hazard Data Geospatial Data Fusion http://axiomamuse.wordpress.com/2011/04/18/ 7
Why Do We Need “Big Data Integration?” Geo-spatial data fusion 7 http://axiomamuse.wordpress.com/2011/04/18/
Why do We need"Big Data Integration? Scientific data analysis Genes genotypes Disease Models Expression C圆 Recombinases(cre) Function Pathways Strains/SNPs Orthology Tumors chiE 310 http://scienceline.org/2012/01/from-index-cards-to-information-overload/
Why Do We Need “Big Data Integration?” Scientific data analysis 8 http://scienceline.org/2012/01/from-index-cards-to-information-overload/
Outline ◆ Motivation Why do we need big data integration? How has"small"data integration been done? Challenges in big data integration ◆ Schema alignment ◆ Record linkage ◆ Data fusion ◆ merging topICs
Outline Motivation – Why do we need big data integration? – How has “small” data integration been done? – Challenges in big data integration Schema alignment Record linkage Data fusion Emerging topics 9
Small Data Integration: What Is It? Data integration solving lots of jigsaw puzzles Each jigsaw puzzle e. g, Ta j mahal) is an integrated entity Each piece of a puzzle comes from some source Small data integration solving small puzzles
“Small” Data Integration: What Is It? Data integration = solving lots of jigsaw puzzles – Each jigsaw puzzle (e.g., Taj Mahal) is an integrated entity – Each piece of a puzzle comes from some source – Small data integration → solving small puzzles 10