Lecture 3 HASHING
Lecture 3 HASHING
Why we need HASHING? Wal-Mart:267 million items/day;4PB data warehouse Sloan Digital Sky Survey:New Mexico telescope captures 200 GB image data/day T吉 nature Science The FOURTH PARADIGM DATA-TENS SCIENCEIN THE PETABYTE ERA data Challenge in big data applications: 1.Curse of dimensionality 2.Storage cost 3.Query speed
Challenge in big data applications: 1. Curse of dimensionality 2. Storage cost 3. Query speed • Wal-Mart: 267 million items/day; 4PB data warehouse • Sloan Digital Sky Survey: New Mexico telescope captures 200 GB image data/day Why we need HASHING?
Example 1.Information Retrieval h(Statue of Liberty)= h (Napoleon)= h (Napoleon)= 10001010 01100001 011001Q1 flipped bit Should be very different Should be similar
Example 1. Information Retrieval
Example 2.Storage Cost Gist vector Binary reduction 10 million images 20 GB 160MB 512values 128bits
Example 2. Storage Cost
Example 3.Fast Nearest Neighbor Search Given a query point g(high dimensional),return the points closest(similar)to g in the database. ● 98 0 KD-TREE KD-tree cannot handle high-dimensional data
Example 3. Fast Nearest Neighbor Search Given a query point q (high dimensional), return the points closest (similar) to q in the database. KD-TREE KD-tree cannot handle high-dimensional data
WHAT WILL WE TALK? 1.Locality-Sensitive Hashing (Shingling+MinHash) 2.Learning to Hash 7
7 1. Locality-Sensitive Hashing (Shingling+ MinHash) 2. Learning to Hash WHAT WILL WE TALK?
Locality-Sensitive Hashing Find Similar Items
Locality-Sensitive Hashing Find Similar Items
Introduction Many Web-mining problems can be expressed as finding "similar"sets: 1.Pages with similar words,e.g.,for classification by topic. 2.NetFlix users with similar tastes in movies,for recommendation systems. 3.Movies with similar sets of fans. 4.Images of related things. 9
9 Many Web-mining problems can be expressed as finding “similar” sets: 1. Pages with similar words, e.g., for classification by topic. 2. NetFlix users with similar tastes in movies, for recommendation systems. 3. Movies with similar sets of fans. 4. Images of related things. Introduction Introduction
CASE STUDY Finding Similar Documents
CASE STUDY Finding Similar Documents
Given a body of documents,e.g.,the Web,find pairs of documents with a lot of text in common, e.g.: -Mirror sites,or approximate mirrors. Application:Don't want to show both in a search -Plagiarism,including large quotations. -Similar news articles at many news sites. Application:Cluster articles by "same story." 11
11 • Given a body of documents, e.g., the Web, find pairs of documents with a lot of text in common, e.g.: – Mirror sites, or approximate mirrors. • Application: Don’t want to show both in a search. – Plagiarism, including large quotations. – Similar news articles at many news sites. • Application: Cluster articles by “same story.” Introduction