当前位置:高等教育资讯网  >  中国高校课件下载中心  >  大学文库  >  浏览文档

电子科技大学:《大数据分析与挖掘 Big Data Analysis and Mining》课程教学资源(课件讲稿)Lecture 3 Hashing

资源类别:文库,文档格式:PDF,文档页数:61,文件大小:2.11MB,团购合买
Locality-Sensitive Hashing Find Similar Items CASE STUDY Finding Similar Documents Min-Hashing Locality-Sensitive Hashing Learn to Hash
点击下载完整版文档(PDF)

Lecture 3 HASHING

Lecture 3 HASHING

Why we need HASHING? Wal-Mart:267 million items/day;4PB data warehouse Sloan Digital Sky Survey:New Mexico telescope captures 200 GB image data/day T吉 nature Science The FOURTH PARADIGM DATA-TENS SCIENCEIN THE PETABYTE ERA data Challenge in big data applications: 1.Curse of dimensionality 2.Storage cost 3.Query speed

Challenge in big data applications: 1. Curse of dimensionality 2. Storage cost 3. Query speed • Wal-Mart: 267 million items/day; 4PB data warehouse • Sloan Digital Sky Survey: New Mexico telescope captures 200 GB image data/day Why we need HASHING?

Example 1.Information Retrieval h(Statue of Liberty)= h (Napoleon)= h (Napoleon)= 10001010 01100001 011001Q1 flipped bit Should be very different Should be similar

Example 1. Information Retrieval

Example 2.Storage Cost Gist vector Binary reduction 10 million images 20 GB 160MB 512values 128bits

Example 2. Storage Cost

Example 3.Fast Nearest Neighbor Search Given a query point g(high dimensional),return the points closest(similar)to g in the database. ● 98 0 KD-TREE KD-tree cannot handle high-dimensional data

Example 3. Fast Nearest Neighbor Search Given a query point q (high dimensional), return the points closest (similar) to q in the database. KD-TREE KD-tree cannot handle high-dimensional data

WHAT WILL WE TALK? 1.Locality-Sensitive Hashing (Shingling+MinHash) 2.Learning to Hash 7

7 1. Locality-Sensitive Hashing (Shingling+ MinHash) 2. Learning to Hash WHAT WILL WE TALK?

Locality-Sensitive Hashing Find Similar Items

Locality-Sensitive Hashing Find Similar Items

Introduction Many Web-mining problems can be expressed as finding "similar"sets: 1.Pages with similar words,e.g.,for classification by topic. 2.NetFlix users with similar tastes in movies,for recommendation systems. 3.Movies with similar sets of fans. 4.Images of related things. 9

9 Many Web-mining problems can be expressed as finding “similar” sets: 1. Pages with similar words, e.g., for classification by topic. 2. NetFlix users with similar tastes in movies, for recommendation systems. 3. Movies with similar sets of fans. 4. Images of related things. Introduction Introduction

CASE STUDY Finding Similar Documents

CASE STUDY Finding Similar Documents

Given a body of documents,e.g.,the Web,find pairs of documents with a lot of text in common, e.g.: -Mirror sites,or approximate mirrors. Application:Don't want to show both in a search -Plagiarism,including large quotations. -Similar news articles at many news sites. Application:Cluster articles by "same story." 11

11 • Given a body of documents, e.g., the Web, find pairs of documents with a lot of text in common, e.g.: – Mirror sites, or approximate mirrors. • Application: Don’t want to show both in a search. – Plagiarism, including large quotations. – Similar news articles at many news sites. • Application: Cluster articles by “same story.” Introduction

点击下载完整版文档(PDF)VIP每日下载上限内不扣除下载券和下载次数;
按次数下载不扣除下载券;
24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
共61页,可试读20页,点击继续阅读 ↓↓
相关文档

关于我们|帮助中心|下载说明|相关软件|意见反馈|联系我们

Copyright © 2008-现在 cucdc.com 高等教育资讯网 版权所有