当前位置:高等教育资讯网  >  中国高校课件下载中心  >  大学文库  >  浏览文档

清华大学:A Pivotal Prefix Based Filtering Algorithm for String Similarity Search(PPT讲稿)

资源类别:文库,文档格式:PPTX,文档页数:54,文件大小:1.41MB,团购合买
点击下载完整版文档(PPTX)

snowbird UT. USA. 2014 A Pivotal Prefix Based Filtering Algorithm for String Similarity search Dong deng, guoliang Li, Jianhua Feng Database Group, Tsinghua University 小 1911 Present by dong deng

Dong Deng, Guoliang Li, Jianhua Feng Database Group, Tsinghua University Present by Dong Deng A Pivotal Prefix Based Filtering Algorithm for String Similarity Search

Search is Important Google Searches per Year 1,600,000,000,000 ■ Search queries 1,200,000,000,000 800000,000,000 40,000,000,000 19992000200120022003200420052006200720082009201020112012 Google searches per Year Source:http://www.internetlivestats.com/google-search-statistics/

Search is Important Source: http://www.internetlivestats.com/google-search-statistics/ Google Searches per Year

Speed matters But when questions aren' t answered quickly, people ask less GOOGLE FOUND THAT SLOWING SEARCH RESULTS BY JUST 4/10THS OF A SECOND would reduce the number of searches by 8,000,000 a d Source

Speed Matters Source:

Data is dirty DBLP Complete Search ypos Argyrios zymnis 008 2EE Argyrios Zymis, Stephen P. Boyd, Dimitry Ml. Gorinevsky: Mixed state estimation for a linear Gaussi an Markov model. CDC2008:3219-3226 I EE Argyris Zymmis, Stephen P. Boyd, Dimitry M. Gorinevsky: Mixed state estimation for a linear Gaussian Markov model. cAo00:1011 argyris Zymnis relaxed E Yai I, Moses Charikar, Piotr Indyk: On page migration and otherrelaxed task systems. Theor. Comput. Sci. Tcs)268(1):43-6(2001) 1997 1 EE Yair Bartal, Moses Charikar, Piotr Indyk: On Page Migration and Other Related Task Systems. SODA 1997: 43-52 related

Data is Dirty • Typos • Typo in “title” relaxed related Argyrios Zymnis Argyris Zymnis DBLP Complete Search

Similarity Search Query All the strings similar to the query String Dataset

Similarity Search Query String Dataset All the strings similar to the query

Edit distance ED( s): The min number of edit operations (insertion/deletion/substitution) needed to transform r to s For example: ED(sigcom, sigmod )=2 sitcom substitute c with m sIemon substitute m with d sigmod

• ED(r, s): The min number of edit operations (insertion/deletion/substitution) needed to transform r to s. • For example: ED(sigcom, sigmod) = 2 Edit Distance sigcom sigmom sigmod substitute c with m substitute m with d

Problem definition DEFINITION 1 (STRING SIMILARITY SEARCH). Given a string dataset R, a query string s, and an edit-distance thresh old T, the string similarity search with edit-distance con- straints finds all strings r ER such that ed(r,S)<T id string Query string lmyouteca s= yotubecom"andτ=2 r2 ubuntucom r3 utubbecou r4 youtbecom r5 yoytubeca ed(s, r4<=2 output ra as a result string dataset R

Problem Definition Query string s = “yotubecom” and τ = 2 string dataset R ed(s, r4 ) <= 2 output r4 as a result

Application Spell checking Copy detection Entity Linking ·Bⅰ informatic

Application • Spell Checking • Copy Detection • Entity Linking • Bioinformatic …

Challenge Naive method Time complexity: for each query O(RIs t

Challenge Naïve Method Time complexity: for each query 𝑂 |𝑅| |𝑠| τ

Filter-and-Verification framework Query string s Filter: Index No es Dataset r Signature()∩ Verify: Results Signature (r)=o? ED(rs)≤τ? Threshold T Pruning power Filtering Cost depends on depends on Sig(r)Ix(sig(s)) and Index: (sig(r)I(Probe set size) min((sig(r)(Sig(s) No Index: (sig(r)+(sig(s)I

No Filter-and-Verification Framework Dataset R Threshold τ Query string s Results Filter: Signature(s) ∩ Signature(r) = ϕ? Verify: ED(r,s) ≤ τ? Yes Index

点击下载完整版文档(PPTX)VIP每日下载上限内不扣除下载券和下载次数;
按次数下载不扣除下载券;
24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
共54页,可试读18页,点击继续阅读 ↓↓
相关文档

关于我们|帮助中心|下载说明|相关软件|意见反馈|联系我们

Copyright © 2008-现在 cucdc.com 高等教育资讯网 版权所有