当前位置:高等教育资讯网  >  中国高校课件下载中心  >  大学文库  >  浏览文档

中国人民大学:Similarity Measures in Deep Web Data Integration

资源类别:文库,文档格式:PPT,文档页数:26,文件大小:252KB,团购合买
点击下载完整版文档(PPT)

Similarity Measures in Deep Web Data Integration Fangliao Jiang

Similarity Measures in Deep Web Data Integration Fangjiao Jiang

Outline Motivation Brief review on Existing Similarity Measures Challenges and Our Solutions Conclusion

Outline ◼ Motivation ◼ Brief Review on Existing Similarity Measures ◼ Challenges and Our Solutions ◼ Conclusion

Outline Motivation Brief Review on Existing Similarity Measures Challenges and Our Solutions Conclusion

Outline ◼ Motivation ◼ Brief Review on Existing Similarity Measures ◼ Challenges and Our Solutions ◼ Conclusion

Similarity measure an essential point in data integration Variations from Representation Typographical errors, misspellings, abbreviations, etc Extraction From unstructured or semi-unstructured documents or web pages 44W. 4th st 44 West fourth Street Smith Abroms “KFC Smoth abrams "Kentucky fried chicken' fR. Smith" i Richard smith'h

Similarity measure — an essential point in data integration Variations from: ◼ Representation Typographical errors, misspellings, abbreviations, etc ◼ Extraction From unstructured or semi-unstructured documents or web pages Smith Smoth 44 W. 4th St. 44 West Fourth Street "R. Smith" " Richard Smith" Abroms Abrams “KFC" “Kentucky Fried Chicken

Similarity measure an essential point in data integration Similarity measure will be applied to I Keyword search From key word query interface to structured query interface Schema matching From integrated query interface to local query interface Result merge Duplicate records detection(field level) Q={keyl,key2,…} Record record 1 Integrated Interface=k,, Record Record2 ocal Interface=, , Record3Record3 Record4→ Record4

Similarity measure — an essential point in data integration ◼ Similarity measure will be applied to: ◼ Keyword search From keyword query interface to structured query interface ◼ Schema matching From integrated query interface to local query interface ◼ Result merge Duplicate records detection (field level ) Q ={ , , …} Integrated Interface={ , ,…} Local Interface={ , ,…} key1 key2 Record1 Record2 Record3 Record4 Record1 Record2 Record3 Record4

Outline Motivation Brief review on Existing Similarity Measures Challenges and Our Solutions Conclusion

Outline ◼ Motivation ◼ Brief Review on Existing Similarity Measures ◼ Challenges and Our Solutions ◼ Conclusion

Similarity methods Similarity methods String Similarity Numeric Data Similarity Character-based Token-based Treated as strings Edit distance Atomic strings Affine gap distance WHIRL Smith -Waterman distance Q-grams with tf idf Jaro distance metric Q-gram distance

Similarity methods Similarity methods String Similarity Numeric Data Similarity Character-based Token-based Edit distance Affine gap distance Smith-Waterman distance Jaro distance metric Q-gram distance Atomic strings WHIRL Q-grams with tf.idf Treated as strings

edit distance edit distance, a.k.a. Levenshtein distance The minimum number of edit operations(insertions, deletions and substitutions) of single characters needed to transform the string SI into $2 E xample SI: unne cessarily Edit distance(s1, s2)=3 S2: un escessaraly O(SI, S2D Problem last names first names and street names that did not agree on a character-by- character basis For example: Similarity (John R Smith, Johathan Richard Smith)=11

Edit distance ◼ Edit distance, a.k.a. Levenshtein distance The minimum number of edit operations (insertions, deletions, and substitutions) of single characters needed to transform the string S1 into S2. ◼ Problem: last names, first names, and street names that did not agree on a character-by￾character basis For example: Similarity(John R.Smith,Johathan Richard Smith)=11 ◼ Example1: S1:unne cessarily Edit distance(S1,S2)=3 S2:un escessaraly O(|S1| ,|S2|)

Affine gap distance Two extra edit operations: open gap and extend gap cost(g)=S+e×l(e<s), where s is the cost of opening a gap e is the cost of extending a gap, and l is the length of a gap in the alignment of two strings Example2(Affine gap distance) F.R.Smith o hn richard smi ith This method is better when matching strings have been truncated or shortened

Affine gap distance ◼ Two extra edit operations: open gap and extend gap cost(g) =s + e × l ( e<s ), where s is the cost of opening a gap, e is the cost of extending a gap, and l is the length of a gap in the alignment of two strings ◼ Example2 (Affine gap distance): ◼ This method is better when matching strings have been truncated or shortened "J. R. S m i t h“ " J o h n R i c h a r d S m i t h

Smith-Waterman distance Extension of edit distance and affine gap distance Mismatches at the beginning and the end of strings have lower cost than mismatches in the middle Example 3 ProfJohn r smith, University of Calgary" Johnr. smith Prof

Smith-Waterman distance ◼ Extension of edit distance and affine gap distance Mismatches at the beginning and the end of strings have lower cost than mismatches in the middle. ◼ Example 3 : “Prof.John R.Smith,University of Calgary“ “John R.Smith, Prof

点击下载完整版文档(PPT)VIP每日下载上限内不扣除下载券和下载次数;
按次数下载不扣除下载券;
24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
共26页,试读已结束,阅读完整版请下载
相关文档

关于我们|帮助中心|下载说明|相关软件|意见反馈|联系我们

Copyright © 2008-现在 cucdc.com 高等教育资讯网 版权所有