
Mining of Massive Web Data第54讲Web信息检索简介更多资料:http://web.stanford.edu/class/cs276/武汉理工大学计算机科学与技术学院
Mining of Massive Web Data 更多资料:h1p://web.stanford.edu/class/cs276/ ᦇᓒᑀӨದᴺ ᒫ54ᦖ Webמ௳༄ᔱᓌՕ

计贸机科学与技术学院第14讲Web信息检索简介IntroductionInformationRetrievalWeb SearchIRHistory武铺理工大学
ᒫ14ᦖ Web מ௳༄ᔱᓌՕ Introduc@on Web Search Informa@on Retrieval IR History

计算机科学与技术学院InformationRetrieval (IR).The indexing and retrieval of textual documents.? Searching for pages on the World Wide Web is the most recent“killer app."? Concerned firstly with retrieving relevant documents to aquery.? Concerned secondly with retrieving from large sets ofdocumentsefficiently武铺理工大学
Information Retrieval (IR) • The indexing and retrieval of textual documents. • Searching for pages on the World Wide Web is the most recent “killer app.” • Concerned firstly with retrieving relevant documents to a query. • Concerned secondly with retrieving from large sets of documents efficiently

计等机科学与技术学院Typical IR TaskGiven:A corpus of textual natural-language documentsA user query in the form of a textual stringFind:A ranked set of documents that are relevant to thequery.武铺理工大学
Typical IR Task • Given: - A corpus of textual natural-language documents. - A user query in the form of a textual string. • Find: - A ranked set of documents that are relevant to the query

计算机科学与技术学院IRSystemDocumentcorpusQueryIRStringSystem1. Docl2. Doc2Ranked3. Doc3Documents武铺理工大学
IR System IR System Query String Document corpus Ranked Documents 1. Doc1 2. Doc2 3. Doc3 .

计算机科学与技术学院Relevance· Relevance is a subjective judgment and mayinclude:-Being on the proper subject.-Being timely (recent information)- Being authoritative (from a trusted source)- Satisfying the goals of the user and his/her intended useoftheinformation(informationneed)武铺理工大学
Relevance • Relevance is a subjective judgment and may include: - Being on the proper subject. - Being timely (recent information). - Being authoritative (from a trusted source). - Satisfying the goals of the user and his/her intended use of the information (information need)

计穿机科学与技术学院Keyword Search? Simplest notion of relevance is that the query stringappears verbatim in the document.: Slightly less strict notion is that the words in thequery appear frequently in the document, in anyorder (bag of words)武铺理工大学
Keyword Search • Simplest notion of relevance is that the query string appears verbatim in the document. • Slightly less strict notion is that the words in the query appear frequently in the document, in any order (bag of words)

计算机科学与技术学院ProblemswithKeywords? May not retrieve relevant documents that include synonymousterms.“restaurant”vs.“"cafe"“PRC”vs.“China”? May retrieve irrelevant documents that include ambiguousterms.“bat"(baseballvs.mammal)“Apple"(companyvs.fruit)“bit"(unit ofdatavs.actof eating)武铺理工大学
Problems with Keywords • May not retrieve relevant documents that include synonymous terms. - “restaurant” vs. “café” - “PRC” vs. “China” • May retrieve irrelevant documents that include ambiguous terms. - “bat” (baseball vs. mammal) - “Apple” (company vs. fruit) - “bit” (unit of data vs. act of eating)

计尊机科学与技术学院WebSearch·Application of IR to HTML documents on the World WideWeb..Differences:-Mustassembledocumentcorpusbyspideringtheweb-Can exploit the structural layout information in HTML (XML)-Documents change uncontrollably-Canexploitthelinkstructureoftheweb武铺理工大学
Web Search • Application of IR to HTML documents on the World Wide Web. • Differences: - Must assemble document corpus by spidering the web. - Can exploit the structural layout information in HTML (XML). - Documents change uncontrollably. - Can exploit the link structure of the web

计导机科学与技术学院Web SearchSystemWebDocumentSpidercorpusQueryStringIRSystem1. Pagel2.Page2Ranked3.Page3Documents武铺理工大学
Web Search System Query String IR System Ranked Documents 1. Page1 2. Page2 3. Page3 . . Document corpus Web Spider