Evaluating Search Engines in chapter 8 of the book Search Engines Information Retrieval in Practice http://net. Hongfei Yan School of EECS, peking University 3/28/2011 Refer to the book s slides
Evaluating Search Engines in chapter 8 of the book Search Engines Information Retrieval in Practice http://net.pku.edu.cn/~course/cs410/2011/ Hongfei Yan School of EECS, Peking University 3/28/2011 Refer to the book’s slides
08: Evaluating Search Engines 8. 1 Why Evaluate 8.2 The Evaluation Corpus 8.3 Logging + 8.4 Effectiveness Metrics(+) 8.5 Efficiency Metrics 8.6 Training, Testing, and statistics +) 8.7 The bottom Line skip) 3/N
08: Evaluating Search Engines 8.1 Why Evaluate 8.2 The Evaluation Corpus 8.3 Logging (+) 8.4 Effectiveness Metrics (+) 8.5 Efficiency Metrics 8.6 Training, Testing, and Statistics (+) 8.7 The Bottom Line (skip) 3/N
Search engine design and the core information retrieval issues Relevance Performance Effective ranking -Efficient search and indexing Evaluation Incorporating new data -Testing and measuring Coverage and freshness Information needs Scalability User interaction Growing with data and users Adaptability Tuning for applications Specific problems E.g…spam 4/N
Search engine design and the core information retrieval issues Relevance -Effective ranking Evaluation -Testing and measuring Information needs -User interaction Performance -Efficient search and indexing Incorporating new data -Coverage and freshness Scalability -Growing with data and users Adaptability -Tuning for applications Specific problems -E.g., spam 4/N
Evaluation Evaluation is key to building effective and efficient search engines measurement usually carried out in controlled laboratory experiments online testing can also be done Effectiveness, efficiency and cost are related e. g if we want a particular level of effectiveness and efficiency this will determine the cost of the system configuration efficiency and cost targets may impact effectiveness 5/N
Evaluation • Evaluation is key to building effective and efficientsearch engines – measurement usually carried out in controlled laboratory experiments – online testing can also be done • Effectiveness, efficiency and cost are related – e.g., if we want a particular level of effectiveness and efficiency, this will determine the cost of the system configuration – efficiency and cost targets may impact effectiveness 5/N
08: Evaluating Search Engines 8. 1 Why Evaluate 8.2 The Evaluation Corpus 8.3 Logging 8. 4 Effectiveness metrics 8.5 Efficiency Metrics 8.6 Training, Testing, and Statistics 6/N
08: Evaluating Search Engines 8.1 Why Evaluate 8.2 The Evaluation Corpus 8.3 Logging 8.4 Effectiveness Metrics 8.5 Efficiency Metrics 8.6 Training, Testing, and Statistics 6/N
Evaluation Corpus Test collections consisting of documents, queries, and relevance judgments, e. g. CACM: Titles and abstracts from the communications of the acm from 1958-1979. Queries and relevance judgments generated by computer scientists. AP: Associated press newswire documents from 1988 1900 ( from tREC disks 1-3 ). Queries are the title fields from TREC topics 51-150 Topics and relevance judgments generated by government information analysts. GOV2: Web pages crawled from websites in the. gov domain during early 2004. Queries are the title fields from TREC topics 701-850 Topics and relevance judgments generated by government analysts 7/N
Evaluation Corpus • Test collections consisting of documents, queries, and relevance judgments, e.g., – CACM: Titles and abstracts from the Communications of the ACM from 1958-1979. Queries and relevance judgments generated by computer scientists. – AP: Associated Press newswire documents from 1988- 1900 (from TREC disks 1-3). Queries are the title fields from TREC topics 51-150. Topics and relevance judgments generated by government information analysts. – GOV2: Web pages crawled from websites in the .gov domain during early 2004. Queries are the title fields from TREC topics 701-850. Topics and relevance judgments generated by government analysts. 7/N
Test Collections Collection Number of Size Average number documents of words/doc CACM 3.204 2.2Mb 64 AP 242,91807Gb 474 GOV225,205,179426Gb 1073 Collection Number of Average number of Average number of queries words/query relevant docs/query CACM 13.0 16 AP 100 43 GOV2 150 3.1 180 8/N
Test Collections 8/N
TREC Topic Example Number: 794 pet therapy description How are pets or animals used in therapy for humans and what are the benefits? narrative Relevant documents must include details of how pet-or animal-assisted therapy is or has been used relevant details include information about pet therapy programs, descriptions of the circumstances in which pet therapy is used, the benefits of this type of therapy the degree of success of this therapy and any laws or regulations governing it. 9/N
TREC Topic Example 9/N
Relevance Judgments Obtaining relevance judgments is an expensive, time-consuming process who does it? what are the instructions? what is the level of agreement? TREC judgments depend on task being evaluated generally binary agreement good because of"narrative 10/N
Relevance Judgments • Obtaining relevance judgments is an expensive, time-consuming process – who does it? – what are the instructions? – what is the level of agreement? • TREC judgments – depend on task being evaluated – generally binary – agreement good because of “narrative” 10/N
Pooling Exhaustive judgments for all documents in a collection is not practical Pooling technique is used in TREC top k results for TREC, k varied between 50 and 200) from the rankings obtained by different search engines for retrieval algorithms are merged into a pool duplicates are removed documents are presented in some random order to the relevance judges Produces a large number of relevance judgments for each query, although still incomplete 1/N
Pooling • Exhaustive judgments for all documents in a collection is not practical • Pooling technique is used in TREC – top k results (for TREC, k varied between 50 and 200) from the rankings obtained by different search engines (or retrieval algorithms) are merged into a pool – duplicates are removed – documents are presented in some random order to the relevance judges • Produces a large number of relevance judgments for each query, although still incomplete 11/N