The text of links is treated in a spe_中国高校课件下载中心

点击下载：《大规模数据处理——云计算 Mass Data Processing Cloud Computing》课程教学资源（阅读材料）The Anatomy of a Large-Scale Hypertextual Web Search Engine

正在加载图片...

by a text-based search engine,such as images,programs,and databases.This makes it possible to return web pages which have not actually been crawled.Note that pages that have not been crawled can cause problems,since they are never checked for validity before being returned to the user.In this case,the search engine can even return a page that never actually existed,but had hyperlinks pointing to it.However,it is possible to sort the results,so that this particular problem rarely happens. This idea of propagating anchor text to the page it refers to was implemented in the World Wide Web Worm McBryan 94]especially because it helps search non- text information,and expands the search coverage with fewer downloaded documents.We use anchor propagation mostly because anchor text can help provide better quality results.Using anchor text efficiently is technically difficult because of the large amounts of data which must be processed.In our current crawl of 24 million pages,we had over 259 million anchors which we indexed. 2.3 Other Features Aside from PageRank and the use of anchor text,Google has several other features.First,it has location information for all hits and so it makes extensive use of proximity in search.Second,Google keeps track of some visual presentation details such as font size of words.Words in a larger or bolder font are weighted higher than other words.Third,full raw HTML of pages is available in a repository. 3 Related Work Search research on the web has a short and concise history.The World Wide Web Worm (WWWW)[McBryan 94]was one of the first web search engines.It was subsequently followed by several other academic search engines,many of which are now public companies.Compared to the growth of the Web and the importance of search engines there are precious few documents about recent search engines [Pinkerton 94.According to Michael Mauldin(chief scientist,Lycos Inc)[Mauldinl "the various services(including Lycos)closely guard the details of these databases".However,there has been a fair amount of work on specific features of search engines.Especially well represented is work which can get results by post-processing the results of existing commercial search engines,or produce small scale"individualized"search engines.Finally,there has been a lot of research on information retrieval systems,especially on well controlled collections. In the next two sections,we discuss some areas where this research needs to be extended to work better on the web. 3.1 Information Retrieval Work in information retrieval systems goes back many years and is well developed [Witten 94.However,most of the research on information retrieval systems is on small well controlled homogeneous collections such as collections of scientific papers or news stories on a related topic.Indeed,the primary benchmark for http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm 6The text of links is treated in a special way in our search engine. Most search engines associate the text of a link with the page that the link is on. In addition, we associate it with the page the link points to. This has several advantages. First, anchors often provide more accurate descriptions of web pages than the pages themselves. Second, anchors may exist for documents which cannot be indexed by a text-based search engine, such as images, programs, and databases. This makes it possible to return web pages which have not actually been crawled. Note that pages that have not been crawled can cause problems, since they are never checked for validity before being returned to the user. In this case, the search engine can even return a page that never actually existed, but had hyperlinks pointing to it. However, it is possible to sort the results, so that this particular problem rarely happens. This idea of propagating anchor text to the page it refers to was implemented in the World Wide Web Worm [McBryan 94] especially because it helps search nontext information, and expands the search coverage with fewer downloaded documents. We use anchor propagation mostly because anchor text can help provide better quality results. Using anchor text efficiently is technically difficult because of the large amounts of data which must be processed. In our current crawl of 24 million pages, we had over 259 million anchors which we indexed. 2.3 Other Features Aside from PageRank and the use of anchor text, Google has several other features. First, it has location information for all hits and so it makes extensive use of proximity in search. Second, Google keeps track of some visual presentation details such as font size of words. Words in a larger or bolder font are weighted higher than other words. Third, full raw HTML of pages is available in a repository. 3 Related Work Search research on the web has a short and concise history. The World Wide Web Worm (WWWW) [McBryan 94] was one of the first web search engines. It was subsequently followed by several other academic search engines, many of which are now public companies. Compared to the growth of the Web and the importance of search engines there are precious few documents about recent search engines [Pinkerton 94]. According to Michael Mauldin (chief scientist, Lycos Inc) [Mauldin], "the various services (including Lycos) closely guard the details of these databases". However, there has been a fair amount of work on specific features of search engines. Especially well represented is work which can get results by post-processing the results of existing commercial search engines, or produce small scale "individualized" search engines. Finally, there has been a lot of research on information retrieval systems, especially on well controlled collections. In the next two sections, we discuss some areas where this research needs to be extended to work better on the web. 3.1 Information Retrieval Work in information retrieval systems goes back many years and is well developed [Witten 94]. However, most of the research on information retrieval systems is on small well controlled homogeneous collections such as collections of scientific papers or news stories on a related topic. Indeed, the primary benchmark for information retrieval, the Text Retrieval Conference [TREC 96], uses a fairly small, well controlled collection for their benchmarks. The "Very Large Corpus" benchmark is only 20GB compared to the 147GB from our crawl of 24 million web pages. Things that work well on TREC often do not produce good results on the web. For example, the standard vector space model tries to return the document that most closely approximates the query, given that both query and document are vectors defined by their word occurrence. On the web, this strategy often returns very short documents that are the query plus a few words. For example, we have seen a major search engine return a page containing only "Bill Clinton Sucks" and picture from a "Bill Clinton" query. Some argue that on the web, users should specify more accurately what they want and add more words to their query. We disagree vehemently with this position. If a user issues a query like "Bill Clinton" they should get reasonable results since there is a enormous amount of high quality information available on this topic. Given examples like these, we believe that the standard information retrieval work needs to be extended to deal effectively with the web. http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm 6

<<向上翻页向下翻页>>

点击下载：《大规模数据处理——云计算 Mass Data Processing Cloud Computing》课程教学资源（阅读材料）The Anatomy of a Large-Scale Hypertextual Web Search Engine