正在加载图片...
We assume page A has pages T1...Tn which point to it(i.e.,are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85.There are more details about d in the next section.Also C(A)is defined as the number of links going out of page A. The PageRank of a page A is given as follows: PR(A)=(1-d)+d(PR(T1)/C(T1)+...+PR(Tn)/C(Tn)) Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages'PageRanks will be one. PageRank or PR(A)can be calculated using a simple iterative algorithm,and corresponds to the principal eigenvector of the normalized link matrix of the web. Also,a PageRank for 26 million web pages can be computed in a few hours on a medium size workstation.There are many other details which are beyond the scope of this paper. 2.1.2 Intuitive Justification PageRank can be thought of as a model of user behavior.We assume there is a "random surfer"who is given a web page at random and keeps clicking on links, never hitting "back"but eventually gets bored and starts on another random page.The probability that the random surfer visits a page is its PageRank.And,the d damping factor is the probability at each page the "random surfer"will get bored and request another random page.One important variation is to only add the damping factord to a single page,or a group of pages.This allows for personalization and can make it nearly impossible to deliberately mislead the system in order to get a higher ranking.We have several other extensions to PageRank,again see [Page 98]. Another intuitive justification is that a page can have a high PageRank if there are many pages that point to it,or if there are some pages that point to it and have a high PageRank.Intuitively,pages that are well cited from many places around the web are worth looking at.Also,pages that have perhaps only one citation from something like the Yahoo!homepage are also generally worth looking at.If a page was not high quality,or was a broken link,it is quite likely that Yahoo's homepage would not link to it.PageRank handles both these cases and everything in between by recursively propagating weights through the link structure of the web. 2.2 Anchor Text The text of links is treated in a special way in our search engine.Most search engines associate the text of a link with the page that the link is on.In addition,we associate it with the page the link points to.This has several advantages.First, anchors often provide more accurate descriptions of web pages than the pages themselves.Second,anchors may exist for documents which cannot be indexed by a text-based search engine,such as images,programs,and databases.This http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm 5We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the next section. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages' PageRanks will be one. PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to the principal eigenvector of the normalized link matrix of the web. Also, a PageRank for 26 million web pages can be computed in a few hours on a medium size workstation. There are many other details which are beyond the scope of this paper. 2.1.2 Intuitive Justification PageRank can be thought of as a model of user behavior. We assume there is a "random surfer" who is given a web page at random and keeps clicking on links, never hitting "back" but eventually gets bored and starts on another random page. The probability that the random surfer visits a page is its PageRank. And, the d damping factor is the probability at each page the "random surfer" will get bored and request another random page. One important variation is to only add the damping factor d to a single page, or a group of pages. This allows for personalization and can make it nearly impossible to deliberately mislead the system in order to get a higher ranking. We have several other extensions to PageRank, again see [Page 98]. Another intuitive justification is that a page can have a high PageRank if there are many pages that point to it, or if there are some pages that point to it and have a high PageRank. Intuitively, pages that are well cited from many places around the web are worth looking at. Also, pages that have perhaps only one citation from something like the Yahoo! homepage are also generally worth looking at. If a page was not high quality, or was a broken link, it is quite likely that Yahoo's homepage would not link to it. PageRank handles both these cases and everything in between by recursively propagating weights through the link structure of the web. 2.2 Anchor Text The text of links is treated in a special way in our search engine. Most search engines associate the text of a link with the page that the link is on. In addition, we associate it with the page the link points to. This has several advantages. First, anchors often provide more accurate descriptions of web pages than the pages themselves. Second, anchors may exist for documents which cannot be indexed by a text-based search engine, such as images, programs, and databases. This makes it possible to return web pages which have not actually been crawled. Note that pages that have not been crawled can cause problems, since they are never checked for validity before being returned to the user. In this case, the search engine can even return a page that never actually existed, but had hyperlinks pointing to it. However, it is possible to sort the results, so that this particular problem rarely happens. http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm 5
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有