Link Analysis Mining Massive Datasets Wu-Jun li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 7: Link Analysis
Link Analysis 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 7: Link Analysis Mining Massive Datasets
Link Analysis Link Analysis algorithms PageRank Hubs and authorities Topic-Sensitive PageRank Spam Detection Algorithms Other interesting topics we wont cover Detecting duplicates and mirrors Mining for communities community detection (Refer to Chapter 10 of the textbook)
Link Analysis 2 Link Analysis Algorithms ▪ PageRank ▪ Hubs and Authorities ▪ Topic-Sensitive PageRank ▪ Spam Detection Algorithms ▪ Other interesting topics we won’t cover ▪ Detecting duplicates and mirrors ▪ Mining for communities (community detection) (Refer to Chapter 10 of the textbook)
Link Analysis Outline PageRank Topic-Sensitive PageRank Hubs and authorities Spam detection
Link Analysis 3 Outline ▪ PageRank ▪ Topic-Sensitive PageRank ▪ Hubs and Authorities ▪ Spam Detection
Link Analysis PageRank Ranking web pages Web pages are not equally important www.joe-schmoe.comvwww.stanford.edu Inlinks as votes www.stanford.eduhas23,400inlinks www.joe-schmoe.comhas1inlink Are all inlinks equal? Recursive question!
Link Analysis 4 Ranking web pages ▪ Web pages are not equally “important” ▪ www.joe-schmoe.com v www.stanford.edu ▪ Inlinks as votes ▪ www.stanford.edu has 23,400 inlinks ▪ www.joe-schmoe.com has 1 inlink ▪ Are all inlinks equal? ▪ Recursive question! PageRank
Link Analysis PageRank Simple recursive formulation Each link's vote is proportional to the importance of Its source page If page P with importance x has n outlinks, each link gets x/n votes Page p's own importance is the sum of the votes on its inlinks
Link Analysis 5 Simple recursive formulation ▪ Each link’s vote is proportional to the importance of its source page ▪ If page P with importance x has n outlinks, each link gets x/n votes ▪ Page P’s own importance is the sum of the votes on its inlinks PageRank
Link Analysis PageRank Simple flow"model The web in 1839 y/2 Yahoo y=y/2+a/2 a=y 2+m a/2 y m=a/2 m Amazon Soft a/2 a m
Link Analysis 6 Simple “flow” model The web in 1839 Yahoo Amazon M’soft y a m y/2 y/2 a/2 a/2 m y = y /2 + a /2 a = y /2 + m m = a /2 PageRank
Link Analysis PageRank Solving the flow equations 3 equations 3 unknowns, no constants No unique solution All solutions equivalent modulo scale factor Additional constraint forces uniqueness yta+m 1 ny=2/5,a=2/5,m=1/5 Gaussian elimination method works for small examples but we need a better method for large graphs
Link Analysis 7 Solving the flow equations ▪ 3 equations, 3 unknowns, no constants ▪ No unique solution ▪ All solutions equivalent modulo scale factor ▪ Additional constraint forces uniqueness ▪ y+a+m = 1 ▪ y = 2/5, a = 2/5, m = 1/5 ▪ Gaussian elimination method works for small examples, but we need a better method for large graphs PageRank
Link Analysis PageRank Matrix formulation Matrix M has one row and one column for each web page Suppose page j has n outlinks fj→ i, then m=1/n ■E|seM:=0 M is a column stochastic matrix Columns sum to 1 Suppose r is a vector with one entry per web page r; is the importance score of page i Call it the rank vector
Link Analysis 8 Matrix formulation ▪ Matrix M has one row and one column for each web page ▪ Suppose page j has n outlinks ▪ If j i, then Mij=1/n ▪ Else Mij=0 ▪ M is a column stochastic matrix ▪ Columns sum to 1 ▪ Suppose r is a vector with one entry per web page ▪ ri is the importance score of page i ▪ Call it the rank vector ▪ |r| = 1 PageRank
Link Analysis PageRank Example Suppose page j links to 3 pages, including i
Link Analysis 9 Example Suppose page j links to 3 pages, including i i j M r r = i 1/3 PageRank
Link Analysis PageRank Eigenvector formulation The flow equations can be written r= Mr So the rank vector is an eigenvector of the stochastic web matrix In fact, its first or principal eigenvector, with corresponding eigenvalue 1
Link Analysis 10 Eigenvector formulation ▪ The flow equations can be written r = Mr ▪ So the rank vector is an eigenvector of the stochastic web matrix ▪ In fact, its first or principal eigenvector, with corresponding eigenvalue 1 PageRank