CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.

Slides:



Advertisements
Similar presentations
Overview of this week Debugging tips for ML algorithms
Advertisements

Markov Models.
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Link Analysis: PageRank and Similar Ideas. Recap: PageRank Rank nodes using link structure PageRank: – Link voting: P with importance x has n out-links,
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis: PageRank
CS345 Data Mining Page Rank Variants. Review Page Rank  Web graph encoded by matrix M N £ N matrix (N = number of web pages) M ij = 1/|O(j)| iff there.
Google’s PageRank By Zack Kenz. Outline Intro to web searching Review of Linear Algebra Weather example Basics of PageRank Solving the Google Matrix Calculating.
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
Experiments with MATLAB Experiments with MATLAB Google PageRank Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University, Taiwan
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin.
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Page Rank.  Intuition: solve the recursive equation: “a page is important if important pages link to it.”  Maximailly: importance = the principal eigenvector.
Lexicon/dictionary DIC Inverted Index Allows quick lookup of document ids with a particular word Stanford UCLA MIT … PL(Stanford) PL(UCLA)
Link Analysis, PageRank and Search Engines on the Web
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
1 Evaluating the Web PageRank Hubs and Authorities.
1 Evaluating the Web PageRank Hubs and Authorities.
Singular Value Decomposition and Data Management
PageRank Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 27, 2014.
CS345 Data Mining Link Analysis 2: Topic-Specific Page Rank Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
The effect of New Links on Google Pagerank By Hui Xie Apr, 07.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
1 Link Analysis and spam Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan –
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
System of Linear Equations Nattee Niparnan. LINEAR EQUATIONS.
MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.
Link Analysis 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 7: Link Analysis Mining Massive Datasets.
Random Walks and Semi-Supervised Learning Longin Jan Latecki Based on : Xiaojin Zhu. Semi-Supervised Learning with Graphs. PhD thesis. CMU-LTI ,
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
1 Page Rank uIntuition: solve the recursive equation: “a page is important if important pages link to it.” uIn technical terms: compute the principal eigenvector.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
How works M. Ram Murty, FRSC Queen’s Research Chair Queen’s University or How linear algebra powers the search engine.
Copyright © D.S.Weld12/3/2015 8:49 PM1 Link Analysis CSE 454 Advanced Internet Systems University of Washington.
Ranking Link-based Ranking (2° generation) Reading 21.
COMP4210 Information Retrieval and Search Engines Lecture 9: Link Analysis.
Understanding Google’s PageRank™ 1. Review: The Search Engine 2.
Database and Knowledge Discovery : from Market Basket Data to Web PageRank Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
CS 440 Database Management Systems Web Data Management 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.
Jeffrey D. Ullman Stanford University. 3  Mutually recursive definition:  A hub links to many authorities;  An authority is linked to by many hubs.
Jeffrey D. Ullman Stanford University.  Web pages are important if people visit them a lot.  But we can’t watch everybody using the Web.  A good surrogate.
Search Engines and Link Analysis on the Web
PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Hubs and Authorities Jeffrey D. Ullman.
PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Jeffrey D. Ullman Stanford University.
Link Analysis 2 Page Rank Variants
PageRank and Markov Chains
DTMC Applications Ranking Web Pages & Slotted ALOHA
CSE 454 Advanced Internet Systems University of Washington
CSE 454 Advanced Internet Systems University of Washington
Iterative Aggregation Disaggregation
Lecture 22 SVD, Eigenvector, and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
CS 440 Database Management Systems
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Link Analysis Many slides are borrowed from Stanford Data Mining Class taught by Drs Anand Rajaraman, Jeffrey D. Ullman, and Jure Leskovec.
Presentation transcript:

CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman

Link Analysis Algorithms  Page Rank  Hubs and Authorities  Topic-Specific Page Rank  Spam Detection Algorithms  Other interesting topics we won’t cover Detecting duplicates and mirrors Mining for communities Classification Spectral clustering

Ranking web pages  Web pages are not equally “important” v  Inlinks as votes has 23,400 inlinks has 1 inlink  Are all inlinks equal? Recursive question!

Simple recursive formulation  Each link’s vote is proportional to the importance of its source page  If page P with importance x has n outlinks, each link gets x/n votes

Simple “flow” model The web in 1839 Yahoo M’softAmazon y a m y/2 a/2 m y = y /2 + a /2 a = y /2 + m m = a /2

Solving the flow equations  3 equations, 3 unknowns, no constants No unique solution All solutions equivalent modulo scale factor  Additional constraint forces uniqueness y+a+m = 1 y = 2/5, a = 2/5, m = 1/5  Gaussian elimination method works for small examples, but we need a better method for large graphs

Matrix formulation  Matrix M has one row and one column for each web page  Suppose page j has n outlinks If j ! i, then M ij =1/n Else M ij =0  M is a column stochastic matrix Columns sum to 1  Suppose r is a vector with one entry per web page r i is the importance score of page i Call it the rank vector

Example Suppose page j links to 3 pages, including i i j M r r = i 1/3

Eigenvector formulation  The flow equations can be written r = Mr  So the rank vector is an eigenvector of the stochastic web matrix In fact, its first or principal eigenvector, with corresponding eigenvalue 1

Example Yahoo M’softAmazon y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 y a m y = y /2 + a /2 a = y /2 + m m = a /2 r = Mr y 1/2 1/2 0 y a = 1/2 0 1 a m 0 1/2 0 m

Power Iteration method  Simple iterative scheme (aka relaxation)  Suppose there are N web pages  Initialize: r 0 = [1/N,….,1/N] T  Iterate: r k+1 = Mr k  Stop when |r k+1 - r k | 1 <  |x| 1 =  1 · i · N |x i | is the L 1 norm Can use any other vector norm e.g., Euclidean

Power Iteration Example Yahoo M’softAmazon y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 y a m y a = m 1/3 1/2 1/6 5/12 1/3 1/4 3/8 11/24 1/6 2/5 1/5...

Random Walk Interpretation  Imagine a random web surfer At any time t, surfer is on some page P At time t+1, the surfer follows an outlink from P uniformly at random Ends up on some page Q linked from P Process repeats indefinitely  Let p(t) be a vector whose i th component is the probability that the surfer is at page i at time t p(t) is a probability distribution on pages

The stationary distribution  Where is the surfer at time t+1? Follows a link uniformly at random p(t+1) = Mp(t)  Suppose the random walk reaches a state such that p(t+1) = Mp(t) = p(t) Then p(t) is called a stationary distribution for the random walk  Our rank vector r satisfies r = Mr So it is a stationary distribution for the random surfer

Existence and Uniqueness A central result from the theory of random walks (aka Markov processes): For graphs that satisfy certain conditions, the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0.

Spider traps  A group of pages is a spider trap if there are no links from within the group to outside the group Random surfer gets trapped  Spider traps violate the conditions needed for the random walk theorem

Microsoft becomes a spider trap Yahoo M’softAmazon y 1/2 1/2 0 a 1/2 0 0 m 0 1/2 1 y a m y a = m /2 3/2 3/4 1/2 7/4 5/8 3/

Random teleports  The Google solution for spider traps  At each time step, the random surfer has two options: With probability , follow a link at random With probability 1-, jump to some page uniformly at random Common values for  are in the range 0.8 to 0.9  Surfer will teleport out of spider trap within a few time steps

Matrix formulation  Suppose there are N pages Consider a page j, with set of outlinks O(j) We have M ij = 1/|O(j)| when j ! i and M ij = 0 otherwise The random teleport is equivalent to  adding a teleport link from j to every other page with probability (1-)/N  reducing the probability of following each outlink from 1/|O(j)| to /|O(j)|  Equivalent: tax each page a fraction (1-) of its score and redistribute evenly

Page Rank  Construct the N £ N matrix A as follows A ij = M ij + (1-)/N  Verify that A is a stochastic matrix  The page rank vector r is the principal eigenvector of this matrix satisfying r = Ar  Equivalently, r is the stationary distribution of the random walk with teleports

Previous example with =0.8 Yahoo M’softAmazon 1/2 1/2 0 1/ /2 1 1/3 1/3 1/3 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/ y a = m /11 5/11 21/11...

Dead ends  Pages with no outlinks are “dead ends” for the random surfer Nowhere to go on next step

Microsoft becomes a dead end Yahoo M’softAmazon y a = m /2 1/2 0 1/ /2 0 1/3 1/3 1/3 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 1/ Non- stochastic!

Dealing with dead-ends  Teleport Follow random teleport links with probability 1.0 from dead-ends Adjust matrix accordingly  Prune and propagate Preprocess the graph to eliminate dead-ends Might require multiple passes Compute page rank on reduced graph Approximate values for deadends by propagating values from reduced graph

Computing page rank  Key step is matrix-vector multiply r new = Ar old  Easy if we have enough main memory to hold A, r old, r new  Say N = 1 billion pages We need 4 bytes for each entry (say) 2 billion entries for vectors, approx 8GB Matrix A has N 2 entries  is a large number!

Sparse matrix formulation  Although A is a dense matrix, it is obtained from a sparse matrix M 10 links per node, approx 10N entries  We can restate the page rank equation r = Mr + [(1-)/N] N [(1-)/N] N is an N-vector with all entries (1-)/N  So in each iteration, we need to: Compute r new = Mr old Add a constant value (1-)/N to each entry in r new

Sparse matrix encoding  Encode sparse matrix using only nonzero entries Space proportional roughly to number of links say 10N, or 4*10*1 billion = 40GB still won’t fit in memory, but will fit on disk 03 1, 5, , 64, 113, 117, , 23 source node degree destination nodes

Basic Algorithm  Assume we have enough RAM to fit r new, plus some working memory Store r old and matrix M on disk Basic Algorithm:  Initialize: r old = [1/N] N  Iterate: Update: Perform a sequential scan of M and r old and update r new Write out r new to disk as r old for next iteration Every few iterations, compute |r new -r old | and stop if it is below threshold  Need to read in both vectors into memory

Update step 03 1, 5, , 64, 113, , 23 srcdegreedestination r new r old Initialize all entries of r new to (1-  )/N For each page p (out-degree n): Read into memory: p, n, dest 1,…,dest n, r old (p) for j = 1..n: r new (dest j ) +=  *r old (p)/n

Analysis  In each iteration, we have to: Read r old and M Write r new back to disk IO Cost = 2|r| + |M|  What if we had enough memory to fit both r new and r old ?  What if we could not even fit r new in memory? 10 billion pages

Block-based update algorithm 04 0, 1, 3, , , 4 srcdegreedestination r new r old

Analysis of Block Update  Similar to nested-loop join in databases Break r new into k blocks that fit in memory Scan M and r old once for each block  k scans of M and r old k(|M| + |r|) + |r| = k|M| + (k+1)|r|  Can we do better?  Hint: M is much bigger than r (approx 10-20x), so we must avoid reading it k times per iteration

Block-Stripe Update algorithm 04 0, srcdegreedestination r new r old

Block-Stripe Analysis  Break M into stripes Each stripe contains only destination nodes in the corresponding block of r new  Some additional overhead per stripe But usually worth it  Cost per iteration |M|(1+) + (k+1)|r|

Next  Topic-Specific Page Rank  Hubs and Authorities  Spam Detection