Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” www.joe-schmoe.com v www.stanford.edu  Inlinks.

Slides:



Advertisements
Similar presentations
Overview of this week Debugging tips for ML algorithms
Advertisements

Markov Models.
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Link Analysis: PageRank and Similar Ideas. Recap: PageRank Rank nodes using link structure PageRank: – Link voting: P with importance x has n out-links,
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis: PageRank
CS345 Data Mining Page Rank Variants. Review Page Rank  Web graph encoded by matrix M N £ N matrix (N = number of web pages) M ij = 1/|O(j)| iff there.
Google’s PageRank By Zack Kenz. Outline Intro to web searching Review of Linear Algebra Weather example Basics of PageRank Solving the Google Matrix Calculating.
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
Experiments with MATLAB Experiments with MATLAB Google PageRank Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University, Taiwan
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin.
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Page Rank.  Intuition: solve the recursive equation: “a page is important if important pages link to it.”  Maximailly: importance = the principal eigenvector.
Lexicon/dictionary DIC Inverted Index Allows quick lookup of document ids with a particular word Stanford UCLA MIT … PL(Stanford) PL(UCLA)
Link Analysis, PageRank and Search Engines on the Web
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
1 Evaluating the Web PageRank Hubs and Authorities.
1 Evaluating the Web PageRank Hubs and Authorities.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
PageRank Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 27, 2014.
HCC class lecture 22 comments John Canny 4/13/05.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
The effect of New Links on Google Pagerank By Hui Xie Apr, 07.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Presented By: - Chandrika B N
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
System of Linear Equations Nattee Niparnan. LINEAR EQUATIONS.
Using Adaptive Methods for Updating/Downdating PageRank Gene H. Golub Stanford University SCCM Joint Work With Sep Kamvar, Taher Haveliwala.
MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.
Link Analysis 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 7: Link Analysis Mining Massive Datasets.
Roshnika Fernando P AGE R ANK. W HY P AGE R ANK ?  The internet is a global system of networks linking to smaller networks.  This system keeps growing,
Random Walks and Semi-Supervised Learning Longin Jan Latecki Based on : Xiaojin Zhu. Semi-Supervised Learning with Graphs. PhD thesis. CMU-LTI ,
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
1 Page Rank uIntuition: solve the recursive equation: “a page is important if important pages link to it.” uIn technical terms: compute the principal eigenvector.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
Copyright © D.S.Weld12/3/2015 8:49 PM1 Link Analysis CSE 454 Advanced Internet Systems University of Washington.
Understanding Google’s PageRank™ 1. Review: The Search Engine 2.
L 7: Linear Systems and Metabolic Networks. Linear Equations Form System.
Google PageRank Algorithm
CompSci 100E 4.1 Google’s PageRank web site xxx web site yyyy web site a b c d e f g web site pdq pdq.. web site yyyy web site a b c d e f g web site xxx.
Database and Knowledge Discovery : from Market Basket Data to Web PageRank Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
Google's Page Rank. Google Page Ranking “The Anatomy of a Large-Scale Hypertextual Web Search Engine” by Sergey Brin and Lawrence Page
CS 440 Database Management Systems Web Data Management 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
Jeffrey D. Ullman Stanford University.  Web pages are important if people visit them a lot.  But we can’t watch everybody using the Web.  A good surrogate.
Motivation Modern search engines for the World Wide Web use methods that require solving huge problems. Our aim: to develop multiscale techniques that.
Search Engines and Link Analysis on the Web
PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Hubs and Authorities Jeffrey D. Ullman.
PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Jeffrey D. Ullman Stanford University.
Link Analysis 2 Page Rank Variants
PageRank and Markov Chains
DTMC Applications Ranking Web Pages & Slotted ALOHA
CSE 454 Advanced Internet Systems University of Washington
CSE 454 Advanced Internet Systems University of Washington
Iterative Aggregation Disaggregation
Lecture 22 SVD, Eigenvector, and Web Search
CS 440 Database Management Systems
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Link Analysis Many slides are borrowed from Stanford Data Mining Class taught by Drs Anand Rajaraman, Jeffrey D. Ullman, and Jure Leskovec.
Presentation transcript:

Web Mining Link Analysis Algorithms Page Rank

Ranking web pages  Web pages are not equally “important” v  Inlinks as votes has 23,400 inlinks has 1 inlink  Are all inlinks equal? Recursive question!

Simple recursive formulation  Each link’s vote is proportional to the importance of its source page  If page P with importance x has n outlinks, each link gets x/n votes

Simple “flow” model The web in 1839 Yahoo M’softAmazon y a m y/2 a/2 m y = y /2 + a /2 a = y /2 + m m = a /2

Solving the flow equations  3 equations, 3 unknowns, no constants No unique solution All solutions equivalent modulo scale factor  Additional constraint forces uniqueness y+a+m = 1 y = 2/5, a = 2/5, m = 1/5  Gaussian elimination method works for small examples, but we need a better method for large graphs

Matrix formulation  Matrix M has one row and one column for each web page  If page j has n outlinks and links to page i Then M ij =1/n Else M ij =0  M is a column stochastic matrix Columns sum to 1  Suppose r is a vector with one entry per web page r i is the importance score of page i Call it the rank vector

Example Suppose page j links to 3 pages, including i i j M r r = i 1/3

Eigenvector formulation  The flow equations can be written r = Mr  So the rank vector is an eigenvector of the stochastic web matrix In fact, its first or principal eigenvector, with corresponding eigenvalue 1

Example Yahoo M’softAmazon y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 y a m y = y /2 + a /2 a = y /2 + m m = a /2 r = Mr y 1/2 1/2 0 y a = 1/2 0 1 a m 0 1/2 0 m

Power Iteration method  Simple iterative scheme (aka relaxation)  Suppose there are N web pages  Initialize: r 0 = [1/N,….,1/N] T  Iterate: r k+1 = Mr k  Stop when |r k+1 - r k | 1 <  |x| 1 =  1 · i · N |x i | is the L 1 norm Can use any other vector norm e.g., Euclidean

Power Iteration Example Yahoo M’softAmazon y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 y a m y a = m 1/3 1/2 1/6 5/12 1/3 1/4 3/8 11/24 1/6 2/5 1/5...

Spider traps  A group of pages is a spider trap if there are no links from within the group to outside the group Random surfer gets trapped  Spider traps violate the conditions needed for the random walk theorem

Microsoft becomes a spider trap Yahoo M’softAmazon y 1/2 1/2 0 a 1/2 0 0 m 0 1/2 1 y a m y a = m /2 3/2 3/4 1/2 7/4 5/8 3/

Random teleports  The Google solution for spider traps  At each time step, the random surfer has two options: With probability , follow a link at random With probability 1-, jump to some page uniformly at random Common values for  are in the range 0.8 to 0.9  Surfer will teleport out of spider trap within a few time steps

Matrix formulation  Suppose there are N pages Consider a page j, with set of outlinks O(j) We have M ij = 1/|O(j)| when |O(j)|>0 and j links to i; M ij = 0 otherwise The random teleport is equivalent to  adding a teleport link from j to every other page with probability (1-)/N  reducing the probability of following each outlink from 1/|O(j)| to /|O(j)|  Equivalent: tax each page a fraction (1-) of its score and redistribute evenly

Page Rank  Construct the N x N matrix A as follows A ij = M ij + (1-)/N  Verify that A is a stochastic matrix  The page rank vector r is the principal eigenvector of this matrix satisfying r = Ar  Equivalently, r is the stationary distribution of the random walk with teleports

Previous example with =0.8 Yahoo M’softAmazon 1/2 1/2 0 1/ /2 1 1/3 1/3 1/3 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/ y a = m /11 5/11 21/11...

Dead ends  Pages with no outlinks are “dead ends” for the random surfer Nowhere to go on next step

Microsoft becomes a dead end Yahoo M’softAmazon y a = m /2 1/2 0 1/ /2 0 1/3 1/3 1/3 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 1/ Non- stochastic!

Dealing with dead-ends  Teleport Follow random teleport links with probability 1.0 from dead-ends Adjust matrix accordingly  Prune and propagate Preprocess the graph to eliminate dead-ends Might require multiple passes Compute page rank on reduced graph Approximate values for deadends by propagating values from reduced graph

Computing page rank  Key step is matrix-vector multiply r new = Ar old  Easy if we have enough main memory to hold A, r old, r new  Say N = 1 billion pages We need 4 bytes for each entry (say) 2 billion entries for vectors, approx 8GB Matrix A has N 2 entries  is a large number!

Sparse matrix formulation  Although A is a dense matrix, it is obtained from a sparse matrix M 10 links per node, approx 10N entries  We can restate the page rank equation r = Mr + [(1-)/N] N [(1-)/N] N is an N-vector with all entries (1-)/N  So in each iteration, we need to: Compute r new = Mr old Add a constant value (1-)/N to each entry in r new

Sparse matrix encoding  Encode sparse matrix using only nonzero entries Space proportional roughly to number of links say 10N, or 4*10*1 billion = 40GB still won’t fit in memory, but will fit on disk 03 1, 5, , 64, 113, 117, , 23 source node degree destination nodes

Basic Algorithm  Assume we have enough RAM to fit r new, plus some working memory Store r old and matrix M on disk Basic Algorithm:  Initialize: r old = [1/N] N  Iterate: Update: Perform a sequential scan of M and r old and update r new Write out r new to disk as r old for next iteration Every few iterations, compute |r new -r old | and stop if it is below threshold

Update step 03 1, 5, , 64, 113, , 23 srcdegreedestination r new r old Initialize all entries of r new to (1-  )/N For each page p (out-degree n): Read into memory: p, n, dest 1,…,dest n, r old (p) for j = 1..n: r new (dest j ) +=  *r old (p)/n

Analysis  In each iteration, we have to: Read r old and M Write r new back to disk IO Cost = 2|r| + |M|  What if we had enough memory to fit both r new and r old ?  What if we could not even fit r new in memory? 10 billion pages

Block-based update algorithm 04 0, 1, 3, , , 4 srcdegreedestination r new r old

Analysis of Block Update  Similar to nested-loop join in databases Break r new into k blocks that fit in memory Scan M and r old once for each block  k scans of M and r old k(|M| + |r|) + |r| = k|M| + (k+1)|r|  Can we do better?  Hint: M is much bigger than r (approx 10-20x), so we must avoid reading it k times per iteration

Block-Stripe Update algorithm 04 0, srcdegreedestination r new r old

Block-Stripe Analysis  Break M into stripes Each stripe contains only destination nodes in the corresponding block of r new  Some additional overhead per stripe But usually worth it  Cost per iteration |M|(1+) + (k+1)|r|