PageRank Algorithm -- Bringing Order to the Web (Hu Bin)

Slides:



Advertisements
Similar presentations
CMU SCS : Multimedia Databases and Data Mining Lecture #20: SVD - part III (more case studies) C. Faloutsos.
Advertisements

Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
The math behind PageRank A detailed analysis of the mathematical aspects of PageRank Computational Mathematics class presentation Ravi S Sinha LIT lab,
Information Networks Link Analysis Ranking Lecture 8.
Graphs, Node importance, Link Analysis Ranking, Random walks
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis: PageRank
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Link Analysis Ranking. How do search engines decide how to rank your query results? Guess why Google ranks the query results the way it does How would.
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin.
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
ICS 278: Data Mining Lecture 15: Mining Web Link Structure
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.
Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.
Link Analysis, PageRank and Search Engines on the Web
Presented By: Wang Hao March 8 th, 2011 The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Structure and Web Mining Shuying Wang
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine CS 277: Data Mining Mining Web Link Structure.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
Presented By: - Chandrika B N
Piyush Kumar (Lecture 2: PageRank) Welcome to COT5405.
Methods of Computing the PageRank Vector Tom Mangan.
Random Walks and Semi-Supervised Learning Longin Jan Latecki Based on : Xiaojin Zhu. Semi-Supervised Learning with Graphs. PhD thesis. CMU-LTI ,
1 Random Walks on Graphs: An Overview Purnamrita Sarkar, CMU Shortened and modified by Longin Jan Latecki.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Overview of Web Ranking Algorithms: HITS and PageRank
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Nov.
Lecture #10 PageRank CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.
How works M. Ram Murty, FRSC Queen’s Research Chair Queen’s University or How linear algebra powers the search engine.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Google PageRank Algorithm
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Dec.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
Google's Page Rank. Google Page Ranking “The Anatomy of a Large-Scale Hypertextual Web Search Engine” by Sergey Brin and Lawrence Page
Theory of Computational Complexity Probability and Computing Lee Minseon Iwama and Ito lab M1 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
Mathematics of the Web Prof. Sara Billey University of Washington.
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
HITS Hypertext-Induced Topic Selection
Lecture #11 PageRank (II)
Link-Based Ranking Seminar Social Media Mining University UC3M
DTMC Applications Ranking Web Pages & Slotted ALOHA
Lecture 22 SVD, Eigenvector, and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Piyush Kumar (Lecture 2: PageRank)
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Junghoo “John” Cho UCLA
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
COMP5331 Web databases Prepared by Raymond Wong
Presentation transcript:

PageRank Algorithm -- Bringing Order to the Web (Hu Bin)

Earlier Search Engines Search by keywords, return all pages that have those words If thousands pages be return, which one first? -- return those pages with most frequencies of those words first Problem -- If people want their pages to be put top on a word search (e.g., “database”), they just need repeat the word many many times Search engines can be easily “fooled”

Link Analysis The goal is to rank pages Intuition  recommendations -- The importance of each page should be decided by what other pages “say” about this page -- A “link” from page A to page B is a recommendation of page B by the author of A. This implies that we need mine structure of the web graph Two main approaches (1) Static: use the links in all pages to calculate a ranking of every page (Google) (2) Dynamic: search by keywords first, use the links in the results to dynamically determine a ranking (IBM Clever – Hubs and Authorities)

Static --- Google’s Approach PageRank is defined by founder of google  outDegree(B) = # edges leaving page B = # hyperlinks in page B  means that Page B equally distributes its rank over all the pages it points to  d is a damping factor which is set between 0 and 1. Normally set to Example: If R(B) = 3 and R(C) = 4 By graph, we have outD (B)= 1 and outD (C) = 2 Therefore, we can calculate R(A) = 0.85 * (3/1 + 4/2) + ( ) = 4.4 Problem: How we get this? To get R(B), we need know R(A) first, To get R(A), we need know R(B) first. Where START? A B C

Matrix Formulation PageRanks’ Matrix Formulation – for one page R 1 = (1- d ) + d * (α 11 * R 1 /outD 1 + α 12 * R 2 /outD 2 + … + α 1n * R n /outD n ) – for all pages (1)  α ij = 0, if there is no link to page “i” in page “j”; otherwise, α ij = 1 How to calculate it?

Calculate PageRanks’ Matrix The Formulation (1) can be written as: R = C + M * R (2) Can we solve it like this? (I – M) * R = C R = (I – M) -1 * C Be careful! It is a matrix equation. -- The Determinants of (I – M) may be zero; In other word, (I – M) may be a singular matrix. Therefore, (I – M) -1 may NOT exist! Is (I – M) a singular matrix? -- NO? How to prove? -- Yes? How solve equation (2)

Solve the matrix Formulation We can NOT prove whether (I – M) is a singular matrix or not. (In fact, it can be a singular matrix in real world) We solve it by using recursive way -- R n+1 = C + M * R n (2) -- We assume an initial value as R 0 to calculate R 1, and use R 1 to get R 2, and so on. -- After many many times, what happen? See example

Example of PageRank At first, we assume the rank of every page is 1. After many times, the value will not change. -- Why? the iteration converge If initial value different, the solution change? -- We will get the same solution whatever initial value is taken! -- Why? the iteration converge Ne Am MS

Convergence It means -- whatever values we start at -- after running a number of times, we will end up with the same final values -- these values will no longer change even we do further iterations of calculation The reason that R n+1 = C + M * R n converges is: -- Markov chain theorem

Markov chain theorem (I) Surfing the web  theoretical random walk -- From page A to page B by randomly choosing an outgoing link in page A -- By this way, it can lead to (1) dead ends at pages with no outgoing link (2) cycles around cliques of interconnected pages The theoretical random walk is know as a Markov chain or Markov process.

Markov chain theorem (II) Markov chain R n+1 = M * R n has a unique stationary distribution under three conditions: (1) M is stochastic (2) M is aperiodic (3) M is irreducible To make these condition true: (1) All columns of M add up to 1 and no value is negative (2) Make sure that G is not bipartite (3) Make sure that G is strongly connected (Note: G is the graph which M corresponds to.) (Note: proved by G. R. Grimmett and D. R. Stirzaker. Probability and Random Processes)

PageRank and Markov chain ---- condition (1) M is stochastic  All columns of M add up to 1 and no value is negative What M is? -- α ij = 0, if there is no link to page “i” in page “j”; otherwise, α ij = 1 -- outD j = total # edges leaving page j = total # hyperlinks in page j Therefore, we have

PageRank and Markov chain ---- condition (2) M is aperiodic  G is not bipartite However sometimes G is bipartite. A B C D E F G

PageRank and Markov chain ---- condition (3) M is irreducible  G is strongly connected Definition of strongly connected Graph: -- A strongly connected digraph is a directed graph in which it is possible to reach any node starting from any other node by traversing edges in the direction(s) in which they point. (by However G is not always strongly connected Two problems: (1) rank leak (2) rank sink (1) (2) We can not go to “A” or “B” from “C” (Note: rank leak is a special case of rank sink, but their solutions are different.) A B C A B C

Rank Leak Pages which have no outgoing links Called a rank leak because all importance will “leak” out of the web ---- all pages’ importance are zero Solution: Assume B have links to all web pages with equal probability. A C B

Rank Sink A group of pages with no links out of the group Called a rank sink because this group will accumulate all importance of the web, the importance of all other pages which do not belong to this group will be zero. A C B

Rank Sink Solution Original Markov chain is: R n+1 = M * R n (NO “d” !) Practical pagerank formula Why need “d”? (1) Intuition in the random surfer mode -- d is the probability of jumping from page “A” to page “B” following the links in page “A” -- 1-d is the probability of jumping to a random page on the web instead of following a link in current page (2) Actually imply to build a not bipartite and strongly connected graph, therefore we can satisfy condition (2) and (3)

build not bipartite & strongly connected graph The original graph is bipartite and not strongly connected graph. We can not from “b” to “a” and “f”; can not from “d”, “f” to any other pages, and so on. By introducing “d”, the graph is a strongly connected and not bipartite graph. For example, there is 1-d chance from “b” to “a”, “b” or “f”. We can go to any page form current position. a b c d f Using “d” a b c d f

Summary of PageRank Build a matrix basing on the pages and the links in those pages. By introducing “d”, make the graph is a not bipartite and strongly connected graph Therefore the system is a Markov chain with a stationary distribution Using recursive way to solve the function:

Dynamic -- Hubs & Authorities Authority: a page that offers info about a topic Hub: a page that doesn ’ t provide much info, but tell us where to find pages about a topic Good hub: page that points to many good authorities. Good authority: page pointed to by many good hubs

Goal -- Hubs & Authorities Goal: Given Keyword Query, assume there are a set of pages P that match this query, calculate a hub and an authoritative value to each page in set P instead of the whole web. Pages with high authority are results of query (to find good sources of content).

Build A Subgraph Find pages S containing the keyword, and using set S to build a subgraph Find all pages these S pages point to, i.e., their forward neighbors. Find all pages that point to S pages, i.e., their backward neighbors Compute this subgraph Query Results = Start Set Forward Set Back Set Result 1 Result 2 Result n f1f1 f2f2 fsfs... b1b1 b2b2 bmbm …

Computing Hubs and Authorities (3)(4) For all pages (1) Number the pages{1,2,…n} (2) define their adjacency matrix M to be the n*n matrix where M ij = 1 if page i links to page j, and is 0 otherwise. (3) Define A=(a 1,a 2,…,a n ) and H=(h 1,h 2,…,h n ). For each page p, it has a non-negative authority weight a p and a non-negative hub weight h p. (2)(1) v1v1 pv2v2 v3v3 h(v 2 ) h(v 3 ) h(v 1 ) q1q1 p a(q 1 ) q2q2 q3q3 a(q 2 ) a(q 3 )

Example(1) X YZ (1) (2) AMH ii * 1   HMA i T i * 1                  M XYZ X Y Z After each iteration of operation (1) and (2), normalize H and A: (Otherwise, values will keep increasing)

Example(2) Iteration 0 1 Norm …… converge X YZ

Proof converge Theorem 3.1 The sequences x1, x2, x3,... and y1, y2, y3,... converge Proof. Let G = (V,E), with V = {p1, p2,..., pn}, and let A denote the adjacency matrix of the graph G; the (i, j)th entry of A is equal to 1 if (p i, p j ) is an edge of G, and is equal to 0 otherwise. One easily verifies that the I and O operations can be written x  A T y and y  Ax respectively. Thus x k is the unit vector in the direction of (A T A) k−1 A T z, and y k is the unit vector in the direction of (AA T ) k z. Now, a standard result of linear algebra states that if M is a symmetric n × n matrix, and v is a vector not orthogonal to the principal eigenvector ω 1 (M), then the unit vector in the direction of M k v converges to ω 1 (M) as k increases without bound. Also (as a corollary), if M has only non-negative entries, then the principal eigenvector of M has only non-negative entries “Authoritative Sources in a Hyperlinked Environment” Jon M. Kleinberg

PageRank v.s. Authorities PageRank (Google) –Query-independent : computed for all web pages stored in the database prior to the query –Quality only depends on all web pages stored in the database –computes authorities only –Trivial and fast to compute HITS (CLEVER) –Query-dependent: performed on the set of retrieved web pages for each query –Quality depends on quality of start set –computes authorities and hubs –easy to compute, but real-time execution is hard

Reference Sergey Brin and Larry Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine Larry Page, Sergey Brin, R. Motwani, T. Winograd (1998) The PageRank Citation Ranking: Bringing Order to the Web Chris Ridings, Mike Shishigin, and Jill Whalen(2002) PageRank Uncovered MONICA BIANCHINI, MARCO GORI, and FRANCO SCARSELLI(2000) Inside PageRank Amy N. Langville and Carl D. Meyer(2004) Deeper Inside PageRank Jon M. Kleinberg(1997) Authoritative Sources in a Hyperlinked Environment AYMAN FARAHATz, THOMAS LOFAROx{, JOEL C. MILLERk, GREGORY RAE, AND LESLEY A. WARD(2001) AUTHORITY RANKINGS FROM HITS, PAGERANK, AND SALSA: EXISTENCE, UNIQUENESS, AND EFFECT OF INITIALIZATION

Reference Alessandro Panconesi DI, La Sapienza of Rome(2005) The Stationary Distribution of a Markov Chain Dean L. Isaacson and Richard W. Madsen. (1976) Markov chains, theory and applications John G. Kemeny and J. Laurie Snell. (2002) Finite Markov chains and Algorithmic Applications Monika Henzinger Hyperlink Analysis on the Web Vagelis Hristidis Random Walks in Ranking Query Results in Semistructured Databases Shang-Hua Teng SVD, Eigenvector, and Web Search G. R. Grimmett and D. R. Stirzaker. Probability and Random Processes. Dragomir R. Radev Information Retrieval

THANKS!