Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.

Slides:



Advertisements
Similar presentations
Lecture 18: Link analysis
Advertisements

Markov Models.
Google Pagerank: how Google orders your webpages Dan Teague NCSSM.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
The math behind PageRank A detailed analysis of the mathematical aspects of PageRank Computational Mathematics class presentation Ravi S Sinha LIT lab,
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis: PageRank
Slide 1 Lecture 9: Unstructured Data Information Retrieval –Types of Systems, Documents, Tasks –Evaluation: Precision, Recall Search Engines (Google) –Architecture.
How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011.
Андрей Андреевич Марков. Markov Chains Graduate Seminar in Applied Statistics Presented by Matthias Theubert Never look behind you…
Matrices, Digraphs, Markov Chains & Their Use. Introduction to Matrices  A matrix is a rectangular array of numbers  Matrices are used to solve systems.
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
Experiments with MATLAB Experiments with MATLAB Google PageRank Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University, Taiwan
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
How Google Relies on Discrete Mathematics Gerald Kruse Juniata College Huntingdon, PA
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Page Rank.  Intuition: solve the recursive equation: “a page is important if important pages link to it.”  Maximailly: importance = the principal eigenvector.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.
Link Analysis, PageRank and Search Engines on the Web
Presented By: Wang Hao March 8 th, 2011 The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Structure and Web Mining Shuying Wang
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
PageRank Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 27, 2014.
Arithmetic Operations on Matrices. 1. Definition of Matrix 2. Column, Row and Square Matrix 3. Addition and Subtraction of Matrices 4. Multiplying Row.
The Further Mathematics network
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
Matrices Write and Augmented Matrix of a system of Linear Equations Write the system from the augmented matrix Solve Systems of Linear Equations using.
Stochastic Approach for Link Structure Analysis (SALSA) Presented by Adam Simkins.
Presented By: - Chandrika B N
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Amy N. Langville Mathematics Department College of Charleston Math Meet 2/20/10.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Piyush Kumar (Lecture 2: PageRank) Welcome to COT5405.
Google’s Billion Dollar Eigenvector Gerald Kruse, PhD. John ‘54 and Irene ‘58 Dale Professor of MA, CS and I T Interim Assistant Provost Juniata.
Methods of Computing the PageRank Vector Tom Mangan.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
How works M. Ram Murty, FRSC Queen’s Research Chair Queen’s University or How linear algebra powers the search engine.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
COMP4210 Information Retrieval and Search Engines Lecture 9: Link Analysis.
Understanding Google’s PageRank™ 1. Review: The Search Engine 2.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
CompSci 100E 4.1 Google’s PageRank web site xxx web site yyyy web site a b c d e f g web site pdq pdq.. web site yyyy web site a b c d e f g web site xxx.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
A Sublinear Time Algorithm for PageRank Computations CHRISTIA N BORGS MICHAEL BRAUTBA R JENNIFER CHAYES SHANG- HUA TENG.
PageRank Google : its search listings always seemed deliver the “good stuff” up front. 1 2 Part of the magic behind it is its PageRank Algorithm PageRank™
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.
Motivation Modern search engines for the World Wide Web use methods that require solving huge problems. Our aim: to develop multiscale techniques that.
HITS Hypertext-Induced Topic Selection
Methods and Apparatus for Ranking Web Page Search Results
Search Engines and Link Analysis on the Web
PageRank and Markov Chains
DTMC Applications Ranking Web Pages & Slotted ALOHA
Iterative Aggregation Disaggregation
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Information retrieval and PageRank
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Sergey Brin, lawrence Page, The anatomy of a large scale hypertextual web search Engine Rogier Brussee ICI
The chain of Andrej Markov and its applications NETWORKS goes to school - April 23, 2018 Jan-Pieter Dorsman.
Junghoo “John” Cho UCLA
Presentation transcript:

Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State University and American Institute of Mathematics Bay Area Mathematical Adventures February 27, 2008

With material from Becky Atherton Matrices Markov Chains Digraphs Google’s PageRank Matrices Markov Chains Digraphs Google’s PageRank Outline

Introduction to Matrices  A matrix is a rectangular array of numbers  Matrices are used to solve systems of equations  Matrices are easy for computers to work with  A matrix is a rectangular array of numbers  Matrices are used to solve systems of equations  Matrices are easy for computers to work with

Matrix arithmetic  Matrix Addition  Matrix Multiplication

 At each time period, every object in the system is in exactly one state, one of 1, …,n.  Objects move according to the transition probabilities: the probability of going from state j to state i is t ij  Transition probabilities do not change over time.  At each time period, every object in the system is in exactly one state, one of 1, …,n.  Objects move according to the transition probabilities: the probability of going from state j to state i is t ij  Transition probabilities do not change over time. Introduction to Markov Chains

The transition matrix of a Markov chain  T = [t ij ] is an n  n matrix.  Each entry t ij is the probability of moving from state j to state i.  0  t ij  1  Sum of entries in a column must be equal to 1 (stochastic).  T = [t ij ] is an n  n matrix.  Each entry t ij is the probability of moving from state j to state i.  0  t ij  1  Sum of entries in a column must be equal to 1 (stochastic).

Example: Customers can choose from three major grocery stores: H-Mart, Freddy’s and Shopper’s Market.  Each year H-Mart retains 80% of its customers, while losing 15% to Freddy’s and 5% to Shopper’s Market.  Freddy’s retains 65% of its customers, loses 20% to H-Mart and 15% to Shopper’s Market.  Shopper’s Market keeps 70% of its customers, loses 20% to H-Mart and 10% to Freddy’s.  Each year H-Mart retains 80% of its customers, while losing 15% to Freddy’s and 5% to Shopper’s Market.  Freddy’s retains 65% of its customers, loses 20% to H-Mart and 15% to Shopper’s Market.  Shopper’s Market keeps 70% of its customers, loses 20% to H-Mart and 10% to Freddy’s.

Example: The transition matrix.

Look at the calculation used to determine the probability of starting at H-Mart and shopping there two year later: We can obtain the same result by multiplying row one by column one in the transition matrix:

This matrix tells us the probabilities of going from one store to another after 2 years:  Compute the probability of shopping at each store 2 years after shopping at Shopper’s Market:

 If the initial distribution was evenly distributed between H-Mart, Freddy’s, and Shpper’s market, compute the distribution after two years:

To utilize a Markov chain to compute probabilities, we need to know the initial probability vector q (0) If there are n states, let the initial probability vector be where To utilize a Markov chain to compute probabilities, we need to know the initial probability vector q (0) If there are n states, let the initial probability vector be where –q i is the probability of being in state i initially –All entries 0  q i  1 –Column sum = 1 –q i is the probability of being in state i initially –All entries 0  q i  1 –Column sum = 1

What happens after 10 years? Example:

 Let q (k) be the probability distribution after k steps.  We are iterating q (k+1) = T q (k)  Eventually, for a large enough k, q (k+1) = q (k) = s  Resulting in s = T s  s is called a steady state vector  s =q (k) is an eigenvector for eigenvalue 1  Let q (k) be the probability distribution after k steps.  We are iterating q (k+1) = T q (k)  Eventually, for a large enough k, q (k+1) = q (k) = s  Resulting in s = T s  s is called a steady state vector  s =q (k) is an eigenvector for eigenvalue 1

In the grocery example, there was a unique steady state vector s, and T q (k)  s. This does not need to be the case:

How can we guarantee convergence to an unique steady state vector regardless of initial conditions?  One way is by having a regular transition matrix  A nonnegative matrix is regular if some power of the matrix has only nonzero entries.  One way is by having a regular transition matrix  A nonnegative matrix is regular if some power of the matrix has only nonzero entries.

Digraphs  A directed graph (digraph) is a set of vertices (nodes) and a set of directed edges (arcs) between vertices  The arcs indicate relationships between nodes  Digraphs can be used as models, e.g.  cities and airline routes between them  web pages and links  A directed graph (digraph) is a set of vertices (nodes) and a set of directed edges (arcs) between vertices  The arcs indicate relationships between nodes  Digraphs can be used as models, e.g.  cities and airline routes between them  web pages and links

How Matrices, Markov Chains and Digraphs are used by Google

How does Google work?  Robot web crawlers find web pages  Pages are indexed & cataloged  Pages are assigned PageRank values  PageRank is a program that prioritizes pages  Developed by Larry Page & Sergey Brin in 1998  When pages are identified in response to a query, they are ranked by PageRank value  Robot web crawlers find web pages  Pages are indexed & cataloged  Pages are assigned PageRank values  PageRank is a program that prioritizes pages  Developed by Larry Page & Sergey Brin in 1998  When pages are identified in response to a query, they are ranked by PageRank value

Why is PageRank important?  Only a few years ago users waited much longer for search engines to return results to their queries.  When a search engine finally responded, the returned list had many links to information that was irrelevant, and useless links invariably appeared at or near the top of the list, while useful links were deeply buried.  The Web's information is not structured like information in the organized databases and document collections - it is self organized.  The enormous size of the Web, currently containing ~10^9 pages, completely overwhelmed traditional information retrieval (IR) techniques.  Only a few years ago users waited much longer for search engines to return results to their queries.  When a search engine finally responded, the returned list had many links to information that was irrelevant, and useless links invariably appeared at or near the top of the list, while useful links were deeply buried.  The Web's information is not structured like information in the organized databases and document collections - it is self organized.  The enormous size of the Web, currently containing ~10^9 pages, completely overwhelmed traditional information retrieval (IR) techniques.

 By 1997 it was clear that IR technology of the past wasn't well suited for Web search  Researchers set out to devise new approaches.  Two big ideas emerged, each capitalizing on the link structure of the Web to differentiate between relevant information and fluff.  One approach, HITS (Hypertext Induced Topic Search), was introduced by Jon Kleinberg  The other, which changed everything, is Google's PageRank that was developed by Sergey Brin and Larry Page  By 1997 it was clear that IR technology of the past wasn't well suited for Web search  Researchers set out to devise new approaches.  Two big ideas emerged, each capitalizing on the link structure of the Web to differentiate between relevant information and fluff.  One approach, HITS (Hypertext Induced Topic Search), was introduced by Jon Kleinberg  The other, which changed everything, is Google's PageRank that was developed by Sergey Brin and Larry Page

How are PageRank values assigned?  Number of links to and from a page give information about the importance of a page.  More inlinks  the more important the page  Inlinks from “good” pages carry more weight than inlinks from “weaker” pages.  If a page points to several pages, its weight is distributed proportionally.  Number of links to and from a page give information about the importance of a page.  More inlinks  the more important the page  Inlinks from “good” pages carry more weight than inlinks from “weaker” pages.  If a page points to several pages, its weight is distributed proportionally.

 Imagine the World Wide Web as a directed graph (digraph)  Each page is a vertex  Each link is an arc  Imagine the World Wide Web as a directed graph (digraph)  Each page is a vertex  Each link is an arc  A sample 6 page web (6 vertex digraph)

 PageRank defines the rank of page i recursively by  r j is the rank of page j  I i is the set of pages that point into page i  O j is the set of pages that have outlinks from page j  r j is the rank of page j  I i is the set of pages that point into page i  O j is the set of pages that have outlinks from page j

 For example, the rank of page 2 in our sample web:

 Since this is a recursive definition, PageRank assigns an initial ranking equally to all pages: then iterates

Process can be written using matrix notation.  Let q (k) be the PageRank vector at the k th iteration  Let T be the transition matrix for the web  Then q (k+1) = T q (k)  T is the matrix such that t ij is the probability of moving from page j to page i in one time step  Based on the assumption that all outlinks are equally likely to be selected.  Let q (k) be the PageRank vector at the k th iteration  Let T be the transition matrix for the web  Then q (k+1) = T q (k)  T is the matrix such that t ij is the probability of moving from page j to page i in one time step  Based on the assumption that all outlinks are equally likely to be selected.

Using our 6-node sample web:  Transition matrix:

To eliminate dangling nodes and obtain a stochastic matrix, replace a column of zeros with a column of 1/n’s, where n is the number of web pages.

 Web’s nature is such that T would not be regular  Brin & Page force the transition matrix to be regular by making sure every entry satisfies 0 < t ij < 1  Create perturbation matrix E having all entries equal to 1/n  Web’s nature is such that T would not be regular  Brin & Page force the transition matrix to be regular by making sure every entry satisfies 0 < t ij < 1  Create perturbation matrix E having all entries equal to 1/n  Form “Google Matrix”:

Using  = 0.85 for our 6-node sample web:

 By calculating powers of the transition matrix, we can determine the stationary vector:

 Stationary vector for our 6-node sample web:

How does Google use this stationary vector?  Query requests term 1 or term 2  Inverted file storage is accessed  Term 1  doc 3, doc 2, doc 6  Term 2  doc 1, doc 3  Relevancy set is {1, 2, 3, 6}  Query requests term 1 or term 2  Inverted file storage is accessed  Term 1  doc 3, doc 2, doc 6  Term 2  doc 1, doc 3  Relevancy set is {1, 2, 3, 6}  s 1 =.2066, s 2 =.1770, s 3 =.1773, s 6 =.1309  Doc 1 deemed most important  s 1 =.2066, s 2 =.1770, s 3 =.1773, s 6 =.1309  Doc 1 deemed most important

 Adding a perturbation matrix seems reasonable, based on the “random jump” idea- user types in a URL  This is only the basic idea behind Google, which has many refinements we have ignored  PageRank as originally conceived and described here ignores the “Back” button  PageRank currently undergoing development  Details of PageRank’s operations and value of  are a trade secret.  Adding a perturbation matrix seems reasonable, based on the “random jump” idea- user types in a URL  This is only the basic idea behind Google, which has many refinements we have ignored  PageRank as originally conceived and described here ignores the “Back” button  PageRank currently undergoing development  Details of PageRank’s operations and value of  are a trade secret.

 Updates to Google matrix done periodically  Google matrix is HUGE  Sophisticated numerical methods are be used  Updates to Google matrix done periodically  Google matrix is HUGE  Sophisticated numerical methods are be used

Thank you!