How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011.

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Google Pagerank: how Google orders your webpages Dan Teague NCSSM.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis: PageRank
Google’s PageRank By Zack Kenz. Outline Intro to web searching Review of Linear Algebra Weather example Basics of PageRank Solving the Google Matrix Calculating.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin.
How Google Relies on Discrete Mathematics Gerald Kruse Juniata College Huntingdon, PA
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
The PageRank Citation Ranking “Bringing Order to the Web”
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Page Rank.  Intuition: solve the recursive equation: “a page is important if important pages link to it.”  Maximailly: importance = the principal eigenvector.
Presented By: Wang Hao March 8 th, 2011 The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
PageRank Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 27, 2014.
Link Analysis HITS Algorithm PageRank Algorithm.
Google and the Page Rank Algorithm Székely Endre
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
The PageRank Citation Ranking: Bringing Order to the Web Larry Page etc. Stanford University, Technical Report 1998 Presented by: Ratiya Komalarachun.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
How Search Engines Work. Any ideas? Building an index Dan taylor Flickr Creative Commons.
Presented By: - Chandrika B N
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Google’s Billion Dollar Eigenvector Gerald Kruse, PhD. John ‘54 and Irene ‘58 Dale Professor of MA, CS and I T Interim Assistant Provost Juniata.
Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
SEO  What is it?  Seo is a collection of techniques targeted towards increasing the presence of a website on a search engine.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
1 Page Rank uIntuition: solve the recursive equation: “a page is important if important pages link to it.” uIn technical terms: compute the principal eigenvector.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Web Search. Crawling Start from some root site e.g., Yahoo directories. Traverse the HREF links. Search(initialLink) fringe.Insert( initialLink ); loop.
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Search Engines By: Faruq Hasan.
Ranking Link-based Ranking (2° generation) Reading 21.
Understanding Google’s PageRank™ 1. Review: The Search Engine 2.
9 Algorithms: PageRank. Ranking After matching, have to rank:
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
1 CS 430: Information Discovery Lecture 5 Ranking.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
Google's Page Rank. Google Page Ranking “The Anatomy of a Large-Scale Hypertextual Web Search Engine” by Sergey Brin and Lawrence Page
CS 440 Database Management Systems Web Data Management 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Jeffrey D. Ullman Stanford University.  Web pages are important if people visit them a lot.  But we can’t watch everybody using the Web.  A good surrogate.
The PageRank Citation Ranking: Bringing Order to the Web
PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Jeffrey D. Ullman Stanford University.
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
9 Algorithms: PageRank.
CS 440 Database Management Systems
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Junghoo “John” Cho UCLA
Junghoo “John” Cho UCLA
PageRank PAGE RANK (determines the importance of webpages based on link structure) Solves a complex system of score equations PageRank is a probability.
Presentation transcript:

How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

Me vs. Jeff High school – Public school in Texas College – The University of California, Berkeley Faculty member at... – UNC High School – Hoity-toity, private all- boys school in Jersey College – Stanford Faculty member at... – Duke

The World Wide Web A Simple Request/Response System Request for web page. Web page returned.

Making The Request How do you make a web request? – Use a browser. Specify what you want directly. Follow a link. – Turns out we very rarely specify documents directly. – Uniform Resource Locator (URL) – Two key characteristics of hyperlinks: Directional Unilateral

Web Search In Three Easy Steps What’s step one? – Cut a hole in the box.

Web Search In Three Easy Steps First, crawl. – Try to find all of the web pages. Follow the links. Second, index. – Organize what you find. Lots of secret sauce here. Third, query. – Usually, text query words. – Retrieves a list of related pages. Usually because they contain the query text.

Which to list first? Possible clues: – Number of times the query term appears – Where it appears Title, body text, URL, metadata, etc. – How it appears Style of text Role of text – Position in the document graph This is what distinguished Google from other search engines at the time.

PageRank Supposedly named after Larry Page Part of his research in grad school – Patented while in grad school. – Licensed to Google for ~ 1 million shares of Google. Sold for about $300M

Document Graph

Probability Distribution of a Random Walk Start walking the graph. After some reasonably long amount of time, stop. What’s the chance that you are on a particular page. – Larger chance => more important page – Is this actually true? Maybe, maybe not

Random Walk Example

Trapdoors and Dead Ends Shangri-La: Can’t ever get here. Hotel California: Can’t ever leave.

Spider Traps

Fixing Our Random Walk What can we do to fix it? – Add a bit more randomness. At each step, with probability α jump to any random page. Otherwise, randomly follow a link. – Provides a way in to / out of trapdoors / dead ends and spider traps.

Random Walk Scalability Problem: Would need to simulate the random walk over and over again to even come close to discovering the underlying probability distribution. – Easy to do for small graphs. – Pain in the ass for large ones. Markov Chain – Tool for analyzing stochastic processes. – Power method

Power Method Equation N : Number of documents R k : Page rank of document k L k : Number of outgoing links in k δ(k,j) : Delta function for links between k and j δ(k,j) = 1 if and only if there exists a link from document k to document j

Power Method Equation Our definition is circular. – To calculate page rank of a page we need to already know the page rank of other pages. Iterative solution. – Start with an initial assignment. Basically set the page rank of every page to 1/N. Why 1/N? – Calculate an updated value for every page using the current values. – Keep repeating until the value are stable.

Power Method Equation Intuition: – Page rank of a document is the sum of its fair share of the page ranks of the pages that link to the document.

Example i = 0 0.1

Example i =

Example i = Something is wrong!

Power Method v2 Dead ends leak. Spider traps slowly collect everything. Translating our random walk solution: – Add a “virtual” link from every document to every other document. – Define a weighting factor α between 0.0 and 1.0 Distribute α proportion of your page rank over the virtual links Distribute (1- α) proportion of your page rank over the real links

Power Method v2 Dead ends leak. Spider traps slowly collect everything. Translating our random walk solution: – Add a “virtual” link from every document to every other document. – Define a weighting factor α between 0.0 and 1.0 Distribute α proportion of your page rank over the virtual links Distribute (1- α) proportion of your page rank over the real links

Convergence Typical value for α is Convergence typically occurs in about 50 iterations even for large graphs.

Example i =

Example i =

Billions and billions How do you do this with billions of documents? – Can be implemented using matrix math. – Special techniques for sparse matrices. – PageRank roughly equivalent to first eigenvector.

Gaming The System Google Bomb! – Create a lot of links to the page that you want to be highly ranked. Create your own spider trap. – Relatively easy to combat by discounting links that come from the same domain. Comment spam. Porn trap.

Last Notes Stanford Sucks! GO HEELS!

Bad Math When originally presented, the final version of the power method equation was shown as: The simplification for the first term is wrong and should have been: