The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,

The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0

 The Internet (1969) is a network that’s  Global  Decentralized  Redundant  Made up of many different types of machines  How many machines make up the Internet?

from Fluency with Information Technology, 4th edition by Lawrence Snyder, Addison-Wesley, 2010, ISBN 0-13-609182-2

 Sir Tim Berners-Lee

 The World Wide Web (or just Web) is:  Global  Decentralized  Redundant (sometimes)  Made up of Web pages and interactive Web services  How many Web pages are on the Web?

 Links are useful to us humans for navigating Web sites and finding things  Links are also useful to search engines  Latest News anchor text destination link (URL)

 How does anchor text apply to ranking?  Anchor text describes the content of the destination page  Anchor text is short, descriptive, and often coincides with query text  Anchor text is typically written by a non-biased third party

 We often represent Web pages as vertices and links as edges in a webgraph http://www.openarchives.org/ore/0.1/datamodel-images/WebGraphBase.jpg

http://www.growyourwritingbusiness.com/images/web_graph_flower.jpg  An example:

 Links may be interpreted as describing a destination Web page in terms of its:  Popularity  Importance  We focus on incoming links (inlinks)  And use this for ranking matching documents  Drawback is obtaining incoming link data  Authority  Incoming link count

 PageRank is a link analysis algorithm  PageRank is accredited to Sergey Brin and Lawrence Page (the Google guys!)  The original PageRank paper: ▪ http://infolab.stanford.edu/~backrub/google.html http://infolab.stanford.edu/~backrub/google.html

 Browse the Web as a random surfer:  Choose a random number r between 0 and 1  If r < λ then go to a random page  else follow a random link from the current page  Repeat!  The PageRank of page A (noted PR(A)) is the probability that this “random surfer” will be looking at that page

 Jumping to a random page avoids getting stuck in:  Pages that have no links  Pages that only have broken links  Pages that loop back to previously visited pages

 PageRank of page C is the probability a random surfer is viewing page C  Based on inlinks  PR(C) = PR(A) / 2 + PR(B) / 1  We assume PageRank is distributed evenly across all pages (so 0.33 for A, B, and C)  PR(C) = 0.33 / 2 + 0.33 / 1 = 0.50

 More generally:  B u is the set of pages that point to u  L v is the number of outgoing links from page v (not counting duplicate links)

 We can account for the “random jumps” by incorporating constant λ into the equation:  Typically, λ is low (e.g. λ = 0.15) (N is the number of pages)

 A cycle tends to negate the effectiveness of the PageRank algorithm

 Read and study Chapter 4.5

The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,

Similar presentations

Presentation on theme: "The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,

Similar presentations

Presentation on theme: "The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,"— Presentation transcript:

Similar presentations

About project

Feedback