Presentation is loading. Please wait.

Presentation is loading. Please wait.

The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,

Similar presentations


Presentation on theme: "The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,"— Presentation transcript:

1 The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0

2  The Internet (1969) is a network that’s  Global  Decentralized  Redundant  Made up of many different types of machines  How many machines make up the Internet?

3 from Fluency with Information Technology, 4th edition by Lawrence Snyder, Addison-Wesley, 2010, ISBN 0-13-609182-2

4  Sir Tim Berners-Lee

5  The World Wide Web (or just Web) is:  Global  Decentralized  Redundant (sometimes)  Made up of Web pages and interactive Web services  How many Web pages are on the Web?

6  Links are useful to us humans for navigating Web sites and finding things  Links are also useful to search engines  Latest News anchor text destination link (URL)

7  How does anchor text apply to ranking?  Anchor text describes the content of the destination page  Anchor text is short, descriptive, and often coincides with query text  Anchor text is typically written by a non-biased third party

8  We often represent Web pages as vertices and links as edges in a webgraph http://www.openarchives.org/ore/0.1/datamodel-images/WebGraphBase.jpg

9 http://www.growyourwritingbusiness.com/images/web_graph_flower.jpg  An example:

10  Links may be interpreted as describing a destination Web page in terms of its:  Popularity  Importance  We focus on incoming links (inlinks)  And use this for ranking matching documents  Drawback is obtaining incoming link data  Authority  Incoming link count

11  PageRank is a link analysis algorithm  PageRank is accredited to Sergey Brin and Lawrence Page (the Google guys!)  The original PageRank paper: ▪ http://infolab.stanford.edu/~backrub/google.html http://infolab.stanford.edu/~backrub/google.html

12  Browse the Web as a random surfer:  Choose a random number r between 0 and 1  If r < λ then go to a random page  else follow a random link from the current page  Repeat!  The PageRank of page A (noted PR(A)) is the probability that this “random surfer” will be looking at that page

13  Jumping to a random page avoids getting stuck in:  Pages that have no links  Pages that only have broken links  Pages that loop back to previously visited pages

14  PageRank of page C is the probability a random surfer is viewing page C  Based on inlinks  PR(C) = PR(A) / 2 + PR(B) / 1  We assume PageRank is distributed evenly across all pages (so 0.33 for A, B, and C)  PR(C) = 0.33 / 2 + 0.33 / 1 = 0.50

15  More generally:  B u is the set of pages that point to u  L v is the number of outgoing links from page v (not counting duplicate links)

16  We can account for the “random jumps” by incorporating constant λ into the equation:  Typically, λ is low (e.g. λ = 0.15) (N is the number of pages)

17

18  A cycle tends to negate the effectiveness of the PageRank algorithm

19  Read and study Chapter 4.5


Download ppt "The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,"

Similar presentations


Ads by Google