Presentation is loading. Please wait.

Presentation is loading. Please wait.

LIS618 lecture 9 Web retrieval Thomas Krichel 2011-11-21.

Similar presentations


Presentation on theme: "LIS618 lecture 9 Web retrieval Thomas Krichel 2011-11-21."— Presentation transcript:

1 LIS618 lecture 9 Web retrieval Thomas Krichel 2011-11-21

2 structure A mini Web theory Google “theory”, mainly page rank

3 literature Brin and Page “The Anatomy of a Large-Scale Hypertextual Web Search Engine” http://infolab.stanford.edu/~backrub/ google.html Calishain and Dornfest's “Google hacks”, O'Reilley 2003 Schneider & alii “How to do everything with Google”, McGraw Hill Osborne, 2004 Google web site

4 web information retrieval We can think of the web as a pile of documents called pages. Some pages are hard to index – PDF documents – Pictures – Sound files But a majority of pages are written in HTML – easy to index – have a loose structure

5 Google uses the structure of HTML Google finds the title of the page, i.e. the contents of the element. Google analysis headings and large font sizes and gives priority weight to terms found there. Most importantly, Google uses the link structure of the web to find important pages.

6 classic IR and the web In classic information retrieval, every document has the same importance. They differ as to their relevance to a query. In classic information retrieval, a document d is relevant if the query terms appears relatively frequently in d rather than in other documents. But if a web page contains the words “Bill Clinton sucks” and a picture, it is not a good hit for “Bill Clinton”.

7 Google finds important pages The idea is that the documents on the web have different degrees of importance. Google will show the most important pages first. The ideas is that more important pages are likely to be more relevant to any query than non-important pages.

8 Google’s monkey Imagine that the web has P pages. Each page has its own address (URL). Imagine a monkey who sits at a terminal. He follows links at random, but on rare occasions he gets bored and types in an address of a random page out of those P. Will the monkey visit all pages with equal probability?

9 page rank Google page rank of a page is the probability that the Google’s monkey will visit the page. – The monkey will come frequently to pages that have a lot of links to them. – Once he is there, he will likely go to a page that it linked by one of the pages that an important page links to. The structure of all the links on the entire web reveals the importance of the page.

10 many page ranks There is an infinite number of ways to calculate the page rank depending on – how likely the monkey gets bored. – the probability of the monkey to visit each page. Potentially, there is a page rank for each user of the web. Google tries to observe users and may be associating personal page ranks.

11 notation Assume that a monkey gets bored with probability d. If bored, it will visit page p with probability π_p. For any page p, let o_p the number of outgoing links. Let l(p',p) be the number of links from page p' to page p.

12 page rank formula The page rank for a page p is r_p = π_p d + (1-d) ∑ l(p',p) r_p' / o_p' In words, it is likelihood that, if bored the money goes to the page p plus the likelihood that he gets there from another page p'. The likelihood getting there from p' is the likelihood of being there, times the number of links between p' and p, divided by the number of outgoing links on p'.

13 example Let there be a web of four pages A B C D A links to B. B links to C. C links to A and D. D links to A. Let the probability to get bored be ¼ and there be a ¼ chance to move to any page when bored.

14 page ranks The following system calculates the ranks r_A = ¼ ¼ + ¾ (r_C / 2 + r_D) r_B = ¼ ¼ + ¾ r_A r_C = ¼ ¼ + ¾ r_B r_D = ¼ ¼ + ¾ r_C / 2 Since this is fairly complicated, Google uses an iterative approximation to calculate the rank. Note that the sum of all ranks is 1.

15 a sample program imagine we start in our previous web with a situation where all pages have a rank of.25. Then we run updates of the rank according to our formula 1000 times, say We find – r_A is 0.287151702786378 – r_B is 0.277863777089783 – r_C is 0.270897832817337 – r_D is 0.164086687306502 sum is 1

16 my $r_A=.25; my $r_B=.25; my $r_C=.25; my $r_D=.25; my $count=0; while($count < 1000) { $r_A = 1/16 + 3/4 * ($r_C / 2 + $r_D); $r_B = 1/16 + 3/4 * $r_A; $r_C = 1/16 + 3/4 * $r_B; $r_D = 1/16 + 3/4 * $r_C / 2; $count++; } print "r_A is $r_A, r_B is $r_B, r_C is $r_C, r_D is $r_D\n"; my $sum=$r_A + $r_B + $r_C + $r_D; print "sum is $sum\n";

17 http://openlib.org/home/krichel Please shutdown the computers when you are done. Thank you for your attention!


Download ppt "LIS618 lecture 9 Web retrieval Thomas Krichel 2011-11-21."

Similar presentations


Ads by Google