Presentation is loading. Please wait.

Presentation is loading. Please wait.

CompSci 100E 4.1 Google’s PageRank web site xxx web site yyyy web site a b c d e f g web site pdq pdq.. web site yyyy web site a b c d e f g web site xxx.

Similar presentations


Presentation on theme: "CompSci 100E 4.1 Google’s PageRank web site xxx web site yyyy web site a b c d e f g web site pdq pdq.. web site yyyy web site a b c d e f g web site xxx."— Presentation transcript:

1 CompSci 100E 4.1 Google’s PageRank web site xxx web site yyyy web site a b c d e f g web site pdq pdq.. web site yyyy web site a b c d e f g web site xxx Inlinks are “good” (recommendations) Inlinks from a “good” site are better than inlinks from a “bad” site but inlinks from sites with many outlinks are not as “good”... “Good” and “bad” are relative. web site xxx W. Cohen

2 CompSci 100E 4.2 Google’s PageRank web site xxx web site yyyy web site a b c d e f g web site pdq pdq.. web site yyyy web site a b c d e f g web site xxx Imagine a “pagehopper” that always either follows a random link, or jumps to random page W. Cohen

3 CompSci 100E 4.3 Google’s PageRank (Brin & Page, http://www-db.stanford.edu/~backrub/google.html) web site xxx web site yyyy web site a b c d e f g web site pdq pdq.. web site yyyy web site a b c d e f g web site xxx Imagine a “pagehopper” that always either follows a random link, or jumps to random page PageRank ranks pages by the amount of time the pagehopper spends on a page: or, if there were many pagehoppers, PageRank is the expected “crowd size” W. Cohen

4 CompSci 100E 4.4 PageRank Google's PageRank™ algorithm. [Sergey Brin and Larry Page, 1998]  Measure popularity of pages based on hyperlink structure of Web. Revolutionized access to world's information.

5 CompSci 100E 4.5 90-10 Rule Model. Web surfer chooses next page:  90% of the time surfer clicks random hyperlink.  10% of the time surfer types a random page. Caveat. Crude, but useful, web surfing model.  No one chooses links with equal probability.  No real potential to surf directly to each page on the web.  The 90-10 breakdown is just a guess.  It does not take the back button or bookmarks into account.  We can only afford to work with a small sample of the web.  …

6 CompSci 100E 4.6 Web Graph Input Format Input format.  N pages numbered 0 through N-1.  Represent each hyperlink with a pair of integers. Graph representation

7 CompSci 100E 4.7 Transition Matrix Transition matrix. p[i][j] = prob. that surfer moves from page i to j. surfer on page 1 goes to page 2 next 38% of the time © Sedgewick & Wayne

8 CompSci 100E 4.8 Web Graph to Transition Matrix % java Transition tiny.txt 5 0.02000 0.92000 0.02000 0.02000 0.02000 0.02000 0.02000 0.38000 0.38000 0.20000 0.02000 0.02000 0.02000 0.92000 0.02000 0.92000 0.02000 0.02000 0.02000 0.02000 0.47000 0.02000 0.47000 0.02000 0.02000 © Sedgewick & Wayne

9 CompSci 100E 4.9 Monte Carlo Simulation Monte Carlo simulation.  Surfer starts on page 0.  Repeatedly choose next page, according to transition matrix.  Calculate how often surfer visits each page. transition matrix page How? see next slide © Sedgewick & Wayne

10 CompSci 100E 4.10 Random Surfer Random move. Surfer is on page page. How to choose next page j ?  Row page of transition matrix gives probabilities.  Compute cumulative probabilities for row page.  Generate random number r between 0.0 and 1.0.  Choose page j corresponding to interval where r lies. page transition matrix

11 CompSci 100E 4.11 Random Surfer Random move. Surfer is on page page. How to choose next page j ?  Row page of transition matrix gives probabilities.  Compute cumulative probabilities for row page.  Generate random number r between 0.0 and 1.0.  Choose page j corresponding to interval where r lies. // make one random move double r = Math.random(); double sum = 0.0; for (int j = 0; j < N; j++) { // find interval containing r sum += p[page][j]; if (r < sum) { page = j; break; } } © Sedgewick & Wayne

12 CompSci 100E 4.12 Random Surfer: Monte Carlo Simulation public class RandomSurfer { public static void main(String[] args) { int T = Integer.parseInt(args[0]); // number of moves int N = in.nextInt(); // number of pages int page = 0; // current page double[][] p = new int[N][N]; // transition matrix // read in transition matrix... // simulate random surfer and count page frequencies int[] freq = new int[N]; for (int t = 0; t < T; t++) { // make one random move freq[page]++; } // print page ranks for (int i = 0; i < N; i++) { System.out.println(String.format("%8.5f", (double) freq[i] / T); } System.out.println(); } see previous slide page rank

13 CompSci 100E 4.13 Mathematical Context Convergence. For the random surfer model, the fraction of time the surfer spends on each page converges to a unique distribution, independent of the starting page. "page rank" "stationary distribution" of Markov chain "principal eigenvector" of transition matrix © Sedgewick & Wayne

14 CompSci 100E 4.14 The Power Method Q. If the surfer starts on page 0, what is the probability that surfer ends up on page i after one step? A. First row of transition matrix. © Sedgewick & Wayne

15 CompSci 100E 4.15 The Power Method Q. If the surfer starts on page 0, what is the probability that surfer ends up on page i after two steps? A. Matrix-vector multiplication.

16 CompSci 100E 4.16 The Power Method Power method. Repeat until page ranks converge. © Sedgewick & Wayne

17 CompSci 100E 4.17 Mathematical Context Convergence. For the random surfer model, the power method iterates converge to a unique distribution, independent of the starting page. "page rank" "stationary distribution" of Markov chain "principal eigenvector" of transition matrix © Sedgewick & Wayne

18 CompSci 100E 4.18

19 CompSci 100E 4.19 Random Surfer: Scientific Challenges Google's PageRank™ algorithm. [Sergey Brin and Larry Page, 1998]  Rank importance of pages based on hyperlink structure of web, using 90-10 rule.  Revolutionized access to world's information. Scientific challenges. Cope with 4 billion-by-4 billion matrix!  Need data structures to enable computation.  Need linear algebra to fully understand computation. © Sedgewick & Wayne


Download ppt "CompSci 100E 4.1 Google’s PageRank web site xxx web site yyyy web site a b c d e f g web site pdq pdq.. web site yyyy web site a b c d e f g web site xxx."

Similar presentations


Ads by Google