Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 COMP4332 Web Data Thanks for Raymond Wong’s slides.

Similar presentations


Presentation on theme: "1 COMP4332 Web Data Thanks for Raymond Wong’s slides."— Presentation transcript:

1 1 COMP4332 Web Data Thanks for Raymond Wong’s slides

2 2 Web Databases Raymond Wong

3 COMP53313 How to rank the webpages?

4 4 Ranking Methods HITS Algorithm PageRank Algorithm

5 COMP53315 HITS Algorithm HITS is a ranking algorithm which ranks “hubs” and “authorities”.

6 COMP53316 HITS Algorithm Authority vv Hub Each page has two weights 1.Authority weight a(v) 2.Hub weight h(v)

7 COMP53317 HITS Algorithm Each vertex has two weights Authority weight Hub weight Authority Weight v v Hub Weight a(v) =  u  v h(u) h(v) =  v  u a(u) A good authority has many edges from good hubs A good hub has many outgoing edges to good authorities

8 COMP53318 HITS Algorithm HITS involves two major steps. Step 1: Sampling Step Step 2: Iteration Step

9 COMP53319 Step 1 – Sampling Step Given a user query with several terms Collect a set of pages that are very relevant – called the base set How to find base set? We retrieve all webpages that contain the query terms. The set of webpages is called the root set. Next, find the link pages, which are either pages with a hyperlink to some page in the root set or some page in the root set has hyperlink to these pages All pages found form the base set.

10 COMP533110 HITS Algorithm HITS involves two major steps. Step 1: Sampling Step Step 2: Iteration Step

11 COMP533111 Step 2 – Iteration Step Goal: to find the base pages that are good hubs and good authorities

12 COMP533112 Step 2 – Iteration Step N A M N: Netscape MS: Microsoft A: Amazon.com h(N) = a(N) + a(MS) + a(A) h(MS) = a(A) h(A) = a(N) + a(MS) = h(N) h(MS) h(A) a(N) a(MS) a(A) Adjacency matrix M = N MS A N A h(N) h(MS) h(A) a(N) a(MS) a(A)

13 COMP533113 Step 2 – Iteration Step N A M N: Netscape MS: Microsoft A: Amazon.com a(N) = h(N) + h(A) a(MS) = h(N) + h(A) a(A) = h(N) + h(MS) = a(N) a(MS) a(A) h(N) h(MS) h(A) Adjacency matrix M = N MS A N A h(N) h(MS) h(A) a(N) a(MS) a(A)

14 COMP533114 Step 2 – Iteration Step We have We derive

15 COMP533115 Step 2 – Iteration Step N A M = N MS A N A M= N A N A MTMT = N A N A MM T = N MS A N A MTMMTM

16 COMP533116 Step 2 – Iteration Step = N MS A N A MM T Iteration No.1234567 Hub (non-normalized) 111111 N MS A Iteration No.1234567 Hub (normalized) N MS A 111111 624624 1.5 0.5 1 725725 7.071 1.929 5.143 1.5 0.429 1.071 1.5 0.409 1.091 7.091 1.909 5.182 7.096 1.904 5.192 1.5 0.404 1.096 1.5 0.402 1.098 7.098 1.902 5.195 1.5 0.402 1.098 The sum of all elements in the vector = 3 N MS A Hub = 1.5 0.402 1.098

17 COMP533117 Step 2 – Iteration Step Iteration No.1234567 Authority (non-normalized) 111111 N MS A Iteration No.1234567 Authority (normalized) N MS A 111111 554554 1.071 0.857 5.143 3.857 5.182 3.818 1.091 0.818 1.096 0.808 5.192 3.808 5.195 3.805 1.098 0.805 1.098 0.804 5.196 3.804 1.098 0.804 The sum of all elements in the vector = 3 = N MS A N A MTMMTM N MS A Hub = 1.5 0.402 1.098 N MS A Authority = 1.098 0.804

18 COMP533118 How to Rank Many ways Rank in descending order of hub only Rank in descending order of authority only Rank in descending order of the value computed from both hub and authority (e.g., the sum of the hub value and the authority value) N MS A Hub = 1.5 0.402 1.098 N MS A Authority = 1.098 0.804

19 COMP533119 Ranking Methods HITS Algorithm PageRank Algorithm

20 COMP533120 PageRank Algorithm (Google) Disadvantage of HITS: Since there are two concepts, namely hubs and authorities, we do not know which concept is more important for ranking. Advantage of PageRank: PageRank involves only one concept for ranking

21 COMP533121 PageRank Algorithm (Google) PageRank Algorithm makes use of Stochastic approach to rank the pages

22 Link Structure of the Web 150 million web pages  1.7 billion links Backlinks and Forward links:  A and B are C’s backlinks  C is A and B’s forward link Intuitively, a webpage is important if it has a lot of backlinks. What if a webpage has only one link off www.yahoo.com?

23 A Simple Version of PageRank u: a web page B u : the set of u’s backlinks N v : the number of forward links of page v c: the normalization factor to make ||R|| L1 = 1 (||R|| L1 = |R 1 + … + R n |)

24 An example of Simplified PageRank PageRank Calculation: first iteration

25 An example of Simplified PageRank PageRank Calculation: second iteration

26 An example of Simplified PageRank Convergence after some iterations

27 A Problem with Simplified PageRank A loop: During each iteration, the loop accumulates rank but never distributes rank to other pages!

28 An example of the Problem

29

30

31 Random Walks in Graphs The Random Surfer Model The simplified model: the standing probability distribution of a random walk on the graph of the web. simply keeps clicking successive links at random The Modified Model The modified model: the “random surfer” simply keeps clicking successive links at random, but periodically “gets bored” and jumps to a random page based on the distribution of E

32 Modified Version of PageRank E(u): a distribution of ranks of web pages that “users” jump to when they “gets bored” after successive links at random.

33 An example of Modified PageRank 33

34 Dangling Links Links that point to any page with no outgoing links Most are pages that have not been downloaded yet Affect the model since it is not clear where their weight should be distributed Do not affect the ranking of any other page directly Can be simply removed before pagerank calculation and added back afterwards

35 PageRank Implementation Convert each URL into a unique integer and store each hyperlink in a database using the integer IDs to identify pages Sort the link structure by ID Remove all the dangling links from the database Make an initial assignment of ranks and start iteration Choosing a good initial assignment can speed up the pagerank Adding the dangling links back.


Download ppt "1 COMP4332 Web Data Thanks for Raymond Wong’s slides."

Similar presentations


Ads by Google