Download presentation

Presentation is loading. Please wait.

Published byRaven Wedgeworth Modified over 2 years ago

1
Mining Web’s Link Structure Sushanth Rai University of Texas at Arlington http://cseweb.uta.edu/~rai

2
2

3
3 Structure of WWW Highly Decentralized Unstructured Hyperlink Based Disorganized Presentation

4
4 Searching the WWW Searching : Process of discovering high quality relevant pages in response to specific need for certain information

5
5 Challenges in Search Engines Index based search engines returns one or million results !! Heuristics used to rank the pages use frequency of occurrence of words Spamming can mislead Index based search engines Human language exhibits synonymy and polysemy Web pages are not self descriptive

6
6 Searching with Hyperlinks Features –Hyperlinks represent latent human judgment –Hyperlinks provides opportunity to find potential authorities Pitfalls –Links are created for purposes other than potential authorities –Balance between popularity and relevance

7
7 Focused Subgraph of WWW Authority : A page that is referred by many good hubs Hub : A page that points to many good authorities Authorities and hubs are extracted through focused subgraph which contain set of pages –Whose size is relatively small –Rich in content related to query –Contains strongest authorities

8
8 root base

9
9 Construction of Subgraph Subgraph( , , t, d) : a query string : a text-based search engine t, d : natural numbers. Let R denote the top t results of on Set S = R For each page p R Let + (p) denote the set of all pages p points to. Let - (p) denote the set of all pages pointing to p Add all pages in + (p) to S . If | - (p)| <= d then Add all the pages in - (p) to S Else Add an arbitrary set of d pages from - (p) to S End Return S

10
10 Pruning the Subgraph In the graph G[S ] induced by the set S –Identify the links that are transverse and intrinsic –Delete all the intrinsic links and retain only transverse links

11
11 Computing Hubs and Authorities Associate non-negative authority weight and non- negative hub weight with each page Weights of each type are normalized so that squares sum to 1 Use I and O operation iteratively to update the weights – I : x q:(q,,p) E y –O : y q:(p,,q) E x

12
12 Hubs Authorities Unrelated page of Large in-degree

13
13 Iterative Algorithm Iterate(G,k) G: a collection of n linked pages K: a natural numbers Let z denote the vector (1,1,1….1) R n Set x 0 = z Set y 0 = z For j = 1,2, ….k Apply the I operation to (x j-1, y j-1), obtaining new x-weights x’ j Apply the O operation to (x’ j, y j-1 ), obtaining new y-weights y’ j Normalize x’ j, obtaining x j. Normalize y’ j, obtaining y j. End Return(x k, y k )

14
14 Results (java) Authorities.328 http://www.gamelan.comhttp://www.gamelan.com.251 http://java.sun.comhttp://java.sun.com.190 http://www.digitalfocus.com/digitalfocus/faq/howdoi.htmlhttp://www.digitalfocus.com/digitalfocus/faq/howdoi.html.183 http://sunsite.unc.edu/javafaq/javafaq.htmlhttp://sunsite.unc.edu/javafaq/javafaq.html (Gates) Authorities.643 http://www.roadahead.comhttp://www.roadahead.com.458 http://www.microsoft.comhttp://www.microsoft.com.440 http://www.microsoft.com/corpinfo/bill-g.htm

15
15 Results (Contd…) Comparative results with Altavista, Yahoo, Clever on 26 broad search topics rated as “bad”, “fair”, “good”, “fantastic” For 31%, Yahoo and Clever received equivalent evaluations For 50%, Clever received a higher evaluation For 19%, Yahoo received the higher evaluation Altavista failed to receive higher evaluation on any of the 26 topics.

16
16 Applications Constructing Taxonomies semiautomatically Trawling the web for Emerging Cybercommunities Mining structured information that succumbs to database techniques

17
17 Web Resources Clever - http://www.almaden.ibm.com/cs/k53/clever.html http://www.almaden.ibm.com/cs/k53/clever.html Google - http : //www.google.com WebL - http://www.research.compaq.com/SRC/WebL

18
Questions ??

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google