Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Structure of Broad Topics on the Web Soumen Chakrabarti, Mukul M. Joshi, etc Presentation by Na Dai.

Similar presentations


Presentation on theme: "The Structure of Broad Topics on the Web Soumen Chakrabarti, Mukul M. Joshi, etc Presentation by Na Dai."— Presentation transcript:

1 The Structure of Broad Topics on the Web Soumen Chakrabarti, Mukul M. Joshi, etc Presentation by Na Dai

2 Introduction & Contribution Convergence of topic distribution on undirected random walks Degree distribution restricted to topics How topic-biased are breadth-first crawls? Representation of topics in Web directories Topic convergence on directed walks Link-based vs. content-based Web communities

3 Building Blocks Sampling Web pages –PageRank-based random walk  Wander walk –The Bar-Yossef random walk  Sampling walk Undirected graph Regular Taxonomy design & Document classification –271,954 topics, 6 levels, 1,697,266 sample URLs –Pruned: taxonomy 482 leaf nodes, 144,859 sample URLs –Classification: Rainbow naïve Bayes classifier

4 Convergence Sampling method –Sampling walk Topic distribution of a set –Soft counting Difference measure –L1 distance

5 The background distribution vs. breadth-first crawls

6 Faithful representation of topics in Web directory

7 Topic-specific degree distributions Power law distribution –Pr(i) = k*1/i x (x>1) Contribution to Class c –Soft-counting –Δd p c (d)

8 Topical locality and link-based prestige ranking Sampling method –Wander walk Class selection –Dmoz, well-populated Collect all the pages at distance i (i>0)

9 Topical locality and link-based prestige ranking

10 Relations between topics Topic citation matrix Contribution to topic citation matrix C –C  C + p(u) T p(v) Implications and application –Improved hypertext classification –Enhanced focused crawling –Reorganizing topic directories

11 Concluding remarks Characterize some important notions of topical locality on the web Open problems –PageRank jump parameter –Topical stability of distillation algorithms –Better crawling algorithms

12 Q & A?


Download ppt "The Structure of Broad Topics on the Web Soumen Chakrabarti, Mukul M. Joshi, etc Presentation by Na Dai."

Similar presentations


Ads by Google