Download presentation
Presentation is loading. Please wait.
Published byClifford Booker Modified over 8 years ago
1
The Structure of Broad Topics on the Web Soumen Chakrabarti, Mukul M. Joshi, etc Presentation by Na Dai
2
Introduction & Contribution Convergence of topic distribution on undirected random walks Degree distribution restricted to topics How topic-biased are breadth-first crawls? Representation of topics in Web directories Topic convergence on directed walks Link-based vs. content-based Web communities
3
Building Blocks Sampling Web pages –PageRank-based random walk Wander walk –The Bar-Yossef random walk Sampling walk Undirected graph Regular Taxonomy design & Document classification –271,954 topics, 6 levels, 1,697,266 sample URLs –Pruned: taxonomy 482 leaf nodes, 144,859 sample URLs –Classification: Rainbow naïve Bayes classifier
4
Convergence Sampling method –Sampling walk Topic distribution of a set –Soft counting Difference measure –L1 distance
5
The background distribution vs. breadth-first crawls
6
Faithful representation of topics in Web directory
7
Topic-specific degree distributions Power law distribution –Pr(i) = k*1/i x (x>1) Contribution to Class c –Soft-counting –Δd p c (d)
8
Topical locality and link-based prestige ranking Sampling method –Wander walk Class selection –Dmoz, well-populated Collect all the pages at distance i (i>0)
9
Topical locality and link-based prestige ranking
10
Relations between topics Topic citation matrix Contribution to topic citation matrix C –C C + p(u) T p(v) Implications and application –Improved hypertext classification –Enhanced focused crawling –Reorganizing topic directories
11
Concluding remarks Characterize some important notions of topical locality on the web Open problems –PageRank jump parameter –Topical stability of distillation algorithms –Better crawling algorithms
12
Q & A?
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.