Download presentation
Presentation is loading. Please wait.
1
The Structure of Broad Topics on the Web
Soumen Chakrabarti Mukul M. Joshi Kunal Punera Lab. for Intelligent Internet Research, IIT Bombay David M. Pennock NEC Research Institute
2
Graph structure of the Web
Over two billion nodes, 20 billion links Power-law degree distribution Pr(degree = k) 1/k2.1 Looks like a “bow-tie” at large scale IN OUT Strongly connected core (SCC) “This is the Web”
3
The need for content-based models
Why does a radius-1 expansion help in topic distillation? Why does topic-specific focused crawling work? Why is a global PageRank useful for specific queries? Search engine Root set Query Crawler Classifier Check frontier topic Prune if irrelevant Uniform jump Walk to out-neighbor
4
The need for content-based models
How are different topics linked to each other? Application: crawling, classification, clustering Are URL collections representative of Web topic populations? Web directories: Dmoz, Yahoo! TREC Web track “This is the Web with topics!”
5
How to characterize “topics”
Web directories—a natural choice Start with Keep pruning until all leaf topics have enough (>300) samples Approx 120k sample URLs Flatten to approx 482 topics Train text classifier (Rainbow) Characterize new document d as a vector of probabilities pd = (Pr(c|d) c) Test doc Classifier Generalize to groups of documents.
6
Critique and defense Cannot capture fine-grained or emerging topics
Emerging topics most often specialize existing broad topics, which rarely change Classifier may be inaccurate Adequate if much better than random guess Can compensate errors using held-out validation data Results depend on one Web directory Can repeat with many others and compare
7
Background topic distribution
What fraction of Web pages are about Health? Sampling via random walk PageRank walk (Henzinger et al.) Undirected regular walk (Bar-Yossef et al.) Make graph undirected (link:…) Add self-loops so that all nodes have the same degree Sample with large stride Collect topic histograms
8
Convergence Start from pairs of diverse topics
Two random walks, sample from each walk Measure distance between topic distributions L1 distance |p1 – p2| = c|p1(c) – p2(c)| in [0,2] Below .05 —.2 within 300—400 physical pages Makes precise the notion that “memory of topic is rapidly lost in a Web walk”
9
Biases in topic directories
Use Dmoz to train a classifier Sample the Web Classify samples Diff Dmoz topic distribution from Web sample topic distribution Report maximum deviation in fractions NOTE: Not exactly Dmoz
10
Topic-specific degree distribution
Preferential attachment: connect to v w.p. proportional to current degree of v, regardless of topic More realistic: u has a topic, and links to v with related topics Unclear if power-law should still hold Holds for large degree Intra-topic linkage Inter-topic linkage
11
Random forward walk without jumps
Sampling walk is designed to mix topics well How about walking forward without jumping? Start from a page u0 on a specific topic Sample many forward random walks (u0, u1, …, ui, …) Compare (Pr(c|ui) c) with (Pr(c|u0) c) and with the background distribution
12
Observations and implications
Forward walks wander away from starting topic slowly But do not converge to the background distribution Global PageRank ok also for topic-specific queries Jump parameter d=.1—.2 Topic drift not too bad within path length of 5—10 Prestige conferred mostly by same-topic neighbors Also explains why focused crawling works W.p. d jump to a random node W.p. (1-d) jump to an out-neighbor u.a.r. Jump High-prestige node
13
Citation matrix Given a page is about topic i, how likely is it to link to topic j? Matrix C[i,j] = probability that page about topic i links to page about topic j Soft counting: C[i,j] += Pr(i|u)Pr(j|v) Applications Classifying Web pages into topics Focused crawling for topic-specific pages Finding relations between topics in a directory u v
14
Citation, confusion, correction
From topic Classifier’s confusion on held-out documents can be used to correct confusion matrix Arts Business Computers Games Health Home Recreation Reference Science Shopping Society Sports To topic True topic Guessed topic From topic To topic
15
Fine-grained views of citation
Prominent off-diagonal (/Arts/Music to /Shopping/Music) entries raise design issues for taxonomy editors and maintainers Drill-down operation. Applet coupled with database and the Web which lets us sample pages and link distributions and inspect data points. Other examples: News, Regional. Clear block-structure derived from coarse-grain topics Strong diagonal blocks reflect tightly-knit topic communities
16
Concluding remarks A model for content-based communities
New characterization and measurement of topical locality on the Web How to set the PageRank jump parameter? Topical stability of topic distillation Better crawling and classification A tool for Web directory maintenance Fair sampling and representation of topics Block-structure and off-diagonals Taxonomy inversion
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.