Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 A Graph-Theoretic Approach to Webpage Segmentation Deepayan Chakrabarti Ravi Kumar

Similar presentations


Presentation on theme: "1 A Graph-Theoretic Approach to Webpage Segmentation Deepayan Chakrabarti Ravi Kumar"— Presentation transcript:

1 1 A Graph-Theoretic Approach to Webpage Segmentation Deepayan Chakrabarti (deepay@yahoo-inc.com)deepay@yahoo-inc.com Ravi Kumar (ravikuma@yahoo-inc.com)ravikuma@yahoo-inc.com Kunal Punera (kpunera@yahoo-inc.com)kpunera@yahoo-inc.com

2 2 Motivation and Related Work Header Navigation bar Primary content Related links Copyright Ad

3 3 Motivation and Related Work Header Navigation bar Primary content Related links Copyright Ad Divide a webpage into visually and semantically cohesive sections

4 4 Motivation and Related Work Sectioning can be useful in:  Webpage classification  Displaying webpages on mobile phones and small-screen devices  Webpage ranking  Duplicate detection  …

5 5 Motivation and Related Work A lot of recent interest  Informative Structure Mining [Cai+/2003, Kao+/2005]  Displaying webpages on small screens [Chen+/2005, Baluja/2006]  Template detection: [Bar-Yossef+/2002]  Topic distillation: [Chakrabarti+/2001] Based solely on visual, or content, or DOM based clues Mostly heuristic approaches

6 6 Motivation and Related Work Our contributions  Combine visual, DOM, and content based cues  Propose a formal graph-based combinatorial optimization approach  Develop two instantiations, both with: Approximation guarantees Automatic determination of the number of sections  Develop methods for automatic learning of graph weights

7 7 Outline Motivation and Related Work Proposed Work Experiments Conclusions

8 8 Proposed Work A graph-based approach  Construct a neighborhood graph of DOM tree nodes  Neighbors  close according to: DOM tree distance, or, visual distance when rendered on the screen, or, similar content types  Partition the neighborhood graph to optimize a cost function A B DCE DOM Tree A B CD E Neighborhood Graph

9 9 Proposed Work A graph-based approach  What is a good cost function? Intuitive Has polynomial-time algorithms that can get provably close to the optimal  Correlation Clustering  Energy-minimizing Graph Cuts  How should we set weights in the neighborhood graph? A B DCE A B CD E DOM Tree Neighborhood Graph

10 10 Correlation Clustering Assign each DOM node p to a section S(p) V pq are edge weights in the neighborhood graph A B CD E Neighborhood Graph V AB V AE V BC Penalty for having DOM nodes p and q in different sections

11 11 Correlation Clustering Rendering Constraint:  Each pixel on the screen must belong to at most one section  Parent section = child section  Constraint only applies to DOM nodes “aimed” at visual rendering A C B S A =? Either S A =S B =S C, or S A ≠S B and S A ≠S c DOM Tree

12 12 Correlation Clustering Rendering Constraint:  Each pixel on the screen must belong to at most one section  Not enforced by CCLUS Workaround: Use only leaf nodes in the neighborhood graph  But content cues may be too noisy at the leaf level A C B S A =? Either S A =S B =S C, or S A ≠S B and S A ≠S c DOM Tree

13 13 Correlation Clustering Algorithm: [Ailon+/2005]  Pick a random leaf node p  Create a new section of p, and all nodes q which are strongly connected to p:  Remove p and q’s from the neighborhood graph  Iterate Within a factor of 2 of the optimal Number of sections picked automatically

14 14 Proposed Work A graph-based approach  What is a good cost function? Intuitive Has polynomial-time algorithms that can get provably close to the optimal  Correlation Clustering  Energy-minimizing Graph Cuts  How should we set weights in the neighborhood graph? A B DCE A B CD E DOM Tree Neighborhood Graph

15 15 Energy-minimizing Graph Cuts Extra: A predefined set of labels Assign to each node p a label S(p) Distance of node to label Distance between pairs of nodes

16 16 Energy-minimizing Graph Cuts Difference from CCLUS:  Node weights D p in addition to edge weights V pq  D p and V pq can depend on the labels (not just “same” or “different”) A B CD E Neighborhood Graph V AB V AE V BC DADA DBDB DEDE Distance of node to label Distance between pairs of nodes

17 17 C Energy-minimizing Graph Cuts How can we fit the Rendering Constraint?  Have a special “invisible” label ξ  Parent is invisible, unless all children have the same label  Can set the V pq values accordingly A B S A =? ξ

18 18 C Energy-minimizing Graph Cuts How can we fit the Rendering Constraint?  Have a special “invisible” label ξ  Parent is invisible, unless all children have the same label  Can set the V pq values accordingly  Automatically infer “rendering” versus “structural” DOM nodes A B

19 19 Energy-minimizing Graph Cuts Why couldn’t we use this trick in CCLUS as well?  CCLUS only asks: Are nodes p and q in the same section or not?  It cannot handle “special” sections like the invisible section  Hence, labels are giving us extra power

20 20 Energy-minimizing Graph Cuts Advantages  Can use all DOM nodes, while still obeying the Rendering Constraint  Better than CCLUS  Factor of 2 approximation of the optimal, by performing iterative min-cuts of specially constructed graphs We extend [Kolmogorov+/2004] Number of sections are picked automatically

21 21 Energy-minimizing Graph Cuts Theorem: V pq must obey the constraint  Separation cost ≥ Merge cost  Set V pq (different) >> V pq (same) for nodes that are extremely close  Cost minimization tries to place them in the same section

22 22 Energy-minimizing Graph Cuts Theorem: V pq must obey the constraint  Separation cost ≥ Merge cost  However, we cannot use V pq to push two nodes to be in different sections  Use D p instead

23 23 Energy-minimizing Graph Cuts To separate nodes p and q:  Ensure that either D p (α) or D q (α) is large, for any label α  So, assigning both p and q to the same label will be too costly Distance of node to label

24 24 Energy-minimizing Graph Cuts  Invisible label lets us use the parent-child DOM tree structure  Ensures that nodes with very different content or visual features are split up  Ensures that nodes with very similar content or visual features are merged

25 25 Proposed Work A graph-based approach  What is a good cost function? Intuitive Has polynomial-time algorithms that can get provably close to the optimal  Correlation Clustering  Energy-minimizing Graph Cuts  How should we set weights in the neighborhood graph? A B DCE A B CD E DOM Tree Neighborhood Graph

26 26 Learning graph weights Extract content and visual features from training data Learning V pq (.)  Learn a logistic regression classifier (prob. that p and q belong to the same section) A B CD E Neighborhood Graph V AB V AE V BC DADA DBDB DEDE

27 27 Learning graph weights Extract content and visual features from training data Learning D p (.)  Training data does not provide labels  Set of labels = Set of DOM tree nodes in that webpage  D p (α) = distance in some feature space  Learn a Mahalanobis distance metric between nodes (distances within section < distances across sections) A B CD E Neighborhood Graph V AB V AE V BC DADA DBDB DEDE

28 28 Outline Motivation and Related Work Proposed Work Experiments Conclusions

29 29 Experiments Manually sectioned 105 randomly chosen webpages to get 1088 sections Two measures were used:  Adjusted RAND: fraction of leaf node pairs which are correctly predicted to be together or apart (over and above random sectioning)  Normalized Mutual Information  Both are between 0 and 1, with higher values indicating better results.

30 30 Experiments CCLUS: Only 20% of the webpages score better than 0.6 GCUTS: Almost 50% of the webpages score better than 0.6 Adjusted RAND % webpages < score

31 31 Experiments GCUTS is better than CCLUS Over all webpages

32 32 Experiments Application to duplicate detection on the Web  Collected lyrics of the same songs from 3 different sites (~2300 webpages) Nearly similar content Different template structures  Our approach: Section all webpages Perform duplicate detection using only the largest section (primary content)

33 33 Experiments Sectioning > No sectioning GCUTS > CCLUS

34 34 Outline Motivation and Related Work Proposed Work Experiments Conclusions

35 35 Conclusions Combined visual, DOM, and content based cues Optimization on a neighborhood graph  Node and edge weights are learnt from training data Developed CCLUS and GCUTS, both with:  Approximation guarantees  Automatic determination of the number of sections

36 36 Learning graph weights Extract content and visual features from training data A B CD E Neighborhood Graph V AB V AE V BC DADA DBDB DEDE

37 37 Energy-minimizing Graph Cuts What is such a D p (.) function?  Use the set of internal DOM nodes as the set of labels  D p (α) measures the difference in feature vectors between node p and internal node (label) α  If nodes p and q are very different, D p (α) and D q (α) will differ for all α

38 38 Correlation Clustering Does not enforce the Rendering Constraint:  Each pixel on the screen must belong to at most one section  Parent nodes should have same section as their children Workaround: Consider only leaf nodes in the neighborhood graph  But content cues may be too noisy at the leaf level

39 39 Correlation Clustering Does not enforce the Rendering Constraint  Each pixel on the screen must belong to at most one section  Parent section = child section  Apply rule only for ancestors “aimed” at visual rendering A C B S A =? Either S A =S B =S C, or S A ≠S B and S A ≠S c

40 40 Correlation Clustering Does not enforce the Rendering Constraint Workaround: Consider only leaf nodes in the neighborhood graph  But content cues may be too noisy at the leaf level A C B S B =5S C =7 S A =? Either S A =S B =S C, or S A ≠S B and S A ≠S c

41 41 Energy-minimizing Graph Cuts How can we fit the Rendering Constraint?  Have a special “invisible” label ξ  Parent is invisible, unless all children have the same label  Can set the V pq values accordingly  Automatically infer “rendering” versus “structural” DOM nodes A C B S B =5S C =7 S A =? ξ S C =5 S A =5

42 42 Energy-minimizing Graph Cuts What is the set of labels?  The set of internal DOM nodes Available at the beginning of the algorithm The labels are themselves nodes, with feature vectors  D p (α) = distance in some feature space “Tuned” to the current webpage


Download ppt "1 A Graph-Theoretic Approach to Webpage Segmentation Deepayan Chakrabarti Ravi Kumar"

Similar presentations


Ads by Google