Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Query Directed Web Page Clustering Daniel Crabtree Peter Andreae, Xiaoying Gao Victoria University of Wellington.

Similar presentations


Presentation on theme: "1 Query Directed Web Page Clustering Daniel Crabtree Peter Andreae, Xiaoying Gao Victoria University of Wellington."— Presentation transcript:

1 1 Query Directed Web Page Clustering Daniel Crabtree Peter Andreae, Xiaoying Gao Victoria University of Wellington

2 2 Related Work: Web Page Clustering All Standard Algorithms – partitioning (k-means), hierarchical (agglomerative, divisive), ………… Web Features – structure, hyperlinks, colour Textual Features – STC: phrases, Lingo: latent semantic indexing Word Semantics – Global document analysis, co-occurrence statistics Query is never used

3 QDC – Query Directed Clustering 3 1: Find Base Clusters 2: Merge Clusters3: Split Clusters4: Select Clusters5: Clean Clusters

4 QDC – 1: Find Base Clusters Clean Pages Identify Base Clusters Prune Small Clusters Semantic Prune #1 Semantic Prune #2 4 Mac (28) Car (40) Auto (25) Animal (18) OS (12) Atari (22) Game (5) Service (80) Forest (11) cluster size distance(cluster,query) Score #1 = Score #2 =

5 Car Home Page Toyota Specific Broad Query: Jaguar Ambiguous QDC – 1: Query Distance 5

6 QDC – 1: Find Base Clusters Removes Many Base Clusters – Normally Negative Effect on Performance But … Query Directed Score – Reliable Guide to Cluster Quality – Removes just Low Quality Clusters – Improves Performance 6

7 QDC – 2: Merge Clusters Merging 7 Mac (28) Car (40) Auto (25) Animal (18) OS (12) Atari (22) Car, Auto (40) Mac, OS (28)

8 QDC – 2: Merge Clusters Single-link Clustering Similarity Function – Extension (by page overlap) – Intension (by description similarity) Global document analysis: co-occurrence frequency relative to expected frequency if independent 8

9 QDC – 2: Merge Clusters Reducing Page Overlap Threshold – Normally Negative Effect on Performance But … Description Similarity – More semantically related clusters merge Increasing cluster coverage – Fewer semantically unrelated clusters merge Increasing cluster quality 9

10 QDC – 3: Split Clusters Single Link Merging – Cluster Chaining (Drifting) Hierarchical Agglomerative – Distance Measure: Path Length 10

11 QDC – 4: Select Clusters ESTC cluster selection algorithm – Heuristic based hill-climbing search with look-ahead and advanced branch and bound pruning Original heuristic – Page Coverage and Cluster Overlap New heuristic – Page Coverage and Cluster Overlap – Pages Not Covered and Cluster Quality 11

12 QDC – 5: Clean Clusters Page-Cluster Relevance – Based on Base Cluster Membership – Cluster Size, Cluster Quality Remove Outliers and Erroneous Inclusions Sorting improves usability 12 13

13 Evaluation Algorithm Efficiency on 250 Documents – Ten Times Faster than STC – One Hundred Times Faster than K-means Algorithm Performance – External Evaluation against a rich gold standard Real World Usability – Informal Usability Comparison with four algorithms K-means, ESTC, Lingo, Vivisimo 13

14 Evaluation: Algorithm Performance External Evaluation against a rich gold standard Four Algorithms – STC, ESTC, K-means, Random Four Data Sets – Salsa, Jaguar, GP, Victoria University Eleven Measurements – Average and Weighted: Quality, Coverage, Precision, Recall, and Entropy + Mutual Information Snippets and Full Page Text 14

15 Evaluation: Quality and Coverage 15

16 Evaluation: Improvement over Random 16

17 Evaluation: Precision and Recall 17

18 Evaluation: Entropy and Mutual Information 18

19 Evaluation: Real World Usability QDC finds broader topics – Maximizes probability of refinement – Simplifies user’s decision process Fewer choices Less chance of multiple relevant choices Fewer semantically meaningless clusters 19 Jaguar Results

20 Evaluation: Real World Usability Performance better than indicated by external evaluation – No penalty for overly specific clusters since gold standard included them External evaluation shows QDC clusters have: – Fewer irrelevant pages – Cover more relevant pages 20

21 Conclusion QDC: New Web Page Clustering Algorithm Key innovations: – Query Directed Scoring – Merging using cluster descriptions – Solve cluster chaining by splitting – Improved cluster selection heuristic Vastly improved performance over other algorithms – External evaluation – Informal usability evaluation 21

22 22 Further Extension Use Phrases rather than just Words – STC, Lingo show large improvement possible Use Wiki Link similarity (WikiMiner) instead of GND Future work: – Improve cluster description similarity merging to consider entire description – Common shared phrases as key features, use VSM, build vectors for each cluster, new weighting – Formal usability evaluation


Download ppt "1 Query Directed Web Page Clustering Daniel Crabtree Peter Andreae, Xiaoying Gao Victoria University of Wellington."

Similar presentations


Ads by Google