Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results Kummamuru et al. Presented by Bei Yu Sept. 22 nd,

Similar presentations


Presentation on theme: "A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results Kummamuru et al. Presented by Bei Yu Sept. 22 nd,"— Presentation transcript:

1 A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results Kummamuru et al. Presented by Bei Yu Sept. 22 nd, 2004

2 Roadmap Properties of topic hierarchy Automatic taxonomy generation (ATG) Monothetic ATG DisCover algorithm CAARD algorithm DSP algorithm Result comparison Questions

3 Generating Topic Hierarchy (Taxonomy) Desirable properties of topic hierarchy document coverage Compactness (breadth/depth, node number) Sibling node distinctiveness Node label predictiveness General to specific Reach time

4 Monothetic ATG Automatic Taxonomy Generation (ATG) monothetic vs. polythetic Monothetic: single-feature based cluster assignment Polythetic: multiple-features based assignment Keywords vs. documents vs. both clustering Top-down vs. bottom-up Monothetic ATG Subsumption algorithm (Sanderson and Croft, 1999) DSP (Lawrie et al., 2001) CAARD (Kummamuru and Krishnapuram, 2001) DisCover (this paper)

5 DisCover Progressively grow the hierarchy Coverage and compactness tradeoff Generate an optimal permuted sequence of the concepts under a node. Every document represented as a set of concepts; “concepts under the node” means all the the other concepts in the documents covered by the node. Select an optimal subset from the concepts with maximal coverage and distinctiveness Question: preset the child node number?

6 DisCover Coverage distinctiveness

7 CAARD (Kummamuru and Krishnapuram, 2001) corpus concepts Inclusion Degree: top-level Min_subset Rest subset recursive

8 DSP (Lawrie et al., 2001) corpus Topic terms top-level topic terms Vocabulary terms Maximal predictive power and vocabulary coverage Language model A: topic term; B: vocabulary A=B Recursion A <- subtopic term around topic B=A?

9 Evaluation In general Precision F-measure User study Summary evaluation (EMIM cmp. TF*IDF) Reachability Reach time This paper compares Computation complexity Coverage and compactness Reach time User study

10 Results

11

12

13 Questions The performance as the number of nodes even increase (greater than 9) ? How to exactly map the concept sequence to the tree structure?


Download ppt "A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results Kummamuru et al. Presented by Bei Yu Sept. 22 nd,"

Similar presentations


Ads by Google