Presentation is loading. Please wait.

Presentation is loading. Please wait.

A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.

Similar presentations


Presentation on theme: "A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07."— Presentation transcript:

1 A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07

2 1. Document Clustering Agglomerative Hierarchical Clustering (AHC)

3 Suffix Tree Clustering (STC) - commonly used in result clustering

4 2-1. Suffix Tree Clustering Ex: 3 documents cat ate cheese cat ate mouse too mouse ate cheese too

5 cat ate cheese

6

7

8

9 score(B) = |B| f(|P|) f: remove stopwords, <= 3, > 40% && penalize single word, constant for |P| > 6 2-2. Base Cluster

10 2-3. Combining Base Cluster Keep top k(=500) base cluster Merge high overlap base clusters merge B i & B j iff |B i ∩B j | / |B i | > 0.5 |B j ∩B i | / |B j | > 0.5

11 2-4. Advantage High precision even using snippet Incremental and linear time Order Independent No magic k top k base clusters? 0.5?

12

13 3. New Suffix Tree Clustering d i T = [tfidf(n 1, d i ), tfidf(n 2, d i ), …] Group-average AHC (GAHC)

14 4. Evaluation Use F-measure precision(C i, G j ) = |C i ∩ G j | / |C i | recall(C i, G j ) = |C i ∩ G j | / | G j |

15 OHSUMED Document Collection MeSH indexing terms RCV1 Document Collection categories

16

17 5. Comparison STC : seldom generate large cluster NSTC : not incremental


Download ppt "A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07."

Similar presentations


Ads by Google