Presentation is loading. Please wait.

Presentation is loading. Please wait.

A New Suffix Tree Similarity Measure for Document Clustering

Similar presentations


Presentation on theme: "A New Suffix Tree Similarity Measure for Document Clustering"— Presentation transcript:

1 A New Suffix Tree Similarity Measure for Document Clustering
Hung Chim, Xiaotie Deng City University of Hong Kong WWW2007

2 INTRODUCTION 目的: To develop a document clustering algorithm to categorize the Web documents in an online community The Vector Space Document (VSD) - representation of any document as a feature vector of the words Suffix tree document model - identifying phrases that are common to groups of documents

3 suffix sub-string

4 Suffix Tree Document Model
1.cat ate cheese 2. mouse ate cheese too 3.cat ate mouse too

5 STC Algorithm (Suffix Tree Clustering)
1. The common suffix tree generating 2. Base cluster selecting Each base cluster B is assigned a score s(B) |B| = the number of documents in B |P| = the number of words in Phase 3. Cluster merging Jaccord coefficient

6 The base cluster graph

7 Problem of STC STC algorithm sometimes generates some large-sized clusters with poor quality No quality measure like tf-idf No single-link, group-average and complete-link Solution mapping each node of a suffix tree into a unique dimension of a M dimensional space M = total number of nodes in the suffix tree except the root node

8 The New Suffix Tree Similarity Measure
Each document d can be represented as a feature vector of the weights of M nodes df(n) = the number of the different documents that have traversed node n tf(n, d) = the total traversed times of document d through node n ex. df(b) = 3 , tf(b,1) =1

9 The New Suffix Tree Similarity Measure
tf-idf formula cosine similarity GAHC algorithm (GA with HC mutation )

10 A Closer Look to Sufx Tree Document Model
Efciency Analysis constructing the suffix tree O(m^2) Ukkonen's paper provided a algorithm to build a suffix tree in O(m) Stopword or Stopnode Words in the stoplist - the score s(B) of a base cluster stopnode - A node with a high df can be ignored

11 Document Preparing 1. combine all posts of the same thread into a single document 2. all non-word tokens are stripped 3. all stopwords are identified and removed 4. Porter stemming algorithm is applied 6. the posts containing at least 3 distinct words are selected

12 Cluster Topic Summary Generating
topic summary generating concerns two important information retrieval work 1. ranking the documents in a cluster by a quality score 2. extracting common phrases as the topic summary

13 Cluster Topic Summary Generating
Document quality evaluation Web documents provide some additional human assessments for the document quality evaluation view clicks, reply posts and recommend clicks top 10% documents as the representatives of the cluster the nodes traversed by the representative documents are selected and sorted by their idf in ascend order. Finally the top 5 nodes are selected.

14 EVALUATION 系統產生的 cluster C = {C1,C2, …,Ck} 答案的cluster Recall (i, j) =
Precision (i, j) =

15 Document Collections OHSUMED Document Collection
8 category, 800 documents, containing 6,281 distinct words. The average length of the documents is about 110 (by words) RCV1 Document Collection 10 groups of documents, containing 19,229 distinct words. The average length of documents is about 150

16 Results and Discussion

17 Results and Discussion
STC algorithm - there is no effective measure to evaluate the quality of the clusters during the cluster merging Thus STC algorithm seldom generated large size clusters with high quality in the experiments

18 Results and Discussion
DS3 document

19

20

21

22

23 CONCLUSIONS AND FUTURE WORK
By completely mapping all nodes in the common suffix tree into a M dimensional space of VSD model, the advantages of VSD model and suffix tree model are smoothly inherited suffix tree similarity measure is very simple, but the implementation is quite difficult time efficiency and the space efficiency Applying the new similarity measure in Chinese document


Download ppt "A New Suffix Tree Similarity Measure for Document Clustering"

Similar presentations


Ads by Google