Presentation is loading. Please wait.

Presentation is loading. Please wait.

The 5th annual UK Workshop on Computational Intelligence London, 5-7 September 2005 Department of Electronic & Electrical Engineering University College.

Similar presentations


Presentation on theme: "The 5th annual UK Workshop on Computational Intelligence London, 5-7 September 2005 Department of Electronic & Electrical Engineering University College."— Presentation transcript:

1 The 5th annual UK Workshop on Computational Intelligence London, 5-7 September 2005 Department of Electronic & Electrical Engineering University College London, UK Learning Topic Hierarchies from Text Documents using a Scalable Hierarchical Fuzzy Clustering Method E. Mendes Rodrigues and L. Sacks {mmendes,

2 Outline Document clustering process H-FCM: Hyper-spherical Fuzzy C-Means H 2 -FCM: Hierarchical H-FCM Clustering experiments Topic hierarchies

3 Document Clustering Process Document Representation Document Encoding Document Clustering Pre- processing Document Clusters Document Similarity Clustering Method Cluster Validity Document Collection Application Document Clustering Document Similarity Clustering Method Document Collection Document Representation Document Encoding Pre- processing Document Clusters Cluster Validity Application Identify all unique words in the document collection Discard common words that are included in the stop list Apply stemming algorithm and combine identical word stems Apply term weighting scheme to the final set of k indexing terms Discard terms using pre-processing filters Document Vectors x 11 x 12  x 1k x 21 x 22 x N1 x N2  x Nk    X = Vector-Space Model of Information Retrieval Very high-dimensional Very sparse (+95%)

4 Measures of Document Relationship FCM applies the Euclidean distance, which is inappropriate for high-dimensional text clustering  non-occurrence of the same terms in both documents is handled in a similar way as the co-occurrence of terms Cosine (dis)similarity measure:  widely applied in Information Retrieval  represents the cosine of the angle between two document vectors  insensitive to different document lengths, since it is normalised by the length of the document vectors

5 H-FCM: Hyper-spherical Fuzzy C-Means Applies the cosine measure to assess document relationships Modified objective function: Subject to an additional constraint: Fuzzy memberships (u) and cluster centroids (v):

6 How many clusters? Usually the final number of clusters is not know a priori  Run the algorithm for a range of c values  Apply validity measures and determine which c leads to the best partition (clusters compactness, density, separation, etc.) How compact and dense are clusters in a sparse high- dimensional problem space?  Very small percentage of documents within a cluster present high similarity to the respective centroid  clusters are not compact  However, there is always a clear separation between intra- and inter- cluster similarity distributions

7 H 2 -FCM: Hierarchical Hyper-spherical Fuzzy C-Means Key concepts  Apply partitional algorithm (H-FCM) to obtain a sufficiently large number of clusters  Exploit the granularity of the topics associated with each cluster to link cluster centroids hierarchically  Form a topic hierarchy Asymmetric similarity measure  Identify parent-child type relationships between cluster centroids  Child should be less similar to parent, than parent to child 

8 S(v 8,v 5 )

9 Scalability of the Algorithm H 2 -FCM time complexity depends on H-FCM and centroid linking heuristic H-FCM computation time is O(Nc 2 k) Linking heuristic is at most O(c 2 k)  Computation of the asymmetric similarity between every pair of cluster centroids - O(c 2 k)  Generation of the cluster hierarchy - O(c 2 ) Overall, H2-FCM time complexity is O(Nc 2 k) Scales well to large document sets!

10 Description of Experiments Goal: evaluate the H 2 -FCM performance Evaluation measures: clustering Precision (P) and Recall (R) H 2 -FCM algorithm run for a range of c values No. hierarchy roots=No. reference classes  t PCS dynamically set Are sub-clusters of the same topic assigned to the same branch? true negatives (tn)false negatives (fn) Not assigned to cluster  false positives (fp)true positives (tp) Assigned to cluster  Not in reference class In reference class 

11 Test Document Collections Reuters test collection: Open Directory Project (ODP): INSPEC database: back-propagation fuzzy control Pattern clustering game lego math safety sport crude interest money-fx ship trade acq earn trade labels no. Classes 0.14 %99.59 % inspec 0.50 %97.69 % odp 0.47 %99.39 % reuters %99.67 % reuters1 stdevavgstdevavgkN Document sparsityDocument lengthSize Collection

12 Clustering Results: H2-FCM Precision and Recall odpinspec reuters1 reuters2

13 Topic Hierarchy Each centroid vector consists of a set of weighted terms Terms describe the topics associated with the document cluster Centroid hierarchy produces a topic hierarchy  Useful for efficient access to individual documents  Provides context to users in exploratory information access

14 Topic Hierarchy Example

15 Concluding Remarks H 2 -FCM clustering algorithm  Partitional clustering (H-FCM)  Linking heuristic organizes centroids hierarchically bases on asymmetric similarity Scales linearly with the number of documents Exhibits good clustering performance Topic hierarchy can be extracted from the centroid hierarchy

16 Clustering in Sparse High-dimensional Spaces reuters1 reuters2 Intra- and inter-cluster similarity CDFs cc cc

17 Clustering in Sparse High-dimensional Spaces (contd.) odp inspec Intra- and inter-cluster similarity CDFs cc cc

18 Iterative optimization of an objective function: Subject to constraints: Fuzzy memberships (u) and cluster centroids (v): FCM: Fuzzy C-Means


Download ppt "The 5th annual UK Workshop on Computational Intelligence London, 5-7 September 2005 Department of Electronic & Electrical Engineering University College."

Similar presentations


Ads by Google