The 5th annual UK Workshop on Computational Intelligence London, 5-7 September 2005 Department of Electronic & Electrical Engineering University College.

Slides:



Advertisements
Similar presentations
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Advertisements

Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis.
Aggregating local image descriptors into compact codes
Chapter 5: Introduction to Information Retrieval
Cluster Analysis: Basic Concepts and Algorithms
Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor:
Albert Gatt Corpora and Statistical Methods Lecture 13.
Multimedia Database Systems
Clustering Basic Concepts and Algorithms
PARTITIONAL CLUSTERING
Introduction to Bioinformatics
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Vector Space Information Retrieval Using Concept Projection Presented by Zhiguo Li
What is Cluster Analysis?
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
S IMILARITY M EASURES FOR T EXT D OCUMENT C LUSTERING Anna Huang Department of Computer Science The University of Waikato, Hamilton, New Zealand BY Farah.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Birch: An efficient data clustering method for very large databases
Advanced Multimedia Text Retrieval/Classification Tamara Berg.
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Clustering Unsupervised learning Generating “classes”
CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
The 5th annual UK Workshop on Computational Intelligence London, 5-7 September 2005 The 5th annual UK Workshop on Computational Intelligence London, 5-7.
Prepared by: Mahmoud Rafeek Al-Farra College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology Data Mining
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Clustering C.Watters CS6403.
Document Maps Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy.
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
A Multiresolution Symbolic Representation of Time Series Vasileios Megalooikonomou Qiang Wang Guo Li Christos Faloutsos Presented by Rui Li.
1 CLUSTER VALIDITY  Clustering tendency Facts  Most clustering algorithms impose a clustering structure to the data set X at hand.  However, X may not.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Clustering Patrice Koehl Department of Biological Sciences National University of Singapore
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 8. Text Clustering.
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
Similarity Measures for Text Document Clustering
Semi-Supervised Clustering
Clustering Patrice Koehl Department of Biological Sciences
Discussion Class 11 Cluster Analysis.
Applying Neural Networks
Clustering of Web pages
Clustering medical and biomedical texts – document map based approach
Data Mining K-means Algorithm
Clustering in Ratemaking: Applications in Territories Clustering
Clustering (3) Center-based algorithms Fuzzy k-means
Compact Query Term Selection Using Topically Related Text
CSE 4705 Artificial Intelligence
Jianping Fan Dept of CS UNC-Charlotte
Representation of documents and queries
From frequency to meaning: vector space models of semantics
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
Hierarchical and Ensemble Clustering
CSE572, CBS572: Data Mining by H. Liu
Concept Decomposition for Large Sparse Text Data Using Clustering
Text Categorization Berlin Chen 2003 Reference:
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
CS 430: Information Discovery
Presentation transcript:

The 5th annual UK Workshop on Computational Intelligence London, 5-7 September 2005 Department of Electronic & Electrical Engineering University College London, UK Learning Topic Hierarchies from Text Documents using a Scalable Hierarchical Fuzzy Clustering Method E. Mendes Rodrigues and L. Sacks {mmendes,

Outline Document clustering process H-FCM: Hyper-spherical Fuzzy C-Means H 2 -FCM: Hierarchical H-FCM Clustering experiments Topic hierarchies

Document Clustering Process Document Representation Document Encoding Document Clustering Pre- processing Document Clusters Document Similarity Clustering Method Cluster Validity Document Collection Application Document Clustering Document Similarity Clustering Method Document Collection Document Representation Document Encoding Pre- processing Document Clusters Cluster Validity Application Identify all unique words in the document collection Discard common words that are included in the stop list Apply stemming algorithm and combine identical word stems Apply term weighting scheme to the final set of k indexing terms Discard terms using pre-processing filters Document Vectors x 11 x 12  x 1k x 21 x 22 x N1 x N2  x Nk    X = Vector-Space Model of Information Retrieval Very high-dimensional Very sparse (+95%)

Measures of Document Relationship FCM applies the Euclidean distance, which is inappropriate for high-dimensional text clustering  non-occurrence of the same terms in both documents is handled in a similar way as the co-occurrence of terms Cosine (dis)similarity measure:  widely applied in Information Retrieval  represents the cosine of the angle between two document vectors  insensitive to different document lengths, since it is normalised by the length of the document vectors

H-FCM: Hyper-spherical Fuzzy C-Means Applies the cosine measure to assess document relationships Modified objective function: Subject to an additional constraint: Fuzzy memberships (u) and cluster centroids (v):

How many clusters? Usually the final number of clusters is not know a priori  Run the algorithm for a range of c values  Apply validity measures and determine which c leads to the best partition (clusters compactness, density, separation, etc.) How compact and dense are clusters in a sparse high- dimensional problem space?  Very small percentage of documents within a cluster present high similarity to the respective centroid  clusters are not compact  However, there is always a clear separation between intra- and inter- cluster similarity distributions

H 2 -FCM: Hierarchical Hyper-spherical Fuzzy C-Means Key concepts  Apply partitional algorithm (H-FCM) to obtain a sufficiently large number of clusters  Exploit the granularity of the topics associated with each cluster to link cluster centroids hierarchically  Form a topic hierarchy Asymmetric similarity measure  Identify parent-child type relationships between cluster centroids  Child should be less similar to parent, than parent to child 

S(v 8,v 5 )<t PCS C1 C3 C6 C9 C10 C12 C11 C8 C7 C4 C2 C5 Document Cluster centroid The H 2 -FCM Algorithm Asymmetric Similarity v1v1 v3v3 v6v6 v9v9 v 10 v 12 v 11 v8v8 v7v7 v4v4 v2v2 v5v5 v1v1 v3v3 v6v6 v9v9 v 10 v 12 v 11 v8v8 v7v7 v4v4 v2v2 v5v5 v  V F  S(v ,v  ) = max[S(v ,v  )], v ,v  V F v3v3 v6v6 v9v9 v 10 v 12 v 11 v7v7 v4v4 v1v1 v8v8 v2v2 v5v5 v1v1 v8v8 v2v2 S(v 1,v 5 )≥t PCS v 10 VFVF VHVH S(v 8,v 1 )<t PCS Compute S(v ,v  ),  Y Apply H-FCM (c, m) All clusters have size≥t ND ? Select centroid While V F ≠  VH=?VH=? N Add root Select parent S≥t PCS ? Add child Y N N c=c-Kc=c-K

Scalability of the Algorithm H 2 -FCM time complexity depends on H-FCM and centroid linking heuristic H-FCM computation time is O(Nc 2 k) Linking heuristic is at most O(c 2 k)  Computation of the asymmetric similarity between every pair of cluster centroids - O(c 2 k)  Generation of the cluster hierarchy - O(c 2 ) Overall, H2-FCM time complexity is O(Nc 2 k) Scales well to large document sets!

Description of Experiments Goal: evaluate the H 2 -FCM performance Evaluation measures: clustering Precision (P) and Recall (R) H 2 -FCM algorithm run for a range of c values No. hierarchy roots=No. reference classes  t PCS dynamically set Are sub-clusters of the same topic assigned to the same branch? true negatives (tn)false negatives (fn) Not assigned to cluster  false positives (fp)true positives (tp) Assigned to cluster  Not in reference class In reference class 

Test Document Collections Reuters test collection: Open Directory Project (ODP): INSPEC database: back-propagation fuzzy control Pattern clustering game lego math safety sport crude interest money-fx ship trade acq earn trade labels no. Classes 0.14 %99.59 % inspec 0.50 %97.69 % odp 0.47 %99.39 % reuters %99.67 % reuters1 stdevavgstdevavgkN Document sparsityDocument lengthSize Collection

Clustering Results: H2-FCM Precision and Recall odpinspec reuters1 reuters2

Topic Hierarchy Each centroid vector consists of a set of weighted terms Terms describe the topics associated with the document cluster Centroid hierarchy produces a topic hierarchy  Useful for efficient access to individual documents  Provides context to users in exploratory information access

Topic Hierarchy Example

Concluding Remarks H 2 -FCM clustering algorithm  Partitional clustering (H-FCM)  Linking heuristic organizes centroids hierarchically bases on asymmetric similarity Scales linearly with the number of documents Exhibits good clustering performance Topic hierarchy can be extracted from the centroid hierarchy

Clustering in Sparse High-dimensional Spaces reuters1 reuters2 Intra- and inter-cluster similarity CDFs cc cc

Clustering in Sparse High-dimensional Spaces (contd.) odp inspec Intra- and inter-cluster similarity CDFs cc cc

Iterative optimization of an objective function: Subject to constraints: Fuzzy memberships (u) and cluster centroids (v): FCM: Fuzzy C-Means