10/6/2015Nikos Hourdakis, MSc Thesis1 Design and Evaluation of Clustering Approaches for Large Document Collections, The “BIC-Means” Method Nikolaos Hourdakis.

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

Cluster Analysis: Basic Concepts and Algorithms
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
K Means Clustering , Nearest Cluster and Gaussian Mixture
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Clustering II.
CS347 Lecture 8 May 7, 2001 ©Prabhakar Raghavan. Today’s topic Clustering documents.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Today Unsupervised Learning Clustering K-means. EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms Ali Al-Shahib.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 16: Flat Clustering 1.
Dan Phelleg, Andrew Moore Carnegie Mellon University
Unsupervised Learning and Data Mining
1 Partitioning Algorithms: Basic Concepts  Partition n objects into k clusters Optimize the chosen partitioning criterion Example: minimize the Squared.
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
What is Cluster Analysis?
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Clustering Unsupervised learning Generating “classes”
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
COMP53311 Clustering Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan ’ s notes Presented by Raymond Wong
Presented by: Apeksha Khabia Guided by: Dr. M. B. Chandak
Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
tch?v=Y6ljFaKRTrI Fireflies.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
Machine Learning Queens College Lecture 7: Clustering.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 15 10/13/2011.
Flat clustering approaches
E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into.
Text Clustering Hongning Wang
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
1 Cluster Analysis – 2 Approaches K-Means (traditional) Latent Class Analysis (new) by Jay Magidson, Statistical Innovations based in part on a presentation.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Genetic Algorithms for clustering problem Pasi Fränti
Data Mining and Text Mining. The Standard Data Mining process.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
Data Mining K-means Algorithm
K Nearest Neighbor Classification
Revision (Part II) Ke Chen
Paraskevi Raftopoulou, Euripides G.M. Petrakis
Revision (Part II) Ke Chen
Text Categorization Berlin Chen 2003 Reference:
EM Algorithm and its Applications
Hierarchical Clustering
Presentation transcript:

10/6/2015Nikos Hourdakis, MSc Thesis1 Design and Evaluation of Clustering Approaches for Large Document Collections, The “BIC-Means” Method Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

10/6/2015Nikos Hourdakis, MSc Thesis2 Motivation  Large document collections in many applications. Digital libraries, Web  There is additional interest in methods for more effective management of information. Abtraction, Browsing, Classification, Retrieval  Clustering is the means for achieving better organization of information. The data space is partitioned into groups of entities with similar content.

10/6/2015Nikos Hourdakis, MSc Thesis3 Outline  Background State-of-the-art clustering approaches Partitional, hierarchical methods  K-Means and its variants Incremental K-Means, Bisecting Incremental K-Means  Proposed method: BIC-Means Bisecting Incremental K-Means using BIC as stopping criterion.  Evaluation of clustering methods  Application in Information Retrieval

10/6/2015Nikos Hourdakis, MSc Thesis4 Hierarchical Clustering (1/3)  Nested sequence of clusters.  Two approaches: A.Agglomerative : Starting from singleton clusters, recursively merges the two most similar clusters until there is only one cluster. B.Divisive (e.g., Bisecting K-Means): Starting with all documents in the same root cluster, iteratively splits each cluster into K clusters.

10/6/2015Nikos Hourdakis, MSc Thesis5 Hierarchical Clustering – Example (2/3)

10/6/2015Nikos Hourdakis, MSc Thesis6 Hierarchical Clustering (3/3)  Organization and browsing of large document collections call for hierarchical clustering but:  Agglomerative clustering have quadratic time complexity.  Prohibitive for large data sets.

10/6/2015Nikos Hourdakis, MSc Thesis7 Partitional Clustering  We focus on Partitional Clustering K-Means, Incremental K-Means, Bisecting K-Means  At least as good as hierarchical.  Low complexity, O(KN)  Faster than hierarchical for large document collections.

10/6/2015Nikos Hourdakis, MSc Thesis8 K-Means 1.Randomly select K centroids 2.Repeat ITER times or until the centroids do not change: a)Assign each instance to the cluster whose centroid it is closest. b)Re-compute the cluster centroids.  Generates a flat partition of K Clusters (K must be known in advance).  Centroid is the mean of a group of instances.

10/6/2015Nikos Hourdakis, MSc Thesis9 K-Means Example x x C CC

10/6/2015Nikos Hourdakis, MSc Thesis10 K-Means demo (1/7):

10/6/2015Nikos Hourdakis, MSc Thesis11 K-Means demo (2/7)

10/6/2015Nikos Hourdakis, MSc Thesis12 K-Means demo (3/7)

10/6/2015Nikos Hourdakis, MSc Thesis13 K-Means demo (4/7)

10/6/2015Nikos Hourdakis, MSc Thesis14 K-Means demo (5/7)

10/6/2015Nikos Hourdakis, MSc Thesis15 K-Means demo (6/7)

10/6/2015Nikos Hourdakis, MSc Thesis16 K-Means demo (7/7)

10/6/2015Nikos Hourdakis, MSc Thesis17 Comments  No proof of convergence  Converges to a local minimum of the distortion measure (average of the square distance of the points from their nearest centroids): Σ i Σ d (d-μ c ) 2  Too slow for practical databases  K-means fully deterministic once initial centroids selected.  Bad choice of initial centroids leads to poor clusters.

10/6/2015Nikos Hourdakis, MSc Thesis18 Incremental K-Means (IK)  In K-Means new centroids are computed after each iteration (after all documents have been examined).  In Incremental K-Means each cluster centroid is updated after a document is assigned to a cluster:

10/6/2015Nikos Hourdakis, MSc Thesis19 Comments  Not as sensitive as K-Means to the selection of initial centroids.  Faster convergence, much faster in general

10/6/2015Nikos Hourdakis, MSc Thesis20 Bisecting IK-Means (1/4)  A hierarchical clustering solution is produced by recursively applying the Incremental K-Means in a document collection. The documents are initially partitioned into two clusters. The algorithm iteratively selects and bisects each one of the leaf clusters until singleton clusters are reached.

10/6/2015Nikos Hourdakis, MSc Thesis21 Bisecting IK-means (2/4)  Input: (d 1,d 2 …d N )  Output: hierarchy of K-clusters 1.All document in cluster C 2.Apply IK-means to split C into K clusters (K=2) C 1,C 2,…C K leaf clusters 3.Iteratively split each C i cluster until K clusters or singleton clusters are produces at the leafs

10/6/2015Nikos Hourdakis, MSc Thesis22 Bisecting IK-Means (3/4)  The algorithm is exhaustive terminating at singleton clusters (unless K is known)  Terminating at singleton clusters  Is time consuming  Singleton clusters are meaningless  Intermediate clusters are more likely to correspond to real classes  No criterion for stopping bisections before singleton clusters are reached.

10/6/2015Nikos Hourdakis, MSc Thesis23 Bayesian Information Criterion (BIC) (1/3)  To prevent over-splitting we define a strategy to stop the Bisecting algorithm when meaningful clusters are reached.  Bayesian Information Criterion (BIC) or Schwarz Criterion [Schwarz 1978].  X-Means [Pelleg and Moore, 2000] used BIC for estimating the best K in a given range of values. X-Means

10/6/2015Nikos Hourdakis, MSc Thesis24 Bayesian Information Criterion (BIC) (2/3)  In this work, we suggest using BIC as the splitting criterion of a cluster in order to decide whether a cluster should split or not.  It measures the improvement of the cluster structure between a cluster and its two children clusters.  We compute the BIC score of: A cluster and of its Two children clusters.

10/6/2015Nikos Hourdakis, MSc Thesis25 Bayesian Information Criterion (BIC) (3/3)  If the BIC score of the produced children clusters is less than the BIC score of their parent cluster we do not accept the split. We keep the parent cluster as it is.  Otherwise, we accept the split and the algorithm proceeds similarly to lower levels.

10/6/2015Nikos Hourdakis, MSc Thesis26 Example  The BIC score of the parent cluster is less than BIC score of the generated cluster structure => we accept the bisection. Two resulting clusters: BIC(K=2)=2245 Parent cluster: BIC(K=1)=1980

10/6/2015Nikos Hourdakis, MSc Thesis27 Computing BIC  The BIC score of a data collection is defined as (Kass and Wasserman, 1995): where is the log-likelihood of the data set D, P j = M*K+1, is a function of the number of independent parameters and R is the number of points.

10/6/2015Nikos Hourdakis, MSc Thesis28 Log-likelihood  Given a cluster of points, that produces a Gaussian distribution N(μ, σ 2 ), log-likelihood is the probability that a neighborhood of data points follows this distribution. The log-likelihood of the data can be considered as a measure of the cohesiveness of a cluster. It estimates how closely to the centroid are the points of the cluster.

10/6/2015Nikos Hourdakis, MSc Thesis29 Parameters p j  Sometimes, due to the complexity of the data (many dimensions or many data points), the data may follow other distributions.  We penalize log-likelihood by a function of the number of independent parameters (p j /2 * logR).

10/6/2015Nikos Hourdakis, MSc Thesis30 Notation  μ j : coordinates of j-th centroid  μ(i) : centroid nearest to i-th data point  D: input set of data points  D j : set of data points that have μ(j) as their closest centroid  R = |D| and R i = |D i |  M: the number of dimensions  M j : family of alternative models (different models correspond clustering solutions)  BIC scores the models and chooses the best among K models

10/6/2015Nikos Hourdakis, MSc Thesis31 Computing BIC (1/3)  To compute log-likelihood of data we need the parameters of the Gaussian for the data  Maximum likelihood estimate (MLE) of the variance (under spherical Gaussian assumption)

10/6/2015Nikos Hourdakis, MSc Thesis32 Computing BIC (2/3)  Probability of point x i : Gaussian with the estimated σ and mean the nearest cluster centroid to x i  Log likelihood of data

10/6/2015Nikos Hourdakis, MSc Thesis33 Computing BIC (3/3)  Focusing on the set D n of points which belong to centroid n

10/6/2015Nikos Hourdakis, MSc Thesis34 Proposed Method: BIC-Means (1/2)  BIC: Bisecting InCremental K-Means clustering incorporating BIC as the stopping criterion. BIC performs a splitting test at each leaf cluster to prevent it from over-splitting. BIC-Means doesn’t terminate at singleton clusters. BIC-Means terminates when there are no separable clusters according to BIC.

10/6/2015Nikos Hourdakis, MSc Thesis35 Proposed Method: BIC-Means (2/2)  Combines the strengths of partitional and hierarchical clustering methods Hierarchical clustering Low complexity (O(N*K)) Good clustering quality Produces meaningful clusters at the leafs

10/6/2015Nikos Hourdakis, MSc Thesis36 BIC-Means Algorithm  Input: S: (d 1, d 2,…,d n ) data in one cluster  Output: A hierarchy of clusters. 1.All documents in one cluster C. 2.Apply Incremental K-Means to split C into C 1, C 2. 3.Compute BIC for C and C 1, C 2 : I.If BIC(C) < BIC(C 1, C 2 ) put C 1, C 2 in queue II.Otherwise do not split C 4.Repeat steps 2, 3 and 4, until there is no separable leaf clusters in queue according to BIC.

10/6/2015Nikos Hourdakis, MSc Thesis37 Evaluation  Evaluation of document clustering algorithms. Two data sets: OHSUMED (233,445 Medline documents), Reuters (21578 documents).  Application of clustering to information retrieval Evaluation of several cluster-based retrieval strategies. Comparison with retrieval by exhaustive search on OHSUMED.

10/6/2015Nikos Hourdakis, MSc Thesis38 F-Measure  Howe good the clusters approximate data classes  F-Measure for cluster C and class T is defined as:, where,  The F measure of a class T is the maximum value it achieves over all clusters C: F T = max C F TC  The F measure of the clustering solution is the mean F T (over all classes)

10/6/2015Nikos Hourdakis, MSc Thesis39 Comparison of Clustering Algorithms

10/6/2015Nikos Hourdakis, MSc Thesis40 Evaluation of Incremental K-Means

10/6/2015Nikos Hourdakis, MSc Thesis41 MeSH Representation of Documents  We use MeSH terms for describing medical documents (OHSUMED).  Each document is represented by a vector of MeSH terms (multi-word terms instead of single word terms).  Leads to more compact representation (each vector contains less terms, about 20).  Sequential approach to extract MeSH terms from OHSUMED documents.

10/6/2015Nikos Hourdakis, MSc Thesis42 Bisecting Incremental K-Means – Clustering Quality

10/6/2015Nikos Hourdakis, MSc Thesis43 Speed of Clustering

10/6/2015Nikos Hourdakis, MSc Thesis44 Evaluation of BIC-Means

10/6/2015Nikos Hourdakis, MSc Thesis45 Speed of Clustering

10/6/2015Nikos Hourdakis, MSc Thesis46 Comments  BIC-Means is much faster than Bisecting Incremental K-Means Not exhaustive algorithm.  Achieves approximately the same F-Measure with the exhaustive Bisecting approach.  It is more suited for clustering large document collections.

10/6/2015Nikos Hourdakis, MSc Thesis47 Application of Clustering to Information Retrieval  We demonstrate that it is possible to reduce the size of the search (and therefore retrieval response time) on large data sets (OHSUMED).  BIC-Means is applied on entire OHSUMED.  Each document is represented by MeSH terms.  Chose 61 queries of the original OHSUMED query set developed by Hersh et. al. Each OHSUMED document has been judged as relevant to a query.

10/6/2015Nikos Hourdakis, MSc Thesis48 Query – Document Similarity  Similarity is defined as the cosine of the angle between document and query vectors. θ d1d1 d2d2

10/6/2015Nikos Hourdakis, MSc Thesis49 Information Retrieval Methods  Method 1: Search M clusters closer to the query Compute similarity between cluster centroid - query  Method 2: Search M clusters closer to the query Each cluster is represented by the 20 most frequent terms of its centroid.  Method 3: Search M clusters whose centre contain the terms of the query.

10/6/2015Nikos Hourdakis, MSc Thesis50 Method 1: Search M clusters closer to the query (compute similarity between cluster centroid – query).

10/6/2015Nikos Hourdakis, MSc Thesis51 Method 2: Search M clusters closer to the query. Each cluster is represented by the 20 most frequent terms of its centroid.

10/6/2015Nikos Hourdakis, MSc Thesis52 Method 3: Search M clusters containing the terms of the query.

10/6/2015Nikos Hourdakis, MSc Thesis53 Size of Search

10/6/2015Nikos Hourdakis, MSc Thesis54 Comments  Best cluster-based retrieval strategy: Retrieve only the clusters which contain all the MeSH query terms in their centroid vector (Method 3). Search the documents which are contained in the retrieved clusters and order them by similarity with the query.  Advantages: Searches only 30% of all OHSUMED documents as opposed to exhaustive searching (233,445 docs). Almost as effective as the retrieval by exhaustive searching (searching without clustering).

10/6/2015Nikos Hourdakis, MSc Thesis55 Conclusions (1/2)  We implemented and evaluated various partitional clustering techniques Incremental K-Means Bisecting Incremental K-Means (exhaustive approach)  BIC-Means Incorporates BIC as stopping criterion for preventing clustering from over-splitting. Produces meaningful clusters at the leafs.

10/6/2015Nikos Hourdakis, MSc Thesis56 Conclusions (2/2)  BIC-Means Much faster than Bisecting Incremental K-Means. As effective as exhaustive Bisecting approach. More suited for clustering large document collection.  Cluster-based retrieval strategies Reduces the size of the search. The best proposed retrieval method is as effective as exhaustive searching (searching without clustering).

10/6/2015Nikos Hourdakis, MSc Thesis57 Future Work  Evaluation using more or application specific data sets.  Examine additional cluster-based retrieval strategies (top-down, bottom-up).  Clustering and Browsing on Medline.  Clustering Dynamic Document Collections.  Semantic Similarity Methods in document clustering.

10/6/2015Nikos Hourdakis, MSc Thesis58 References  Nikos Hourdakis, Michalis Argyriou, Euripides G.M. Petrakis, Evangelos Milios, " Hierarchical Clustering in Medical Document Collections: the BIC-Means Method", Journal of Digital Information Management (JDIM), Vol. 8, No. 2, pp , April Hierarchical Clustering in Medical Document Collections: the BIC-Means Method  Dan Pelleg, Andrew Moore, “X-means: Extending K-means with efficient estimation of the number of clusters”, Proc. of the 7th Intern. Conf. on Machine Learning, 2000, pp X-means: Extending K-means with efficient estimation of the number of clusters

10/6/2015Nikos Hourdakis, MSc Thesis59 Thank you!!! Questions?