(University of Minnesota)

(University of Minnesota)
CLUTO A Clustering Toolkit Release 2.0 George Karypis (University of Minnesota) 012ITI 인터넷기술전공(3학기) 신난영

Contents Ⅰ. Introduction Ⅳ. Which clustering algorithm should I use?
1. What is cluster 2. What is CLUTO 3. Three different class of CLUTO Ⅱ. Using CLUTO Ⅴ. CLUTO’s Library interface 1. via Stand-only program 2. vcluster & scluster program 3. Input file format 4. Output file format 5. Example of vcluster & scluster 1. Using CLUTO’s Library 2. Matrix and Graph Data Structure 3. Clustering parameter 4. Object modeling parameter 5. Debugging paramete 6. Clustering Routine 7. Graph Creation Routine 8. Cluster Statistics Routines Ⅲ. The information produced by CLUTO program 1. Internal Custer Quality Statistics 2. External Cluster Quality Statistics 3. Looking at each Cluster’s Feature 4. Looking at the Hierarchical Agglomerative Tree 5. Looking at the Visualization Ⅲ. The information produced by CLUTO program 1. Internal Custer Quality Statistics 2. External Cluster Quality Statistics 3. Looking at each Cluster’s Feature 4. Looking at the Hierarchical Agglomerative Tree 5. Looking at the Visualization

Serve mining and analysis technique for various data
What is cluster? Divide data meaningful or useful groups Intra-cluster similarity is maximized & Inter-cluster similarity is minimize Can explain the characteristics of data distribution Serve mining and analysis technique for various data The example of application : - Characterization of customer group on purchasing pattern - Categorization of document on WWW

What is CLUTO? S/W package for clustering low and high dimensional datasets & for analyzing the characteristics of the various clusters. Characteristic Seek to maximize or minimize a particular clustering criterion function defined Identify the feature that best describe and discriminate each cluster Visualization Capa. Can used to see relationship between clusters, objects,features Optimized for operating on very large data set(object수, dimension수) - can quickly cluster dataset with several very large object and dimensions. directly consider sparsity and requires memory as roughly linear on input size Provided Tool for analyzing and understand the relation between objects assigned to each cluster and relation between the differnet clusters for visualization the disovered clustering solution What is Cluto - 각 cluster에 할당된 object간의 relation을 이해하기 위한 발견된 cluster를 분석하기 위한 Tool과 발견된clustering solution을 visualization하기 위한 tool을 제공 - 이 feature sets는 각cluster에 할당된 객체의 set를 보다 쉽게 이해하는데 사용될수 있으며 cluster contents에 관한 정확한 summary를 제공 Distribution stand-alone program(vcluster and scluster) Library via an application program can access CLUTO algorithms

Three different class of CLUTO
CLUTO provides three different classes of clustering algorithm that operate either directly in object’s feature space or in the object’s similarity space Partitional-Based Criterion-driven Clustering Compute a k-way clustering of a set of document either directly or via a sequence of repeated bisections. k-way clustering : initially, a set of k document is selected then for each documents, its similarity is computed and it is assigned to the cluster this forms initial k-way clustering and then repeatedly refined so that it optimizes the criterion function Used to optimize criterion function Agglomerative Clustering Find the cluster by initially assigning each object to its own cluster and then repeatedly merging pairs of clusters until stopping criterion is meet Graph-partitioning-Based Clustering Well-suited Finding clusters that form contiguous region that span different dimension of the underlying feature space Utilize high-quality and efficient multilevel graph partitioning algorithm

Using CLUTO via Stand-Alone Program
CLUTO provides access to its various clustering and analysis algorithm via the vcluster and scluster stand-along program ◐ Key difference between vcluster and scluster ◑ - vcluster take as an input the actual multi-dimensional representation of object that need to clustered (v came from vector) - scluster takes as input the similarity matrix or graph between these objects (came from similarity) ♧ Besides this difference, both programs provide similar functionality

Vcluster and scluster program
Both of them are are cluster a collection of objects into a predetermined number of clusters k Command Line Vcluster [option parameters] MatrixFile NClusters Scluster [option parameters] GraphFile NClusters MatrixFile : the file that stores the n objects to be clustered (n-dimensional space) GraphFile : the file that store the adjacency matrix of the similarity graph between the n object to be clustered NClusters : the number of clusters Optional parameters : categorized into three groups ① control various aspects of the clustering algorithm ② control type of analysis and reporting that is performed computed clusters ③ control visualization of the clusters optional parameters are specified using –paramname or –paramname=value ▶ Actual clustering solution is stored in a file named MatrixFile.clustering.NClusters (or GraphFile.clustering.Nclusters)

Input File Formats (1) – Matrix File
Each row of this matrix represent a single object, and various column correspond to dimension of the objects. (sparse format : has three number / dense format : has two number) Sparse Matrix Format Row / column / the number of data Column & value pair With n rows and m column is stored in a plain text file that contains n+1 lines. The first line contains information about the size of matrix, while the remaining n lines contain information for each row of A Dense Matrix Format The first line contains exactly tow numbers, The first integer is the number of row and the second integer is the number of column. The remain n lines store the values of the m column. Each line contain m space-separated float point values

Input File Formats (2) – Graph File
Adjacency matrix of the graph that specifies the similarity between The objects to be clustered. (sparse format : has two number / dense format : has one number) Sparse Graph Format Adjacency matrix A of a sparse graph with n vertices is stored in a plain text file contains n+1 lines. The first line of file : first integer – the number of vertices in graph(n) second integer – the number of edges in the graph remaining n lines – about the non-zero structure of A Each pair contains the number of the adjacent vertex followed by the similarity of the corresponding edge. vertex numbers are assumed to be integer and similarity are assumed to be floating point number Dense Graph Format The first line of file contains exactly one number, which is the number of vertices n of the graph. The remaining n lines store the values of the n column of the adjacency matrix for each one of the vertices Each line contains exactly n space-separated floating point value, such that the ith value corresponds to the similarity to the ith vertex of the graph

Input File Formats (3) – others
Row Label File with the –rlabelfile parameter is used, CLUTO’s stand-along programs read a file that stores the label for each one of the row of the matrix ith line of this file contains the label of the ith row of the matrix Column Label File with the –clabelfile parameter is used, read a file that stores the label for each one of the column of the matrix ith line of this file contains the label of the ith column of the matrix Row Class Label File with the –rclassfile parameter is used, read a file that stores the class label for each one of the row of the matrix ith line of this file contains the label of the ith row of the matrix In order to ensure that a set of objects belong to the same class, their corresponding row in the class-label file must contain identical strings.

Clustering Solution File
Output File Format Clustering Solution File a matrix with n row consists of n lines with a single number per line ith line of the file contains the cluster number that the ith object/row/vertex belong to Cluster numbers run from zero to the number of clusters minus one If the –zscores is specified, each line of this file contains two additional numbers right after the cluster number. (The first number is its internal z-score, the second number is its external z-score) Tree file produced by performing a hierarchical agglomerative clustering on top of k-way clustering solution is stored in a file in the form of a parent array. If k is the number of clusters, then the tree file contains 2k-1 lines, such that the ith line contains the parent of the ith node of the tree. In the case of the root node, that is stored in the last line of the file, the parent is set to –1.

Example of Using vcluster and scluster
Figure 1. Output vcluster for matrix sports.mat, 10-way clustering Figure 2. Output scluster for graph la1.graph, 10-way clustering Initially prints information about the matrix or graph Next it, Prints information about the various options & the number of desired clusters Once it clustering solution, displays information quality of the overall clustering solution and the quality of each cluster Finally, reports the time taken by the phases of the program

Through the “Solution” section
Internal Cluster Quality Statistics The quality of each clusters measured by the criterion function that it use and The similarity between the objects in each cluster Through the “Solution” section of Figure 1,2 Firstly, report overall value of criterion function for clustering solution & how many of original objects there were able to cluster. ex) 10-way clustering : [I2=2.29e+03] [8580 of 8580] Then, display a table in which each row contains various statistics for each clusters ▶ The meaning of each column ◀ cid – cluster number, Size – the number of objects that belongs to each cluster Isim – average similarity between object of each cluster(internal similarities) Isdev – standard deviation of Ismi(internal standard deviations) Esim – average similarity of objects of each cluster and the rest of object (external similarity) Esdev – Standard deviation of the external similarities (external standard deviations) In general, both vcluster and scluster try to cluster all objects. However, when some of the object(vertices) do not share any dimension(edges) with the rest of the objects, or when the various edge- and vertex-pruning parameters are used, both programs may end up clustering fewer than the total number of input object

Via the –rclassfile option
External Cluster Quality Statistics Consider information the various objects belong to and compute various Statistics the quality of the clusters (external quality measures) Via the –rclassfile option In addition to the overall value, prints the entropy and purity and two additional sets included at each cluster - first set : entropy and purity - second set : information about how the different classes are distributed in each one of the cluster. Each column show the number of documents of this class that are in each cluster Small entropy & large purity indicate good clustering solutions Looking at the class-distribution table, can easily determine the quality of the different clusters.

Via the –showfeature option
Looking at each Cluster’s Feature Analyze each one of the clusters and determine the set of features that best describe and discriminate each one of clusters. Via the –showfeature option For each cluster, displays three line of information First line : basic statistics for each cluster Second line : five most descriptive feature Third line : five most discrimination feature Descriptive feature : percentage of the within cluster similarity Discriminating feature : percentage of the dissimilarity between the cluster and the rest of the objects → discriminating facture are typically smaller than descriptive feature because some of the them of a cluster may also present in a small fraction of the object that do not belong to this cluster ※ -clabelfile option : present column number of each feature (If don’t use, use label)

Via the –showtree option
Looking at the Hierarchical Agglomerative Tree (1) Discovered clusters form the leaf nodes of this tree. In constructing this tree, algorithms repeatedly merge a particular pair of clusters, and pair of clusters to be merged is selected so that the resulting clustering solution at that point optimizes the specified clustering criterion function Via the –showtree option displays in a rotated fashion, the root of the tree is the first column, and the tree grow from left to right. (root being highest numbered code) The leaves of this tree are numbered from 0 to NClusters-1, Internal nodes are numbered from Nclusters to 2*Nclusters-2 The numbering of internal nodes were obtained by merging a pair of clusters at an earlier stage of agglomerative process have lower numberes compared to node obtained at later stage. ※ print information about how the objects of the various classes are distributed in each cluster. If –rclassfile not specified, This information is omitted.

With additional parameter of “showtree” via –laveltree option
Looking at the Hierarchical Agglomerative Tree (2) With additional parameter of “showtree” via –laveltree option Displaying statistics regarding their quality and a set of descriptive features. Print a number of statistics for each cluster - Size : the number of objects in cluster - Isim : average similarity between object - Xsim : average similarity between the objects of each pair of clusters that are the children of the same node of the tree - Gain : change in the value of particular clustering criterion function ※ Print the set of features that best describe each cluster

Looking at Visualizations(1)
Produce number of graphical visualization showing the relation between different objects, features, and clusters Via the –plotXXX option Example 1. produced when –plotmatrix is specified for a sparse matrix. (a) : row of input matrix re-ordered in such a way so that the rows assigned to each one of the ten clusters each non-zero positive element of matrix is displayed by a different shade of red. (b) : row and columns are also re-ordered according to a hierarchical clustering (c) : 10-way clustering obtained by scluster.

Example 2. produced when –plotmatrix is specified for a dense matrix.
Looking at Visualizations(2) Example 2. produced when –plotmatrix is specified for a dense matrix. <Three different visualization by executing Different command option> for a particular micro-array gene expression data set. Each row has a label The plots contain both red and green boxes, Representing positive and negative value.

in the plot is only equal to the number of clusters
Looking at Visualizations(3) Example 3. produced when –plotcluster is specified for a sparse matrix. particular useful for displaying very large data sets, as the number of rows in the plot is only equal to the number of clusters

Example 4. produced when –plottree is specified
Looking at Visualizations(4) Example 4. produced when –plottree is specified entire hierarchical tree Leaves of the tree are labeled with the particular row-id(or row label if available)

Which Clustering Algorithm should I use?(1)
It highly depends on the nature of your dataset and what constitutes meaningful clusters in your application. (We just provide some general usage guideline) Cluster Types two different type of cluster what differentiates them is the relationship between the cluster’s objects and the dimensions of their feature space. 1. First type - contains objects that exhibit a strong pattern of conservation along a subset of their dimension. Subset of original dimension in which a large fraction of the object agree - This subset of dimensions is often referred to as a subspace. Exactly this variation in subspace size and density is what complicates the problem of discovering this type of cluster - tend to contain objects in which the similarity between all pair of objects will be high 2. Second Type - contains objects in which again exist a subspace associated with that cluster. - In these clusters there will be sub-clusters that share a very small number of the subspace’s dimension, but there will be a strong path within that cluster that will connect them - a lot of objects whose direct pair-wise similarity will be quite low, but these objects will be connected by many paths that stay within the cluster that traverse high similarity edges.

Which Clustering Algorithm should I use?(2)
Matching Algorithms to Cluster Types CLUTO provides clustering algorithms for finding both of these types of clusters The different clustering criterion functions used by the partitional and agglomerative clustering algorithms impact to extend to which the individual instance of the clustering algorithm is capable of finding globular cluster that contain cluster with different size consensus, or clusters whose average pair-wise similarity is different, as well as the extent to which clusters can be of dramatically different size Similarity measure between objects the object to be clustered as vector in high-dimensional space and measure the degree of similarity between these object using many kind of function. 1. Cosine- and correlation-based similarity measure - Well-suited for clustering high-dimensional datasets arising in many diverse application areas. 2. Eucidean distance based similarity function - well-suited for finding clusters in the original feature space, as it is the case for the spatial clusters Scalability of CLUTO’s Clustering algorithm CLUTO have different scalability characteristics(time & space-complexity) 1. In terms of time and memory, the most scalable method is vcluster’s repeated-bisecting algorithm

CLUTO’s Library Interface(1)
Using Cluto’s Library In order to use Cluto’s stand-along library you must link your program with Cluto’s pre-compiled library that is provide in the s/w distribution Matrix & Graph Data Structure Most of the routine in Cluto’s library take, as input, the objects to be clustered in the form of a matrix That is, the row are the object and the columns are the features Sparse Matrix and Graph Data Structure Row수 +1 Matrix의 nonzero entry의 Column 위치 저장 Nonzero value들 Dense Matrix Data Structure By using only the rowval array and seting the rowptr and rowind array to NULL

Clustering parameters Two parameters that control the similarity function to be used while clustering the objects and the clustering criterion function to be optimized in the process of clustering - The simfun parameter : specified the similarity function - The crfun parameter : specified the clustering criterion function ※ The cstype parameter – specified the method to be used for selecting the next cluster to bisected Ojbect modeling parameter take as input parameter that control how the row and columns of the input matrix will be modeled - The rowmodel parameter : used for scaling the various columns of each row - The colmodel parameter : used for scaling the various columns globally across all the rows of the matrix - The grmodel parameter : type of k-nearst neighbor graph that will be built by CLUTO’s graph-partitioning - The colprune parameter : columns of the matrix will be pruned before clustering - The edgeprune parameter : control how the edges in the graph-partitioning clustering algorithm will be pruned based on link-connectivity of their incident vertices - The vtxprune parameter : control how outlier vertices in the graph-partitioning clutering algorithm will be pruned based on their degree

Debugging parameter take as input a parameter called dbglvl that controls of the amount of information to be printed. This is used for internal purposes and should be set to 0, which suppresses any debugging output Cluster Routines 1. Void CLUTO_VP_ClusterDirect → cluster a matrix into specified number of cluster using a partitional clustering algorithm computes the k-way clustering directly 2. Void CLUTO_VP_ClusterRB algorithm computes the k-way clustering performing a sequence of repeated bisections considerably faster than CLUTO_VP_Clusterdirect and it should be preferred it the number of desired clusters is quite large(20~30) 3. Void CLUTO_VP_GraphClusterRB → cluster a matrix into specified number of cluster using a graph-partitional-base clustering algorithm computes the k-way clustering performing a sequence of repeated min-cut bisections 4. Void CLUTO_VA_Cluster → cluster a matrix into specified number of cluster using a hierarchical agglomerative clustering algorithm. Due to high computational requirement of CLUTO_VA_Cluster, it should only be used to cluster matrices that have fewer than 3K~6K rows

Cluster Routines 5. Void CLUTO_SP_ClusterDirect → cluster a graph into specified number of cluster using a partitional clustering algorithm computes the k-way clustering directly 6. Void CLUTO_SP_ClusterRB → cluster a matrix into specified number of cluster using a partitional clustering algorithm computes the k-way clustering performing a sequence of repeated bisections. considerably faster than CLUTO_SP_Clusterdirect and it should be preferred it the number of desired clusters is quite large(20~30) 7. Void CLUTO_SP_GraphClusterRB → cluster a matrix into specified number of cluster using a graph-partitional-base clustering algorithm computes the k-way clustering by performing a sequence of repeated min-cut bisections. 8. Void CLUTO_SA_Cluster → cluster a matrix into specified number of cluster using a hierarchical agglomerative clustering algorithm. Due to high computational requirement of CLUTO_SA_Cluster, it should only be used to cluster matrices that have fewer than 3K~6K rows 9. Void CLUTO_V_BuildTree / Void CLUTO_S_BuildTree → build a hierarchical agglomerative tree that preserve the clustering solution supplied in the part array. Build two type of tree(one, built on the top of particular cluster solution, two, complete agglomerative tree)

Graphic Creation Routines 1. Void CLUTO_V_GetGraph / Void CLUTO_S_GetGraph → create a nearest-neighbor graph of the set of object. This is graph can be used as input to the graph-partitioning based clustering algorithm (CLUTO_SP_GraphClusterRB) Cluster Statistics Routines 1. Void CLUTO_V_GetSolutionQuality / Void CLUTO_S_GetSolutionQuality → Return the value of a particular criterion function for a given clustering solution. This routine can be used to find the value of any clustering criterion function regardless of the criterion function used to compute the clustering solution 2. Void CLUTO_V_GetClusterStats / Void CLUTO_S_GetClusterStats → Return the value of statistics about a given clustering solution. In order for this routine to get the accurate statistics for a particular clustering solution, the values for the rowmodel,colmodel and colprune parameters should be identical to those used to compute the clustering solution 3. Void CLUTO_V_GetClusterFeatures → Return the set of feature that best describe and discriminate each one of the cluster of a given clustering solution. 4. Void CLUTO_V_GetTreeStats → Return a number of statistics about the clusters corresponding to the different nodes of the hierarchical agglomerative tree. 5. Void CLUTO_V_GetTreeFeatures clusters corresponding to the various nodes of the hierarchical agglomerative tree that was built on top of the clustering option

(University of Minnesota)

Similar presentations

Presentation on theme: "(University of Minnesota)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

(University of Minnesota)

Similar presentations

Presentation on theme: "(University of Minnesota)"— Presentation transcript:

Similar presentations

About project

Feedback