Cluto – Clustering toolkit by G. Karypis, UMN

Cluto – Clustering toolkit by G. Karypis, UMN
Andrea Tagarelli Univ. of Calabria, Italy

CLUstering Toolkit for very large, high dimensional & sparse datasets
Main characteristics Seeks to optimize a particular clustering criterion function Identifies the features that best describe and discriminate each cluster Allows for visually examining relations between clusters, objects, and features Handles sparsity and requires memory as roughly linear in the input size Analysis Goals To understand relations between objects assigned to each cluster and relations between the different clusters To visualize the discovered clustering solution Distributions Stand-alone programs (vcluster and scluster) Library via an application program can access CLUTO algorithms What is CLUTO?

Clustering algorithms
Programs: vcluster: takes as input a multidimensional representation of the objects to be clustered scluster: takes as input the object similarity graph Parameter: -clmethod=string Partitional Direct k-way clustering (direct) Bisecting k-way clustering (rb, rbr) Agglomerative hierarchical (agglo) Partitional-based agglomerative hierarchical (bagglo) Graph-partitioning-based (graph) Rb: the desired k-way clustering solution is computed by performing a sequence of k − 1 repeated bisections. In this approach, the matrix is first clustered into two groups, then one of these groups is selected and bisected further. This process continues until the desired number of clusters is found. During each step, the cluster is bisected so that the resulting 2-way clustering solution optimizes a particular clustering criterion function (which is selected using the -crfun parameter). Note that this approach ensures that the criterion function is locally optimized within each bisection, but in general is not globally optimized. The cluster that is selected for further partitioning is controlled by the -cstype parameter. By default, vcluster uses this approach to find the k-way clustering solution. biased agglomerative : use a partitional sqrt(n)-way clustering solution to bias the agglomeration process. The key motivation behind these algorithms is to use a partitional clustering solution that optimizes a global criterion function to limit the number of errors performed during the early stages of the agglomerative algorithms. Extensive experiments with these algorithms on document datasets show that they lead to superior clustering solution. Graph: by first modeling the objects using a nearest-neighbor graph (each object becomes a vertex, and each object is connected to its most similar other objects), and then splitting the graph into k-clusters using a min-cut graph partitioning algorithm. Note that if the graph contains more than one connected component, then vcluster and scluster return a (k + m)-way clustering solution, where m is the number of connected components in the graph. Clustering algorithms

Usage MatrixFile: the file that stores the objects to be clustered
GraphFile: the file that stores the adjacency matrix of the object similarity graph NClusters: the number of clusters Optional parameters: categorized into three groups specified using –paramname or –paramname=value categorized into three groups control various aspects of the clustering algorithm control type of analysis and reporting that is performed computed clusters control visualization of the clusters Output clustering solution is stored in a file named File.clustering.NClusters vcluster [option parameters] MatrixFile NClusters scluster [option parameters] GraphFile NClusters Usage

Input file format: matrix file
Plain text with n+1 lines storing the data matrix for n m-dimensional objects Dense format Metadata (in the first line): #rows, #columns Each remaining line contains space-separated float values Sparse format Metadata (in the first line): #rows, #columns, #nonzero entries Each row represents a single object, and the various columns correspond to object attributes Input file format: matrix file

Input file format: graph file
Plain text with n+1 lines storing the adjacency matrix of the graph that specifies the similarity between the n objects Dense format: Metadata (in the first line): #vertices (n) Each of the remaining n lines stores n space-separated floating point values such that the ith value corresponds to the similarity to the ith vertex of the graph Sparse format: Metadata (in the first line): #vertices (n) and #edges Each of the remaining n lines contains the index of the adjacent vertex followed by the similarity of the corresponding edge Each pair contains the number of the adjacent vertex followed by the similarity of the corresponding edge. vertex numbers are assumed to be integer and similarity are assumed to be floating point number Input file format: graph file

Input file format: labels
Row label file: Stores the label for each of the rows of the matrix (objects) -rlabelfile param Column label file: Stores the label for each of the columns of the matrix (attributes) -clabelfile param Row class label file Stores the class-label for each of the rows of the matrix (objects) -rclassfile param Input file format: labels

Output file format Clustering solution file Tree file
n lines, with a single number per line ith line contains the cluster number that the ith object/row/vertex belongs to Cluster numbers run from zero to the number of clusters minus one If –zscores is specified, each line of this file contains two additional numbers right after the cluster number internal z-score, external z-score Tree file produced by performing AHC on top of a k-way clustering solution stored into a file in the form of a parent array: 2k-1 lines such that the ith line contains the parent of the ith node of the tree In the case of the root node, which is stored in the last line of the file, the parent is set to –1. Output file format

Output example Matrix/Graph information Settings
Clustering/Clusters quality statistics Timing information Output example

Internal clustering quality

External clustering quality
Comparison with reference classification (via –rclassfile) Overall Entropy and Purity For each cluster Local entropy and purity Object distribution over the classes External clustering quality

Determine the best set of descriptive & discriminating features for each cluster (via –showfeatures)
Top-L most descriptive features, with % of the within cluster sim. Top-L most discriminating features, with % of the dissim. between the cluster and the rest of the objects Cluster description

Cluster tree (1/2) via –showtree Displayed in a rotated fashion
First column as the root, the tree grows from left to right The leaves are numbered from Nclusters to 2*Nclusters -2 If –rclassfile is specified: prints information about how the objects of the various classes are distributed in each cluster Cluster tree (1/2)

Cluster tree (2/2) via –showtree and -laveltree
Further statistics on each of the the clusters Size Isim Xsim: avg sim between the objects of each pair of clusters that are children of the same node of the tree Gain: change in the value of a particular clustering criterion function Cluster tree (2/2)

Cluster visualization
Example 3. produced when –plotcluster is specified for a sparse matrix. color-intensity plot of the relations between the different clusters of documents and features. A subset of the features is displayed: union of the most descriptive and discriminating features of each cluster. Features are re-ordered according to a HAC. A brighter red cell correspond ing to a pair feature-cluster indicates higher power of that feature to be, for that cluster, descriptive (i.e., the fraction of within-cluster similarity that this feature can explain) and discriminating (i.e., the fraction of dissimilarity between the cluster and the rest of the objects this feature can explain.) The width of each cluster-column is proportional to the logarithm of the corresponding cluster's size. Cluster visualization

Example 1. produced when –plotmatrix is specified for a sparse matrix.
: row of input matrix re-ordered in such a way so that the rows assigned to each one of the ten clusters each non-zero positive element of matrix is displayed by a different shade of red. (b) : row and columns are also re-ordered according to a hierarchical clustering (c) : 10-way clustering obtained by scluster.

Cluster visualization
Example 4. produced when –plottree is specified entire hierarchical tree Leaves of the tree are labeled with the particular row-id(or row label if available) Cluster visualization

Cluto – Clustering toolkit by G. Karypis, UMN

Similar presentations

Presentation on theme: "Cluto – Clustering toolkit by G. Karypis, UMN"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Cluto – Clustering toolkit by G. Karypis, UMN

Similar presentations

Presentation on theme: "Cluto – Clustering toolkit by G. Karypis, UMN"— Presentation transcript:

Similar presentations

About project

Feedback