Ricochet A Family of Unconstrained Algorithms for Graph Clustering.

Slides:



Advertisements
Similar presentations
Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

1. Find the cost of each of the following using the Nearest Neighbor Algorithm. a)Start at Vertex M.
Clustering.
Fill Reduction Algorithm Using Diagonal Markowitz Scheme with Local Symmetrization Patrick Amestoy ENSEEIHT-IRIT, France Xiaoye S. Li Esmond Ng Lawrence.
PARTITIONAL CLUSTERING
Minimizing Seed Set for Viral Marketing Cheng Long & Raymond Chi-Wing Wong Presented by: Cheng Long 20-August-2011.
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Supervised Learning Recap
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Protein Domain Finding Problem Olga Russakovsky, Eugene Fratkin, Phuong Minh Tu, Serafim Batzoglou Algorithm Step 1: Creating a graph of k-mers First,
Unsupervised Learning and Data Mining
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
1 Partitioning Algorithms: Basic Concepts  Partition n objects into k clusters Optimize the chosen partitioning criterion Example: minimize the Squared.
Cluster Analysis (1).
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
18th International Conference on Database and Expert Systems Applications Journey to the Centre of the Star: Various Ways of Finding Star Centers in Star.
Clustering Unsupervised learning Generating “classes”
Graph clustering Jin Chen CSE Fall 2012 MSU 1.
Nirmalya Roy School of Electrical Engineering and Computer Science Washington State University Cpt S 223 – Advanced Data Structures Graph Algorithms: Minimum.
Network Aware Resource Allocation in Distributed Clouds.
Presented by Tienwei Tsai July, 2005
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim.
May 1, 2002Applied Discrete Mathematics Week 13: Graphs and Trees 1News CSEMS Scholarships for CS and Math students (US citizens only) $3,125 per year.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
Module 5 – Networks and Decision Mathematics Chapter 23 – Undirected Graphs.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Classification Heejune Ahn SeoulTech Last updated May. 03.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
Markov Cluster (MCL) algorithm Stijn van Dongen.
Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
COMP Data Mining: Concepts, Algorithms, and Applications 1 K-means Arbitrarily choose k objects as the initial cluster centers Until no change,
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
CLUSTER ANALYSIS Introduction to Clustering Major Clustering Methods.
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
Clustering.
Hierarchical Clustering for POS Tagging of the Indonesian Language Derry Tanti Wijaya and Stéphane Bressan.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A modified version of the K-means algorithm with a distance.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Zeidat&Eick, MLMTA, Las Vegas K-medoid-style Clustering Algorithms for Supervised Summary Generation Nidal Zeidat & Christoph F. Eick Dept. of Computer.
Community Discovery in Social Network Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
Designing Factorial Experiments with Binary Response Tel-Aviv University Faculty of Exact Sciences Department of Statistics and Operations Research Hovav.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Prims Algorithm for finding a minimum spanning tree
Lecture 19 Minimal Spanning Trees CSCI – 1900 Mathematics for Computer Science Fall 2014 Bill Pine.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
1 Distributed Vertex Coloring. 2 Vertex Coloring: each vertex is assigned a color.
Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Semi-Supervised Clustering
Lin Lu, Margaret Dunham, and Yu Meng
CSE572, CBS572: Data Mining by H. Liu
Alan Kuhnle*, Victoria G. Crawford, and My T. Thai
CSE572: Data Mining by H. Liu
Presentation transcript:

Ricochet A Family of Unconstrained Algorithms for Graph Clustering

Background Clustering is an unsupervised process of discovering natural clusters: Objects within the same cluster are “similar” Objects from different clusters are “dissimilar” When we have similarity metrics, we can represent objects in a similarity graph: Vertices represent objects Edges represent similarity between objects Clustering translates to graph clustering for dense graph

Background Motivation: clustering algorithm often necessitates a priori decisions on parameters Based on our study on: Star clustering [1] to select significant vertices Using Star clustering’s method for selecting cluster seeds, without the need for number of clusters Single-link hierarchical clustering [2] to select significant edges Using single-link hierarchical clustering’s method for selecting edges, without the need for threshold K-means [3] for termination condition Using re-assignment of vertices, clusters’ quality can be updated and improved. Reach a terminating condition without the need for number of clusters or threshold

Contribution Ricochet does not require any parameter to be set a priori Alternate two phases: Choice of vertices to be seeds using average metric [1]: ave(v) = Σ  vi ∈ v.adj sim(v i,v) / degree(v) Assignment of vertices into clusters using single-link hierarchical clustering and K-means method Pictorially, resembling the rippling of stones thrown in a pond, thus the name: Ricochet

Ricochet family Sequential rippling Stones are thrown one after another Hard clustering Straightforward extension to K-means Concurrent rippling Stones are thrown at the same time Soft clustering

Sequential Rippling Sequential Rippling (SR) Choose the heaviest vertex (vertex with the biggest ave(v)) as the first seed One cluster is formed containing all vertices Subsequent seeds are chosen from the list of ordered vertices from heaviest to lightest When new seed is added, re-assign vertices to nearest seeds Clusters reduced to singletons are assigned to other nearest seeds Stop when all vertices have been considered Balanced Sequential Rippling (BSR) Balances the distribution of seeds Subsequent seed is chosen as one that maximizes the ratio of its weight (ave(v)) to the sum of its similarities to existing seeds Stop when there is no more re-assignment O (N 3 )

Balanced Sequential Rippling

Concurrent Rippling Concurrent Rippling (CR) Each vertex is initially a seed At each iteration, find all edges connecting vertices to their next most similar neighbors Find the minimum of these edges, e min Collect all unprocessed edges whose weight are ≥ e min Process these edges from heaviest to lightest:  If an edge connects a seed to a non-seed, add the non-seed to the seed’s cluster  If an edge connects two seeds, the cluster of one is absorbed by the other if its weight (ave(v)) is smaller than the weight of the other seed Stop when the seeds no longer change Ordered Concurrent Rippling (OCR) At each iteration, process edges connecting vertices to their next most similar neighbors from heaviest edge to lightest edge O (N 2 logN)

Ordered Concurrent Rippling S S S S S 1 st iteration 2 nd iteration

Ordered Concurrent Rippling At each step, OCR tries to maximize the average similarity between vertices and their seeds: OCR processes adjacent vertices of each vertex in order of their similarity from highest to lowest, ensuring best possible merger for the vertex at each iteration OCR chooses the bigger weight (ave(v)) vertex as seed whenever two seeds are adjacent to one another. As in [1, 4] this is an approximation to maximizing the average similarity between the seed and its vertices

Experiments Compare performance with constrained clustering algorithms (K-medoids [5], Star clustering [4]) and unconstrained clustering algorithms (Markov Clustering [6]) Use data from Reuters-21578, Tipster-AP, and our original collection: Google Measure effectiveness: recall, precision, F1 Measure efficiency: running time

Experimental Results Comparison with constrained algorithms Effectiveness: BSR and OCR are most effective BSR achieves higher precision than K-medoids, Star and Star-Ave OCR achieves higher or comparable F1 than K-medoids, Star and Star-Ave Efficiency: OCR is faster than Star and Star-Ave, but is slower than K- medoids due to the pre-processing time required to build the graph

Experimental Results Effectiveness comparison Tipster-APReuters

Experimental Results Efficiency comparison Tipster-APReuters

Experimental Results Comparison with unconstrained algorithms Compare with Markov Clustering (MCL) that has an intrinsic inflation parameter (MCL is sensitive to this choice of inflation parameter) Effectiveness BSR and OCR are competitive to MCL set at its best inflation value BSR and OCR are much more effective than MCL at its minimum and maximum inflation values Efficiency BSR and OCR are significantly faster than MCL at all inflation values

Experimental Results Effectiveness and efficiency of MCL at different inflation parameters

Experimental Results Effectiveness and efficiency comparison (on Tipster-AP)

Summary We propose Ricochet, a family of algorithms for clustering weighted graphs Our proposed algorithms are unconstrained, they do not require a priori setting of extrinsic or intrinsic parameters OCR yields a very respectable effectiveness while being efficient Pre-processing time is still a bottleneck when compared to non-graph clustering algorithms like K- medoids

References 1. Wijaya D., Bressan S.: Journey to the Centre of the Star: Various Ways of Finding Star Centers in Star Clustering. In: 18th International Conference on Database and Expert Systems Applications DEXA (2007) 2. Croft, W. B.: Clustering Large Files of Documents using the Single-link Method. In: Journal of the American Society for Information Science, (1977) 3. MacQueen, J. B.: Some Methods for Classification and Analysis of Multivariate Observations. In: 5th Berkeley Symposium on Mathematical Statistics and Probability. Berkeley, 1: University of California Press (1967) 4. Aslam, J., Pelekhov, K., Rus, D.: The Star Clustering Algorithm. In: Journal of Graph Algorithms and Applications, 8(1) (2004) 5. Kaufman L., Rousseeuw P.: Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons, New York (1990) 6. Van Dongen, S. M.: Graph Clustering by Flow Simulation. In: Tekst. Proefschrift Universiteit Utrecht (2000)