Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process Standard definition:  Clustering - grouping a set of.

Slides:



Advertisements
Similar presentations
Fast algorithm for detecting community structure in networks M. E. J. Newman Department of Physics and Center for the Study of Complex Systems, University.
Advertisements

CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
Social network partition Presenter: Xiaofei Cao Partick Berg.
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
PARTITIONAL CLUSTERING
~1~ Infocom’04 Mar. 10th On Finding Disjoint Paths in Single and Dual Link Cost Networks Chunming Qiao* LANDER, CSE Department SUNY at Buffalo *Collaborators:
Community Detection Laks V.S. Lakshmanan (based on Girvan & Newman. Finding and evaluating community structure in networks. Physical Review E 69,
Presented by: GROUP 7 Gayathri Gandhamuneni & Yumeng Wang.
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Data Mining Techniques: Clustering
Clustering and Dimensionality Reduction Brendan and Yifang April
V4 Matrix algorithms and graph partitioning
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
1Ellen L. Walker Segmentation Separating “content” from background Separating image into parts corresponding to “real” objects Complete segmentation Each.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
HCS Clustering Algorithm
Segmentation Divide the image into segments. Each segment:
Fast algorithm for detecting community structure in networks.
Basic Data Mining Techniques
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
Introduction to Bioinformatics - Tutorial no. 12
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
FLANN Fast Library for Approximate Nearest Neighbors
Evaluating Performance for Data Mining Techniques
Tal Mor  Create an automatic system that given an image of a room and a color, will color the room walls  Maintaining the original texture.
Taking Raw Data Towards Analysis 1 iCSC2015, Vince Croft, NIKHEF Exploring EDA, Clustering and Data Preprocessing Lecture 2 Taking Raw Data Towards Analysis.
University of Kurdistan Digital Image Processing (DIP) Lecturer: Kaveh Mollazade, Ph.D. Department of Biosystems Engineering, Faculty of Agriculture,
CSE 185 Introduction to Computer Vision Pattern Recognition 2.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves Dept Ciencies Mediques.
CS654: Digital Image Analysis
Learning With Bayesian Networks Markus Kalisch ETH Zürich.
Network Community Behavior to Infer Human Activities.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Learning to Detect Faces A Large-Scale Application of Machine Learning (This material is not in the text: for further information see the paper by P.
Zeidat&Eick, MLMTA, Las Vegas K-medoid-style Clustering Algorithms for Supervised Summary Generation Nidal Zeidat & Christoph F. Eick Dept. of Computer.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Machine Learning and Data Mining Clustering (adapted from) Prof. Alexander Ihler TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Community Discovery in Social Network Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
CfE Advanced Higher Physics
Community structure in graphs Santo Fortunato. More links “inside” than “outside” Graphs are “sparse” “Communities”
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
James Hipp Senior, Clemson University.  Graph Representation G = (V, E) V = Set of Vertices E = Set of Edges  Adjacency Matrix  No Self-Inclusion (i.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Color Image Segmentation Mentor : Dr. Rajeev Srivastava Students: Achit Kumar Ojha Aseem Kumar Akshay Tyagi.
Multi-cellular paradigm The molecular level can support self- replication (and self- repair). But we also need cells that can be designed to fit the specific.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Graph clustering to detect network modules
Data Mining: Basic Cluster Analysis
Semi-Supervised Clustering
More on Clustering in COSC 4335
Cluster Analysis II 10/03/2012.
Clustering CSC 600: Data Mining Class 21.
Data Mining Soongsil University
Data Mining K-means Algorithm
Clustering in Ratemaking: Applications in Territories Clustering
Computer Vision Lecture 12: Image Segmentation II
Jianping Fan Dept of CS UNC-Charlotte
Clustering.
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
CSE572, CBS572: Data Mining by H. Liu
Data Transformations targeted at minimizing experimental variance
CSE572: Data Mining by H. Liu
Using Clustering to Make Prediction Intervals For Neural Networks
Presentation transcript:

Vladyslav Kolbasin Stable Clustering

Clustering data Clustering is part of exploratory process Standard definition:  Clustering - grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups There is no “true” solution for clustering  We don't have any “true Y values”  Usually we want to do some data exploration or simplification or even find some data taxonomy  Usually we don't have precise mathematical definition of clustering Usually we iterate through different methods that have different mathematical target function, then use some best method 2

Usual clustering methods Methods:  K-means  Hierarchical clustering  Spectral clustering  DBScan  BIRCH  … Issues:  Need to estimate clusters count  Non-determinism  Non-stability 3

Are standard methods stable? Kmeans 4

Are standard methods stable? Hierarchical clustering 5

Audience data A lot of attributes: All attributes are binary There are several data providers There is no very important attributes 6

Stability importance Data comes from different providers and it is very noisy It is unlikely that results will change from run to run  Usually audience doesn't change a lot in short period Many algorithms “explode” when we increase data size  Non linear complexity of clustering  Best count of clusters move to higher values for bigger data 7

Let's add some additional requirement to clustering:  clustering result should be a structure on the data set that is “stable” So there should be similar results when:  We change some small portion of data  We apply clustering onto several datasets from the same underlying model  Apply clustering onto several subsets of initial dataset We don't want to process gigabytes and terabytes to get several stable clusters which are independent of randomness in sampling 8 Stable clustering. Requirements

Natural restrictions:  We don't want to have too many clusters  We don't want to have too small or too big clusters Too small clusters are usually useless for further processing Too big clusters do not bring significantly new information  Some points can be noise points, so let try to find only significant tendencies It will be big benefit if we can easily scale results  To be able to look at inner structure of selected cluster without full rerun Any additional instruments for manual analysis of clustering are welcome 9

Stable clustering ideas Do not use whole dataset, but use many small sub samples Use several samplings to mine as much as possible information from data Average all clustering on samples to get stable result 10

Stable clustering algorithm 1)Select N samples of whole dataset 2)Do clustering for each sample  So for each sample we have set of clusters (possibly very different) 3)Do some clusters' preprocessing 4)Associate clusters from different samples to each other  Build some relationship structure - clusters graph  Set relationship measure - distance measure 5)Do clustering on relationship structure  Do communities search 11

2. Sample clustering Any clustering method:  Kmeans  Hierarchical clustering  … It is conveniently to use hierarchical clustering:  It is rather fast clustering method  We can estimate clusters count using natural restrictions, not using special criteria like we usually do for kmeans  We can deep into internal structure without any additional calculations 12

2.1. Dendrogram clustering Recursive splitting of large clusters With natural restrictions:  Set max possible cluster size (in %)  Set min cluster size (in %), any smaller cluster – noise  Max count of splits  … 13

2.1. Dendrogram clustering 14

3. Do clusters' preprocessing Reduce noise points Cluster smoothing Make clusters more convenient for associating:  Cluster can be similar to several other clusters (1-to-many)  If split it, it can transform into: 1-to-1 & 1-to-1 clusters And some other heuristics... 15

4. Associate clusters from different samples to each other How similar to each other are clusters? Set relationship measure:  Simplest measure - distance between cluster's centers  But we can use any suitable measure 16

4. Associate clusters from different samples to each other Clusters relationship structure - clusters graph  But we are not interested in edges for very different clusters So need some threshold:  Can estimate manually, then hard-code  Can estimate automatically 17

5. Communities search in networks Methods:  walktrap.community  edge.betweenness.community  fastgreedy.community  spinglass.community  … It is possible that some clusters will not be in any community. Then will mark these clusters as special type community 18

5.1 Community structure detection based on edge betweenness edge.betweenness.community() implements Girvan–Newman algorithm Betwenness - the number of geodesics (shortest paths) going through an edge Algorithm:  Calculate edge-betweenness for all edges  Remove the edge with highest betweenness  Recalculate betweenness  Repeat until all edges are removed, or modularity function is optimized (depending on variation) 19

5. Communities examples 20

Algorithm analysis 21

Algorithm analysis 22

Algorithm analysis 23

Summary Issues in clustering algorithms Why stability is important for business questions? 2 staged clustering algorithm  1 st stage – apply simple clustering on samples  2 nd stage – do clustering on cluster graph Real data clustering example Algorithm can be simply parallelized:  Most time is spent on 2 nd step 24