Robust Information-theoretic Clustering By C. Bohm, C. Faloutsos, J-Y. Pan, and C. Plant Presenter: Niyati Parikh.

Slides:



Advertisements
Similar presentations
Data Set used. K Means K Means Clusters 1.K Means begins with a user specified amount of clusters 2.Randomly places the K centroids on the data set 3.Finds.
Advertisements

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Agglomerative Hierarchical Clustering 1. Compute a distance matrix 2. Merge the two closest clusters 3. Update the distance matrix 4. Repeat Step 2 until.
Dimensionality Reduction PCA -- SVD
PCA + SVD.
Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching ANSHUL VARMA FAISAL QURESHI.
Y.-J. Lee, O. L. Mangasarian & W.H. Wolberg
6/26/2006CGI'06, Hangzhou China1 Sub-sampling for Efficient Spectral Mesh Processing Rong Liu, Varun Jain and Hao Zhang GrUVi lab, Simon Fraser University,
Computer Vision Spring ,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm – 4:20pm Lecture #20.
Clustering CMPUT 466/551 Nilanjan Ray. What is Clustering? Attach label to each observation or data points in a set You can say this “unsupervised classification”
Clustering II.
1 Machine Learning: Symbol-based 10d More clustering examples10.5Knowledge and Learning 10.6Unsupervised Learning 10.7Reinforcement Learning 10.8Epilogue.
Segmentation CSE P 576 Larry Zitnick Many slides courtesy of Steve Seitz.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
“Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Computations” By Ravi, Ma, Chiu, & Agrawal Presented.
Clustering.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Face Recognition Using Eigenfaces
Independent Component Analysis (ICA) and Factor Analysis (FA)
1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti
Unsupervised Learning
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Fitting. Choose a parametric object/some objects to represent a set of tokens Most interesting case is when criterion is not local –can’t tell whether.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Clustering & Dimensionality Reduction 273A Intro Machine Learning.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Hierarchical clustering & Graph theory
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Robust PCA in Stata Vincenzo Verardi FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Clustering Spatial Data Using Random Walk David Harel and Yehuda Koren KDD 2001.
Particle Filters for Shape Correspondence Presenter: Jingting Zeng.
es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves Dept Ciencies Mediques.
Unsupervised Learning. Supervised learning vs. unsupervised learning.
CSE 185 Introduction to Computer Vision Face Recognition.
Relevant Overlapping Subspace Clusters on CATegorical Data (ROCAT) Xiao He1, Jing Feng1, Bettina Konte1, Son T.Mai1, Claudia Plant2 1: University of Munich,
Christoph F. Eick Questions and Topics Review November 11, Discussion of Midterm Exam 2.Assume an association rule if smoke then cancer has a confidence.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
A split-and-merge framework for 2D shape summarization D. Gerogiannis, C. Nikou and A. Likas Department of Computer Science, University of Ioannina, Greece.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Machine Learning Queens College Lecture 7: Clustering.
Lloyd Algorithm K-Means Clustering. Gene Expression Susumu Ohno: whole genome duplications The expression of genes can be measured over time. Identifying.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
Advanced Artificial Intelligence Lecture 8: Advance machine learning.
Survival-Time Classification of Breast Cancer Patients and Chemotherapy Yuh-Jye Lee, Olvi Mangasarian & W. H. Wolberg UW Madison & UCSD La Jolla Computational.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Clustering Usman Roshan CS 675. Clustering Suppose we want to cluster n vectors in R d into two groups. Define C 1 and C 2 as the two groups. Our objective.
CURE: An Efficient Clustering Algorithm for Large Databases Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Presentation by: Vuk Malbasa For CIS664.
Color Image Segmentation Mentor : Dr. Rajeev Srivastava Students: Achit Kumar Ojha Aseem Kumar Akshay Tyagi.
Christoph F. Eick Questions Review October 12, How does post decision tree post-pruning work? What is the purpose of applying post-pruning in decision.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
Clustering Anna Reithmeir Data Mining Proseminar 2017
Data Mining: Basic Cluster Analysis
Semi-Supervised Clustering
Dimensionality Reduction
Clustering Usman Roshan.
Discrimination and Classification
KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner
Consensus Partition Liang Zheng 5.21.
Announcements Project 4 questions Evaluations.
SEEM4630 Tutorial 3 – Clustering.
Hierarchical Clustering
Clustering.
Clustering Usman Roshan CS 675.
Presentation transcript:

Robust Information-theoretic Clustering By C. Bohm, C. Faloutsos, J-Y. Pan, and C. Plant Presenter: Niyati Parikh

Objective Find natural clustering in a dataset Two questions: Goodness of a clustering Efficient algorithm for good clustering

Define “goodness” Ability to describe the clusters succinctly Adopt VAC (Volume after Compression) Record #bytes for number of clusters k Record #bytes to record their type (guassian, uniform,..) Compressed location of each point

VAC Tells which grouping is better Lower VAC => better grouping Formula using decorrelation matrix Decorrelation matrix = matrix with eigenvectors

Computing VAC Steps: Compute covariance matrix of cluster C Compute PCA and obtain eigenvector matrix Compute VAC from the matrix

Efficient algorithm Take initial clustering given by any algorithm Refine that clustering to remove outliers/noise Output a better clustering by doing post processing

Refining Clusters Use VAC to refine existing clusters Removing outliers from the given cluster C Define Core and Out as set of points for core and outliers in C Initially Out contains all points in C Arrange points in ascending order of its distance from center Compute VAC Pick the closest point from Out and move to Core Compute new VAC If new VAC increases then stop, else pick next closest point and repeat

VAC and Robust estimation -Conventional estimation: covariance matrix uses Mean -Robust estimation: covariance matrix uses Median -Median is less affected by outliers than Mean

Sample result -Imperfect clusters formed by K-Means affect purifying process -May result into redundant clusters, that could be merged

Cluster Merging Merge Ci and Cj only if the combined VAC decreases savedCost(Ci, Cj) = VAC(Ci) + VAC(Cj) – VAC(Ci U Cj) If savedCost > 0, then merge Ci and Cj Greedy search to maximize savedCost, hence minimize VAC

Final Result

Experiment results

Example

Thank You Questions?