Deepak Turaga 1, Michalis Vlachos 2, Olivier Verscheure 1 1 IBM T.J. Watson Research Center, NY, USA 2 IBM Zürich Research Laboratory, Switzerland On K-Means.

Slides:



Advertisements
Similar presentations
Clustering. How are we doing on the pass sequence? Pretty good! We can now automatically learn the features needed to track both people But, it sucks.
Advertisements

Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.
On the Effect of Trajectory Compression in Spatio-temporal Querying Elias Frentzos, and Yannis Theodoridis Data Management Group, University of Piraeus.
13.1 Theoretical Probability
CMPUT 615 Applications of Machine Learning in Image Analysis
Group Meeting Presented by Wyman 10/14/2006
K-MEANS ALGORITHM Jelena Vukovic 53/07
CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
Clustering AMCS/CS 340: Data Mining Xiangliang Zhang
Aggregating local image descriptors into compact codes
Hierarchical Clustering, DBSCAN The EM Algorithm
Clustering Basic Concepts and Algorithms
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
BRISK (Presented by Josh Gleason)
Tokyo Research Laboratory © Copyright IBM Corporation 2006 | 2006/09/19 | PKDD 2006 Why does subsequence time-series clustering produce sine waves? IBM.
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Dynamics of Learning VQ and Neural Gas Aree Witoelar, Michael Biehl Mathematics and Computing Science University of Groningen, Netherlands in collaboration.
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Unsupervised Learning: Clustering Rong Jin Outline  Unsupervised learning  K means for clustering  Expectation Maximization algorithm for clustering.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Spatial and Temporal Data Mining
Segmentation Divide the image into segments. Each segment:
Data mining and statistical learning - lecture 14 Clustering methods  Partitional clustering in which clusters are represented by their centroids (proc.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Right Protection via Watermarking with Provable Preservation of Distance-based Mining Spyros Zoumpoulis Joint work with Michalis Vlachos, Nick Freris and.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Tal Mor  Create an automatic system that given an image of a room and a color, will color the room walls  Maintaining the original texture.
Computer Vision James Hays, Brown
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Shape Matching for Model Alignment 3D Scan Matching and Registration, Part I ICCV 2005 Short Course Michael Kazhdan Johns Hopkins University.
Digital Image Processing Lecture 20: Representation & Description
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
CSIE Dept., National Taiwan Univ., Taiwan
CSE 185 Introduction to Computer Vision Pattern Recognition 2.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Exact data mining from in-exact data Nick Freris Qualcomm, San Diego October 10, 2013.
Image segmentation Prof. Noah Snavely CS1114
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Unsupervised Learning. Supervised learning vs. unsupervised learning.
Project 11: Determining the Intrinsic Dimensionality of a Distribution Okke Formsma, Nicolas Roussis and Per Løwenborg.
Project 11: Determining the Intrinsic Dimensionality of a Distribution Okke Formsma, Nicolas Roussis and Per Løwenborg.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Chapter 4: Feature representation and compression
Chapter 13 (Prototype Methods and Nearest-Neighbors )
STATISTIC & INFORMATION THEORY (CSNB134) MODULE 11 COMPRESSION.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Ch. 4: Feature representation
Gilad Lerman Math Department, UMN
Image Representation and Description – Representation Schemes
Data Mining: Basic Cluster Analysis
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Semi-Supervised Clustering
Context-based Data Compression
Mean Shift Segmentation
Ch. 4: Feature representation
Jianping Fan Dept of CS UNC-Charlotte
Reconstructing Shredded Documents
KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner
CS 188: Artificial Intelligence Fall 2008
Foundation of Video Coding Part II: Scalar and Vector Quantization
Feature space tansformation methods
Data Transformations targeted at minimizing experimental variance
Topological Signatures For Fast Mobility Analysis
Unsupervised Learning: Clustering
Presentation transcript:

Deepak Turaga 1, Michalis Vlachos 2, Olivier Verscheure 1 1 IBM T.J. Watson Research Center, NY, USA 2 IBM Zürich Research Laboratory, Switzerland On K-Means Cluster Preservation using Quantization Schemes

overview – what we want to do… Examine under what conditions compression methodologies retain the clustering outcome We focus on the K-Means algorithm k-Means cluster 1 cluster 2 cluster 3 cluster 1 cluster 2 cluster 3 k-Means identical clustering results original dataquantized data

why we want to do that… Reduced Storage –The quantized data will take up less space

why we want to do that… Reduced Storage –The quantized data will take up less space Faster execution –Since the data can be represented in a more compact form the cluster algorithm will require less runtime

why we want to do that… Reduced Storage –The quantized data will take up less space Faster execution –Since the data can be represented in a more compact form the cluster algorithm will take less runtime Anonymization/Privacy Preservation –The original values are not disclosed

why we want to do that… Reduced Storage –The quantized data will take up less space Faster execution –Since the data can be represented in a more compact form the cluster algorithm will take less runtime Anonymization/Privacy Preservation –The original values are not disclosed Authentication –encode some message with the quantization We will achieve the above and still guarantee same results

other cluster preservation techniques We do not transform into another space Space requirements same – no data simplification Shape preservation [Oliveira04] S. R. M. Oliveira and O. R. Zaane. Privacy Preservation When Sharing Data For Clustering, 2004 [Parameswaran05] R. Parameswaran and D. Blough. A Robust Data Obfuscation Approach for Privacy Preservation of Clustered Data, 2005 original quantized

K-Means Algorithm: 1.Initialize k clusters (k specified by user) randomly. 2.Repeat until convergence 1.Assign each object to the nearest cluster center. 2.Re-estimate cluster centers. k-means overview

k-means example

k-means applications/usage Fast pre-clustering

k-means applications/usage Fast pre-clustering Real-time clustering (eg image, video effects) –Color/Image segmentation

k-means objective function Objective: Mininize sum of intra-class variance Cluster centroid After some algebraic manipulations clusters Dimensions/Time instances 2 nd moment 1 st moment

k-means objective function So we can preserve the k-Means outcome if: clusters Dimensions/Time instances 2 nd moment 1 st moment We maintain the cluster assignment We preserve the 1st and 2nd moment of the cluster objects

moment preserving quantization 1st moment: average 2 nd (central) moment : variance 3 rd moment: skewness 4 th moment: kyrtosis

In order to preserve the first and second moment we will use the following quantizer: Everything below the mean value is ‘snapped’ here Everything above the mean value is ‘snapped’ here

Everything below the mean value is ‘snapped’ here = = original quantized average = average =

These are the points for one dimension and for one cluster of objects. Process is repeated for all dimensions and for all clusters We have one quantizer per class Dimension d (or time instance d)

our quantization One quantizer per class The quantized data are binary

our quantization The fact the we have 1 quantizer per class suggests that we need to run k-Means once before we quantize This is not a shortcoming of the technique as we need to know the cluster boundaries so that we know how much we can simplify the data.

why quantization works? Why does the clustering remain same before and after quantization? Centers do not change (averages remain same)

why quantization works? Why does the clustering remain same before and after quantization? Centers do not change (averages remain same) Cluster assignment does not change because clusters ‘shrink’ due to quantization

will it always work? The results will be the same for datasets with well-formed clusters Discrepancy of results means that clusters were not that dense

recap Use moment preserving quantization to preserve objective function Due to cluster shrinkage, cluster assignments will not change Identical results for optimal k-Means One quantizer per class 1-bit quantizer per dimension clusters Dimensions 2 nd moment 1 st moment

example: shape preservation

[Bagnall06] A. J. Bagnall, C. A. Ratanamahatana, E. J. Keogh, S. Lonardi, and G. J. Janacek. A Bit Level Representation for Time Series Data Mining with Shape Based Similarity. In Data Min. Knowl. Discov. 13(1), pages 11–40, 2006.

example: cluster preservation 3 years Nasdaq stock ticker data We cluster into k=8 clusters Confusion Matrix

8 3% mislabeled data after the moment preserving quantization With Binary Clipping: 80% mislabeled Cluster centers

quantization levels indicate cluster spread

example: label preservation 2 datasets –Contours of fish –Contours of leaves Clustering and then k-NN voting Acer platanoidesSalix fragilisTilia Quercus robur For rotation invariance we use a rotation invariant features space-time frequency

example: label preservation Very low mislabeling error for MPQ High error rate for Binary Clipping

other nice characteristics Low sensitivity to initial centers –Mismatch when starting from different centers is around 7%

other nice characteristics Low sensitivity to initial centers –Mismatch when starting from different centers is around 7% Neighborhood preservation –even though we are not optimizing directly that… –Good results because we are preserving the ‘shape’ of the object A B

size reduction by a factor of 3 when using the quantized scheme Compression reduces for increasing K

summary 1-bit quantizer per dimension sufficient to preserve kMeans ‘as well as possible’ Theoretically the results will be identical (under conditions) Good ‘shape’ preservation Future work: Multi-bit quantization Multi-dimension quantization

end..