Fast and Exact K-Means Clustering

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

A Privacy Preserving Index for Range Queries
Mining High-Speed Data Streams Presented by: Tyler J. Sawyer UVM Spring CS 332 Data Mining Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International.
Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.
Scalable Classification Robert Neugebauer David Woo.
Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for.
February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.
BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.
1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)
Yoshiharu Ishikawa (Nagoya University) Yoji Machida (University of Tsukuba) Hiroyuki Kitagawa (University of Tsukuba) A Dynamic Mobility Histogram Construction.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Overview Of Clustering Techniques D. Gunopulos, UCR.
Scalable Data Mining The Auton Lab, Carnegie Mellon University Brigham Anderson, Andrew Moore, Dan Pelleg, Alex Gray, Bob Nichols, Andy.
Maintenance of Discovered Association Rules S.D.LeeDavid W.Cheung Presentation : Pablo Gazmuri.
Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions.
Liang Jin and Chen Li VLDB’2005 Supported by NSF CAREER Award IIS Selectivity Estimation for Fuzzy String Predicates in Large Data Sets.
(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Experimental Evaluation
Population Proportion The fraction of values in a population which have a specific attribute p = Population proportion X = Number of items having the attribute.
FLANN Fast Library for Approximate Nearest Neighbors
Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1.
Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
Selective Block Minimization for Faster Convergence of Limited Memory Large-scale Linear Models Kai-Wei Chang and Dan Roth Experiment Settings Block Minimization.
Mining Data Streams Challenges, Techniques, and Future Work Ruoming Jin Joint work with Prof. Gagan Agrawal.
A Statistical Approach to Speed Up Ranking/Re-Ranking Hong-Ming Chen Advisor: Professor Shih-Fu Chang.
Memory Bounded Inference on Topic Models Paper by R. Gomes, M. Welling, and P. Perona Included in Proceedings of ICML 2008 Presentation by Eric Wang 1/9/2009.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob SPRINT: A Scalable Parallel Classifier for Data Mining.
Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.
Bahman Bahmani Stanford University
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.
New Sampling-Based Estimators for OLAP Queries Ruoming Jin, Kent State University Leo Glimcher, The Ohio State University Chris Jermaine, University of.
1 Using Tiling to Scale Parallel Datacube Implementation Ruoming Jin Karthik Vaidyanathan Ge Yang Gagan Agrawal The Ohio State University.
Click to edit Master subtitle style 2/23/10 Time and Space Optimization of Document Content Classifiers Dawei Yin, Henry S. Baird, and Chang An Computer.
Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.
1 Mean Analysis. 2 Introduction l If we use sample mean (the mean of the sample) to approximate the population mean (the mean of the population), errors.
System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Gagan Agrawal Department of Computer and Information Sciences Ohio.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
Bootstrapped Optimistic Algorithm for Tree Construction
Sporadic model building for efficiency enhancement of the hierarchical BOA Genetic Programming and Evolvable Machines (2008) 9: Martin Pelikan, Kumara.
ESTIMATION OF THE MEAN. 2 INTRO :: ESTIMATION Definition The assignment of plausible value(s) to a population parameter based on a value of a sample statistic.
1 Parallel Datacube Construction: Algorithms, Theoretical Analysis, and Experimental Evaluation Ruoming Jin Ge Yang Gagan Agrawal The Ohio State University.
System Support for High Performance Scientific Data Mining Gagan Agrawal Ruoming Jin Raghu Machiraju S. Parthasarathy Department of Computer and Information.
Mining High-Speed Data Streams Presented by: William Kniffin Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Conference
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
Scalability of Local Image Descriptors Björn Þór Jónsson Department of Computer Science Reykjavík University Joint work with: Laurent Amsaleg (IRISA-CNRS)
Machine Learning for the Quantified Self
Data Driven Resource Allocation for Distributed Learning
Data Science Algorithms: The Basic Methods
Ge Yang Ruoming Jin Gagan Agrawal The Ohio State University
Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang
Chapter 15 QUERY EXECUTION.
Overview Of Clustering Techniques
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
Image Processing for Physical Data
Scalable Data Mining: Algorithms, System Support, and Applications
Communication and Memory Efficient Parallel Decision Tree Construction
Soft Error Detection for Iterative Applications Using Offline Training
Bootstrapped Optimistic Algorithm for Tree Construction
Decision Trees for Mining Data Streams
Minwise Hashing and Efficient Search
Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)
FREERIDE: A Framework for Rapid Implementation of Datamining Engines
FREERIDE: A Framework for Rapid Implementation of Datamining Engines
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Fast and Exact K-Means Clustering 2/23/2019 Fast and Exact K-Means Clustering Ruoming Jin Anjan Goswami Gagan Agrawal The Ohio State University Hello, everyone, my name is Ruoming Jin. Today, I will present the paper, “Communication and Memory Efficient Parallel Decision Tree Construction”. 2/23/2019

Mining Out-of-Core Datasets The need to efficiently process disk-resident datasets In many cases, the huge amount of data can not fit into the main memory The processor-memory performance gap, and consequently, the processor-secondary memory performance gap become larger and larger! Moore’s law (50% per year) Latency gap (disks, 5 ms /DRAM 50ns > 100) The problem Most Mining Algorithms are I/O (data) intensive Many Mining Algorithms, such as Decision Tree Construction and K-means clustering, have to rewrite or scan the dataset many times Some remedies Approximate Mining Algorithms Working on Samples How can we develop efficient out-of-core mining algorithms without losing accuracy? 2/23/2019

Processor/Disk race How can we let Carl Lewis do more running to reduce turtle’s running distance? 2/23/2019

Sampling Based Approach Use samples to get approximate results or information Find criteria to test or estimate the accuracy of the sample results, and prune the statistically incorrect results Scan the complete dataset, collect the necessary information based on the approximate results in order to derive the accurate final results If the estimation from the sample is wrong and results in a false pruning, a re-scan is needed. 2/23/2019

Applications of this Approach Scaling and Parallelizing Decision Tree Construction Use RainForest (RF-read) as the basis Statistically Pruning Intervals for Enhanced Scalability (SPIES) – SDM 2003 Reduces both memory and communication requirements Fast and Exact k-means (this paper) Distributed, Fast, and Exact K-means (submitted) 2/23/2019

K-means Algorithm An iterative assigning and shifting procedure In-Core Datasets Using kd-tree (Pelleg and Moore) Out-of-Core Datasets Single-pass approximate algorithms Bradley and Fayyad Farnstorm and his colleagues Domingos and Hulten 2/23/2019

The Problem Can we have an algorithm which requires fewer passes on the entire dataset and can produce the same results as the original k-means? Fast and Exact K-Means Algorithm (FEKM) Typically requires only one or a small number of passes on the entire dataset Provably produces the same cluster centers as reported by the original k-means algorithm Experimental results from a number of real and synthetic datasets show speedups between a factor of 2 and 4.5, as compared to k-means 2/23/2019

Basic ideas of FEKM Run the original k-means algorithm on samples; store the centers computed after each iteration on the sampled data Build Confidence Radius for every cluster center at every iteration an estimation of the difference between the centers from samples and the corresponding exact k-means centers Apply confidence radius to find the points likely to have different center assignment in k-means running on the complete dataset 2/23/2019

Sampling the datasets 2/23/2019

K-means clustering on samples Confidence Radius An estimation of the upper-bound of the distance between the sample center and the corresponding k-means center! 2/23/2019

Boundary Points C1,δ1 d1 C2,δ2 d2 |d1-d2|<δ1+δ2 Another center is close enough compared to the closest center! 2/23/2019

Processing the Complete Dataset Identify if a point is a boundary point For a boundary point, we simply keep it in the main memory For other points (stable points), we will assign them to the closest centers derived from the samples for each possible iteration Caching sufficient statistics for stable points (CA-Table, cluster number * number of iterations * dimension) 2/23/2019

Confidence Radius Computation of Confidence Radius Large radius/small radius Heuristics 2/23/2019

Discussion Correctness Performance Analysis FEKM guarantees to find the same clusters as the original k-means Performance Analysis Determined by the number of passes of the dataset 2/23/2019

Experimental Setup and Datasets Machines 700 MHz Pentium Processors 1 GB memory Datasets Synthetic Datasets Similar to the ones used by Bradley et al 18 1.1GB datasets and 2 4.4GB datasets 5, 10, 20, 50, 100, and 200 dimensions 5, 10 and 20 clusters Real Datasets (UCI ML archive) KDDCup99 (38 dim, 1.8 GB, k=5) Corel image (32 dim, 1.9GB, k=16) Reuters text database (258 dim, 2 GB, k=25) Super-sampling, normalized [0,1] 2/23/2019

Performance of k-means and FEKM Algorithms on synthetic Datasets No. iterations Time of k-means Time of FEKM Samples (%) Passes Size Dimensions 1.1GB 200 100 54862.33 27388.85 10 2 3 1898.65 584.88 5 1 41029.15 18106.51 1233.12 585.63 50 1796.30 882.36 20 5335.15 2112.42 6 3919.08 1643.75 4619.95 2353.41 4.4GB 4393.02 2931.53 21985.62 8194.07 7467.53 Running Time in Seconds, 20 clusters 2/23/2019

Performance of k-means and FEKM Algorithms on Real Datasets No. iterations Time of k-means Time of FEKM Samples (%) Passes Squared Error Kdd99 19 7151 2317 10 2 4.0 kdd99 2529 15 3.5 2136 5 4.2 Corel 43 28442 10503 3 2.2 12603 2.15 9342 3.24 Reuter 20 41290 10311 10.1 11204 8.6 9214 14.9 Running Time in Seconds, Squared Error between final centers and the centers after sampling 2/23/2019

Summary Both algorithms (SPIES and FEKM) use information derived from samples to decide what needs to be cached, summarized, or dropped from the complete dataset Construct detailed class-histogram for unpruned intervals (SPIES) Cache the sufficient statistics for stable points and store the boundary points (FEKM) Both algorithms achieve significant performance gains without losing any accuracy 2/23/2019