Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.

Slides:

Advertisements

Similar presentations

Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)

Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

K-means Clustering Given a data point v and a set of points X,

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.

University of Minnesota CG_Hadoop: Computational Geometry in MapReduce Ahmed Eldawy* Yuan Li* Mohamed F. Mokbel*$ Ravi Janardan* * Department of Computer.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

ACL, June Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.

Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University

Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.

What is Cluster Analysis?

Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?

Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.

FLANN Fast Library for Approximate Nearest Neighbors

Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.

Evaluating Performance for Data Mining Techniques

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.

Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.

Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.

Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)

COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.

Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,

MapReduce How to painlessly process terabytes of data.

Intelligent Database Systems Lab 1 Advisor ： Dr. Hsu Graduate ： Jian-Lin Kuo Author ： Silvia Nittel Kelvin T.Leung Amy Braverman 國立雲林科技大學 National Yunlin.

Clustering Methods K- means. K-means Algorithm Assume that K=3 and initially the points are assigned to clusters as follows. C 1 ={x 1,x 2,x 3 }, C 2.

Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.

Mining High Utility Itemset in Big Data

IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.

Map-Reduce for Machine Learning on Multicore C. Chu, S.K. Kim, Y. Lin, Y.Y. Yu, G. Bradski, A.Y. Ng, K. Olukotun (NIPS 2006) Shimin Chen Big Data Reading.

Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.

Parallel and Distributed Simulation Time Parallel Simulation.

Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.

Machine Learning Queens College Lecture 7: Clustering.

Apache Mahout Qiaodi Zhuang Xijing Zhang.

HAMA: An Efficient Matrix Computation with the MapReduce Framework Sangwon Seo, Edward J. Woon, Jaehong Kim, Seongwook Jin, Jin-soo Kim, Seungryoul Maeng.

Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.

Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.

Lloyd Algorithm K-Means Clustering. Gene Expression Susumu Ohno: whole genome duplications The expression of genes can be measured over time. Identifying.

1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 28 Nov 9, 2005 Nanjing University of Science & Technology.

Clustering Algorithms Sunida Ratanothayanon. What is Clustering?

Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.

MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/

Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!

Iterative K-Means Algorithm Based on Fisher Discriminant UNIVERSITY OF JOENSUU DEPARTMENT OF COMPUTER SCIENCE JOENSUU, FINLAND Mantao Xu to be presented.

Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.

Clustering Approaches Ka-Lok Ng Department of Bioinformatics Asia University.

Linear Models & Clustering Presented by Kwak, Nam-ju 1.

Apache Mahout Industrial Strength Machine Learning Jeff Eastman.

Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )

Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

A Simple Approach for Author Profiling in MapReduce

”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.

Homework 1 Tutorial Instructor: Weidong Shi (Larry), PhD

Semi-Supervised Clustering

A Straightforward Author Profiling Approach in MapReduce

Applying Twister to Scientific Applications

On Spatial Joins in MapReduce

Dr. Unnikrishnan P.C. Professor, EEE

Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &

KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner

Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &

BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies

Presentation transcript:

Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang Ma, Qing He CloudCom, 2009 Aug 1, 2014 Kyung-Bin Lim

2 / 24 Outline  Introduction  Methodology  Discussion  Conclusion

3 / 24 What is clustering?  Classification of objects into different groups, or more precisely, the partitioning of a data set into subsets (clusters)  The data in each subset (ideally) share some common trait – often according to some defined distance measure  Clustering is alternatively called as “grouping”

4 / 24 K-Means Clustering  The k-means algorithm is an algorithm to cluster n objects based on attributes into k partitions, where k < n  It assumes that the object attributes form a vector space  The grouping is done by minimizing the sum of squares of distances between data and the corresponding cluster centroid

5 / 24 K-means Algorithm  For a given cluster assignment C of the data points, compute the cluster means m k :  For a current set of cluster means, assign each observation as:  Iterate above two steps until convergence

6 / 24 K-means clustering example

7 / 24 MapReduce Programming  Framework that supports distributed computing on clusters of computers  Introduced by Google in 2004  Map step  Reduce step  Combine step (Optional)  Applications

8 / 24 MapReduce Model

9 / 24 Outline  Introduction  Methodology  Results  Conclusion

10 / 24 Parallel K-means Clustering Based on MapReduce

11 / 24 Map Function

12 / 24 Map Function  The input dataset is a sequence file of pairs  The dataset is split and globally broadcast to all mappers  Output: – key = index of closest center point – value = string comprise of the values of different dimensions

13 / 24 Combine Function  Partially sum the values of the points assigned to the same cluster

14 / 24 Reduce Function  Sum all the samples and compute the total number of samples assigned to the same cluster → Get new centers for next iteration

15 / 24 Map map map map AB centers a b c d e f g h

16 / 24 Combine combine combine combine a b c d e f g h AB centers

17 / 24 Reduce shuffle reduce reduce AB (26/4, 26/4)(14/4, 12/4) centers

18 / 24 Outline  Introduction  Methodology  Results  Conclusion

19 / 24 Experimental Setup  Hadoop  Cluster of machines – Each with two 2.8 GHz cores and 4GB memory  Java 1.5.0_14

20 / 24 Speedup

21 / 24 Scaleup  The ability of m-times larger system to perform an m-times larger job

22 / 24 Sizeup  Fixed the number of computers

23 / 24 Outline  Introduction  Methodology  Results  Conclusion

24 / 24 Conclusion  Simple and fast MapReduce solution for clustering problem  The result shows the algorithm can process large datasets effectively – Speedup – Scaleup – Sizeup