Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

Slides:



Advertisements
Similar presentations
Algorithms of Google News An Analysis of Google News Personalization Scalable Online Collaborative Filtering 1.
Advertisements

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
CO-CLUSTERING USING CUDA. Co-Clustering Explained  Problem:  Large binary matrix of samples (rows) and features (columns)  What samples should be grouped.
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability Ramesh Nallapati, William Cohen and John.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
The hybird approach to programming clusters of multi-core architetures.
18.337: Image Median Filter Rafael Palacios Aeronautics and Astronautics department. Visiting professor (IIT-Institute for Research in Technology, University.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
Spiros Papadimitriou Jimeng Sun IBM T.J. Watson Research Center Hawthorne, NY, USA Reporter: Nai-Hui, Ku.
DisCo: Distributed Co-clustering with Map-Reduce S. Papadimitriou, J. Sun IBM T.J. Watson Research Center Speaker: 吳宏君 陳威遠 洪浩哲.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal
Independent Component Analysis (ICA) A parallel approach.
HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Intelligent Database Systems Lab 1 Advisor : Dr. Hsu Graduate : Jian-Lin Kuo Author : Silvia Nittel Kelvin T.Leung Amy Braverman 國立雲林科技大學 National Yunlin.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Pipelined and Parallel Computing Data Dependency Analysis for 1 Hongtao Du AICIP Research Mar 9, 2006.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.
October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.
The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan.
Distributed Computing Systems CSCI 4780/6780. Scalability ConceptExample Centralized servicesA single server for all users Centralized dataA single on-line.
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
Information-Theoretic Co- Clustering Inderjit S. Dhillon et al. University of Texas, Austin presented by Xuanhui Wang.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
Parallel Programming By J. H. Wang May 2, 2017.
Linchuan Chen, Xin Huo and Gagan Agrawal
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
MapReduce Simplied Data Processing on Large Clusters
Linchuan Chen, Peng Jiang and Gagan Agrawal
CS110: Discussion about Spark
KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
Hybrid Programming with OpenMP and MPI
A Map-Reduce System with an Alternate API for Multi-Core Environments
Parallel Programming in C with MPI and OpenMP
Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Venkatram Ramanathan 1

Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet Transform On FREERIDE Co-clustering on FREERIDE Conclusion 2

Performance Increase: Increased number of cores with lower clock frequencies Cost Effective Scalability of performance HPC Environments – Cluster of Multi- Cores 3

Multi-Level Parallelism Within Cores in a node – Shared Memory Parallelism - Pthreads, OpenMP Within Nodes – Distributed Memory Parallelism - MPI Achieving Programmability and Performance – Major Challenge 4

Possible solution Use higher-level/restricted APIs Reduction based APIs Map-Reduce Higher-level API Program Cluster of Multi-Cores with 1 API Expressive Power Considered Limited Expressing computations using reduction-based APIs 5

Two Algorithms Wavelet Transform Co-Clustering Expressed as reduction structures and parallelized on FREERIDE Speedup of 42 on 64 cores for Wavelet Transform Speedup of 21 on 32 cores for Co- clustering 6

MapReduce Map (in_key,in_value) -> list(out_key,intermediate_value) Reduce(out_key,list(intermediate_value) -> list(out_value) FREERIDE Users explicitly declare Reduction Object and update it Map and Reduce steps combined Each data element – processed and reduced before next element is processed 7

8

9

Wavelet Transform – Important tool in Medical Imaging fMRI – probing mechanism for brain activation Seeks to study behavior across spatio- temporal data 10

Discrete Wavelet Transform Defined for input having 2^n numbers Convolution along Time domain results in 2^n output values Has following steps Pair up Input values Store difference Pass the sum Repeat till there are 2^n – 1 differences and 1 sum 11

Serial Wavelet Transform Algorithm Input: a1, a2, a3, a4, a5, a6, a7, a8 Output: a1-a2, a3-a4, a5-a6, a7-a8 a1+a2-a3-a4, a5+a6-a7-a8 a1+a2+a3+a4-a5-a6-a7-a8 a1+a2+a3+a4+a5+a6+a7+a8 12

Time Series length = T; Number of Nodes = P Time Series Per Node = T/P If P is power of 2, T/P-1 final values of output calculated locally T-P final values produced without communication Remaining P values require Inter-process Communication Allocate reduction object of size P on each Node Each node updates Reduction Object with its contribution Global reduction The last P values can be calculated. Since output – out of order, index on the output where each final value needs to go can be calculated 13

14

Input data distributed among nodes Threads share data Size of reduction object - #Threads x #Nodes Each thread computes local final values Updates reduction object at ThreadID+(#Threads x NodeID) Global Combination Calculate last #Threads x #Nodes values from the data in reduction object 15

16

Computation of the last #Threads x #Nodes values – Parallelized Local Reduction step Global Reduction step- Global Array Size of Reduction Object Local Reduction Step : #Threads Global Reduction Step: #Nodes 17

18

 Index if Iteration I = 0  Index if Iteration I > 0  term is local index of value calculated in current iteration  Chunkid is ThreadID+(NodeID x #Threads)  I is current iteration 19

Experimental Setup: Cluster of Multi-core machines Intel Xeon CPU E5345 – quad core Clock Frequency 2.33 GHz Main Memory 6 GB Datasets Varying p, dimension of spatial cube and s, time-steps in time series p = 10; s = (DS1) p = 32; s = 2048 (DS2) p = 32; s = 4096 (DS3) p = 32; s = 8192 (DS4) p = 39; s = 8192 (DS5) 20

21

22

23

Clustering - Grouping together of “similar” objects Hard Clustering -- Each object belongs to a single cluster Soft Clustering -- Each object is probabilistically assigned to clusters 24

Co-clustering clusters both words and documents simultaneously 25

Involves simultaneous clustering of rows to row clusters and columns to column clusters Maximizes Mutual Information Uses Kullback-Leibler Divergence 26

27

28

Input matrix and its transpose pre-computed Input matrix and transpose Divided into files Distributed among nodes Each node - same amount of row and column data rowCL and colCL – replicated on all nodes Initial clustering Round robin fashion - consistency across nodes 29

In Preprocessing, pX and pY – normalized by total sum Wait till all nodes process to normalize Each node calculates pX and pY with local data Reduction object updated partial sum, pX and pY values Accumulated partial sums - total sum pX and pY normalized xnorm and ynorm calculated in second iteration as they need total sum 30

Compressed Matrix of size #rowclusters x #colclusters, calculated with local data Sum of values of values of each row cluster across each column cluster Final compressed matrix -sum of local compressed matrices Local compressed matrices – updated in reduction object Produces final compressed matrix on accumulation Cluster Centroids calculated 31

Reassign clustering Determined by Kullback-Leibler divergence Reduction object updated Compute compressed matrix Update reduction object Column Clustering – similar Objective function – finalize Next iteration 32

33

34

Algorithm - same for shared memory, distributed memory and hybrid parallelization Experiments conducted 2 clusters env1 Intel Xeon E5345 Quad Core Clock Frequency 2.33 GHz Main Memory 6 GB 8 nodes env2 AMD Opteron 8350 CPU 8 Cores Main Memory 16 GB 4 Nodes 35

 2 Datasets  1 GB Dataset  Matrix Dimensions 16k x 16k  4 GB Dataset  Matrix Dimensions 32k x 32k  Datasets and transpose  Split into 32 files each (row partitioning)  Distributed among nodes  Number of row and column clusters: 4 36

37

38

39

Preprocessing stage – bottleneck for smaller dataset – not compute intensive Speedup with Preprocessing : Speedup without Preprocessing: Preprocessing stage scales well for Larger dataset – more computation Speedup is the same with and without preprocessing. Speedup for larger dataset :

Parallelized two data intensive applications, namely Wavelet Transform Co-clustering Representing the algorithms as generalized reduction structures Implementing them on FREERIDE Wavelet Transform - speedup 42 on 64 cores Co-clustering - speedup 21 on 32 cores. 41

42