Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3, 20151.

Slides:



Advertisements
Similar presentations
Spark: Cluster Computing with Working Sets
Advertisements

Distributed Computations
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
Distributed Computations MapReduce
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing.
MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.
HADOOP ADMIN: Session -2
IPDPS, Supporting Fault Tolerance in a Data-Intensive Computing Middleware Tekin Bicer, Wei Jiang and Gagan Agrawal Department of Computer Science.
Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.
Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.
A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
The Limitation of MapReduce: A Probing Case and a Lightweight Solution Zhiqiang Ma Lin Gu Department of Computer Science and Engineering The Hong Kong.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Data-Intensive Computing: From Multi-Cores and GPGPUs to Cloud Computing and Deep Web Gagan Agrawal u.
Integrating and Optimizing Transactional Memory in a Data Mining Middleware Vignesh Ravi and Gagan Agrawal Department of ComputerScience and Engg. The.
Data-Intensive and High Performance Computing on Cloud Environments Gagan Agrawal 1.
A Map-Reduce System with an Alternate API for Multi-Core Environments Wei Jiang, Vignesh T. Ravi and Gagan Agrawal.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal June 1,
Mehmet Can Kurt, The Ohio State University Gagan Agrawal, The Ohio State University DISC: A Domain-Interaction Based Programming Model With Support for.
Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.
High-level Interfaces and Abstractions for Data-Driven Applications in a Grid Environment Gagan Agrawal Department of Computer Science and Engineering.
Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The.
Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Data Parallelism Task Parallel Library (TPL) The use of lambdas Map-Reduce Pattern FEN 20141UCN Teknologi/act2learn.
Supporting Load Balancing for Distributed Data-Intensive Applications Leonid Glimcher, Vignesh Ravi, and Gagan Agrawal Department of ComputerScience and.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal.
System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Gagan Agrawal Department of Computer and Information Sciences Ohio.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Exploiting Computing Power of GPU for Data Mining Application Wenjing Ma, Leonid Glimcher, Gagan Agrawal.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
System Support for High Performance Scientific Data Mining Gagan Agrawal Ruoming Jin Raghu Machiraju S. Parthasarathy Department of Computer and Information.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
Computer Engg, IIT(BHU)
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Year 2 Updates.
Communication and Memory Efficient Parallel Decision Tree Construction
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Wei Jiang Advisor: Dr. Gagan Agrawal
Data-Intensive Computing: From Clouds to GPU Clusters
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
A Map-Reduce System with an Alternate API for Multi-Core Environments
FREERIDE: A Framework for Rapid Implementation of Datamining Engines
FREERIDE: A Framework for Rapid Implementation of Datamining Engines
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3, 20151

2  Growing need for analysis of large scale data  Scientific  Commercial  Data-intensive Supercomputing (DISC)  Map-Reduce has received a lot of attention  Database and Datamining communities  High performance computing community Motivation

Motivation (2)  Processor Architecture Trends  Clock speeds are not increasing  Trend towards multi-core architectures  Accelerators like GPUs are very popular  Cluster of multi-cores / GPUs are becoming common  Trend Towards the Cloud  Use storage/computing/services from a provider  How many of you prefer gmail over cse/osu account for ?  Utility model of computation  Need high-level APIs and adaptation December 3, 20153

My Research Group  Data-intensive theme at multiple levels  Parallel programming Models  Multi-cores and accelerators  Adaptive Middleware  Scientific Data Management / Workflows  Deep Web Integration and Analysis December 3, 20154

This Talk  Parallel Programming API for Data-Intensive Computing  An alternate API and System for Google’s Map- Reduce  Show actual comparison  Fault-tolerance for data-intensive computing  Data-intensive Computing on Accelerators  Compilation for GPUs December 3, 20155

Map-Reduce  Simple API for (data-intensive) parallel programming  Computation is:  Apply map on each input data element  Produce ( key,value ) pair(s)  Sort them using the key  Apply reduce on the set with a distinct key values December 3, 20156

7 Map-Reduce Execution

December 3,  Positives:  Simple API  Functional language based  Very easy to learn  Support for fault-tolerance  Important for very large-scale clusters  Questions  Performance?  Comparison with other approaches  Suitability for different class of applications? Map-Reduce: Positives and Questions

Class of Data-Intensive Applications  Many different types of applications  Data-center kind of applications  Data scans, sorting, indexing  More ``compute-intensive’’ data-intensive applications  Machine learning, data mining, NLP  Map-reduce / Hadoop being widely used for this class  Standard Database Operations  Sigmod 2009 paper compares Hadoop with Databases and OLAP systems  What is Map-reduce suitable for?  What are the alternatives?  MPI/OpenMP/Pthreads – too low level? December 3, 20159

10 FREERIDE: GOALS  Developed at Ohio State  Framework for Rapid Implementation of Data Mining Engines  The ability to rapidly prototype a high- performance mining implementation  Distributed memory parallelization  Shared memory parallelization  Ability to process disk-resident datasets  Only modest modifications to a sequential implementation for the above three December 3,

FREERIDE – Technical Basis December 3,  Popular data mining algorithms have a common canonical loop  Generalized Reduction  Can be used as the basis for supporting a common API  Demonstrated for Popular Data Mining and Scientific Data Processing Applications While( ) { forall (data instances d) { I = process(d) R(I) = R(I) op f(d) } ……. }

December 3,  Similar, but with subtle differences Comparing Processing Structure

December 3, December 3, Processing Structure for FREERIDE  Basic one-stage dataflow

Observations on Processing Structure  Map-Reduce is based on functional idea  Do not maintain state  This can lead to sorting overheads  FREERIDE API is based on a programmer managed reduction object  Not as ‘clean’  But, avoids sorting  Can also help shared memory parallelization  Helps better fault-recovery December 3,

December 3,  Tuning parameters in Hadoop  Input Split size  Max number of concurrent map tasks per node  Number of reduce tasks  For comparison, we used four applications  Data Mining: KMeans, KNN, Apriori  Simple data scan application: Wordcount  Experiments on a multi-core cluster  8 cores per node (8 map tasks) Experiment Design

December 3,  KMeans: varying # of nodes Avg. Time Per Iteration (sec) # of nodes Dataset: 6.4G K : 1000 Dim: 3 Results – Data Mining

December 3, Results – Data Mining (II) December 3,  Apriori: varying # of nodes Avg. Time Per Iteration (sec) # of nodes Dataset: 900M Support level: 3% Confidence level: 9%

December 3, December 3,  KNN: varying # of nodes Avg. Time Per Iteration (sec) # of nodes Dataset: 6.4G K : 1000 Dim: 3 Results – Data Mining (III)

December 3,  Wordcount: varying # of nodes Total Time (sec) # of nodes Dataset: 6.4G Results – Datacenter-like Application

December 3,  KMeans: varying dataset size Avg. Time Per Iteration (sec) Dataset Size K : 100 Dim: 3 On 8 nodes Scalability Comparison

December 3,  Wordcount: varying dataset size Total Time (sec) Dataset Size On 8 nodes Scalability – Word Count

Observations December 3,  Performance issues with Hadoop are now well know  How much of a factor is the API  Java, file system  API comparison on the same platform  Design of MATE  Map-reduce with an AlternaTE API

December 3, December 3, Basis: Phoenix implementation  Shared memory map-reduce implementation from Stanford  C based  An efficient runtime that handles parallelization, resource management, and fault recovery  Support FREERIDE-like API

December 3, December 3, Functions  APIs provided by the runtime Function DescriptionR/O int mate_init(scheudler_args_t * args)R int mate_scheduler(void * args)R int mate_finalize(void * args)O void reduction_object_pre_init()R int reduction_object_alloc(int size)—return the object idR void reduction_object_post_init()R void accumulate/maximal/minimal(int id, int offset, void * value)O void reuse_reduction_object()O void * get_intermediate_result(int iter, int id, int offset)O

December 3, December 3, Experiments: K-means  K-means: 400MB, 3-dim points, k = 100 on one AMD node with 16 cores

Fault-Tolerance in FREERIDE/MATE  Map-reduce supports fault-tolerance by replicating files  Storage and processing time overheads  FREERIDE/MATE API offers another option  Reduction object is a low-cost application-level checkpoint  Can support efficient recovery  Can also allow redistribution of work on other nodes December 3,

Fault Tolerance Results December 3,

This Talk  Parallel Programming API for Data-Intensive Computing  An alternate API and System for Google’s Map- Reduce  Show actual comparison  Fault-tolerance for data-intensive computing  Data-intensive Computing on Accelerators  Compilation for GPUs December 3,

Background - GPU Computing Many-core architectures/Accelerators are becoming more popular GPUs are inexpensive and fast CUDA is a high-level language for GPU programming

CUDA Programming Significant improvement over use of Graphics Libraries But.. Need detailed knowledge of the architecture of GPU and a new language Must specify the grid configuration Deal with memory allocation and movement Explicit management of memory hierarchy

Parallel Data mining Common structure of data mining applications (FREERIDE)‏ /* outer sequential loop *//* outer sequential loop */ while() { while() { /* Reduction loop */ /* Reduction loop */ Foreach (element e){ Foreach (element e){ (i, val) = process(e); (i, val) = process(e); Reduc(i) = Reduc(i) op val; Reduc(i) = Reduc(i) op val; } }

Porting on GPUs  High-level Parallelization is straight-forward  Details of Data Movement  Impact of Thread Count on Reduction time  Use of shared memory

Architecture of the System Variable information Reduction functions Optional functions Code Analyzer( In LLVM) ‏ Variable Analyzer Code Generator Variable Access Pattern and Combination Operations Host Program Grid configuration and kernel invocation Kernel functions Executable User Input

A sequential reduction function Optional functions (initialization function, combination function…) ‏ Values of each variable or size of array Variables to be used in the reduction function

Analysis of Sequential Code Get the information of access features of each variable Determine the data to be replicated Get the operator for global combination Variables for shared memory

Memory Allocation and Copy Copy the updates back to host memory after the kernel reduction function returns C.C.C.C. Need copy for each thread T0T1 T2 T3 T4 T61T62 T63T0T1 …… T0T1 T2T3T4 T61T62 T63T0T1 …… A.A.A.A. B.B.B.B.

Generating CUDA Code and C++/C code Invoking the Kernel Function Memory allocation and copy Thread grid configuration (block number and thread number) ‏ Global function Kernel reduction function Global combination

Optimizations Using shared memory Providing user-specified initialization functions and combination functions Specifying variables that are allocated once

Applications K-means clustering EM clustering PCA

K-means Results Speedups

Speedup of EM

Speedup of PCA

Summary  Data-intensive Computing is of growing importance  One size doesn’t fit all  Map-reduce has many limitations  Accelerators can be promising for achieving performance December 3,