Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3, 20151.

Similar presentations


Presentation on theme: "Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3, 20151."— Presentation transcript:

1 Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3, 20151

2 2  Growing need for analysis of large scale data  Scientific  Commercial  Data-intensive Supercomputing (DISC)  Map-Reduce has received a lot of attention  Database and Datamining communities  High performance computing community Motivation

3 Motivation (2)  Processor Architecture Trends  Clock speeds are not increasing  Trend towards multi-core architectures  Accelerators like GPUs are very popular  Cluster of multi-cores / GPUs are becoming common  Trend Towards the Cloud  Use storage/computing/services from a provider  How many of you prefer gmail over cse/osu account for e-mail?  Utility model of computation  Need high-level APIs and adaptation December 3, 20153

4 My Research Group  Data-intensive theme at multiple levels  Parallel programming Models  Multi-cores and accelerators  Adaptive Middleware  Scientific Data Management / Workflows  Deep Web Integration and Analysis December 3, 20154

5 This Talk  Parallel Programming API for Data-Intensive Computing  An alternate API and System for Google’s Map- Reduce  Show actual comparison  Fault-tolerance for data-intensive computing  Data-intensive Computing on Accelerators  Compilation for GPUs December 3, 20155

6 Map-Reduce  Simple API for (data-intensive) parallel programming  Computation is:  Apply map on each input data element  Produce ( key,value ) pair(s)  Sort them using the key  Apply reduce on the set with a distinct key values December 3, 20156

7 7 Map-Reduce Execution

8 December 3, 20158  Positives:  Simple API  Functional language based  Very easy to learn  Support for fault-tolerance  Important for very large-scale clusters  Questions  Performance?  Comparison with other approaches  Suitability for different class of applications? Map-Reduce: Positives and Questions

9 Class of Data-Intensive Applications  Many different types of applications  Data-center kind of applications  Data scans, sorting, indexing  More ``compute-intensive’’ data-intensive applications  Machine learning, data mining, NLP  Map-reduce / Hadoop being widely used for this class  Standard Database Operations  Sigmod 2009 paper compares Hadoop with Databases and OLAP systems  What is Map-reduce suitable for?  What are the alternatives?  MPI/OpenMP/Pthreads – too low level? December 3, 20159

10 10 FREERIDE: GOALS  Developed 2001-2003 at Ohio State  Framework for Rapid Implementation of Data Mining Engines  The ability to rapidly prototype a high- performance mining implementation  Distributed memory parallelization  Shared memory parallelization  Ability to process disk-resident datasets  Only modest modifications to a sequential implementation for the above three December 3, 201510

11 FREERIDE – Technical Basis December 3, 201511  Popular data mining algorithms have a common canonical loop  Generalized Reduction  Can be used as the basis for supporting a common API  Demonstrated for Popular Data Mining and Scientific Data Processing Applications While( ) { forall (data instances d) { I = process(d) R(I) = R(I) op f(d) } ……. }

12 December 3, 201512  Similar, but with subtle differences Comparing Processing Structure

13 December 3, 201513 December 3, 201513 Processing Structure for FREERIDE  Basic one-stage dataflow

14 Observations on Processing Structure  Map-Reduce is based on functional idea  Do not maintain state  This can lead to sorting overheads  FREERIDE API is based on a programmer managed reduction object  Not as ‘clean’  But, avoids sorting  Can also help shared memory parallelization  Helps better fault-recovery December 3, 201514

15 December 3, 201515  Tuning parameters in Hadoop  Input Split size  Max number of concurrent map tasks per node  Number of reduce tasks  For comparison, we used four applications  Data Mining: KMeans, KNN, Apriori  Simple data scan application: Wordcount  Experiments on a multi-core cluster  8 cores per node (8 map tasks) Experiment Design

16 December 3, 201516  KMeans: varying # of nodes Avg. Time Per Iteration (sec) # of nodes Dataset: 6.4G K : 1000 Dim: 3 Results – Data Mining

17 December 3, 201517 Results – Data Mining (II) December 3, 201517  Apriori: varying # of nodes Avg. Time Per Iteration (sec) # of nodes Dataset: 900M Support level: 3% Confidence level: 9%

18 December 3, 201518 December 3, 201518  KNN: varying # of nodes Avg. Time Per Iteration (sec) # of nodes Dataset: 6.4G K : 1000 Dim: 3 Results – Data Mining (III)

19 December 3, 201519  Wordcount: varying # of nodes Total Time (sec) # of nodes Dataset: 6.4G Results – Datacenter-like Application

20 December 3, 201520  KMeans: varying dataset size Avg. Time Per Iteration (sec) Dataset Size K : 100 Dim: 3 On 8 nodes Scalability Comparison

21 December 3, 201521  Wordcount: varying dataset size Total Time (sec) Dataset Size On 8 nodes Scalability – Word Count

22 Observations December 3, 201522  Performance issues with Hadoop are now well know  How much of a factor is the API  Java, file system  API comparison on the same platform  Design of MATE  Map-reduce with an AlternaTE API

23 December 3, 201523 December 3, 201523 Basis: Phoenix implementation  Shared memory map-reduce implementation from Stanford  C based  An efficient runtime that handles parallelization, resource management, and fault recovery  Support FREERIDE-like API

24 December 3, 201524 December 3, 201524 Functions  APIs provided by the runtime Function DescriptionR/O int mate_init(scheudler_args_t * args)R int mate_scheduler(void * args)R int mate_finalize(void * args)O void reduction_object_pre_init()R int reduction_object_alloc(int size)—return the object idR void reduction_object_post_init()R void accumulate/maximal/minimal(int id, int offset, void * value)O void reuse_reduction_object()O void * get_intermediate_result(int iter, int id, int offset)O

25 December 3, 201525 December 3, 201525 Experiments: K-means  K-means: 400MB, 3-dim points, k = 100 on one AMD node with 16 cores

26 Fault-Tolerance in FREERIDE/MATE  Map-reduce supports fault-tolerance by replicating files  Storage and processing time overheads  FREERIDE/MATE API offers another option  Reduction object is a low-cost application-level checkpoint  Can support efficient recovery  Can also allow redistribution of work on other nodes December 3, 201526

27 Fault Tolerance Results December 3, 201527

28 This Talk  Parallel Programming API for Data-Intensive Computing  An alternate API and System for Google’s Map- Reduce  Show actual comparison  Fault-tolerance for data-intensive computing  Data-intensive Computing on Accelerators  Compilation for GPUs December 3, 201528

29 Background - GPU Computing Many-core architectures/Accelerators are becoming more popular GPUs are inexpensive and fast CUDA is a high-level language for GPU programming

30 CUDA Programming Significant improvement over use of Graphics Libraries But.. Need detailed knowledge of the architecture of GPU and a new language Must specify the grid configuration Deal with memory allocation and movement Explicit management of memory hierarchy

31 Parallel Data mining Common structure of data mining applications (FREERIDE)‏ /* outer sequential loop *//* outer sequential loop */ while() { while() { /* Reduction loop */ /* Reduction loop */ Foreach (element e){ Foreach (element e){ (i, val) = process(e); (i, val) = process(e); Reduc(i) = Reduc(i) op val; Reduc(i) = Reduc(i) op val; } }

32 Porting on GPUs  High-level Parallelization is straight-forward  Details of Data Movement  Impact of Thread Count on Reduction time  Use of shared memory

33 Architecture of the System Variable information Reduction functions Optional functions Code Analyzer( In LLVM) ‏ Variable Analyzer Code Generator Variable Access Pattern and Combination Operations Host Program Grid configuration and kernel invocation Kernel functions Executable User Input

34 A sequential reduction function Optional functions (initialization function, combination function…) ‏ Values of each variable or size of array Variables to be used in the reduction function

35 Analysis of Sequential Code Get the information of access features of each variable Determine the data to be replicated Get the operator for global combination Variables for shared memory

36 Memory Allocation and Copy Copy the updates back to host memory after the kernel reduction function returns C.C.C.C. Need copy for each thread T0T1 T2 T3 T4 T61T62 T63T0T1 …… T0T1 T2T3T4 T61T62 T63T0T1 …… A.A.A.A. B.B.B.B.

37 Generating CUDA Code and C++/C code Invoking the Kernel Function Memory allocation and copy Thread grid configuration (block number and thread number) ‏ Global function Kernel reduction function Global combination

38 Optimizations Using shared memory Providing user-specified initialization functions and combination functions Specifying variables that are allocated once

39 Applications K-means clustering EM clustering PCA

40 K-means Results Speedups

41 Speedup of EM

42 Speedup of PCA

43 Summary  Data-intensive Computing is of growing importance  One size doesn’t fit all  Map-reduce has many limitations  Accelerators can be promising for achieving performance December 3, 201543


Download ppt "Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3, 20151."

Similar presentations


Ads by Google