PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David Chiu, Yu Su,..) 1

PDAC-10 Motivation Cloud Resources Pay-as-you-go Elasticity Black boxes from a performance view-point Scientific Data –Specialized formats, like NetCDF, HDF5, etc. –Very Large Scale 2

PDAC-10 Ongoing Work at Ohio State MATE-EC2: Middleware for Data-Intensive Computing on EC2 –Alternative to Amazon Elastic MapReduce Data Management Solutions for Scientific Datasets –Target NetCDF and HDF5 Accelerating Data Mining Computations Using Accelerators Resource Allocation Problems on Clouds 3

PDAC-10 MATE-EC2: Motivation MATE – MapReduce with an Alternate API MATE-EC2: Implementation for AWS Environments Cloud resources are blackboxes Need for services and tools that can… –get the most out of cloud resources –help their users with easy APIs 4

PDAC-10 MATE vs. Map-Reduce Processing Structure 5 Reduction Object represents the intermediate state of the execution Reduce func. is commutative and associative Sorting, grouping.. overheads are eliminated with red. func/obj.

PDAC-10 MATE-EC2 Design Data organization –Three levels: Buckets, Chunks and Units –Metadata information Chunk Retrieval –Threaded Data Retrieval –Selective Job Assignment Load Balancing and handling heterogeneity –Pooling mechanism 6

PDAC-10 MATE-EC2 Processing Flow 7 C 0 C 5 C n Computing Layer Job Scheduler Metadata File Request Job from Master NodeC 0 is assigned as job Retrieve chunk pieces and Write them into the buffer T 0 T 1 T 2 T 3 Pass retrieved chunk to Computing Layer and process Request another job C 5 is assigned as a job Retrieve the new job EC2 Slave Node S3 Data Object EC2 Master Node

PDAC-10 Experiments Goals: –Finding the most suitable setting for AWS –Performance of MATE-EC2 on heterogeneous and homogeneous environments –Performance comparison of MATE-EC2 and Map- Reduce Applications: KMeans and PCA Used Resources: –4 Large EC2 instances for processing, 1 Large instance for Master –16 Data objects on S3 (8.2GB total data set for both app.) 8

PDAC-10 Diff. Data Chunk Sizes KMeans 16 Retrieval threads Performance increase –8M vs. others 1.13 to 1.30 –1 Thread vs. 16 Threads versions 1.24 to 1.81 9

PDAC-10 Diff. Number of Threads 10 128MB chunk size Performance increase in Fig. (KMeans) –1.37 to 1.90 Performance increase for PCA –1.38 to 1.71

PDAC-10 Selective Job Assignment 11 Performance increase in Fig. (KMeans) –1.01 to 1.14 For PCA –1.19 to 1.68

PDAC-10 Heterogeneous Env. 12 L: Large instances S: Small instances 128MB chunk size Overheads in Fig. (KMeans) –Under 1% Overheads for PCA –1.1 to 11.7

PDAC-10 MATE-EC2 vs. Map-Reduce 13 Scalability (MATE) –Efficiency: 90% Scalability (MR) –Efficiency: 74% Speedups: –MATE vs. MR 3.54 to 4.58

PDAC-10 MATE-EC2: Continuing Directions Cloud Bursting –Cloud as an Complement or On-Demand Alternative to Local Resources Autotuning for a New Cloud Environment –Data Storage can be black-box Data-Intensive Applications on Cluster of GPUs –Programming Model, System Design 14

PDAC-10 Outline MATE-EC2: Middleware for Data-Intensive Computing on EC2 –Alternative to Amazon Elastic MapReduce Data Management Solutions for Scientific Datasets –Target NetCDF and HDF5 Accelerating Data Mining Computations Using Accelerators Resource Allocation Problems on Clouds 15

PDAC-10 Data Management: Motivation Datasets are becoming extremely large Scientific datasets are in formats like NetCDF and HDF5 Existing database solutions are not scalable –Can’t help with native data formats 16

PDAC-10 Data Management: Use Scenarios Data Dissemination Efforts –Support User-Defined Subsetting and Data Aggregation Implementing Data Processing Applications –Higher-level API than NetCDF/HDF5 libraries Visualization Tools (ParaView etc.) –Data format Conversion on Large Datasets 17

PDAC-10 Initial Prototype: Data Subsetting With Relational View on NetCDF Parse the SQL expression Metadata for netcdf dataset Generate data access code Filter variable value Filter dimensions Partition tasks and assign to slave processes Execute query

PDAC-10 Metadata descriptor Dataset Storage Description –List the nodes and the directories where the data is resident. Dataset Layout Description –Header part of each netcdf file Naturally included in netcdf dataset Save the energy for generating the metadata –Describe the layout of each netcdf file

PDAC-10 Pre-filter and Post-filter Pre-filter: –Take SQL grammar and metadata as input –Do filtering based on dimensions of variable –Support both direct dimensions and coordinate variable Post-filer: –Do filtering based on variable value

PDAC-10 Query Partition Partition current query into several sub-queries and assign each sub-query to a slave process. Two partition criteria –Consider the continuous of the memory –Consider data aggregation(future)

PDAC-10 Experiment Setup Application: –Global Cloud Resolving Model and Data (GCRM) Environment: –Glenn System in Ohio Supercomputer Center

PDAC-10 SQL queries No.DescriptionPercent SQL 1SELECT pressure FROM dataset;100% SQL 2SELECT pressure FROM dataset WHERE cells<=20481 50% SQL 3SELECT pressure FROM dataset WHERE cells>20481 AND layers>330; 25% SQL 4SELECT pressure FROM dataset WHERE cells<=20481 AND layers<250; 10% SQL 5SELECT pressure FROM dataset WHERE cells <= 20481 AND time<=781710 AND layers<250; 1%

PDAC-10 Scalability with different data size 8 processes Execution time scaled almost linearly within each query

PDAC-10 Time improvement for using prefilter 4 processes; SQL5 (only query 1% of the data); Prefilter efficiently decreases the query size, improve the performance.

PDAC-10 Scalability with Increasing No. of Sources 4G dataset; SQL1 (full scan of the data table); Execution time scaled almost linearly

PDAC-10 Data Management: Continuing Work Similar Prototype with HDF5 under Implementation Consider processing, not just subsetting/aggregation –Map-Reduce like Processing for NetCDF/HDF5 datasets? Consider Format Conversion for Existing Tools 27

PDAC-10 Outline MATE-EC2: Middleware for Data-Intensive Computing on EC2 –Alternative to Amazon Elastic MapReduce Data Management Solutions for Scientific Datasets –Target NetCDF and HDF5 Accelerating Data Mining Computations Resource Allocation Problems on Clouds 28

PDAC-10 29 User Input: Simple C code with annotations Application Developer Multi-core Middlewar e API GPU Code for CUDA Compilation Phase Code Generator Run-time System Worker Thread Creation and Management Map Computation to CPU and GPU Dynamic Work Distribution System for Mapping to Heterogeneous Configurations

PDAC-10 K-Means on GPU + Multi-Core CPUs 30

PDAC-10 Summary Dataset Sizes are Increasing Clouds add many challenges Many challenges in data processing on clouds 31

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

Similar presentations

Presentation on theme: "PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

Similar presentations

Presentation on theme: "PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David."— Presentation transcript:

Similar presentations

About project

Feedback