Presentation is loading. Please wait.

Presentation is loading. Please wait.

Yi Wang, Wei Jiang, Gagan Agrawal

Similar presentations


Presentation on theme: "Yi Wang, Wei Jiang, Gagan Agrawal"— Presentation transcript:

1 Yi Wang, Wei Jiang, Gagan Agrawal
SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats Yi Wang, Wei Jiang, Gagan Agrawal The Ohio State University CCGrid 2012 May 15th, Ottawa, Canada 2019/2/19

2 Outline Introduction System Design System Optimization
Experimental Results Conclusion and Future Work 2019/2/19

3 Scientific Data Analysis Today
Increasingly data-intensive Volume approximately doubles each year Stored in certain specialized formats NetCDF, HDF5, ADIOS… Popularity of MapReduce and its variants Free accessibility Easy programmability Good scalability Built-in fault tolerance 2019/2/19

4 NetCDF Network Common Data Form 2019/2/19

5 HDF5 Hierarchical Data Format 2019/2/19

6 MATE MapReduce with AlternaTE API Generalized Reduction
MATE outperforms Hadoop with factors of 5 to 10 Generalized Reduction First proposed in FREERIDE that was developed at Ohio State Share a similar processing structure Avoid storing and shuffling a large number of key-value pairs between Map & Reduce 2019/2/19

7 Scientific Data Analysis Today (Cont’d)
“Store-first-analyze-after” Reload data in another file system E.g. load data from PVFS to HDFS Reload data in another data format E.g. load NetCDF/HDF5 data to a specific format Problem Long data migration/transformation time Stresses network and disks 2019/2/19

8 Outline Introduction System Design System Optimization
Experimental Results Conclusion and Future Work 2019/2/19

9 SciMATE Framework “In-situ data analysis ” (No data reloading!)
Extend MATE for scientific data analysis [Wei Jiang et al., CCGRID’10] Customizable data format adaption API Ability to be adapted to support processing on any ( or even new) scientific data format Optimized by Access strategies Access patterns 2019/2/19

10 System Overview Key feature scientific data processing module
2019/2/19

11 Scientific Data Processing Module
2019/2/19

12 Integrating a New Data Format
Data adaption layer is customizable Insert a third-party adapter Open for extension but closed for modification Have to implement the generic block loader interface Partitioning function and auxiliary functions E.g., partition, get_dimensionality Full read function and partial read functions E.g., full_read, partial_read, partial_read_by_block 2019/2/19

13 Outline Introduction System Design System Optimization
Experimental Results Conclusion and Future Work 2019/2/19

14 Data Access Strategies and Patterns
Full Read: probably too expensive for reading a small data subset Partial Read Strided pattern Column pattern Discrete point pattern 2019/2/19

15 Access Pattern Optimization
Strided pattern: directly supported by API Discrete point pattern: rarely used, so no optimization for now Column pattern Fixed-size column read Contiguous column read (our choice) 1 2 3 4 5 1 2 2019/2/19

16 Outline Introduction System Design System Optimization
Experimental Results Conclusion and Future Work 2019/2/19

17 Evaluation System functionality and scalability
Use 16 GB datasets Data processing times (K-means, PCA, kNN) Thread scalability Node scalability Data loading times (K-means, PCA) Compare partial read with full read Compare fixed-size column read with contiguous column read 2019/2/19

18 Evaluating Thread Scalability
Data processing times for kNN (on a 8-core node) 2019/2/19

19 Evaluating Node Scalability
16GB Data processing times for K-means (8 threads per node) 2019/2/19

20 Evaluating Node Scalability
Data loading times for PCA (8 threads per node) 2019/2/19

21 Outline Introduction System Design System Optimization
Experimental Results Conclusion and Future Work 2019/2/19

22 Conclusion and Future Work
Avoid bulk data transfers and vast data transformation Provide a customizable data format adaption API Support optimized read via access strategies & patterns Future Work Expect to compare the performance with SciHadoop [Joe Buck et al., SC’11] 2019/2/19


Download ppt "Yi Wang, Wei Jiang, Gagan Agrawal"

Similar presentations


Ads by Google