Presentation is loading. Please wait.

Presentation is loading. Please wait.

IPDPS 2013 - Boston Integrating Online Compression to Accelerate Large-Scale Data Analytics Applications Tekin Bicer, Jian Yin, David Chiu, Gagan Agrawal.

Similar presentations


Presentation on theme: "IPDPS 2013 - Boston Integrating Online Compression to Accelerate Large-Scale Data Analytics Applications Tekin Bicer, Jian Yin, David Chiu, Gagan Agrawal."— Presentation transcript:

1 IPDPS Boston Integrating Online Compression to Accelerate Large-Scale Data Analytics Applications Tekin Bicer, Jian Yin, David Chiu, Gagan Agrawal and Karen Schuchardt Ohio State University Washington State University Pacific Northwest National Laboratories 1

2 IPDPS Boston Introduction Scientific simulations and instruments can generate large amount of data –E.g. Global Cloud Resolving Model 1PB data for 4km grid-cell –Higher resolutions, more and more data –I/O operations become bottleneck Problems –Storage, I/O performance Compression 2

3 IPDPS Boston Motivation Generic compression algorithms –Good for low entropy sequence of bytes –Scientific dataset are hard to compress Floating point numbers: Exponent and mantissa Mantissa can be highly entropic Using compression in applications is challenging –Suitable compression algorithms –Utilization of available resources –Integration of compression algorithms 3

4 IPDPS Boston Outline Introduction Motivation Compression Methodology Online Compression Framework Experimental Results Related Work Conclusion 4

5 IPDPS Boston Compression Methodology Common properties of scientific datasets –Multidimensional arrays –Consist of floating point numbers –Relationship between neighboring values Domain specific solutions can help Approach: –Prediction-based differential compression Predict the values of neighboring cells Store the difference 5

6 IPDPS Boston Example: GCRM Temperature Variable Compression E.g.: Temperature record The values of neighboring cells are highly related X table (after prediction): X compressed values –5bits for prediction + difference Lossless and lossy comp. Fast and good compression ratios 6

7 IPDPS Boston Compression Framework Improve end-to-end application performance Minimize the application I/O time –Pipelining I/O and (de)comp. operations Hide computational overhead –Overlapping app. computation with comp. framework Easy implementation of diff. comp. alg. Easy integration with applications –Similar API to POSIX I/O 7

8 IPDPS Boston A Compression Framework for Data Intensive Applications Chunk Resource Allocation (CRA) Layer Initialization of the system Generate chunk requests, enqueue processing Converting original offset and data size requests to compressed 8 Parallel Compression Engine (PCE) Applies encode(), decode() functions to chunks Manages in-memory cache with informed prefetching Creates I/O requests Parallel I/O Layer (PIOL) Creates parallel chunk requests to storage medium Each chunk request is handled by a group of threads Provides abstraction for different data transfer protocols

9 IPDPS Boston Compression Framework API User defined functions: –encode_t(…): (R) Code for compression –decode_t(…): (R) Code for decompression –prefetch_t(…): (O) Informed prefetching function Application can use below functions –comp_read: Applies decode_t to comp. chunk –comp_write: Applies encode_t to original chunk comp_seek: Mimics fseek, also utilizes prefetch_t –comp_init: Init. system (thread pools, cache etc.) 9

10 IPDPS Boston Prefetching and In-Memory Cache Overlapping application layer computation with I/O Reusability of already accessed data is small Prefetching and caching the prospective chunks –Default is LRU –User can analyze history and provide prospective chunk list Cache uses row-based locking scheme for efficient consecutive chunk requests 10 Informed Prefetching prefetch(…)

11 IPDPS Boston Integration with a Data-Intensive Computing System MapReduce style API –Remote data processing –Sensitive to I/O bandwidth Processes data in… –local cluster –cloud –or both (Hybrid Cloud) 11

12 IPDPS Boston Outline Introduction Motivation Compression Methodology Online Compression Framework Experimental Results Related Work Conclusion 12

13 IPDPS Boston Experimental Setup Two datasets: –GCRM: 375GB (L:270 + R:105) –NPB: 237GB (L:166 + R:71) 16x8 cores (Intel Xeon 2.53GHz) Storage of datasets –Lustre FS (14 storage nodes) –Amazon S3 (Northern Virginia) Compression algorithms –CC, FPC, LZO, bzip, gzip, lzma Applications: AT, MMAT, KMeans 13

14 IPDPS Boston Performance of MMAT 14 Breakdown of Performance Overhead (Local): 15.41% Read Speedup: 1.96

15 IPDPS Boston Lossy Compression (MMAT) 15 Lossy #e: # dropped bits Error bound: 5x(1/10^5)

16 IPDPS Boston 16 Performance of KMeans NPB dataset Comp ratio: 24.01% (180GB) More computation –More opportunity to fetch and decompression

17 IPDPS Boston Conclusion Management and analysis of scientific datasets are challenging –Generic compression algorithms are inefficient for scientific datasets We proposed a compression framework and methodology –Domain specific compression algorithms are fast and space efficient 51.68% compression ratio 53.27% improvement in exec. time –Easy plug-and-play of compression –Integration of the proposed framework and methodology with a data analysis middleware 17

18 IPDPS Boston Thanks! 18

19 IPDPS Boston 19 Multithreading & Prefetching Diff. # PCE and I/O Threads 2P – 4IO –2 PCE threads, 4 I/O threads One core is assigned to comp. framework

20 IPDPS Boston Related Work (Scientific) data management –NetCDF, PNetCDF, HDF5 –Nicolae et al. (BlobSeer) Distributed data management service for efficient reading, writing and appending ops. Compression –Generic: LZO, bzip, gzip, szip, LZMA etc. –Scientific Schendel and Jin et al. (ISOBAR) –Organizes highly entropic data into compressible data chunks Burtscher et al. (FPC) –Efficient double-precision floating point compression Lakshminarasimhan et al. (ISABELA) 20


Download ppt "IPDPS 2013 - Boston Integrating Online Compression to Accelerate Large-Scale Data Analytics Applications Tekin Bicer, Jian Yin, David Chiu, Gagan Agrawal."

Similar presentations


Ads by Google