Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

Similar presentations


Presentation on theme: "1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei."— Presentation transcript:

1 1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei Ripeanu

2 2 GPUs radically change the cost landscape $600 $1279 (Source: CUDA Guide)

3 3  more complex programming model  limited memory space  accelerator / co-processor model Harnessing GPU Power is Challenging

4 4 Does the 10x reduction in computation costs GPUs offer change the way we design/implement distributed systems? Motivating Question: Distributed Storage Systems Context:

5 5 Distributed Systems Computationally Intensive Operations Hashing Erasure coding Encryption/decryption Membership testing (Bloom-filter) Compression Computationally intensive Limit performance Similarity detection Content addressability Security Integrity checks Redundancy Load balancing Summary cache Storage efficiency Operations Techniques

6 6 Distributed Storage System Architecture Client Metadata Manager Storage Nodes Access Module Application Techniques To improve Performance/Reliability b1 b2 b3 bnbn Files divided into stream of blocks Similarity Detection Security Integrity Checks Redundancy CPU GPU Offloading Layer Enabling Operations Compression Encoding/ Decoding Encryption/ Decryption Hashing Application Layer FS API

7 7 Contributions:  A GPU accelerated storage system: Design and prototype implementation that integrates similarity detection and GPU support  End-to-end system evaluation: 2x throughput improvement for a realistic checkpointing workload

8 8 Challenges  Integration Challenges  Minimizing the integration effort  Transparency  Separation of concerns  Extracting Major Performance Gains  Hiding memory allocation overheads  Hiding data transfer overheads  Efficient utilization of the GPU memory units  Use of multi-GPU systems Similarity Detection b1 b2 b3 bnbn Files divided into stream of blocks GPU Hashing Offloading Layer

9 9 Past Work: Hashing on GPUs HashGPU 1 : a library that exploits GPUs to support specialized use of hashing in distributed storage systems 1 “Exploiting Graphics Processing Units to Accelerate Distributed Storage Systems” S. Al-Kiswany, A. Gharaibeh, E. Santos- Neto, G. Yuan, M. Ripeanu,, HPDC ‘08 However, significant speedup achieved only for large blocks (>16MB) => not suitable for efficient similarity detection One performance data point: Accelerates hashing by up to 5x speedup compared to a single core CPU HashGPU GPU b1 b2 b3 bnbn Hashing stream of blocks

10 10 Profiling HashGPU Amortizing memory allocation and overlapping data transfers and computation may bring important benefits At least 75% overhead

11 11 CrystalGPU CrystalGPU: a layer of abstraction that transparently enables common GPU optimizations Similarity Detection b1 b2 b3 bnbn Files divided into stream of blocks GPU HashGPU Offloading Layer CrystalGPU One performance data point: CrystalGPU improves the speedup of HashGPU library by more than one order of magnitude

12 12 CrystalGPU Opportunities and Enablers  Opportunity: Reusing GPU memory buffers Enabler: a high-level memory manager  Opportunity: overlap the communication and computation Enabler: double buffering and asynchronous kernel launch  Opportunity: multi-GPU systems (e.g., GeForce 9800 GX2 and GPU clusters) Enabler: a task queue manager Similarity Detection b1 b2 b3 bnbn Files divided into stream of blocks GPU HashGPU Offloading Layer CrystalGPU Memory Manager Task Queue Double Buffering

13 13 Experimental Evaluation:  CrystalGPU evaluation  End-to-end system evaluation

14 14 CrystalGPU Evaluation Testbed: A machine with CPU: Intel quad-core 2.66 GHz with PCI Express 2.0 x16 bus GPU: NVIDIA GeForce dual-GPU 9800GX2 Experiment space:  HashGPU/CrystalGPU vs. original HashGPU  Three optimizations  Buffer reuse  Overlap communication and computation  Exploiting the two GPUs HashGPU GPU b1 b2 b3 bnbn Files divided into stream of blocks CrystaGPU

15 15 HashGPU Performance on top CrystalGPU The gains enabled by the three optimizations can be realized! Base Line: CPU Single Core

16 16  Testbed –Four storage nodes and one metadata server –One client with 9800GX2 GPU  Three implementations –No similarity detection (without-SD) –Similarity detection on CPU (4 cores @ 2.6GHz) (SD-CPU) on GPU (9800 GX2) (SD-GPU)  Three workloads –Real checkpointing workload –Completely similar files: all possible gains in terms of data saving –Completely different files: only overheads, no gains  Success metrics: –System throughput –Impact on a competing application: compute or I/O intensive End-to-End System Evaluation

17 17 System Throughput (Checkpointing Workload) The integrated system preserves the throughput gains on a realistic workload! 1.8x improvement

18 18 System Throughput (Synthetic Workload of Similar Files) Offloading to the GPU enables close to optimal performance! Room for 2x improvement

19 19 Impact on Competing (Compute Intensive) Application Writing Checkpoints back to back 2x improvement Frees resources (CPU) to competing applications while preserving throughput gains! 7% reduction

20 20 Summary  We present the design and implementation of a distributed storage system that integrates GPU power  We present CrystalGPU: a management layer that transparently enable common GPU optimizations across GPGPU applications  We empirically demonstrate that employing the GPU enable close to optimal system performance  We shed light on the impact of GPU offloading on competing applications running on the same node

21 21 netsyslab.ece.ubc.ca

22 22 File A X Y Z Hashing Similarity Detection W Y Z File B Hashing Only the first block is different Potentially improving write throughput

23 23 Execution Path on GPU – Data Processing Application T Total = 1 T Preprocesing 1 2 + T DataHtoG 2 3 + T Processing 3 4 + T DataGtoH 4 5 + T PostProc 5 1.Preprocessing (memory allocation) 2.Data transfer in 3.GPU Processing 4.Data transfer out 5.Postprocessing


Download ppt "1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei."

Similar presentations


Ads by Google