Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 StoreGPU Exploiting Graphics Processing Units to Accelerate Distributed Storage Systems NetSysLab The University of British Columbia Samer Al-Kiswany.

Similar presentations


Presentation on theme: "1 StoreGPU Exploiting Graphics Processing Units to Accelerate Distributed Storage Systems NetSysLab The University of British Columbia Samer Al-Kiswany."— Presentation transcript:

1 1 StoreGPU Exploiting Graphics Processing Units to Accelerate Distributed Storage Systems NetSysLab The University of British Columbia Samer Al-Kiswany with: Abdullah Gharaibeh, Elizeu Santos-Neto, George Yuan, Matei Ripeanu

2 2 Computation Landscape Recent GPUs dramatically change the computation cost landscape. Floating-Point Operations per Second for the CPU and GPU. (Source: CUDA 1.1 Guide) A quiet revolution:  Computation: 367 vs. 32 GFLOPS 128 vs. 4 cores  Memory Bandwidth: 86.4 vs. 8.4 GB/s $220 $290 HPDC ‘08

3 3 Computation Landscape  Affordable  Widely available in commodity desktop  Include 10s to 100s of cores ( can support 1000s of threads)  General purpose programming friendly Recent GPUs dramatically change the computation cost landscape. HPDC ‘08

4 4 Exploiting GPUs’ Computational Power Studies exploiting the GPU: Bioinformatics: [Liu06] Chemistry: [Vogt08] Physics: [Anderson08] And many more : [Owens07] Report: 4x to 50x speedup But: Mostly scientific and specialized applications. HPDC ‘08

5 5 Motivating Question System design: balancing act in a multi-dimensional space e.g., given certain objectives, say job turnaround, minimize total system cost given component prices, I/O bottlenecks, bounds on storage and network traffic, energy consumption, etc. Q: Does the 10x reduction in computation costs GPUs offer change the way we design/implement (distributed) system middleware?

6 6 Distributed Systems Computationally Intensive Operations  Hashing  Erasure coding  Encryption/decryption  Compression  Membership testing (Bloom-filter) HPDC ‘08 Computationally intensive Often avoided in existing systems. Used in:  Storage systems  Security protocols  Data dissemination techniques  Virtual machines memory management  And many more …

7 7 Why Start with Hashing? Popular -- used in many situations:  Similarity detection  Content addressability  Integrity  Copyright infringement detection  Load balancing

8 8 File A X Y Z Hashing ICDCS ‘08 How Hashing is Used in Similarity Detection ? W Y Z File B Hashing Only the first block is different

9 9  How to divide the file into blocks  Fixed-size blocks  Content-based block boundaries ICDCS ‘08 How Hashing is Used in Similarity Detection ?

10 10 File i Hashing B1 B2 B3 B4 ICDCS ‘08 HashValue K = 0 ? m bytes k bits offset Detecting Content-based Block Boundaries

11 11 Hashing Use in Similarity Detection – Two scenarios I.Computing block hashes : large blocks of data (100s KB to 10s MB). II. Computing block boundary: Hashing large number of small data blocks (few bytes) HPDC ‘08

12 12 StoreGPU HPDC ‘08 StoreGPU : a library that exploits GPUs to support distributed storage system by offloading the computationally intensive functions. One performance data point: In similarity detection, StoreGPU achieves 8x speedup and 5x data compression for a checkpointing application. StoreGPU v1.0 implements hashing functions used in computing block hashes and blocks boundaries Implication: GPUs unleash valuable set of optimization techniques into high performance systems design space. - Although GPUs have not been designed with this usage in mind.

13 13 Outline  GPU architecture  GPU programming  Typical application flow  StoreGPU design  Evaluation HPDC ‘08

14 14 NVIDIA CUDA GPU Architecture HPDC ‘08  SIMD Architecture.  Four memories. Device (a.k.a. global) slow – 400-600 cycles access latency large – 256MB – 1GB Shared fast – 4 cycles access latency small – 16KB Texture – read only Constant – read only

15 15 GPU Programming HPDC ‘08 NVIDIA CUDA programming model:  Abstracts the GPU architecture  Is an extension to C programming language Compiler directives Provides GPU specific API (device properties, timing, memory management…etc) Programming still challenging  Parallel programming is challenging Extracting parallelism at large scale Parallel programming (SIMD)  Memory management  Synchronization  Immature debugging tools

16 16 Performance Tips HPDC ‘08  Use 1000s of threads to best use the GPU hardware  Optimize the use the shared memory and the registers Challenge: limited shared memory and registers Challenge: small, bank conflicts

17 17 Shared Memory Complications HPDC ‘08 Shared memory is organized into 16 -1KB banks. Bank 0 Bank 1 Bank 15...... Complication I : Concurrent accesses to the same bank will be serialized (bank conflict)  slow down. Complication II : Banks are interleaved. Tip : Assign different threads to different banks. Bank 0 Bank 1 Bank 2...... 4 bytes 0 4 8 16

18 18 Execution Path on GPU – Data Processing Application HPDC ‘08 T Total = 1 T Preprocesing 1 2 + T DataHtoG 2 3 + T Processing 3 4 + T DataGtoH 4 5 + T PostProc 5 1.Preprocessing 2.Data transfer in 3.GPU Processing 4.Data transfer out 5.Postprocessing

19 19 Outline  GPU architecture  GPU programming  Typical application flow  StoreGPU design  Evaluation HPDC ‘08

20 20 StoreGPU Design I.Computing block hashes : large blocks of data (100s KB to 10s MB). II. Computing block boundary: Hashing large number of small data blocks (few bytes) HPDC ‘08

21 21 HPDC ‘08 Input Data 123456i-1i... Output GPU Input Data Host Machine Data transf. to shared mem Data transfer in Processing Result transfer to global Result transfer out Preprocessing Execute the final hash Computing Block Hash – Module Design

22 22 Computing Block Hash – Module Design HPDC ‘08  The design is highly parallel  Last step - on the CPU to avoid synchronization  The resulting hash is not compatible with standard MD5 and SHA1 but is equally collision resistant [Damgard89]

23 23 HPDC ‘08 Input Data.. Output Input Data GPU Host Machine Data transf. to shared mem Data transfer in Processing Result transfer to global Result transfer out Preprocessing Detecting Block Boundaries – Module Design

24 24 StoreGPU v1.0 Optimizations  Optimized shared memory usage. StoreGPU shared memory management mechanism: assigns threads to different banks while providing contiguous space abstraction.  Memory pinning  Reduced output size HPDC ‘08 B1 B2 B3 B4 HashValue K = 0 ? m bytes k bits 12 3 4 5 Bank 0 Bank 1 Bank 2...... 4 bytes 0 4 8 16

25 25 Outline  GPU architecture  GPU programming  Typical application flow  StoreGPU design  StoreGPU v1.0 optimizations  Evaluation HPDC ‘08

26 26 Evaluation Testbed: A machine with CPU: Intel Core2 Duo 6600, 2 GB RAM (priced at : $290) GPU: GeForce 8600 GTS GPU (32 cores, 256 MB RAM, PCIx 16x) (priced at : $100) HPDC ‘08 Experiment space:  GPU vs. single CPU core.  MD5 and SHA1 implementations  Three optimizations  Detecting block boundary configurations (m and offset)

27 27 Computing Block Hash HPDC ‘08 Over 4x speedup in computing block hashes Computing Block Hash – MD5

28 28 Computing Block Boundary HPDC ‘08 Over 8x speedup in detecting blocks boundaries Computing Block Boundary– MD5 m = 20 bytes, offset = 4 bytes 1

29 29 HPDC ‘08 Dissecting GPU Execution Time T Total = 1 T Preprocesing 1 2 + T DataHtoG 2 3 + T Processing 3 4 + T DataGtoH 4 5 + T PostProc 5 1.Preprocessing 2.Data transfer in 3.GPU Processing 4.Data transfer out 5.Postprocessing

30 30 Dissecting GPU Execution Time HPDC ‘08 T Total = T Preprocesing 1 + T DataHtoG 2 + T Processing 3 + T DataGtoH 4 + T PostProc 5 MD5 computing block hashes module with all optimizations enabled

31 31 Application Level Performance – Similarity Detection HPDC ‘08 Online similarity detection throughput and speedup using MD5. Throughput (MBps)Similarity ratio detected StoreGPUStandard Fixed size Compare by Hash 193 23% Content based Compare by Hash 13.5 80% Implication: similarity detection can be used even on 10Gbps setups !! 840 Speedup : 4.3x 114 Speedup: 8.4x Application: similarity detection between checkpoint images. Data: checkpoints from BLAST (bioinformatics) collected using BLCR, checkpoint interval : 5 minutes

32 32 Summary HPDC ‘08 StoreGPU :  Offloads the computationally intensive operations from the CPU  Achieves considerable speedups Contributions:  Feasibility of using GPUs to support (distributed) middlewares  Performance model  StoreGPU library Implication : GPUs unleash valuable set of optimization techniques into high performance systems design space.

33 33 Other GPU Applications Current NetSysLab GPU-related projects  Exploring GPU to support other middleware primitives: Bloom filters (BloomGPU)  Packet classification  Medical imaging compression  Hashing  Erasure coding  Encryption/decryption  Compression  Membership testing (Bloom-filter) HPDC ‘08

34 34 Thank you netsyslab.ece.ubc.ca HPDC ‘08

35 35 References HPDC ‘08 [Damgard89] Damgard, I. A Design Principle for Hash Functions. in Advances in Cryptology - CRYPTO. 1989: Lecture Notes in Computer Science. [Liu06] Liu, W., et al. Bio-sequence database scanning on a GPU. in Parallel and Distributed Processing Symposium, IPDPS. 2006 [Vogt08] Vogt, L, et al. Accelerating Resolution-of-the-Identity Second- Order Moller-Plesset Quantum Chemistry Calculations with Graphical Processing Units. J. Phys. Chem. A, 112 (10), 2049 - 2057, 2008. [Anderson08] Joshua A. Anderson, Chris D. Lorenz and A. Travesset, General purpose molecular dynamics simulations fully implemented on graphics processing units. Journal of Computational Physics Volume 227, Issue 10, 1 May 2008, Pages 5342-5359 [Owens07] Owens, J.D., et al., A Survey of General-Purpose Computation on Graphics Hardware. Computer Graphics Forum, 2007. 26(1): p. 80-113


Download ppt "1 StoreGPU Exploiting Graphics Processing Units to Accelerate Distributed Storage Systems NetSysLab The University of British Columbia Samer Al-Kiswany."

Similar presentations


Ads by Google