Presentation is loading. Please wait.

Presentation is loading. Please wait.

Shredder GPU-Accelerated Incremental Storage and Computation Pramod Bhatotia §, Rodrigo Rodrigues §, Akshat Verma ¶ § MPI-SWS, Germany ¶ IBM Research-India.

Similar presentations


Presentation on theme: "Shredder GPU-Accelerated Incremental Storage and Computation Pramod Bhatotia §, Rodrigo Rodrigues §, Akshat Verma ¶ § MPI-SWS, Germany ¶ IBM Research-India."— Presentation transcript:

1 Shredder GPU-Accelerated Incremental Storage and Computation Pramod Bhatotia §, Rodrigo Rodrigues §, Akshat Verma ¶ § MPI-SWS, Germany ¶ IBM Research-India USENIX FAST 2012

2 Handling the data deluge Data stored in data centers is growing at a fast pace Challenge: How to store and process this data ? Key technique: Redundancy elimination Applications of redundancy elimination Incremental storage: data de-duplication Incremental computation: selective re-execution Pramod Bhatotia 2

3 Redundancy elimination is expensive Pramod Bhatotia 3 For large-scale data, chunking easily becomes a bottleneck Content-based chunking [SOSP01] Content Marker Chunking File Chunks HashYes No Duplicate Hashing Matching File Fingerprint

4 Accelerate chunking using GPUs GPUs have been successfully applied to compute-intensive tasks Pramod Bhatotia 4 2X 20Gbps (2.5GBps) Using GPUs for data-intensive tasks presents new challenges ? Storage Servers

5 Rest of the talk Shredder design Basic design Background: GPU architecture & programming model Challenges and optimizations Evaluation Case studies Computation: Incremental MapReduce Storage: Cloud backup Pramod Bhatotia 5

6 Shredder basic design Pramod Bhatotia 6 GPU (Device) CPU (Host) Reader Transfer Store Chunking kernel Data for chunking Chunked data

7 GPU architecture Pramod Bhatotia 7 CPU (Host) CPU (Host) GPU (Device) Multi-processor N Multi-processor 2 Multi-processor 1 SP PCI Host memory Device global memory Shared memory

8 GPU programming model Pramod Bhatotia 8 CPU (Host) CPU (Host) GPU (Device) Multi-processor N Multi-processor 2 Multi-processor 1 SP PCI Host memory Device global memory Shared memory Input Output Threads

9 Scalability challenges 1.Host-device communication bottlenecks 2.Device memory conflicts 1.Host bottlenecks Pramod Bhatotia 9 (See paper for details)

10 Host-device communication bottleneck Pramod Bhatotia 10 Challenge # 1 GPU (Device) Device global memory PCI Synchronous data transfer and kernel execution Cost of data transfer is comparable to kernel execution For large-scale data it involves many data transfers CPU (Host) Main memory Main memory Chunking kernel I/O Transfer Reader

11 Asynchronous execution Pramod Bhatotia 11 Time Copy to Buffer 1 Copy to Buffer 1 GPU (Device) Asynchronous copy Buffer 1 Device global memory Main memory Main memory Transfer Buffer 2 CPU (Host) Compute Buffer 2 Compute Buffer 2 Compute Buffer 1 Compute Buffer 1 Copy to Buffer 2 Copy to Buffer 2 Pros: + Overlaps communication with computation + Generalizes to multi-buffering Cons: - Requires page-pinning of buffers at host side

12 Circular ring pinned memory buffers Pramod Bhatotia 12 GPU (Device) CPU (Host) Memcpy Pinned circular Ring buffers Pageable buffers Device global memory Asynchronous copy

13 Device memory conflicts Pramod Bhatotia 13 Challenge # 2 GPU (Device) Device global memory PCI CPU (Host) Main memory Main memory Chunking kernel I/O Transfer Reader

14 Accessing device memory Pramod Bhatotia 14 Thread-1 Thread-2 Thread-4 Device global memory Multi-processor SP-1 SP-2SP-3SP-4 Thread Cycles Few cycles Device shared memory

15 Pramod Bhatotia 15 Device global memory Un-coordinated accesses to global memory lead to a large number of memory bank conflicts Memory bank conflicts Multi-processor SP Device shared memory SP

16 Accessing memory banks Pramod Bhatotia 16 Bank 0 Bank 3 Bank 1 Bank MSBsLSBs Address Memory address Chip enable OR Data out Interleaved memory

17 Memory coalescing Pramod Bhatotia 17 Device global memory Device shared memory Memory coalescing Time Thread # 3

18 Processing the data Pramod Bhatotia 18 Thread-1Thread-2Thread-4 Device shared memory Thread-3

19 Outline Shredder design Evaluation Case-studies Pramod Bhatotia 19

20 Evaluating Shredder Goal: Determine how Shredder works in practice How effective are the optimizations? How does it compare with multicores? Implementation Host driver in C++ and GPU in CUDA GPU: NVidia Tesla C2050 cards Host machine: Intel Xeon with12 cores Pramod Bhatotia 20 (See paper for details)

21 Shredder vs. Multicores Pramod Bhatotia 21 5X

22 Outline Shredder design Evaluation Case studies Computation: Incremental MapReduce Storage: Cloud backup Pramod Bhatotia 22 (See paper for details)

23 Read input Map tasks Reduce tasks Write output Incremental MapReduce Pramod Bhatotia 23

24 Unstable input partitions Pramod Bhatotia Read input Map tasks Reduce tasks Write output 24

25 GPU accelerated Inc-HDFS Pramod Bhatotia 25 Input file Shredder HDFS Client Shredder HDFS Client copyFromLocal Split-1 Split-2 Split-3 Content-based chunking

26 Related work GPU-accelerated systems Storage: Gibraltar [ICPP10], HashGPU [HPDC10] SSLShader[NSDI11], PacketShader[SIGCOMM10], … Incremental computations Incoop[SOCC11], Nectar[OSDI10], Percolator[OSDI10],… Pramod Bhatotia 26

27 Conclusions GPU-accelerated framework for redundancy elimination Exploits massively parallel GPUs in a cost-effective manner Shredder design incorporates novel optimizations More data-intensive than previous usage of GPUs Shredder can be seamlessly integrated with storage systems To accelerate incremental storage and computation Pramod Bhatotia 27

28 Thank You!


Download ppt "Shredder GPU-Accelerated Incremental Storage and Computation Pramod Bhatotia §, Rodrigo Rodrigues §, Akshat Verma ¶ § MPI-SWS, Germany ¶ IBM Research-India."

Similar presentations


Ads by Google