Presentation is loading. Please wait.

Presentation is loading. Please wait.

Shredder GPU-Accelerated Incremental Storage and Computation

Similar presentations


Presentation on theme: "Shredder GPU-Accelerated Incremental Storage and Computation"— Presentation transcript:

1 Shredder GPU-Accelerated Incremental Storage and Computation
Pramod Bhatotia§, Rodrigo Rodrigues§, Akshat Verma¶ §MPI-SWS, Germany ¶IBM Research-India In this talk, I will present a system called Shredder. 2) Shredder is a gpu-accelerated framework for redundancy elimination. 3) In this work, we have used Shredder as a generic library to accelerate both incremental storage and computation. Trans) To start with I first motivate, why we need to accelerate redundancy elimination. USENIX FAST 2012

2 Handling the data deluge
Data stored in data centers is growing at a fast pace Challenge: How to store and process this data ? Key technique: Redundancy elimination Applications of redundancy elimination Incremental storage: data de-duplication Incremental computation: selective re-execution Pramod Bhatotia

3 Redundancy elimination is expensive
Duplicate File Chunks Hash Yes Chunking Hashing Matching No Content-based chunking [SOSP’01] Fingerprint File Content Marker For large-scale data, chunking easily becomes a bottleneck Pramod Bhatotia

4 Accelerate chunking using GPUs
GPUs have been successfully applied to compute-intensive tasks 20Gbps (2.5GBps) Storage Servers ? 2X Using GPUs for data-intensive tasks presents new challenges Pramod Bhatotia

5 Rest of the talk Shredder design Evaluation Case studies Basic design
Background: GPU architecture & programming model Challenges and optimizations Evaluation Case studies Computation: Incremental MapReduce Storage: Cloud backup Pramod Bhatotia

6 Shredder basic design CPU (Host) GPU (Device) Transfer Chunking kernel
Reader Store Data for chunking Chunked data Pramod Bhatotia

7 GPU architecture GPU (Device) Host memory Multi-processor N Device
global memory Multi-processor 2 CPU (Host) Multi-processor 1 Shared memory PCI SP SP SP SP Pramod Bhatotia

8 GPU programming model GPU (Device) Host memory Multi-processor N Input
global memory Multi-processor 2 CPU (Host) Multi-processor 1 Shared memory PCI Output SP SP SP SP The programming model for GPUs requires designing hybrid application, where the control part runs on the CPUs and the compute-intensive task is offloaded to GPUs. 2) Host co-ordinates the execution of application code on GPU. Where first the host ensures data to operate is available in device memory. 3) Thereafter, the host launches kernel program, where the GPU runtime dynamically schedules the kernel thread to one of the scalar processor. 4) These threads fetches the input data directly from the device memory. Or In case, they want to reuse the data in device memory for multiple operation they can bring the data from global memory to shared memory for fast reuse. 5) Kernel program executes SIMT code, where group of threads share resources on Device (GPU) Threads Pramod Bhatotia

9 Scalability challenges
Host-device communication bottlenecks Device memory conflicts Host bottlenecks (See paper for details) Recall – earlier a basic desing gives us 2x, we can do better by solving 3 challenges. Pramod Bhatotia

10 Host-device communication bottleneck
Challenge # 1 CPU (Host) GPU (Device) Main memory Chunking kernel Device global memory Transfer PCI Reader I/O Synchronous data transfer and kernel execution Cost of data transfer is comparable to kernel execution For large-scale data it involves many data transfers Pramod Bhatotia

11 Asynchronous execution
GPU (Device) CPU (Host) Asynchronous copy Device global memory Main memory Transfer Buffer 1 Buffer 2 Pros: + Overlaps communication with computation + Generalizes to multi-buffering Cons: - Requires page-pinning of buffers at host side Copy to Buffer 1 Copy to Buffer 2 Time Compute Buffer 1 Compute Buffer 2 Pramod Bhatotia

12 Circular ring pinned memory buffers
CPU (Host) Memcpy GPU (Device) Asynchronous copy Device global memory Pinned circular Ring buffers Pageable buffers Pramod Bhatotia

13 Device memory conflicts
Challenge # 2 CPU (Host) GPU (Device) Main memory Chunking kernel Device global memory Transfer PCI Reader I/O Pramod Bhatotia

14 Accessing device memory
Device global memory Cycles Thread-1 Thread-2 Thread-3 Thread-4 Few cycles Device shared memory SP-1 SP-2 SP-3 SP-4 Multi-processor Pramod Bhatotia

15 Memory bank conflicts Device global memory Un-coordinated accesses to global memory lead to a large number of memory bank conflicts Device shared memory SP SP SP SP Multi-processor Pramod Bhatotia

16 Accessing memory banks
Interleaved memory Bank 0 Bank 1 Bank 2 Bank 3 Memory address 4 8 1 5 2 6 3 7 Chip enable OR MSBs LSBs Data out Address Pramod Bhatotia

17 Memory coalescing Device global memory Thread # Time Memory coalescing
1 2 3 4 Time Memory coalescing Device shared memory Pramod Bhatotia

18 Processing the data Device shared memory Thread-1 Thread-2 Thread-3
Pramod Bhatotia

19 Outline Shredder design Evaluation Case-studies Pramod Bhatotia

20 Evaluating Shredder Goal: Determine how Shredder works in practice
How effective are the optimizations? How does it compare with multicores? Implementation Host driver in C++ and GPU in CUDA GPU: NVidia Tesla C2050 cards Host machine: Intel Xeon with12 cores (See paper for details) Pramod Bhatotia

21 Shredder vs. Multicores
5X Pramod Bhatotia

22 Outline Shredder design Evaluation Case studies
Computation: Incremental MapReduce Storage: Cloud backup (See paper for details) Pramod Bhatotia

23 Incremental MapReduce
Read input Map tasks Reduce tasks Write output Pramod Bhatotia

24 Unstable input partitions
Read input Map tasks Reduce tasks Write output Pramod Bhatotia

25 GPU accelerated Inc-HDFS
Input file copyFromLocal Shredder HDFS Client Content-based chunking Split-1 Split-2 Split-3 Pramod Bhatotia

26 Related work GPU-accelerated systems Incremental computations
Storage: Gibraltar [ICPP’10], HashGPU [HPDC’10] SSLShader[NSDI’11], PacketShader[SIGCOMM’10], … Incremental computations Incoop[SOCC’11], Nectar[OSDI’10], Percolator[OSDI’10],… Pramod Bhatotia

27 Conclusions GPU-accelerated framework for redundancy elimination
Exploits massively parallel GPUs in a cost-effective manner Shredder design incorporates novel optimizations More data-intensive than previous usage of GPUs Shredder can be seamlessly integrated with storage systems To accelerate incremental storage and computation Pramod Bhatotia

28 Thank You!


Download ppt "Shredder GPU-Accelerated Incremental Storage and Computation"

Similar presentations


Ads by Google