Shredder GPU-Accelerated Incremental Storage and Computation

Slides:

Advertisements

Similar presentations

You have been given a mission and a code. Use the code to complete the mission and you will save the world from obliteration…

Advertisements

Shared-Memory Model and Threads Intel Software College Introduction to Parallel Programming – Part 2.

Advanced Piloting Cruise Plot.

Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.

Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,

Chapter 1 The Study of Body Function Image PowerPoint

1 Towards an Open Service Framework for Cloud-based Knowledge Discovery Domenico Talia ICAR-CNR & UNIVERSITY OF CALABRIA, Italy Cloud.

1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 4 Computing Platforms.

1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 3 CPUs.

1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.

1 Building a Fast, Virtualized Data Plane with Programmable Hardware Bilal Anwer Nick Feamster.

Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Title Subtitle.

My Alphabet Book abcdefghijklm nopqrstuvwxyz.

DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.

FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.

Year 6 mental test 5 second questions

Chapter 5 Input/Output 5.1 Principles of I/O hardware

Prasanna Pandit R. Govindarajan

1 RAID Overview n Computing speeds double every 3 years n Disk speeds cant keep up n Data needs higher MTBF than any component in system n IO.

Fast Crash Recovery in RAMCloud

Debugging operating systems with time-traveling virtual machines Sam King George Dunlap Peter Chen CoVirt Project, University of Michigan.

ABC Technology Project

1 Undirected Breadth First Search F A BCG DE H 2 F A BCG DE H Queue: A get Undiscovered Fringe Finished Active 0 distance from A visit(A)

Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.

Squares and Square Root WALK. Solve each problem REVIEW:

© 2012 National Heart Foundation of Australia. Slide 2.

Lets play bingo!!. Calculate: MEAN Calculate: MEDIAN

Understanding Generalist Practice, 5e, Kirst-Ashman/Hull

Processes Management.

Executional Architecture

Chapter 5 Test Review Sections 5-1 through 5-4.

GG Consulting, LLC I-SUITE. Source: TEA SHARS Frequently asked questions 2.

HJ-Hadoop An Optimized MapReduce Runtime for Multi-core Systems Yunming Zhang Advised by: Prof. Alan Cox and Vivek Sarkar Rice University 1.

KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.

Addition 1’s to 20.

25 seconds left…...

Januar MDMDFSSMDMDFSSS

We will resume in: 25 Minutes.

©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.

©2004 Brooks/Cole FIGURES FOR CHAPTER 12 REGISTERS AND COUNTERS Click the mouse to move to the next page. Use the ESC key to exit this chapter. This chapter.

© DEEDS – OS Course WS11/12 Lecture 10 - Multiprocessing Support 1 Administrative Issues  Exam date candidates  CW 7 * Feb 14th (Tue): * Feb 16th.

1 Unit 1 Kinematics Chapter 1 Day

PSSA Preparation.

McGraw-Hill©The McGraw-Hill Companies, Inc., 2001 Chapter 16 Integrated Services Digital Network (ISDN)

GPU-accelerated SDR Implementation of Multi-User Detector for Satellite Return Links > Sino-German Workshop > Chen Tang > DLR.de Chart 1 Chen Tang.

CpSc 3220 Designing a Database

© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA.

NetSlices: Scalable Multi-Core Packet Processing in User-Space Tudor Marian, Ki Suh Lee, Hakim Weatherspoon Cornell University Presented by Ki Suh Lee.

Traktor- og motorlære Kapitel 1 1 Kopiering forbudt.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

Weekly Report Ph.D. Student: Leo Lee date: Oct. 9, 2009.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.

Presentation transcript:

Shredder GPU-Accelerated Incremental Storage and Computation Pramod Bhatotia§, Rodrigo Rodrigues§, Akshat Verma¶ §MPI-SWS, Germany ¶IBM Research-India In this talk, I will present a system called Shredder. 2) Shredder is a gpu-accelerated framework for redundancy elimination. 3) In this work, we have used Shredder as a generic library to accelerate both incremental storage and computation. Trans) To start with I first motivate, why we need to accelerate redundancy elimination. USENIX FAST 2012

Handling the data deluge Data stored in data centers is growing at a fast pace Challenge: How to store and process this data ? Key technique: Redundancy elimination Applications of redundancy elimination Incremental storage: data de-duplication Incremental computation: selective re-execution Pramod Bhatotia

Redundancy elimination is expensive Duplicate File Chunks Hash Yes Chunking Hashing Matching No Content-based chunking [SOSP’01] Fingerprint 0 1 0 1 1 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0 1 1 0 1 0 1 0 File Content Marker For large-scale data, chunking easily becomes a bottleneck Pramod Bhatotia

Accelerate chunking using GPUs GPUs have been successfully applied to compute-intensive tasks 20Gbps (2.5GBps) Storage Servers ? 2X Using GPUs for data-intensive tasks presents new challenges Pramod Bhatotia

Rest of the talk Shredder design Evaluation Case studies Basic design Background: GPU architecture & programming model Challenges and optimizations Evaluation Case studies Computation: Incremental MapReduce Storage: Cloud backup Pramod Bhatotia

Shredder basic design CPU (Host) GPU (Device) Transfer Chunking kernel Reader Store Data for chunking Chunked data Pramod Bhatotia

GPU architecture GPU (Device) Host memory Multi-processor N Device global memory Multi-processor 2 CPU (Host) Multi-processor 1 Shared memory PCI SP SP SP SP Pramod Bhatotia

GPU programming model GPU (Device) Host memory Multi-processor N Input global memory Multi-processor 2 CPU (Host) Multi-processor 1 Shared memory PCI Output SP SP SP SP The programming model for GPUs requires designing hybrid application, where the control part runs on the CPUs and the compute-intensive task is offloaded to GPUs. 2) Host co-ordinates the execution of application code on GPU. Where first the host ensures data to operate is available in device memory. 3) Thereafter, the host launches kernel program, where the GPU runtime dynamically schedules the kernel thread to one of the scalar processor. 4) These threads fetches the input data directly from the device memory. Or In case, they want to reuse the data in device memory for multiple operation they can bring the data from global memory to shared memory for fast reuse. 5) Kernel program executes SIMT code, where group of threads share resources on Device (GPU) Threads Pramod Bhatotia

Scalability challenges Host-device communication bottlenecks Device memory conflicts Host bottlenecks (See paper for details) Recall – earlier a basic desing gives us 2x, we can do better by solving 3 challenges. Pramod Bhatotia

Host-device communication bottleneck Challenge # 1 CPU (Host) GPU (Device) Main memory Chunking kernel Device global memory Transfer PCI Reader I/O Synchronous data transfer and kernel execution Cost of data transfer is comparable to kernel execution For large-scale data it involves many data transfers Pramod Bhatotia

Asynchronous execution GPU (Device) CPU (Host) Asynchronous copy Device global memory Main memory Transfer Buffer 1 Buffer 2 Pros: + Overlaps communication with computation + Generalizes to multi-buffering Cons: - Requires page-pinning of buffers at host side Copy to Buffer 1 Copy to Buffer 2 Time Compute Buffer 1 Compute Buffer 2 Pramod Bhatotia

Circular ring pinned memory buffers CPU (Host) Memcpy GPU (Device) Asynchronous copy Device global memory Pinned circular Ring buffers Pageable buffers Pramod Bhatotia

Device memory conflicts Challenge # 2 CPU (Host) GPU (Device) Main memory Chunking kernel Device global memory Transfer PCI Reader I/O Pramod Bhatotia

Accessing device memory Device global memory 400-600 Cycles Thread-1 Thread-2 Thread-3 Thread-4 Few cycles Device shared memory SP-1 SP-2 SP-3 SP-4 Multi-processor Pramod Bhatotia

Memory bank conflicts Device global memory Un-coordinated accesses to global memory lead to a large number of memory bank conflicts Device shared memory SP SP SP SP Multi-processor Pramod Bhatotia

Accessing memory banks Interleaved memory Bank 0 Bank 1 Bank 2 Bank 3 Memory address 4 8 1 5 2 6 3 7 Chip enable OR MSBs LSBs Data out Address Pramod Bhatotia

Memory coalescing Device global memory Thread # Time Memory coalescing 1 2 3 4 Time Memory coalescing Device shared memory Pramod Bhatotia

Processing the data Device shared memory Thread-1 Thread-2 Thread-3 Pramod Bhatotia

Outline Shredder design Evaluation Case-studies Pramod Bhatotia

Evaluating Shredder Goal: Determine how Shredder works in practice How effective are the optimizations? How does it compare with multicores? Implementation Host driver in C++ and GPU in CUDA GPU: NVidia Tesla C2050 cards Host machine: Intel Xeon with12 cores (See paper for details) Pramod Bhatotia

Shredder vs. Multicores 5X Pramod Bhatotia

Outline Shredder design Evaluation Case studies Computation: Incremental MapReduce Storage: Cloud backup (See paper for details) Pramod Bhatotia

Incremental MapReduce Read input Map tasks Reduce tasks Write output Pramod Bhatotia

Unstable input partitions Read input Map tasks Reduce tasks Write output Pramod Bhatotia

GPU accelerated Inc-HDFS Input file copyFromLocal Shredder HDFS Client Content-based chunking Split-1 Split-2 Split-3 Pramod Bhatotia

Related work GPU-accelerated systems Incremental computations Storage: Gibraltar [ICPP’10], HashGPU [HPDC’10] SSLShader[NSDI’11], PacketShader[SIGCOMM’10], … Incremental computations Incoop[SOCC’11], Nectar[OSDI’10], Percolator[OSDI’10],… Pramod Bhatotia

Conclusions GPU-accelerated framework for redundancy elimination Exploits massively parallel GPUs in a cost-effective manner Shredder design incorporates novel optimizations More data-intensive than previous usage of GPUs Shredder can be seamlessly integrated with storage systems To accelerate incremental storage and computation Pramod Bhatotia

Thank You!