1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

Slides:

Advertisements

Similar presentations

Towards Automating the Configuration of a Distributed Storage System Lauro B. Costa Matei Ripeanu {lauroc, NetSysLab University of British.

Advertisements

Module 13: Performance Tuning. Overview Performance tuning methodologies Instance level Database level Application level Overview of tools and techniques.

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),

Shredder GPU-Accelerated Incremental Storage and Computation

Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.

1 StoreGPU Exploiting Graphics Processing Units to Accelerate Distributed Storage Systems NetSysLab The University of British Columbia Samer Al-Kiswany.

Revisiting Co-Processing for Hash Joins on the Coupled CPU- GPU Architecture School of Computer Engineering Nanyang Technological University 27 th Aug.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.

Chimera: Collaborative Preemption for Multitasking on a Shared GPU

Priority Research Direction (I/O Models, Abstractions and Software) Key challenges What will you do to address the challenges? – Develop newer I/O models.

GPU Computing with Hartford Condor Week 2012 Bob Nordlund.

Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

Size Matters : Space/Time Tradeoffs to Improve GPGPU Application Performance Abdullah Gharaibeh Matei Ripeanu NetSysLab The University of British Columbia.

OpenFOAM on a GPU-based Heterogeneous Cluster

The Energy Case for Graph Processing on Hybrid Platforms Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto and Matei Ripeanu NetSysLab The University.

1 The Case for Versatile Storage System NetSysLab The University of British Columbia Samer Al-Kiswany, Abdullah Gharaibeh, Matei Ripeanu.

1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work.

1 Harvesting the Opportunity of GPU-Based Acceleration for Data-Intensive Applications Matei Ripeanu Networked Systems Laboratory (NetSysLab) University.

Enabling Cross-Layer Optimizations in Storage Systems with Custom Metadata Elizeu Santos-Neto Samer Al-Kiswany Nazareno Andrade Sathish Gopalakrishnan.

Where to go from here? Get real experience building systems! Opportunities: 496 projects –More projects:

Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.

1 stdchk : A Checkpoint Storage System for Desktop Grid Computing Matei Ripeanu – UBC Sudharshan S. Vazhkudai – ORNL Abdullah Gharaibeh – UBC The University.

1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Matei Ripeanu.

1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Advisor: Professor.

A Workflow-Aware Storage System Emalayan Vairavanathan 1 Samer Al-Kiswany, Lauro Beltrão Costa, Zhao Zhang, Daniel S. Katz, Michael Wilde, Matei Ripeanu.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

Computer System Architectures Computer System Software

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.

Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.

Emalayan Vairavanathan

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

Experience with Using a Performance Predictor During Development a Distributed Storage System Tale Lauro Beltrão Costa *, João Brunet +, Lile Hattori #,

MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.

1 Configurable Security for Scavenged Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany, Matei Ripeanu.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

Energy Prediction for I/O Intensive Workflow Applications 1 MASc Exam Hao Yang NetSysLab The Electrical and Computer Engineering Department The University.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Our work on virtualization Chen Haogang, Wang Xiaolin {hchen, Institute of Network and Information Systems School of Electrical Engineering.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.

Srihari Makineni & Ravi Iyer Communications Technology Lab

Computers Operating System Essentials. Operating Systems PROGRAM HARDWARE OPERATING SYSTEM.

Increasing Web Server Throughput with Network Interface Data Caching October 9, 2002 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.

1 MosaStore -A Versatile Storage System Lauro Costa, Abdullah Gharaibeh, Samer Al-Kiswany, Matei Ripeanu, Emalayan Vairavanathan, (and many others from.

Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.

JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.

S-Paxos: Eliminating the Leader Bottleneck

Automating and Optimizing Data Transfers for Many-core Coprocessors Student: Bin Ren, Advisor: Gagan Agrawal, NEC Intern Mentor: Nishkam Ravi, Yi Yang.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.

1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

Full and Para Virtualization

1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

MemcachedGPU Scaling-up Scale-out Key-value Stores Tayler Hetherington – The University of British Columbia Mike O’Connor – NVIDIA / UT Austin Tor M. Aamodt.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Sunpyo Hong, Hyesoon Kim

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,

NFV Compute Acceleration APIs and Evaluation

CS427 Multicore Architecture and Parallel Computing

A Software-Defined Storage for Workflow Applications

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei Ripeanu

2 GPUs radically change the cost landscape $600 $1279 (Source: CUDA Guide)

3  more complex programming model  limited memory space  accelerator / co-processor model Harnessing GPU Power is Challenging

4 Does the 10x reduction in computation costs GPUs offer change the way we design/implement distributed systems? Motivating Question: Distributed Storage Systems Context:

5 Distributed Systems Computationally Intensive Operations Hashing Erasure coding Encryption/decryption Membership testing (Bloom-filter) Compression Computationally intensive Limit performance Similarity detection Content addressability Security Integrity checks Redundancy Load balancing Summary cache Storage efficiency Operations Techniques

6 Distributed Storage System Architecture Client Metadata Manager Storage Nodes Access Module Application Techniques To improve Performance/Reliability b1 b2 b3 bnbn Files divided into stream of blocks Similarity Detection Security Integrity Checks Redundancy CPU GPU Offloading Layer Enabling Operations Compression Encoding/ Decoding Encryption/ Decryption Hashing Application Layer FS API

7 Contributions:  A GPU accelerated storage system: Design and prototype implementation that integrates similarity detection and GPU support  End-to-end system evaluation: 2x throughput improvement for a realistic checkpointing workload

8 Challenges  Integration Challenges  Minimizing the integration effort  Transparency  Separation of concerns  Extracting Major Performance Gains  Hiding memory allocation overheads  Hiding data transfer overheads  Efficient utilization of the GPU memory units  Use of multi-GPU systems Similarity Detection b1 b2 b3 bnbn Files divided into stream of blocks GPU Hashing Offloading Layer

9 Past Work: Hashing on GPUs HashGPU 1 : a library that exploits GPUs to support specialized use of hashing in distributed storage systems 1 “Exploiting Graphics Processing Units to Accelerate Distributed Storage Systems” S. Al-Kiswany, A. Gharaibeh, E. Santos- Neto, G. Yuan, M. Ripeanu,, HPDC ‘08 However, significant speedup achieved only for large blocks (>16MB) => not suitable for efficient similarity detection One performance data point: Accelerates hashing by up to 5x speedup compared to a single core CPU HashGPU GPU b1 b2 b3 bnbn Hashing stream of blocks

10 Profiling HashGPU Amortizing memory allocation and overlapping data transfers and computation may bring important benefits At least 75% overhead

11 CrystalGPU CrystalGPU: a layer of abstraction that transparently enables common GPU optimizations Similarity Detection b1 b2 b3 bnbn Files divided into stream of blocks GPU HashGPU Offloading Layer CrystalGPU One performance data point: CrystalGPU improves the speedup of HashGPU library by more than one order of magnitude

12 CrystalGPU Opportunities and Enablers  Opportunity: Reusing GPU memory buffers Enabler: a high-level memory manager  Opportunity: overlap the communication and computation Enabler: double buffering and asynchronous kernel launch  Opportunity: multi-GPU systems (e.g., GeForce 9800 GX2 and GPU clusters) Enabler: a task queue manager Similarity Detection b1 b2 b3 bnbn Files divided into stream of blocks GPU HashGPU Offloading Layer CrystalGPU Memory Manager Task Queue Double Buffering

13 Experimental Evaluation:  CrystalGPU evaluation  End-to-end system evaluation

14 CrystalGPU Evaluation Testbed: A machine with CPU: Intel quad-core 2.66 GHz with PCI Express 2.0 x16 bus GPU: NVIDIA GeForce dual-GPU 9800GX2 Experiment space:  HashGPU/CrystalGPU vs. original HashGPU  Three optimizations  Buffer reuse  Overlap communication and computation  Exploiting the two GPUs HashGPU GPU b1 b2 b3 bnbn Files divided into stream of blocks CrystaGPU

15 HashGPU Performance on top CrystalGPU The gains enabled by the three optimizations can be realized! Base Line: CPU Single Core

16  Testbed –Four storage nodes and one metadata server –One client with 9800GX2 GPU  Three implementations –No similarity detection (without-SD) –Similarity detection on CPU (4 2.6GHz) (SD-CPU) on GPU (9800 GX2) (SD-GPU)  Three workloads –Real checkpointing workload –Completely similar files: all possible gains in terms of data saving –Completely different files: only overheads, no gains  Success metrics: –System throughput –Impact on a competing application: compute or I/O intensive End-to-End System Evaluation

17 System Throughput (Checkpointing Workload) The integrated system preserves the throughput gains on a realistic workload! 1.8x improvement

18 System Throughput (Synthetic Workload of Similar Files) Offloading to the GPU enables close to optimal performance! Room for 2x improvement

19 Impact on Competing (Compute Intensive) Application Writing Checkpoints back to back 2x improvement Frees resources (CPU) to competing applications while preserving throughput gains! 7% reduction

20 Summary  We present the design and implementation of a distributed storage system that integrates GPU power  We present CrystalGPU: a management layer that transparently enable common GPU optimizations across GPGPU applications  We empirically demonstrate that employing the GPU enable close to optimal system performance  We shed light on the impact of GPU offloading on competing applications running on the same node

21 netsyslab.ece.ubc.ca

22 File A X Y Z Hashing Similarity Detection W Y Z File B Hashing Only the first block is different Potentially improving write throughput

23 Execution Path on GPU – Data Processing Application T Total = 1 T Preprocesing T DataHtoG T Processing T DataGtoH T PostProc 5 1.Preprocessing (memory allocation) 2.Data transfer in 3.GPU Processing 4.Data transfer out 5.Postprocessing