Dineshkumar Bhaskaran Aricent Technologies

Slides:

Advertisements

Similar presentations

Analysis and Construction of Functional Regenerating Codes with Uncoded Repair for Distributed Storage Systems Yuchong Hu, Patrick P. C. Lee, Kenneth.

Advertisements

current hadoop architecture

Alex Dimakis based on collaborations with Dimitris Papailiopoulos Arash Saber Tehrani USC Network Coding for Distributed Storage.

CSCE430/830 Computer Architecture

Henry C. H. Chen and Patrick P. C. Lee

1 NCFS: On the Practicality and Extensibility of a Network-Coding-Based Distributed File System Yuchong Hu 1, Chiu-Man Yu 2, Yan-Kit Li 2 Patrick P. C.

 RAID stands for Redundant Array of Independent Disks  A system of arranging multiple disks for redundancy (or performance)  Term first coined in 1987.

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

BASIC Regenerating Codes for Distributed Storage Systems Kenneth Shum (Joint work with Minghua Chen, Hanxu Hou and Hui Li)

Coding and Algorithms for Memories Lecture 12 1.

Simple Regenerating Codes: Network Coding for Cloud Storage Dimitris S. Papailiopoulos, Jianqiang Luo, Alexandros G. Dimakis, Cheng Huang, and Jin Li University.

Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.

1 STAIR Codes: A General Family of Erasure Codes for Tolerating Device and Sector Failures in Practical Storage Systems Mingqiang Li and Patrick P. C.

Enhanced Availability With RAID CC5493/7493. RAID Redundant Array of Independent Disks RAID is implemented to improve: –IO throughput (speed) and –Availability.

Availability in Globally Distributed Storage Systems

CSE 486/586 CSE 486/586 Distributed Systems Case Study: Facebook f4 Steve Ko Computer Sciences and Engineering University at Buffalo.

REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.

Beyond the MDS Bound in Distributed Cloud Storage

A “Hitchhiker’s” Guide to Fast and Efficient Data Reconstruction in Erasure-coded Data Centers K. V. Rashmi, Nihar Shah, D. Gu, H. Kuang, D. Borthakur,

Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.

RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing Kai Hwang, Hai Jin, and Roy Ho.

Storage System: RAID Questions answered in this lecture: What is RAID? How does one trade-off between: performance, capacity, and reliability? What is.

Network Coding for Distributed Storage Systems IEEE TRANSACTIONS ON INFORMATION THEORY, SEPTEMBER 2010 Alexandros G. Dimakis Brighten Godfrey Yunnan Wu.

Redundant Array of Inexpensive Disks (RAID). Redundant Arrays of Disks Files are "striped" across multiple spindles Redundancy yields high data availability.

ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 6 – RAID ©Manuel Rodriguez.

Lecture 4 1 Reliability vs Availability Reliability: Is anything broken? Availability: Is the system still available to the user?

©2001 Pål HalvorsenINFOCOM 2001, Anchorage, April 2001 Integrated Error Management in MoD Services Pål Halvorsen, Thomas Plagemann, and Vera Goebel University.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

TPB Models Development Status Report Presentation to the Travel Forecasting Subcommittee Ron Milone National Capital Region Transportation Planning Board.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

I/O – Chapter 8 Introduction Disk Storage and Dependability – 8.2 Buses and other connectors – 8.4 I/O performance measures – 8.6.

Lecture 9 of Advanced Databases Storage and File Structure (Part II) Instructor: Mr.Ahmed Al Astal.

Redundant Array of Inexpensive Disks aka Redundant Array of Independent Disks (RAID) Modified from CCT slides.

A BigData Tour – HDFS, Ceph and MapReduce These slides are possible thanks to these sources – Jonathan Drusi - SCInet Toronto – Hadoop Tutorial, Amir Payberah.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu th Annual IEEE/IFIP International.

Towards a Billion Routing Lookups per Second in Software  Author: Marko Zec, Luigi, Rizzo Miljenko Mikuc  Publisher: SIGCOMM Computer Communication Review,

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.

Coding and Algorithms for Memories Lecture 14 1.

Exact Regenerating Codes on Hierarchical Codes Ernst Biersack Eurecom France Joint work and Zhen Huang.

Coding and Algorithms for Memories Lecture 13 1.

Enhanced Availability With RAID CC5493/7493. RAID Redundant Array of Independent Disks RAID is implemented to improve: –IO throughput (speed) and –Availability.

PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker,

Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.

I/O Errors 1 Computer Organization II © McQuain RAID Redundant Array of Inexpensive (Independent) Disks – Use multiple smaller disks (c.f.

A Tale of Two Erasure Codes in HDFS

By Chris immanuel, Heym Kumar, Sai janani, Susmitha

Distributed Network Traffic Feature Extraction for a Real-time IDS

Steve Ko Computer Sciences and Engineering University at Buffalo

Steve Ko Computer Sciences and Engineering University at Buffalo

A Simulation Analysis of Reliability in Erasure-coded Data Centers

Vladimir Stojanovic & Nicholas Weaver

Repair Pipelining for Erasure-Coded Storage

Presented by Haoran Wang

Section 7 Erasure Coding Overview

PA an Coordinated Memory Caching for Parallel Jobs

Understanding System Characteristics of Online Erasure Coding on Scalable, Distributed and Large-Scale SSD Array Systems Sungjoon Koh, Jie Zhang, Miryeong.

RAID RAID Mukesh N Tekwani

ICOM 6005 – Database Management Systems Design

Xiaoyang Zhang1, Yuchong Hu1, Patrick P. C. Lee2, Pan Zhou1

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

Dr. Zhijie Huang and Prof. Hong Jiang University of Texas at Arlington

Erasure Correcting Codes for Highly Available Storage

RAID RAID Mukesh N Tekwani April 23, 2019

Erasure Codes for Systems

CS 295: Modern Systems Organizing Storage Devices

Low-cost and Fast Failure Recovery Using In-VM Containers in Clouds

Efficient Migration of Large-memory VMs Using Private Virtual Memory

Presentation transcript:

Dineshkumar Bhaskaran Aricent Technologies Accelerated Erasure Coding: The New Frontier of Software Defined Storage Dineshkumar Bhaskaran Aricent Technologies

Introduction to Data Resiliency Traditional RAID and Mirroring Multiple disks are used for data placement thereby improving performance and resiliency High storage overhead; high rebuild times Difficult to recover from co-related disk failures Erasure coding Erasure coding is data protection method in which data is encoded to data blocks and parity blocks. These are then stored across locations or storage nodes Compute intensive

Erasure Coding : A primer A traditional erasure code is represented as (k, m) where it encodes k data blocks with m parity blocks writes them to k+m storage nodes An optimal (or MDS) code can recover from any ‘m’ node failures A popular code is Reed-Solomon (RS). It has been successfully used in several solutions like Linux RAID-6, Google file system II, Hadoop, Facebook, etc. Erasure coding D1 D2 D3 D4 C1 C2 D A (4, 2) erasure code

Erasure Coding : Read and Write Traditional erasure code A (4, 2) erasure code has 4 data chunks and 2 parity chunks Read path Data Block (file/object) Application Write path Read Application Data Block (file/object) F1 D1 S1 F1 F2 F3 F4 D1 D2 S2 F2 D2 D3 S3 Write Create chunks encode Write Disks Read D3 Recon- struct D4 S4 F3 D4 C1 S5 F4 C2 S6

Erasure Coding : Read and Write Traditional erasure code A (4, 2) erasure code has 4 data chunks and 2 parity chunks Read path Data Block (file/object) Application Write path Read Application Data Block (file/object) F1 D1 S1 F1 F2 F3 F4 D1 D2 S2 F2 D2 D3 S3 Write Create chunks encode Write Disks Read D3 Recon- struct D4 S4 F3 C1 C1 S5 New disks F4 C2 S6 D4 C2 Reconstruct path

Erasure Coding : Shortcomings Encode is compute intensive In case of Reed Solomon a generator matrix of dimension (k+m, k) is used to create code chunks from data chunks Reconstruction is costly. It is triggered in case of Degraded Read : This issue is caused when application receives read exception while reading a data block in a node due to software errors (hot spot effect or system updates) or hardware errors Node repair : The whole node is down Number of failed nodes in a Facebook cluster of 3,000 nodes for a month [4]

Erasure Coding : Modern approaches Locally recoverable code (LRC) LRCs trade storage efficiency for speeding up the recovery process LRCs use MDS code in a hierarchical manner by performing the encoding at multiple levels D1 D2 D3 D4 D5 D6 D7 D8 L1 G1 G2 Global Parity Local Parity L2 G3 G4

Erasure Coding : Modern approaches Regenerating codes These are MDS codes (not all) represented as (n, k, d, α, β) which divides the chunks into smaller sub-chunks during the encoding process Reduce the bandwidth for the repair by reducing the amount of data read from each node Further classified as minimum storage regenerating codes and minimum bandwidth regenerating codes Highly compute intensive

Erasure code @ Aricent Improvement in storage efficiency, Latency. Employ new generation Clay Code[2]. Clay Code has Least possible storage overhead Least possible repair bandwidth and disk read Shown 3x repair time reduction and up to 30% and 106% improvement in degraded read and write with CEPH Acceleration of Erasure Coding Offloading the computation to GPU Accelerated Cauchy RS and Clay Code Integrate the accelerated erasure code algorithms to CEPH.

Cauchy Reed-Solomon Cauchy Reed Solomon Uses Cauchy generator matrices Input data Cauchy Reed Solomon Uses Cauchy generator matrices Multiplication is reduced to XOR operation Accelerated Cauchy Reed Solomon Use of constant memory of generator matrix in GPU Use of shared memory to optimize access to data in global memory Data Chunk k-1 Data Chunk 0 Data Chunk 1 kw Block 0 Block 1 Block N-1 Generator matrix Packet 0 Packet 0 Packet 0 mw Packet 1 Packet 1 Packet 1 Packet w-1 Packet w-1 Packet w-1 Parity Chunk 0 Parity Chunk 1 Parity Chunk m-1 Cauchy RS erasure code[3]

Clay Code : Construction Uncoupled data cube x U1* U2* C1 C1* C2 C2* Consider the (2, 2) encoding, each sub-chunk is represented using a point in plane. Sub-chunks are further classified as coupled (blue dots) and uncoupled (red-dots) Using uncoupled pairs copied as is, a Pairwise Reverse Transform (PRT) is used on un paired sub-chunks to obtain elements of uncoupled data cube (cube on RHS). A MDS code is used to get rest of uncoupled data cube Using newly constructed uncoupled data cube a Pairwise Forward Transform (PFT) is applied to obtain the code chunks. Both PRT and PFT are (2, 2) MDS codes U1 z = (0,0) PRT U2 z = (0,1) +Copy z = (1,0) z = (1,1) MDS C3 C3* C4 C4* U3* U3 U4 U4* PFT +Copy A (2, 2) Clay Code

Clay Code : Decode/Recovery Process Consider the following single data node erasure case Uncoupled data cube is created using PRT and copying the unpaired sub-chunks MDS decode is performed on the planes selected for recovery and uncoupled sub-chunks are copied PRT+ Copy MDS C C* U U*

Clay Code : Decode/Recovery Process With clay code construction any two sub-chunks in the set {U, U*, C, C*} can be recovered from the remaining two sub-chunks using PFT. Here C1* is computed from C1, U1 and C2* from C2, U2 The repair bandwidth is reduced in this method since data from only 2(half) Z-planes are used for the recovery process C2* C2 U2 U1 C1* C1

Clay Code : Enhancements Use of accelerated Cauchy RS Clay code uses MDS codes for performing PFT and PRT through existing Erasure code infrastructure in CEPH. A version of earlier accelerated Cauchy RS is used for PFT and PRT Multiple memory allocation (both CPU and GPU side) and related copying were involved with CEPH erasure code infrastructure. These were optimized by removing redundant operations Optimized memory access and separate GPU kernel for PFT and PRT Clay code construction uses data copy and various transforms to create intermediate and final results. Complete clay operations were moved to GPU space while using CUDA primitives to achieve the copy operations An optimized and independent (2, 2) erasure code CUDA implementation is used for PFT and PRT

Locally repairable erasure Environment Jerasure library Hardware 16 core Intel(R) Xeon(R) CPU E5-2660 @ 2.20GHz with 64GB ram NVIDIA GTX 1080 Software CEPH 13.1.0 (mimic) CUDA 8.0 (driver 384.111) Reed Solomon (Vand) OSD CRS RBD Accelerated CRS RADOS ErasureCode library ErasureInterface Clay Code TestCase Injector Googletest EC test Fixture Result abridger Accelerated Clay Code Intel ISA-L Locally repairable erasure Shingled Erasure code

Results – Jerasure Performance Gains Various (k, m) encode performances with different chunk sizes Execution MB/s Execution MB/s GPU Encode performance improves with higher chunk size

Results – Jerasure Performance Gains Various (k, m) decode performances with different chunk sizes Execution MB/s Execution MB/s REF decode performance drops approximately 35% with increase in number of erasures while GPU is drops only < 10%.

Results – Jerasure Performance Gains Jerasure REF vs GPU performance with 4MB chunk size Approximate gain (> ~5-7x) is observed in case of decode Execution MB/s Execution MB/s Encode bandwidth at lower (k, m) values is significant (~16x for 6,4), however the gain reduces for higher k, m values (7x for 20,10)

Results – Clay Code Performance Gains Various (k, m) encode performances with different chunk sizes Execution MB/s Execution MB/s GPU encode improves with higher chunk size

Results – Clay Code Performance Gains Various (k, m) decode performances with different chunk sizes Execution MB/s Execution MB/s GPU decode remains consistent with number of erasures.

Results – Clay Code Performance Gains Clay REF vs GPU performance with 4MB chunk size The decode gain reduces with higher k, m values. It reduces from ~13x for 6,4 a gain of ~7.37x for 12, 6. Execution MB/s Execution MB/s Encode bandwidth slowly increases with increase in (k, m) values e.g. ~11x for 6,4 to ~40x for 20,10.

Summary Accelerated Cauchy Reed Solomon and Clay Code show good performance gain compared to corresponding CPU versions. Jerasure code show good improvements upto16.03x for encode and 10.6x for decode Clay code shows improvements up to 41.32x in encode and 13.08x in decode for different k, m values We continue the work of Enhancing Clay code on GPU Testing Clay code with a CEPH Cluster comprising four server machine with 16 core Intel Xeon CPU E5-2660 @ 2.20GHz, 64GB ram with NVIDIA GTX 1080 card and 60TB storage array

Erasure code : Future possibilities Erasure Coding Use Cases Application Workload Dependent Resiliency Storage Technology Dependent Resiliency Integration of EC with File System Data Migration for Resiliency Optimization

Reference Mingyuan Xia, Mohit Saxena, Mario Blaum, and David A. Pease. A Tale of Two Erasure Codes in HDFS. Usenix conference on File and storage technologies, 2015 Myna Vajha, Vinayak Ramkumar, Bhagyashree Puranik, Ganesh Kini, Elita Lobo, Birenjith Sasidharan, and P. Vijay Kumar, Indian Institute of Science, Bangalore; Alexandar Barg and Min Ye, University of Maryland; Srinivasan Narayanamurthy, Syed Hussain, and Siddhartha Nandi. Clay Codes: Moulding MDS Codes to Yield an MSR Code, Usenix conference on File and storage technologies, 2018. Chengjian Liu, Qiang Wang, Xiaowen Chu, Yiu-Wing Leung. G-CRS: GPU Accelerated Cauchy Reed-Solomon Coding, IEEE Transactions on Parallel and Distributed Systems, 2018 Maheswaran Sathiamoorthy, Alexandros G. Dimakis, Megasthenis Asteris, Ramkumar Vadali, Dhruba Borthakur, Dimitris Papailiopoulos, Scott Chen. XORing Elephants: Novel Erasure Codes for Big Data, Proceedings of the VLDB Endowment, 2013.

Thank you