A Tale of Two Erasure Codes in HDFS

Slides:



Advertisements
Similar presentations
Disk Arrays COEN 180. Large Storage Systems Collection of disks to store large amount of data. Performance advantage: Each drive can satisfy only so many.
Advertisements

Redundant Array of Independent Disks (RAID) Striping of data across multiple media for expansion, performance and reliability.
RAID (Redundant Arrays of Independent Disks). Disk organization technique that manages a large number of disks, providing a view of a single disk of High.
Analysis and Construction of Functional Regenerating Codes with Uncoded Repair for Distributed Storage Systems Yuchong Hu, Patrick P. C. Lee, Kenneth.
current hadoop architecture
CSCE430/830 Computer Architecture
Henry C. H. Chen and Patrick P. C. Lee
 RAID stands for Redundant Array of Independent Disks  A system of arranging multiple disks for redundancy (or performance)  Term first coined in 1987.
Coding and Algorithms for Memories Lecture 12 1.
Enhanced Availability With RAID CC5493/7493. RAID Redundant Array of Independent Disks RAID is implemented to improve: –IO throughput (speed) and –Availability.
RAID- Redundant Array of Inexpensive Drives. Purpose Provide faster data access and larger storage Provide data redundancy.
Availability in Globally Distributed Storage Systems
CSE 486/586 CSE 486/586 Distributed Systems Case Study: Facebook f4 Steve Ko Computer Sciences and Engineering University at Buffalo.
Lecture 36: Chapter 6 Today’s topic –RAID 1. RAID Redundant Array of Inexpensive (Independent) Disks –Use multiple smaller disks (c.f. one large disk)
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
Beyond the MDS Bound in Distributed Cloud Storage
Other Disk Details. 2 Disk Formatting After manufacturing disk has no information –Is stack of platters coated with magnetizable metal oxide Before use,
High Performance Computing Course Notes High Performance Storage.
Lecture 3: A Case for RAID (Part 1) Prof. Shahram Ghandeharizadeh Computer Science Department University of Southern California.
Codes with local decoding procedures Sergey Yekhanin Microsoft Research.
RAID Systems CS Introduction to Operating Systems.
Failures in the System  Two major components in a Node Applications System.
Network Coding for Distributed Storage Systems IEEE TRANSACTIONS ON INFORMATION THEORY, SEPTEMBER 2010 Alexandros G. Dimakis Brighten Godfrey Yunnan Wu.
Transactions and Reliability. File system components Disk management Naming Reliability  What are the reliability issues in file systems? Security.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 6 – RAID ©Manuel Rodriguez.
Chapter 6 RAID. Chapter 6 — Storage and Other I/O Topics — 2 RAID Redundant Array of Inexpensive (Independent) Disks Use multiple smaller disks (c.f.
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Storage Systems.
©2001 Pål HalvorsenINFOCOM 2001, Anchorage, April 2001 Integrated Error Management in MoD Services Pål Halvorsen, Thomas Plagemann, and Vera Goebel University.
RAID COP 5611 Advanced Operating Systems Adapted from Andy Wang’s slides at FSU.
Lecture 9 of Advanced Databases Storage and File Structure (Part II) Instructor: Mr.Ahmed Al Astal.
Redundant Array of Inexpensive Disks aka Redundant Array of Independent Disks (RAID) Modified from CCT slides.
"1"1 Introduction to Managing Data " Describe problems associated with managing large numbers of disks " List requirements for easily managing large amounts.
1 Making MapReduce Scheduling Effective in Erasure-Coded Storage Clusters Runhui Li and Patrick P. C. Lee The Chinese University of Hong Kong LANMAN’15.
Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.
Coding and Algorithms for Memories Lecture 14 1.
1 Enabling Efficient and Reliable Transitions from Replication to Erasure Coding for Clustered File Systems Runhui Li, Yuchong Hu, Patrick P. C. Lee The.
RAID Systems Ver.2.0 Jan 09, 2005 Syam. RAID Primer Redundant Array of Inexpensive Disks random, real-time, redundant, array, assembly, interconnected,
Exact Regenerating Codes on Hierarchical Codes Ernst Biersack Eurecom France Joint work and Zhen Huang.
Coding and Algorithms for Memories Lecture 13 1.
Enhanced Availability With RAID CC5493/7493. RAID Redundant Array of Independent Disks RAID is implemented to improve: –IO throughput (speed) and –Availability.
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
CSE 451: Operating Systems Spring 2010 Module 18 Redundant Arrays of Inexpensive Disks (RAID) John Zahorjan Allen Center 534.
CS Introduction to Operating Systems
Transactions and Reliability
Steve Ko Computer Sciences and Engineering University at Buffalo
Steve Ko Computer Sciences and Engineering University at Buffalo
Repair Pipelining for Erasure-Coded Storage
OptiSystem applications: BER analysis of BPSK with RS encoding
Presented by Haoran Wang
Section 7 Erasure Coding Overview
Understanding System Characteristics of Online Erasure Coding on Scalable, Distributed and Large-Scale SSD Array Systems Sungjoon Koh, Jie Zhang, Miryeong.
CSE 451: Operating Systems Spring 2006 Module 18 Redundant Arrays of Inexpensive Disks (RAID) John Zahorjan Allen Center.
RAID Disk Arrays Hank Levy 1.
RAID RAID Mukesh N Tekwani
ICOM 6005 – Database Management Systems Design
Xiaoyang Zhang1, Yuchong Hu1, Patrick P. C. Lee2, Pan Zhou1
RAID Disk Arrays Hank Levy 1.
CSE 451: Operating Systems Spring 2005 Module 17 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
CSE 451: Operating Systems Autumn 2010 Module 19 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
Mark Zbikowski and Gary Kimura
CSE 451: Operating Systems Autumn 2004 Redundant Arrays of Inexpensive Disks (RAID) Hank Levy 1.
CSE 451: Operating Systems Winter 2007 Module 18 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
CSE 451: Operating Systems Autumn 2009 Module 19 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
RAID Disk Arrays Hank Levy 1.
RAID RAID Mukesh N Tekwani April 23, 2019
CSE 451: Operating Systems Winter 2004 Module 17 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
IT 344: Operating Systems Winter 2007 Module 18 Redundant Arrays of Inexpensive Disks (RAID) Chia-Chi Teng CTB
Erasure Codes for Systems
CSE 451: Operating Systems Winter 2006 Module 18 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
CS 295: Modern Systems Organizing Storage Devices
Presentation transcript:

A Tale of Two Erasure Codes in HDFS Thomas McKenna March 3, 2016 ECE Main Slide

A technique used to recover lost or corrupt data What are Erasure Codes? A technique used to recover lost or corrupt data FEC – Forward Error Correction Parity to recover data (e.g. RAID 5) Less space used than replication Erasure Codes are a way of recovering data when only some of the data is lost. FEC * Allows the DFS to recreate without recomputing the data

Each set of blocks create a parity block How Parity Works Each set of blocks create a parity block A lost block can be calculated from other blocks and the parity Each block needs to be accessed. 1 1 1 1

Degraded Reads, Reconstruction Terminology Skew, Hot, Cold Degraded Reads, Reconstruction Encoding, Decoding, Upcoding, Downcoding PC, RS, LRC Skew, Hot, Cold Hot is data that is accessed a lot. Cold is data that is not accessed a lot Skew describes how most data is Cold Degraded Reads, Reconstruction DIFFERENT Degraded Reads – Read of a lost block that must first be reconstructed – Ideal if FAST for hot blocks Reconstruction – Periodic reconstruction of blocks using parity Encoding, Decoding, Upcoding, Downcoding Encode – Code the data so it is safe Decode – Use Coding to repair lost data Upcoding – Convert from high overhead to low overhead Downcoding – Convert to high overhead, fast reconstruction PC, RS, LRC PC – Product Codes RS – Reed Solomon LRC -  Local Reconstruction Codes

98% failures are single block, 1.87% are 2 blocks Motivation and Observation Lots of Skew High ratio of cold to hot data Want fast recovery of hot data Want very low overhead 98% failures are single block, 1.87% are 2 blocks Erasure Codes allow for a tradeoff of overhead vs reconstruction time

Single erasure code systems Current Techniques Redundancy is popular Fast reconstruction Requires a lot of storage overhead Lower mean time to failure Single erasure code systems Have to be tuned for low overhead OR fast reconstruction Inefficient when compared to dual EC system Low overhead means lots of network transfers Redundancy: 3x the storage requirement is not scalable compared to the codes we will see its MTBF is lower Single Inefficient on the mostly COLD data we see Use less parity per block or conversely more blocks per parity block Requires you to fetch all of the data blocks during reconstruction

Adaptively code the data Solution is HACFS Adaptively code the data Cold data has low over head (higher reconstruction time) Hot data has fast reconstruction time (high overhead) Limit the total overhead ratio Replication ~3x Single EC ~1.5 Double EC ~1.5

No two blocks on same disk MapReduce all the things Construction of HACFS Built on HDFS RAID Module Keeps State Up/Down Codes No two blocks on same disk MapReduce all the things HDFS – Hadoop Distributed File System State – file size, modification time, read count, encoding Dynamically changes coding scheme up/down Uses HDFS to ensure that no two blocks are on the same disk. Low parity that can only survive one block loss can lose a whole disk In reality, Erasure codes can be tuned to handle arbitrary number of failures MapReduce Scalable

System uses state to determine best coding How Up/Down Coding works System uses state to determine best coding Amount fast coded bounded by total ratio Periodically downs Cold and ups Hot data Upon data entering the system it is 3-way replicated for durability and fast degraded read When the system detects “write cold” UNLIKE SINGLE EC Checks state of read HOT/COLD We can see the up/down coding from two codes.

Others? Mechanics of Up/Down Coding LEFT – PC Only needs to access Parity blocks to construct new parity scheme Only requires XOR operations No Data transmitted (compared to 6-10 data blocks for reconstruction) RIGHT – LRC Local parity can be constructed relatively easily – 6 reads Global parity would require fetching all the data blocks Code Collapsing LRCfast  LRCcomp Most failure modes only require reading the Local parity Reconstruction is fast because of smaller Galois field compared to RS Others Paper specifies that other (modern) EC schemes can be up and down coded efficiently Does not show how it is done or any other information other than: “we have done it” Others?

On a client read – Degraded read When data is lost On a client read – Degraded read The client fetches data and parity Reconstructs lost data and uses as return from read Does not commit reconstructed data Periodic Job Same as client but stores reconstructed data Prioritizes Hot reconstruction Client Does not store data because data loss is usually transient (node reboot) Periodic Prioritize Hot reconstruction to reduce number of Degraded Reads during recovery period

Use Load data from Facebook Network and IO sensitive MapReduce job Testing Setup Use Load data from Facebook Production/real world loads Network and IO sensitive MapReduce job Fault injection Simulate failure of block, disk, and node Presume that the loads from facebook used in this test match those used in development of LRC Pretty full coverage of fault injection

Performance Bounded overhead On average better degraded read performance because of fast hot data reconstruction’ Bounded overhead is an important consideration For instance in every case it has more overhead then azure – But unproportionally beats azure in reconstruction time and latency. Note: Percent improvement HACFS over others. More positive better.

HACFS latency is better for any given overhead More Performance (Read Latency) HACFS latency is better for any given overhead Claims based on RS Red is higher percentage of Reads to hot data, encoded with fastcode The really striking performance increases claimed are with respect to RS coding We already know that LRC beats the pants off of RS

More Performance (Reconstruction) Note about LRC Cost of parity recovery larger than data recovery Global reconstruction more expensive then RS Reed-Soloman Both versions of PC and LRC reconstruct faster than RS

Similar construction costs to LRC CC3 high % hot data and hot reads More Performance (Costs) Similar construction costs to LRC CC3 high % hot data and hot reads Churn and lots of conversion NOT a performance metric cared about. So simply meeting others is good enough. Clarification: (black) ‘c’ is for conversion, (white) ‘e’ is for encoding

Other systems have optimal overhead but poor recovery time (like RS) Related Work Other systems have optimal overhead but poor recovery time (like RS) Some systems achieve fast recovery time at the cost of encoding time Tiered Storage Systems

It would have been nice to see: Thoughts It would have been nice to see: PC and LRC vs HACFS-PC and HACFS-LRC Latency vs overhead curves instead of points A lot of things in paper needed more explanation and clarity