Understanding Real World Data Corruptions in Cloud Systems

Slides:

Advertisements

Similar presentations

Chapter 16: Recovery System

Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung

The google file system Cs 595 Lecture 9.

Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.

1 Awareness Services for Digital Libraries Arturo Crespo Hector Garcia-Molina Stanford University.

Failures in the System  Two major components in a Node Applications System.

What Bugs Live in the Cloud? A Study of Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

Google Distributed System and Hadoop Lakshmi Thyagarajan.

The Hadoop Distributed File System, by Dhyuba Borthakur and Related Work Presented by Mohit Goenka.

Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore INTRODUCTION TO HADOOP.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.

The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.

1 The Google File System Reporter: You-Wei Zhang.

Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.

CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture VI: 2014/04/14.

Introduction to Hadoop and HDFS

Hadoop & Condor Dhruba Borthakur Project Lead, Hadoop Distributed File System Presented at the The Israeli Association of Grid Technologies.

MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.

(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.

Concurrency Control. Objectives Management of Databases Concurrency Control Database Recovery Database Security Database Administration.

Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.

Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.

GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.

HDFS (Hadoop Distributed File System) Taejoong Chung, MMLAB.

HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.

Google File System Robert Nishihara. What is GFS? Distributed filesystem for large-scale distributed applications.

{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.

Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.

1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.

Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, Haryadi S. Gunawi.

Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint.

Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski.

HDFS Deep Dive Berlin Buzzwords 2010 Jay Booth. HDFS in a slide One NameNode, N datanodes Files are split into Blocks Client talks to namenode in order.

File-System Management

Map reduce Cs 595 Lecture 11.

Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung

A Tale of Two Erasure Codes in HDFS

Remote Backup Systems.

Presented by: Daniel Taylor

Data Management with Google File System Pramod Bhatotia wp. mpi-sws

Introduction to Distributed Platforms

CSS534: Parallel Programming in Grid and Cloud

HDFS Yarn Architecture

Experiences and Outlook Data Preservation and Long Term Analysis

Yongle Zhang, Serguei Makarov, Xiang Ren, David Lion, Ding Yuan

Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

RDE: Replay DEbugging for Diagnosing Production Site Failures

Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016

Software Engineering Introduction to Apache Hadoop Map Reduce

Ministry of Higher Education

NOVA: A High-Performance, Fault-Tolerant File System for Non-Volatile Main Memories Andiry Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah,

Operation System Program 4

RAID RAID Mukesh N Tekwani

The Basics of Apache Hadoop

GARRETT SINGLETARY.

Hadoop Distributed Filesystem

Fault Tolerance Distributed Web-based Systems

Hadoop Technopoints.

Recovery System.

Lecture 20: Intro to Transactions & Logging II

RAID RAID Mukesh N Tekwani April 23, 2019

Apache Hadoop and Spark

Remote Backup Systems.

Concurrency Control.

Distributed Systems and Concurrency: Distributed Systems

Presentation transcript:

Understanding Real World Data Corruptions in Cloud Systems Peipei Wang, Daniel Dean, Xiaohui Gu North Carolina State University Hi everyone, My name is Peipei, Today, I am going to present you my paper on :*****

Motivation That’s first look at several real data corruption events in recently years Facebook temporarily loses more than 10% of photos in hard drive failure When most people think their data stored in cloud is safe, actually it is not the case. While people think if data corruption happens, it must be because of hardware problems. But it could also because of software problems It is important to understand some of the causes of data corruption

HDFS Background Client Disk Memory HDFS System Files NameNode Data 1.HDFS write operation Client Disk Memory HDFS System Files NameNode Data 3.Return three block locations 2.Log this operation, log block location Data Memory Memory Memory Block size Checksum Timestamp Version Data corruption can happen at any level of storage and with any type of media, Data corruption can happen anywhere within a storage environment. Data can be corrupted simply by migrating it to a different platform. Media: bit rot, controller failures, deduplication metadata, tape failures Metadata corruption----hardware-induced, software glitches, DataNode A DataNode B DataNode C Disk Disk Disk Block Block Block Block Block Block metadata Block metadata Block metadata

HDFS Background Client Disk Memory HDFS System Files NameNode Data 1.HDFS write operation Client Disk Memory HDFS System Files NameNode Data 2.Log this operation, log block location 3.Return three block locations Memory Memory Memory Block size Checksum Timestamp Version Errors occurred in logging----HDFS system File Pipeline, network--------Block Changes or update Block, without updating block metadata------metadata file race condition---block metadata---which one is correct DataNode A DataNode B DataNode C Disk Disk Disk Block Block Block Block metadata Block metadata Block metadata

Methodology Randomly sampled 138 Hadoop bug incidents that are related to data corruption All incidents are resolved bug incidents Manually studied each bug report (e.g., bug descriptions, patches) System Name System file corruption Metadata corruption Block corruption Misreported corruption Hadoop 1.x 15 11 46 4 Hadoop 2.x (YARN) 1 7 HDFS 1.x 17 23 HDFS 2.x 8 22 10 Covered both 1.x and 2.x version of Hadoop HDFS is more likely related to data corruption than other Hadoop components

Outline State of the art Research goals Data corruption impact Data corruption detection Data corruption causes Data corruption handling Key findings Future work Conclusion Don’t need to read all points

State of the Art Data corruption studies [Zhang et al. FAST`10, Schroeder et al. FAST`07] Focused on hardware-induced data corruption problems Data corruption detection frameworks [Yang et al. OSDI`06, Subramanian et al. ICDE`10] Reactive approaches, for stand-alone systems (e.g., file system) Bug characteristic studies [Jin et al. PLDI`12, Lu et al. ASPLOS`08] Focus on software bugs (e.g., performance bugs, concurrency bugs) Don’t need to read all points

Research Goals Understand real-world software-induced data corruptions What impact can data corruption have on the application and system? How is data corruption detected? What are the causes of the data corruption? What problems can occur while attempting to handle data corruption? Don’t need to read all points

Data Corruption Impact on System Integrity Block Metadata System file Availability Hadoop failures MapReduce job failures Performance Time delay Decreased throughput

Data Corruption Impact Examples HDFS-3277: fsimage load failure HDFS-2798: Thread cannot complete file operation Block Appending Block Scanner Disk Memory fsimage NameNode Matched File system state Disk An fsimage file contains the complete state of the file system at a point in time Block Block Block metadata Hadoop failures Job failures Time delay Unmatched

Data Corruption Detection Discuss what each type of detection means 42%: misreported 21%: silent 12%: misreported 25%: correct Existing data detection schemes are insufficient

Data Corruption Detection Example HDFS-1483: silent data corruption DataNode A Block location on DataNode A Block location on DataNode A/B/C DataNode B DataNode C getBlockLocations() Discuss what each type of detection means NameNode Client does not know block corruption Corrupted block Uncorrupted block

Data Corruption Detection Example HDFS-1524: Misreported data corruption 4 bytes of compression related information unread Memory Compressed fsimage NameNode Disk Compressed fsimage Compressed fsimage HDFS-1524 load failure because of uncomplete fsimage file

Data Corruption Causes Number of incidents Improper runtime checking 25 Race condition 26 Inconsistent state 16 Improper network failure handling 5 Improper node crash handling 10 Incorrect name/value Lib/command errors 4 Compression-related errors Incorrect data movement 2 Discuss what each type of detection means 42%: misreported 21%: silent 12%: misreported 25%: correct

Data Corruption Causes Example HDFS-3626: Improper runtime check given invalid file path Command with invalid path: hadoop fs –put filename hdfs://localhost:8020//temp/filename Mkdir (path=/) (path=//temp) Add block Set timestamp Update block … Illegal operation Explain what this command is used for Hadoop failed to load edits.log Edits.log

Data Corruption Causes Example HADOOP-3069: Improper network failure handling Void getFileServer (outstream,…) try{ … outstream.write(buf,0,num); }finally{ outstream.close(); } try{ … TransferFsImage.getFileServer(response.getOutputStream().nn.getFsImageName()); }catch(IOException e) Response.sendError(…); } SecondaryNameNode NameNode Error message cannot send out, Namenode will never know the file is corrupted

Existing Data Corruption Handling Schemes Data recovery Data replication Data deletion Simple re-execution Discuss what each type of detection means 42%: misreported 21%: silent 12%: misreported 25%: correct

Problems in Data Corruption Handling Schemes HDFS-4799: Incorrect data deletion reboot DataNode A DataNode D DataNode B DataNode E Discuss what each type of detection means 42%: misreported 21%: silent 12%: misreported 25%: correct DataNode C DataNode F NameNode

Key Findings The impact of data corruption is not limited to data integrity Existing data corruption detection schemes are quite insufficient There are various causes of data corruption Existing data corruption handling mechanisms make frequent mistakes Don’t need to read all points

Future Work Data corruption detection schemes Trace data-related operations Anomaly detection over the operation logs Advantages: proactive Don’t need to read all points

Conclusion Characteristic study of 138 real world data corruption incidents Software-induced data corruptions are prevalent Data corruption detection schemes need to be improved Replication cannot completely solve data corruption problems Data corruption handling schemes may introduce other issues (e.g., mistaken block deletion, resource hogging) Don’t need to read all points Thank you!