Understanding the Effects and Implications of Compute Node Failures in Florin Dinu T. S. Eugene Ng.

Slides:



Advertisements
Similar presentations
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Advertisements

SDN + Storage.
‘s Overload Tolerant Design Exacerbates Failure Detection and Recovery Florin Dinu T. S. Eugene Ng Rice University.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce Online Veli Hasanov Fatih University.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Distributed Computations
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.
Improving MapReduce Performance through Data Placement in Heterogeneous Hadoop Clusters Jiong Xie Ph.D. Student April 2010.
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.
Distributed Computations MapReduce
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
Improving MapReduce Performance Using Smart Speculative Execution Strategy Qi Chen, Cheng Liu, and Zhen Xiao Oct 2013 To appear in IEEE Transactions on.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
On Availability of Intermediate Data in Cloud Computations Steven Y. Ko, Imranul Hoque, Brian Cho, and Indranil Gupta Distributed Protocols Research Group.
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Network Support for Cloud Services Lixin Gao, UMass Amherst.
THE HOG LANGUAGE A scripting MapReduce language. Jason Halpern Testing/Validation Samuel Messing Project Manager Benjamin Rapaport System Architect Kurry.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Our Experience Running YARN at Scale Bobby Evans.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Apache Hadoop MapReduce What is it ? Why use it ? How does it work Some examples Big users.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
MARISSA: MApReduce Implementation for Streaming Science Applications 作者 : Fadika, Z. ; Hartog, J. ; Govindaraju, M. ; Ramakrishnan, L. ; Gunter, D. ; Canon,
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
GreenSched: An Energy-Aware Hadoop Workflow Scheduler
MC 2 : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd Sakr 1.
Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
BIG DATA/ Hadoop Interview Questions.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
Large-scale file systems and Map-Reduce
Edinburgh Napier University
Introduction to MapReduce and Hadoop
Introduction to HDFS: Hadoop Distributed File System
Understanding Real World Data Corruptions in Cloud Systems
Software Engineering Introduction to Apache Hadoop Map Reduce
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
The Basics of Apache Hadoop
Cloud Computing MapReduce in Heterogeneous Environments
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Understanding the Effects and Implications of Compute Node Failures in Florin Dinu T. S. Eugene Ng

Computing in the Big Data Era 2 15PB 20PB100PB 120PB Big Data – Challenging for previous systems Big Data Frameworks – Google – – Yahoo & Facebook

Image Processing Protein Sequencing Web Indexing Machine Learning Advertising Analytics Log Storage and Analysis 3 Is Widely Used and many more …..

4 SIGMOD 2010 Building Around

5 Building On Top Of Building on core Hadoop functionality

The Danger of Compute-Node Failures 6 “ In each cluster’s first year, it’s typical that 1,000 individual machine failures will occur; thousands of hard drive failures will occur” Jeff Dean – Google I/O 2008 Causes: large scale use of commodity components “ Average worker deaths per job: 5.0 ” Jeff Dean – Keynote I – PACT 2006

The Danger of Compute-Node Failures 7 In the cloud compute node failures are the norm NOT the exception Amazon, SOSP 2009

Failures From Hadoop’s Point of View 8 Important to understand effect of compute-node failures on Hadoop Situations indistinguishable from compute node failures: Switch failures Longer-term dis-connectivity Unplanned reboots Maintenance work (upgrades) Quota limits Challenging environments Spot markets (price driven availability) Volunteering systems Virtualized environments

Hadoop is widely used Compute node failures are common 9 Hadoop needs to be failure resilient The Problem Hadoop needs to be failure resilient in an efficient way Minimize impact on job running times Minimize resources needed

Contribution First in-depth analysis of the impact of failures on Hadoop – Uncover several inefficiencies Potential for future work – Immediate practical relevance – Basis for realistic modeling of Hadoop 10

Quick Hadoop Background 11

Background – the Tasks 12 R M MGR Master DataNode TaskTracker Reducer taskMap task Give me work ! RM M R More work ? JobTracker NameNode RM 2 waves of R 2 waves of M

13 Background – Data Flow MM M R RR HDFS Map Tasks Shuffle Reducer Tasks HDFS

14 Background – Speculative Execution M M M 0 <= Progress Score <= 1 Progress Rate = (Progress Score/time) Ex: 0.05/sec Ideal case: Similar progress rates

15 Background – Speculative Execution (SE) M M M Reality: Varying progress rates ! Goal of SE: Detect underperforming nodes Duplicate the computation Reasons for underperforming tasks Node overload, network congestion, etc. Underperforming tasks (outliers) in Hadoop: > 1 STD slower than mean progress rate M

How does Hadoop detect failures? 16

17 MGR Master DataNode TaskTracker M R Failures of the Distributed Processes Timeouts, Heartbeats & Periodic Checks Heartbeats

18 Timeouts, Heartbeats & Periodic Checks Conservative approach – last line of defense Time Failure interrupts heartbeat stream Periodically check for changes Declare failure after a number of checks AHA ! It failed

19 Failures of the Individual Tasks (Maps) MM R R Infer map failures from notifications Conservative – not react to temporary failures MGR R Give me data! M does not answer !! RR M does not answer !! M ΔtΔt ΔtΔt

R complains too much? (failed/ succ. attempts) R stalled for too long? (no new succ. attempts) 20 MM R R Notifications also help infer reducer failures Give me data! MM Give me data! X MGR M does not answer !! R Failures of the Individual Tasks (Reducers)

Do these mechanisms work well? 21

Methodology Focus on failures of distributed components (TaskTracker and DataNode) Inject these failures separately Single failures – Enough to catch many shortcomings – Identified mechanisms responsible – Relevant to multiple failures too 22 DataNode TaskTracker M R

23 Mechanisms Under Task Tracker Failure? LARGE, VARIABLE, UNPREDICTABLE job running times Poor performance under failure OpenCirrus Sort 10GB 15 nodes 14 reducers Inject fail at random time 220s running time without failures Findings also relevant to larger jobs

24 Few reducers impacted. Notification mechanism ineffective Timeouts fire. 70% cases – notification mechanism ineffective Clustering Results Based on Cause Failure has no impact Not due to notifications

25 More reducers impacted Notification mechanism detects failure Timeouts do not fire. Notification mechanism detects failure in: Few cases Specific moment in the job Clustering Results Based on Cause

R complains too much? (failed/ total attempts) e.g. 3 out of 3 failed Give me data! 26 Side Effects: Induced Reducer Death Failures propagate to healthy tasks Negative Effects: Time and resource waste for re-execution Job failure - a small number of runs fail completely X MGR M does not answer !! R Unlucky reducers die early M

R stalled for too long? (no new succ. attempts) 27 Side Effects: Induced Reducer Death MGR M does not answer !! R Give me data! X All reducers may eventually die Fundamental problem: Inferring task failures from connection failures Connection failures have many possible causes Hadoop has no way to distinguish the cause (src? dst?) M

CDF 28 More Reducers: 4/Node = 56 Total Job running time spread out even more More reducers = more chances for explained effects

Effect of DataNode Failures 29 TaskTracker M R DataNode

30 Timeouts When Writing Data RM X Write Timeout (WTO)

31 Timeouts When Writing Data RM X Connect Timeout (CTO)

32 Effect on Speculative Execution Outliers in Hadoop:>1 STD slower than mean progress rate Low PR High PR AVG AVG – 1*STD Outliers Very high PR AVG AVG – 1*STD

33 Delayed Speculative Execution M 9 11 M 50s Waiting for mappers M 9 11 M 100s Map outputs read Avg(PR)- STD(PR) s Reducer write output

34 Delayed Speculative Execution s Failure occurs Reducers timeout R9 speculatively exec 9 11 ! 9 > 200s New R9 skews stats M M Very low 400s R11 finally speculatively exec. 11 ! Finally low enough WTO

35 Delayed Speculative Execution Hadoop’s assumptions about progress rates invalidated Stats skewed by very fast speculated task Significant impact on job running time 9 Very low

36 52 reducers – 1 Wave Reducers stuck in WTO Delayed speculative execution CTO after WTO Reconnect to failed DataNode

37 Delayed SE – A General Problem Failures and timeouts are not the only cause To suffer from delayed SE : Slow tasks that benefit from SE I showed the ones stuck in a WTO Other: slow or heterogeneous nodes, slow transfers (heterogeneous networks) Fast advancing tasks I showed varying data input availability Other: varying task input size varying network speed Statistical SE algorithms need to be carefully used

Conclusion - Inefficiencies Under Failures Task Tracker failures – Large, variable and unpredictable job running times – Variable efficiency depending on reducer number – Failures propagate to healthy tasks – Success of TCP connections not enough Data Node failures – Delayed speculative execution – No sharing of potential failure information (details in paper) 38

Ways Forward Provide dynamic info about infrastructure to applications (at least in the private DCs) Make speculative execution cause aware – Why is a task slow at runtime? – Move beyond statistical SE algorithms – Estimate PR of tasks (use envir, data characteristics) Share some information between tasks – In Hadoop tasks rediscover failures individually – Lots of work on SE decisions (when, where to SE) – This decisions can be invalidate by such runtime inefficiencies 39

40 Thank you

41 Backup slides

Large variability in job running times Experiment: Results Group G2 Group G6 Group G7 Group G3 Group G5 Group G1 Group G4 42

Group G1 – few reducers impacted Slow recovery when few reducers impacted M1 R1 M1 copied by all reducers before failure. R1_1 X Job Tracker After failure R1_1 cannot access M1. R1_1 needs to send 3 notifications ~ 1250s Task Tracker declared dead after s M2 M3 Notif (M1) R2 R3 43

Group G2 – timing of failure Timing of failure relative to Job Tracker checks impacts job running time Time G1 G2 170s Time Job end 600s 200s 200s difference between G1 and G2. 44

Group G3 – early notifications Early notifications increase job running time variability G1 notifications sent after 416s G3 early notifications => map outputs declared lost Causes: Code-level race conditions Timing of a reducer’s shuffle attempts R2 X M5 R2 X M5-1 M6-1 M5-2 M6-2 M5-3 M6-3 M5-4 M6-4 M6-1 M5-1 M6-2 M5-2 M6-3 M5-3 M M5 M6 M5-4 M6-5 45

Group G4 & G5 – many reducers impacted Job running time under failure varies with nr of reducers impacted R1_1 X Job Tracker G4 - Many reducers send notifications after 416s - Map output is declared lost before the Task Tracker is declared dead G5 - Same as G4 but early notifications are sent Notif (M1,M2,M3, M4,M5) M1 R1 M2 M3 R2 R3 46

47 Task Tracker Failures Gew reducers impacted. Not enough notifications. Timeouts fire. Many reducers impacted. Enough notifications sent Timeouts do not fire LARGE, VARIABLE, UNPREDICTABLE job running times Efficiency varies with number of affected reducers

CDF 48 Node Failures: No RST Packets No RST -> No Notifications -> Timeouts always fire

49 Not Sharing Failure Information Different SE algorithm (OSDI 08) Tasks SE even before failure. Delayed SE not the cause. Both initial and SE task connect to failed node No sharing of potential failure information

50 t Outlier: avg(PR(all)) – std(PR(all)) > PR(t) limit R9R11 Delayed Speculative Execution Stats skewed by very fast speculative tasks. Hadoop’s assumptions about prog. rates invalidated M M 9 11 WTO 9 11

Timeline: ~50s reducers wait for map outputs ~100s reducers get map outputs ~200s failure => reducers timeout ~200s R9 speculatively executed huge progress rate statistics skewed ~400s R11 finally speculatively executed 51 Delayed Speculative Execution Stats skewed by very fast speculative tasks. Hadoop’s assumptions about prog. rates invalidated M M 9 11 WTO 9 11