‘s Overload Tolerant Design Exacerbates Failure Detection and Recovery Florin Dinu T. S. Eugene Ng Rice University.

Slides:



Advertisements
Similar presentations
Introduction to Hadoop Richard Holowczak Baruch College.
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
MapReduce Simplified Data Processing on Large Clusters
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce Online Veli Hasanov Fatih University.
Wei-Chiu Chuang 10/17/2013 Permission to copy/distribute/adapt the work except the figures which are copyrighted by ACM.
Spark: Cluster Computing with Working Sets
Understanding the Effects and Implications of Compute Node Failures in Florin Dinu T. S. Eugene Ng.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Distributed Computations
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
Distributed Object System. Project Goals Develop a distributed system for performing time-consuming calculations. Load Balancing support. Fault Tolerance.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
** MapReduce Debugging with Jumbune. * Agenda * Debugging Challenges Debugging MapReduce Jumbune’s Debugger Zero Tolerance in Production.
Distributed Computations MapReduce
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Failures in the System  Two major components in a Node Applications System.
Jeffrey D. Ullman Stanford University.  Mining of Massive Datasets, J. Leskovec, A. Rajaraman, J. D. Ullman.  Available for free download at i.stanford.edu/~ullman/mmds.html.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Yavor Todorov. Introduction How it works OS level checkpointing Application level checkpointing CPR for parallel programing CPR functionality References.
Overview Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
HAMS Technologies 1
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
MARISSA: MApReduce Implementation for Streaming Science Applications 作者 : Fadika, Z. ; Hartog, J. ; Govindaraju, M. ; Ramakrishnan, L. ; Gunter, D. ; Canon,
The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available.
Application of Hadoop to Proteomic Searches Steven Lewis 1, Attila Csordas 2, Sarah Killcoyne 1, Henning Hermjakob 2, John Boyle 1 1 Institute for Systems.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Lecture (Mar 23, 2000) H/W Assignment 3 posted on Web –Due Tuesday March 28, 2000 Review of Data packets LANS WANS.
 Load balancing is the process of distributing a workload evenly throughout a group or cluster of computers to maximize throughput.  This means that.
Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes.
COSC 3330/6308 Solutions to the Third Problem Set Jehan-François Pâris November 2012.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
MapReduce and the New Software Stack CHAPTER 2 1.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
HADOOP Carson Gallimore, Chris Zingraf, Jonathan Light.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
By: Joel Dominic and Carroll Wongchote 4/18/2012.
BIG DATA/ Hadoop Interview Questions.
HERON.
Hadoop Aakash Kag What Why How 1.
Large-scale file systems and Map-Reduce
Introduction to MapReduce and Hadoop
Software Engineering Introduction to Apache Hadoop Map Reduce
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
The Basics of Apache Hadoop
湖南大学-信息科学与工程学院-计算机与科学系
CS 345A Data Mining MapReduce This presentation has been altered.
5/7/2019 Map Reduce Map reduce.
Presentation transcript:

‘s Overload Tolerant Design Exacerbates Failure Detection and Recovery Florin Dinu T. S. Eugene Ng Rice University

is Widely Used Image Processing Protein Sequencing Web Indexing Machine Learning Advertising Analytics Log Storage and Analysis * * Source: Recent research work 2

Compute-Node Failures Are Common “... typical that 1,000 individual machine failures will occur; thousands of hard drive failures will occur; one power distribution unit will fail, bringing down 500 to 1,000 machines for about 6 hours” Jeff Dean – Google I/O 2008 “ 5.0 average worker deaths per job” Jeff Dean – Keynote I – PACT 2006 Revenue ReputationUser experience 3

Compute-node failures are common and damaging is widely used How does behave under compute-node failures? vs Inflated, variable and unpredictable job running times. Sluggish failure detection. What are the design decisions responsible? Answer in this work. 4

Focus of This Work Task Tracker failures Loss of intermediate data Loss of running tasks Data Nodes not failed Types of failures Task Tracker process fail-stop failures Task Tracker node fail-stop failures Single failures Expose mechanisms and their interactions Findings also apply to multiple failures Name Node Job Tracker Task Tracker Mapper Reducer Data Node 5

Declaring a Task Tracker Dead Time Heartbeats from Task Tracker to Job Tracker Usually every 3s Job Tracker checks if heartbeats not sent for at least 600s 200s Restart running tasks Restart completed maps Conservative design <200 <400 <600 >600 6

Declaring a Task Tracker Dead Time Variable failure detection time <200 <400 <600 >600 <200 <400 <600 >600 Detection time ~ 800s Detection time ~ 600s Time 7

Uses notifications from running reducers to Job Tracker A message that a specific map output is unavailable Restart map M to re-compute its lost output #notif(M) > (0.5* #running reducers) and #notif(M) > 3 Declaring Map Output Lost Job Tracker Time X Conservative designStatic parameters <200 <400<600>600 8

Reducer Notifications Signals a specific map output is unavailable On connection error (R1) re-attempt connection send notification when nr of attempts % 10 = 0 exponential wait between attempts wait = 10*(1.3) ^(nr_failed_attempts) usually 416s needed for 10 attempts On read error (R2) send notification immediately M5 R1 R2 X X Job Tracker Conservative designStatic parameters 9

Declaring a Reducer Faulty Reducer faulty if (simplified version): #shuffles failed > 0.5* #shuffles attempted and #shuffles succeeded < 0.5* #shuffles necessary or reducer stalled for too long Ignores cause of failed shuffles. Static parameters X 10

Experiment: Methodology 15-node, 4-rack testbed in the OpenCirrus* cluster 14 compute nodes, 1 reserved for Job Tracker and Name Node Sort job, 10GB input, 160 maps, 14 reducers, 200 runs/experiment Job takes 220s in the absence of failures Inject single Task Tracker process failure randomly between 0 and 220s * the HP/Intel/Yahoo! Open Cloud Computing Research Testbedhttps://opencirrus.org/ 11

Large variability in job running times Experiment: Results 12

Large variability in job running times Experiment: Results Group G2 Group G6 Group G7 Group G3 Group G5 Group G1 Group G4 13

Group G1 – few reducers impacted Slow recovery when few reducers impacted M1 R1 M1 copied by all reducers before failure. R1_1 X Job Tracker After failure R1_1 cannot access M1. R1_1 needs to send 3 notifications ~ 1250s Task Tracker declared dead after s M2 M3 Notif (M1) R2 R3 14

Group G2 – timing of failure Timing of failure relative to Job Tracker checks impacts job running time Time G1 G2 170s Time Job end 600s 200s 200s difference between G1 and G2. 15

Group G3 – early notifications Early notifications increase job running time variability G1 notifications sent after 416s G3 early notifications => map outputs declared lost Causes: Code-level race conditions Timing of a reducer’s shuffle attempts Regular notification (416s) Early notification (<416s)

Group G4 & G5 – many reducers impacted Job running time under failure varies with nr of reducers impacted R1_1 X Job Tracker G4 - Many reducers send notifications after 416s - Map output is declared lost before the Task Tracker is declared dead G5 - Same as G4 but early notifications are sent Notif (M1,M2,M3, M4,M5) M1 R1 M2 M3 R2 R3 17

Induced Reducer Death Reducer faulty if (simplified version): #shuffles failed > 0.5 #shuffles attempted and #shuffles succeeded < 0.5 or stalled for too long #shuffles necessary If failed Task Tracker is contacted among first Task Trackers => the reducer dies If failed Task Tracker is attempted too many times => the reducer dies A failure can induce other failures in healthy reducers. CPU time and network bandwidth are unnecessarily wasted. X 18

56 vs 14 Reducers Job running times are spread out even more Increased chance for induced reducer death or early notifications CDF 19

Simulating Node Failure Without RST packets all affected tasks wait for Task Tracker to be declared dead. CDF 20

Lack of Adaptivity Recall: Notification sent after 10 attempts Inefficiency: A static, one size fits all solution cannot handle all situations Efficiency varies with number of reducers A way forward: Use more detailed information about current job state 21

Conservative Design Recall: Declare a Task Tracker dead after at least 600s Send a notification after 10 attempts and 416 seconds Inefficiency: Assumes most problems are transient Sluggish response to permanent compute-node failure A way forward: Additional information should be leveraged Network state information Historical information of compute-node behavior [OSDI ‘10] 22

Simplistic Failure Semantics Lack of TCP connectivity = problem with tasks Inefficiency: Cannot distinguish between multiple causes for lack of connectivity Transient congestion Compute-node failure A way forward: Decouple failure recovery from overload recovery Use AQM/ECN to provide extra congestion information Allow direct communication between application and infrastructure 23

Thank you Company and product logos from company’s website. Conference logos from the conference websites. Links to images:

Group G3 – early notifications Early notifications increase job running time variability G1 notifications sent after 416s G3 early notifications => map outputs declared lost Causes: Code-level race conditions Timing of a reducer’s shuffle attempts R2 X M5 R2 X M5-1 M6-1 M5-2 M6-2 M5-3 M6-3 M5-4 M6-4 M6-1 M5-1 M6-2 M5-2 M6-3 M5-3 M M5 M6 M5-4 M6-5 25

Task Tracker Failure-Related Mechanisms Declaring a Task Tracker dead Declaring a map output lost Declaring a reducer faulty 26