Presentation is loading. Please wait.

Presentation is loading. Please wait.

‘s Overload Tolerant Design Exacerbates Failure Detection and Recovery Florin Dinu T. S. Eugene Ng Rice University.

Similar presentations


Presentation on theme: "‘s Overload Tolerant Design Exacerbates Failure Detection and Recovery Florin Dinu T. S. Eugene Ng Rice University."— Presentation transcript:

1 ‘s Overload Tolerant Design Exacerbates Failure Detection and Recovery Florin Dinu T. S. Eugene Ng Rice University

2 is Widely Used Image Processing Protein Sequencing Web Indexing Machine Learning Advertising Analytics Log Storage and Analysis * * Source: Recent research work 2

3 Compute-Node Failures Are Common “... typical that 1,000 individual machine failures will occur; thousands of hard drive failures will occur; one power distribution unit will fail, bringing down 500 to 1,000 machines for about 6 hours” Jeff Dean – Google I/O 2008 “ 5.0 average worker deaths per job” Jeff Dean – Keynote I – PACT 2006 Revenue ReputationUser experience 3

4 Compute-node failures are common and damaging is widely used How does behave under compute-node failures? vs Inflated, variable and unpredictable job running times. Sluggish failure detection. What are the design decisions responsible? Answer in this work. 4

5 Focus of This Work Task Tracker failures Loss of intermediate data Loss of running tasks Data Nodes not failed Types of failures Task Tracker process fail-stop failures Task Tracker node fail-stop failures Single failures Expose mechanisms and their interactions Findings also apply to multiple failures Name Node Job Tracker Task Tracker Mapper Reducer Data Node 5

6 Declaring a Task Tracker Dead Time Heartbeats from Task Tracker to Job Tracker Usually every 3s Job Tracker checks if heartbeats not sent for at least 600s 200s Restart running tasks Restart completed maps Conservative design <200 <400 <600 >600 6

7 Declaring a Task Tracker Dead Time Variable failure detection time <200 <400 <600 >600 <200 <400 <600 >600 Detection time ~ 800s Detection time ~ 600s Time 7

8 Uses notifications from running reducers to Job Tracker A message that a specific map output is unavailable Restart map M to re-compute its lost output #notif(M) > (0.5* #running reducers) and #notif(M) > 3 Declaring Map Output Lost Job Tracker Time X Conservative designStatic parameters <200 <400<600>600 8

9 Reducer Notifications Signals a specific map output is unavailable On connection error (R1) re-attempt connection send notification when nr of attempts % 10 = 0 exponential wait between attempts wait = 10*(1.3) ^(nr_failed_attempts) usually 416s needed for 10 attempts On read error (R2) send notification immediately M5 R1 R2 X X Job Tracker Conservative designStatic parameters 9

10 Declaring a Reducer Faulty Reducer faulty if (simplified version): #shuffles failed > 0.5* #shuffles attempted and #shuffles succeeded < 0.5* #shuffles necessary or reducer stalled for too long Ignores cause of failed shuffles. Static parameters X 10

11 Experiment: Methodology 15-node, 4-rack testbed in the OpenCirrus* cluster 14 compute nodes, 1 reserved for Job Tracker and Name Node Sort job, 10GB input, 160 maps, 14 reducers, 200 runs/experiment Job takes 220s in the absence of failures Inject single Task Tracker process failure randomly between 0 and 220s * https://opencirrus.org/ the HP/Intel/Yahoo! Open Cloud Computing Research Testbedhttps://opencirrus.org/ 11

12 Large variability in job running times Experiment: Results 12

13 Large variability in job running times Experiment: Results Group G2 Group G6 Group G7 Group G3 Group G5 Group G1 Group G4 13

14 Group G1 – few reducers impacted Slow recovery when few reducers impacted M1 R1 M1 copied by all reducers before failure. R1_1 X Job Tracker After failure R1_1 cannot access M1. R1_1 needs to send 3 notifications ~ 1250s Task Tracker declared dead after s M2 M3 Notif (M1) R2 R3 14

15 Group G2 – timing of failure Timing of failure relative to Job Tracker checks impacts job running time Time G1 G2 170s Time Job end 600s 200s 200s difference between G1 and G2. 15

16 Group G3 – early notifications Early notifications increase job running time variability G1 notifications sent after 416s G3 early notifications => map outputs declared lost Causes: Code-level race conditions Timing of a reducer’s shuffle attempts Regular notification (416s) Early notification (<416s)

17 Group G4 & G5 – many reducers impacted Job running time under failure varies with nr of reducers impacted R1_1 X Job Tracker G4 - Many reducers send notifications after 416s - Map output is declared lost before the Task Tracker is declared dead G5 - Same as G4 but early notifications are sent Notif (M1,M2,M3, M4,M5) M1 R1 M2 M3 R2 R3 17

18 Induced Reducer Death Reducer faulty if (simplified version): #shuffles failed > 0.5 #shuffles attempted and #shuffles succeeded < 0.5 or stalled for too long #shuffles necessary If failed Task Tracker is contacted among first Task Trackers => the reducer dies If failed Task Tracker is attempted too many times => the reducer dies A failure can induce other failures in healthy reducers. CPU time and network bandwidth are unnecessarily wasted. X 18

19 56 vs 14 Reducers Job running times are spread out even more Increased chance for induced reducer death or early notifications CDF 19

20 Simulating Node Failure Without RST packets all affected tasks wait for Task Tracker to be declared dead. CDF 20

21 Lack of Adaptivity Recall: Notification sent after 10 attempts Inefficiency: A static, one size fits all solution cannot handle all situations Efficiency varies with number of reducers A way forward: Use more detailed information about current job state 21

22 Conservative Design Recall: Declare a Task Tracker dead after at least 600s Send a notification after 10 attempts and 416 seconds Inefficiency: Assumes most problems are transient Sluggish response to permanent compute-node failure A way forward: Additional information should be leveraged Network state information Historical information of compute-node behavior [OSDI ‘10] 22

23 Simplistic Failure Semantics Lack of TCP connectivity = problem with tasks Inefficiency: Cannot distinguish between multiple causes for lack of connectivity Transient congestion Compute-node failure A way forward: Decouple failure recovery from overload recovery Use AQM/ECN to provide extra congestion information Allow direct communication between application and infrastructure 23

24 Thank you Company and product logos from company’s website. Conference logos from the conference websites. Links to images:

25 Group G3 – early notifications Early notifications increase job running time variability G1 notifications sent after 416s G3 early notifications => map outputs declared lost Causes: Code-level race conditions Timing of a reducer’s shuffle attempts R2 X M5 R2 X M5-1 M6-1 M5-2 M6-2 M5-3 M6-3 M5-4 M6-4 M6-1 M5-1 M6-2 M5-2 M6-3 M5-3 M M5 M6 M5-4 M6-5 25

26 Task Tracker Failure-Related Mechanisms Declaring a Task Tracker dead Declaring a map output lost Declaring a reducer faulty 26


Download ppt "‘s Overload Tolerant Design Exacerbates Failure Detection and Recovery Florin Dinu T. S. Eugene Ng Rice University."

Similar presentations


Ads by Google