Download presentation
Presentation is loading. Please wait.
Published byLoraine Dawson Modified over 9 years ago
1
Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State University # Pacific Northwest National Laboratory 1
2
Faults at Scale Future systems built with large number of components MTBF inversely proportional to #components Faults will be frequent Checkpoint-restart too expensive with numerous faults Strain on system components, notably file system Assumption of fault-free operation infeasible Applications need to think about faults 2
3
Programming Models SPMD ties computation to a process Fixed machine model Applications needs to change with major architectural shifts Fault handling involves non-local design changes Rely on p processes: what if one goes away? Message-passing makes it harder Consistent cuts are challenging Message logging, etc. expensive Fault management requires lot of user involvement 3
4
Problem Statement Fault management framework Minimize user effort Components Data state Application data Communication operations Control state What work is each process doing? Continue to completion despite faults 4
5
Approach One-sided communication model Easy to derive consistent cuts Task parallel control model Computation decoupled from processes User specifies computation Collection of tasks on global data Runtime schedules computation Load balancing Fault management 5
6
Global Arrays (GA) PGAS Family: UPC (C), CAF (Fortran), Titanium (Java), GA (library) Aggregate memory from multiple nodes into global address space Data access via one-sided get(..), put(..), acc(..) operations Programmer controls data distribution and locality Fully inter-operable with MPI and ARMCI Support for higher-level collectives – DGEMM, etc. Widely used – chemistry, sub-surface transport, bioinformatics, CFD 6 Shared Global address space Private Proc 0 Proc 1 Proc n X[M][M][N] X[1..9] [1..9][1..9] X
7
GA Memory Model Remote memory access Dominant communication in GA programs Destination known in advance No receive operation or tag matching Remote Progress Ensure overlap Atomics and collectives Blocking Few outstanding at any time 7
8
Saving Data State Data State = Commn state + memory state Communication state “Flush” pending RMA operations (single call) Save atomic and collective ops (small state) Memory state Force other processes to flush their pending ops Used in virtualized execution of GA apps ( Comp. Frontiers’09 ) Also enables pre-emptive migration 8
9
9 The Asynchronous Gap The PGAS memory model simplifies managing data Computation model is still regular, process-centric SPMD Irregularity in the data can lead to load imbalance Extend PGAS model to bridge asynchronous gap Dynamic, irregular view of the computation Runtime system should perform load balancing Allow for computation movement to exploit locality X[M][M][N] X[1..9] [1..9][1..9] X get(…)
10
Control State – Task Model Express computation as collection of tasks Tasks operate on data stored in Global Arrays Executed in collective task parallel phases Runtime system manages task execution 10 SPMD Task Parallel Termination
11
11 Task Model Inputs: Global data, Immediates, CLOs Outputs: Global data, CLOs, Child tasks Strict dependence: Only parent → child (for now) CLO 1 Shared Y[0] Private Y[1]Y[N] Proc 0 Proc 1 Proc n CLO 1 f(...) In: 5, Y[0],... Out: X[1] Task: Partitioned Global Address Space X[0] X[1]X[N]
12
12 Scioto Programming Interface High level interface: shared global task collection Low level interface: set of distributed task queues Queues are prioritized by affinity Use work first principle (LIFO) Load balancing via work stealing (FIFO)
13
13 Work Stealing Runtime System ARMCI task queue on each processor Steals don’t interrupt remote process When a process runs out of work Select a victim at random and steal work from them Scaled to 8192 cores ( SC’09 )
14
Communication Markers Communication initiated by a failed process Handling partial completions Get(), Put() are idempotent – ignore Acc() non-idempotent Mark beginning and end of acc() ops Overhead Memory usage – proportional to # tasks Communication – additional small messages 14
15
Fault Tolerant Task Pool 15 Re-execute incomplete tasks till a round without failures
16
Task Execution 16 Update result only if it has not already been modified
17
Detecting Incomplete Commn Data with ‘started’ set but not ‘contributed’ Approach 1: “Naïve” scheme Check all markers for any that remain `started’ Not scalable Approach 2: “Home-based” scheme Invert the task-to-data mapping Distributed meta-data check + all-to-all 17
18
Algorithm Characteristics Tolerance to arbitrary number of failures Low overhead in absence of failures Small messages for markers Can we optimized through pre-issue/speculation Space overhead proportional to task pool size Storage for markers Recovery cost proportional to #failures Redo work to produce data in failed processes 18
19
Bounding Cascading Failures A process with “corrupted” data Incomplete comm. from failed process Marking it as failed -> cascade failures A process with “corrupted” data Flushes its communication; then recovers its data Each task computes only a few data blocks Each process: pending comm. to few blocks at a time Total recovery cost Data in failed processes + a small additional number 19
20
Experimental Setup Linux cluster Each node Dual quad-core 2.5GHz opterons 24GB RAM Infiniband interconnection network Self-Consistent Field (SCF) kernel – 48 Be atoms Worst case fault – at the end of a task pool 20
21
Cost of Failure – Strong Scaling 21 #tasks re-executed goes down with increase in process count
22
Worst Case Failure Cost 22
23
Relative Performance 23 Less than 10% cost for one worst case fault
24
Related Work Checkpoint restart Continues to handle the SPMD portion of an app Finer-grain recoverability using our approach BOINC – client-server CilkNOW – single assignment form Linda – requires transactions CHARM++ processor virtualization based Needs message logging Efforts on fault tolerant runtimes Complements this work 24
25
Conclusions Fault tolerance through PGAS memory model Task parallel computation model Fine-grain recoverability through markers Cost of failure proportional to #failures Demonstrated low cost recovery for an SCF kernel 25
26
Thank You! 26
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.