Counting on Failure 10, 9, 8, 7,…,3, 2, 1 Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory September 12, 2006 CCGSC Conference.

Slides:



Advertisements
Similar presentations
Support for Fault Tolerance (Dynamic Process Control) Rich Graham Oak Ridge National Laboratory.
Advertisements

Presented by Fault Tolerance Challenges and Solutions Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported.
Managed by UT-Battelle for the Department of Energy MPI Must Evolve or Die! Research sponsored by ASCR Al Geist Oak Ridge National Laboratory September.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
1 Deadlocks Chapter Resource 3.2. Introduction to deadlocks 3.3. The ostrich algorithm 3.4. Deadlock detection and recovery 3.5. Deadlock avoidance.
Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.
A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.
Building Fault Survivable MPI Programs with FT-MPI Using Diskless Checkpointing Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julien Langou, Thara Angskun,
Chapter 3 Deadlocks 3.1. Resource 3.2. Introduction to deadlocks
Simplifying the Recovery Model of User- Level Failure Mitigation Wesley Bland ExaMPI ‘14 New Orleans, LA, USA November 17, 2014.
1 MOLAR: MOdular Linux and Adaptive Runtime support Project Team David Bernholdt 1, Christian Engelmann 1, Stephen L. Scott 1, Jeffrey Vetter 1 Arthur.
Self stabilizing Linux Kernel Mechanism Doron Mishali, Alex Plits Supervisors: Prof. Shlomi Dolev Dr. Reuven Yagel.
LOGO OPERATING SYSTEM Dalia AL-Dabbagh
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
Priority Research Direction (use one slide for each) Key challenges -Fault understanding (RAS), modeling, prediction -Fault isolation/confinement + local.
What are the main differences and commonalities between the IS and DA systems? How information is transferred between tasks: (i) IS it may be often achieved.
Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.
Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott and Christian Engelmann Computer Science.
Directed Reading 2 Key issues for the future of Software and Hardware for large scale Parallel Computing and the approaches to address these. Submitted.
1/20 Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications Sheng Di, Mohamed Slim Bouguerra, Leonardo Bautista-gomez, Franck Cappello.
Crystal Ball Panel ORNL Heterogeneous Distributed Computing Research Al Geist ORNL March 6, 2003 SOS 7.
1 Deadlocks Chapter Resource 3.2. Introduction to deadlocks 3.3. The ostrich algorithm 3.4. Deadlock detection and recovery 3.5. Deadlock avoidance.
Headline in Arial Bold 30pt HPC User Forum, April 2008 John Hesterberg HPC OS Directions and Requirements.
1 Deadlocks 2 Resources Examples of computer resources –printers –tape drives –tables Processes need access to resources in reasonable order Suppose.
1 Deadlocks Chapter Resource 3.2. Introduction to deadlocks 3.3. The ostrich algorithm 3.4. Deadlock detection and recovery 3.5. Deadlock avoidance.
1 Deadlocks Chapter 3. 2 Resources Examples of computer resources –printers –tape drives –tables Processes need access to resources in reasonable order.
Operating Systems 软件学院 高海昌 Operating Systems Gao Haichang, Software School, Xidian University 22 Contents  1. Introduction** 
1 Deadlocks Chapter Resource 3.2. Introduction to deadlocks 3.3. The ostrich algorithm 3.4. Deadlock detection and recovery 3.5. Deadlock avoidance.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Diskless Checkpointing on Super-scale Architectures Applied to the Fast Fourier Transform Christian Engelmann, Al Geist Oak Ridge National Laboratory Februrary,
Scalable Systems Software for Terascale Computer Centers Coordinator: Al Geist Participating Organizations ORNL ANL LBNL.
Deadlock Detection and Recovery
A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance Chao Wang, Frank Mueller North Carolina State University Christian Engelmann, Stephen.
Copyright © Clifford Neuman - UNIVERSITY OF SOUTHERN CALIFORNIA - INFORMATION SCIENCES INSTITUTE Advanced Operating Systems Lecture notes Dr.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
CS3771 Today: Distributed Coordination  Previous class: Distributed File Systems Issues: Naming Strategies: Absolute Names, Mount Points (logical connection.
Presented by Fault Tolerance Challenges and Solutions Al Geist Network and Cluster Computing Computational Sciences and Mathematics Division Research supported.
VGrADS and GridSolve Asim YarKhan Jack Dongarra, Zhiao Shi, Fengguang Song Innovative Computing Laboratory University of Tennessee VGrADS Workshop – September.
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
Informationsteknologi Monday, October 1, 2007Computer Systems/Operating Systems - Class 111 Today’s class Deadlock.
Lecture 6 Deadlock 1. Deadlock and Starvation Let S and Q be two semaphores initialized to 1 P 0 P 1 wait (S); wait (Q); wait (Q); wait (S);. signal (S);
Sistem Operasi IKH311 Deadlock. 2 Resources Examples of computer resources printers tape drives tables Processes need access to resources in reasonable.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Concurrency: Deadlock and Starvation
Large-scale file systems and Map-Reduce
Jack Dongarra University of Tennessee
FT-MPI Survey Alan & Nathan.
Chapter 9: Virtual Memory
VisIt Libsim Update DOE Computer Graphics Forum 2012 Brad Whitlock
ITEC 202 Operating Systems
Lecture 21 Concurrency Introduction
Real-time Software Design
Mattan Erez The University of Texas at Austin July 2015
EECS 582 Final Review Mosharaf Chowdhury EECS 582 – F16.
湖南大学-信息科学与工程学院-计算机与科学系
Fault Injection: A Method for Validating Fault-tolerant System
Chapter 9: Virtual-Memory Management
Fault Tolerance Distributed Web-based Systems
DPNM Lab. Dept. of CSE, POSTECH
Chapter 3 Deadlocks 3.1. Resource 3.2. Introduction to deadlocks
Chapter 3 Deadlocks 3.1. Resource 3.2. Introduction to deadlocks
Co-designed Virtual Machines for Reliable Computer Systems
CS510 Operating System Foundations
Introduction to Deadlocks
Chapter 3 Deadlocks 3.1. Resource 3.2. Introduction to deadlocks
Abstractions for Fault Tolerance
Chapter 8: Deadlocks Deadlock Characterization
Lecture 29: Distributed Systems
Presentation transcript:

Counting on Failure 10, 9, 8, 7,…,3, 2, 1 Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory September 12, 2006 CCGSC Conference Flat Rock, North Carolina CS Research Group Research sponsored by DOE Office of Science

Jul 2006 Nov 2006 Dec 2007 Nov TF 100 TF 1 PF 250 TF Cray XT3 Cray Baker Rapid growth in scale drives fault tolerance need Cray XT4 Eg. ORNL Leadership Computing Facility Hardware roadmap 54 TF (56 cabinets) 5294 nodes 10,588 proc 21 TB 25 TF 100 TF (+68 cab) 11,706 nodes 23,412 proc 46 TB 250 TF (68 quad) 11,706 nodes 36,004 proc 71 TB 1 PF (136 new cab) 24,576 nodes 98,304 proc 175 TB Today 20X scale change in 2 ½ years

10 nodes a day “Estimated” failure rate for 1 Petaflop system With 25,000 nodes this is a tiny fraction (0.0004) of the whole system The RAS system automatically configures around faults – up for days But every one of these failures kills the application that was using that node! Today’s applications and their runtime libraries may scale but are not prepared for the failure rates of these systems ORNL 1 PF Cray “Baker” System 2008 Est: One every day or two today times 20

’09 The End of Fault Tolerance as We Know It Point where checkpoint ceases to be viable MTTI grows smaller as number of parts increases Time to checkpoint grows larger as problem size increases time 2009 is guess Good news is the MTTI is better than expected for LLNL BG/L and ORNL XT3 a/b 6-7 days not minutes 2006 Crossover point

8 Strategies for application to handle fault Restart – from checkpoint file [large apps today] Restart from diskless checkpoint [Avoids stressing the IO system and causing more faults] Recalculate lost data from in memory RAID Lossy recalculation of lost data [for iterative methods] Recalculate lost data from initial and remaining data Replicate computation across system Reassign lost work to another resource Use natural fault tolerant algorithms Need to develop rich methodology to “run through” faults Store chkpt in memory No chkpt Some state saved

Demonstrated that the scale invariance and natural fault tolerance can exist for local and global algorithms where 100 failures happen across 100,000 processes 8 (cont) Natural Fault Tolerant algorithms local global Finite Difference (Christian Engelman) – –Demonstrated natural fault tolerance w/ chaotic relaxation, meshless, finite difference solution of Laplace and Poisson problems Global information (Kasidit Chancio) – –Demonstrated natural fault tolerance in global max problem w/random, directed graphs Gridless Multigrid (Ryan Adams) – –Combines the fast convergence of multigrid with the natural fault tolerance property. Hierarchical implementation of finite difference above. – –Three different asynchronous updates explored Theoretical analysis (Jeffery Chen)

7 / 24 System can’t ignore faults The file system can’t let data be corrupted by faults. I/O nodes must recover and cover failures The heterogeneous OS must be able to tolerate failures of any of its node types and instances. For example a failed service node shouldn’t take out a bunch of compute nodes. The schedulers and other system components must be aware of dynamically changing system configuration So that tasks get assigned around failed components Support simultaneous updates Parallel recovery from multiple node failures Fast recovery from fault Harness P2P control research

6 Options for system to handle jobs Restart – from checkpoint or from beginning Notify application and let it handle the problem Migrate task to other hardware before failure Reassignment of work to spare processor(s) Replication of tasks across machine Ignore the fault altogether What to do? Need a mechanism for each application (or component) to specify to system what to do if fault occurs system

5 recovery modes for MPI applications Harness project’s FT-MPI explored 5 modes of recovery. They effect the size (extent) and ordering of the communicators –ABORT: just do as vendor implementations –BLANK: leave holes –But make sure collectives do the right thing afterwards –SHRINK: re-order processes to make a contiguous communicator –Some ranks change –REBUILD: re-spawn lost processes and add them to MPI_COMM_WORLD –REBUILD_ALL: same as REBUILD except rebuilds all communicators, groups and resets all key values etc. May be time to consider an MPI-3 standard that allows applications to recover from faults

4 Ways to Fail Anyway Validation of answer on such large systems Fault may not be detected Recovery introduces perturbations Result may depend on which nodes fail Result looks reasonable but is actually wrong I’ll just keep running the job till I get the answer I want Can’t afford to run every job three (or more) times Yearly Allocations are like $5M-$10M grants

3 Steps to Fault Tolerance There are three main steps in fault tolerance Detection that something has gone wrong System – detection in hardware Framework – detection by runtime environment Library – detection in math or communication library Notification of the application, runtime, system components Interrupt – signal sent to job or system component Error code returned by application routine Recovery of the application to the fault By the system By the application Neither - Natural fault tolerance subscription notification Ace repair staff

2 Reasons the problem is only going to get worse The drive for large scale simulations in biology, nanotechnology, medicine, chemistry, materials, etc. Require much larger problems (Space) Easily consume the 2 GB per core in ORNL LCF systems Require much longer to run (Time) Science teams in climate, combustion, and fusion want to run for couple months dedicated From Fault Tolerance perspective: Space means the job ‘state’ to be recovered is huge Time means that many faults will occur during a single run

Fault Tolerance Backplane DetectionNotificationRecovery Monitor Logger Event Manager Configuration Prediction & Prevention Autonomic Actions Recovery Services 1 Holistic Solution We need coordinated fault awareness, prediction and recovery across the entire HPC system from the application to the hardware. Middleware Applications Operating System Hardware CIFTS project underway at ANL, ORNL, LBL, UTK, IU, OSU “Prediction and prevention are critical because the best fault is the one that never happens”

Thanks Questions?