Presented by Fault Tolerance Challenges and Solutions Al Geist Network and Cluster Computing Computational Sciences and Mathematics Division Research supported.

Slides:



Advertisements
Similar presentations
Support for Fault Tolerance (Dynamic Process Control) Rich Graham Oak Ridge National Laboratory.
Advertisements

Presented by Fault Tolerance Challenges and Solutions Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported.
Managed by UT-Battelle for the Department of Energy MPI Must Evolve or Die! Research sponsored by ASCR Al Geist Oak Ridge National Laboratory September.
Parallel Research at Illinois Parallel Everywhere
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
1Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 1Managed by UT-Battelle for the Department of Energy Richard L. Graham Computer.
1 Deadlocks Chapter Resource 3.2. Introduction to deadlocks 3.3. The ostrich algorithm 3.4. Deadlock detection and recovery 3.5. Deadlock avoidance.
Objektorienteret Middleware Presentation 2: Distributed Systems – A brush up, and relations to Middleware, Heterogeneity & Transparency.
Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.
A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.
Building Fault Survivable MPI Programs with FT-MPI Using Diskless Checkpointing Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julien Langou, Thara Angskun,
1: Operating Systems Overview
Lessons Learned Implementing User-Level Failure Mitigation in MPICH Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory User-level.
Implementing Disaster Protection
Chiba City: A Testbed for Scalablity and Development FAST-OS Workshop July 10, 2002 Rémy Evard Mathematics.
Undergraduate Poster Presentation Match 31, 2015 Department of CSE, BUET, Dhaka, Bangladesh Wireless Sensor Network Integretion With Cloud Computing H.M.A.
Simplifying the Recovery Model of User- Level Failure Mitigation Wesley Bland ExaMPI ‘14 New Orleans, LA, USA November 17, 2014.
Distributed Deadlocks and Transaction Recovery.
1 MOLAR: MOdular Linux and Adaptive Runtime support Project Team David Bernholdt 1, Christian Engelmann 1, Stephen L. Scott 1, Jeffrey Vetter 1 Arthur.
LOGO OPERATING SYSTEM Dalia AL-Dabbagh
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
Priority Research Direction (use one slide for each) Key challenges -Fault understanding (RAS), modeling, prediction -Fault isolation/confinement + local.
Fault Tolerance BOF Possible CBHPC paper –Co-authors wanted –Tammy, Rob, Bruce, Daniel, Nanbor, Sameer, Jim, Doug, David What infrastructure is needed.
Copyright © Clifford Neuman and Dongho Kim - UNIVERSITY OF SOUTHERN CALIFORNIA - INFORMATION SCIENCES INSTITUTE Advanced Operating Systems Lecture.
Distributed Systems: Concepts and Design Chapter 1 Pages
What are the main differences and commonalities between the IS and DA systems? How information is transferred between tasks: (i) IS it may be often achieved.
Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott and Christian Engelmann Computer Science.
Peer-to-Peer Distributed Shared Memory? Gabriel Antoniu, Luc Bougé, Mathieu Jan IRISA / INRIA & ENS Cachan/Bretagne France Dagstuhl seminar, October 2003.
Directed Reading 2 Key issues for the future of Software and Hardware for large scale Parallel Computing and the approaches to address these. Submitted.
1/20 Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications Sheng Di, Mohamed Slim Bouguerra, Leonardo Bautista-gomez, Franck Cappello.
Supercomputing Cross-Platform Performance Prediction Using Partial Execution Leo T. Yang Xiaosong Ma* Frank Mueller Department of Computer Science.
Workshop on the Future of Scientific Workflows Break Out #2: Workflow System Design Moderators Chris Carothers (RPI), Doug Thain (ND)
Crystal Ball Panel ORNL Heterogeneous Distributed Computing Research Al Geist ORNL March 6, 2003 SOS 7.
Presented by Leadership Computing Facility (LCF) Roadmap Buddy Bland Center for Computational Sciences Leadership Computing Facility Project.
1 Deadlocks Chapter Resource 3.2. Introduction to deadlocks 3.3. The ostrich algorithm 3.4. Deadlock detection and recovery 3.5. Deadlock avoidance.
Headline in Arial Bold 30pt HPC User Forum, April 2008 John Hesterberg HPC OS Directions and Requirements.
1 Deadlocks Chapter Resource 3.2. Introduction to deadlocks 3.3. The ostrich algorithm 3.4. Deadlock detection and recovery 3.5. Deadlock avoidance.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Computer and Computational Sciences Division Los Alamos National Laboratory On the Feasibility of Incremental Checkpointing for Scientific Computing Jose.
Diskless Checkpointing on Super-scale Architectures Applied to the Fast Fourier Transform Christian Engelmann, Al Geist Oak Ridge National Laboratory Februrary,
Scalable Systems Software for Terascale Computer Centers Coordinator: Al Geist Participating Organizations ORNL ANL LBNL.
Deadlock Detection and Recovery
System-Directed Resilience for Exascale Platforms LDRD Proposal Ron Oldfield (PI)1423 Ron Brightwell1423 Jim Laros1422 Kevin Pedretti1423 Rolf.
Copyright © Clifford Neuman - UNIVERSITY OF SOUTHERN CALIFORNIA - INFORMATION SCIENCES INSTITUTE Advanced Operating Systems Lecture notes Dr.
Lecture 4 Page 1 CS 111 Online Modularity and Virtualization CS 111 On-Line MS Program Operating Systems Peter Reiher.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.
Embedded Real-Time Systems Processing interrupts Lecturer Department University.
April 6, 2016ASPLOS 2016Atlanta, Georgia. Yaron Weinsberg IBM Research Idit Keidar Technion Hagar Porat Technion Eran Harpaz Technion Noam Shalev Technion.
Denis Caromel1 OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis -- IUF IPDPS 2003 Nice Sophia Antipolis, April Overview: 1. What.
1 An unattended, fault-tolerant approach for the execution of distributed applications Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT Madrid, Spain.
1.3 Operating system services An operating system provide services to programs and to the users of the program. It provides an environment for the execution.
VGrADS and GridSolve Asim YarKhan Jack Dongarra, Zhiao Shi, Fengguang Song Innovative Computing Laboratory University of Tennessee VGrADS Workshop – September.
Counting on Failure 10, 9, 8, 7,…,3, 2, 1 Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory September 12, 2006 CCGSC Conference.
Introduction to Distributed Platforms
Duncan MacMichael & Galen Deal CSS 534 – Autumn 2016
Jack Dongarra University of Tennessee
ITEC 202 Operating Systems
Fault Injection: A Method for Validating Fault-tolerant System
Chapter 3 Deadlocks 3.1. Resource 3.2. Introduction to deadlocks
Chapter 3 Deadlocks 3.1. Resource 3.2. Introduction to deadlocks
Introduction to Deadlocks
Chapter 3 Deadlocks 3.1. Resource 3.2. Introduction to deadlocks
Presentation transcript:

Presented by Fault Tolerance Challenges and Solutions Al Geist Network and Cluster Computing Computational Sciences and Mathematics Division Research supported by the Department of Energy’s Office of Science Office of Advanced Scientific Computing Research

2 Geist_FT_0611 Example: ORNL Leadership Computing Facility hardware roadmap 54 TF (56 cabinets) 5294 nodes 10,588 processors 21 TB 100 TF (+68 cabinets) 11,706 nodes 23,412 processors 46 TB 250 TF (68 quad) 11,706 nodes 36,004 processors 71 TB 1 PF (136 new cabinets) 24,576 nodes 98,304 processors 175 TB Rapid growth in scale drives fault tolerance need Jul 2006 Nov 2006 Dec 2007 Nov TF 100 TF 1 PF 250 TF Cray XT3 Cray Baker Cray XT4 25 TF 20X scale change in 2.5 years Today Late

3 Geist_FT_0611 Today’s applications and their runtime libraries may scale but are not prepared for the failure rates of these systems ORNL 1 PF Cray “Baker” system, nodes a day: “Estimated” failure rate for 1 petaflop system (Today’s rate of 1–2 per day times 20)  With 25,000 nodes, this is a tiny fraction (0.0004) of the whole system  The RAS system automatically configures around faults – up for days  But every one of these failures kills the application that was using that node!

4 Geist_FT_0611 Good news: MTTI is better than expected for LLNL BG/L and ORNL XT3 (6–7 days, not minutes) ’09: The end of fault tolerance as we know it (The point where checkpoint ceases to be viable) Mean time to interrupt (hours) NovemberDecemberJanuary Time to checkpoint increases with problem size Time 2009 (estimate) 2006 Crossover point MTTI decreases as number of parts increases

5 Geist_FT_ strategies for application to handle fault 1.Restart from checkpoint file [large apps today] 2.Restart from diskless checkpoint [Avoids stressing the IO system and causing more faults] 3.Recalculate lost data from in-memory RAID 4.Lossy recalculation of lost data [for iterative methods] Need to develop rich methodology to “run through” faults No checkpoint Some state saved 5.Recalculate lost data from initial and remaining data 6.Replicate computation across system 7.Reassign lost work to another resource 8.Use natural fault-tolerant algorithms Store checkpoint in memory

6 Geist_FT_0611 Harness P2P control research The file system can’t let data be corrupted by faults I/O nodes must recover and cover failures The heterogeneous OS must be able to tolerate failures of any of its node types and instances For example: A failed service node shouldn’t take out a bunch of computer nodes The schedulers and other system components must be aware of dynamically changing system configuration So that tasks get assigned around failed components Fast recovery from fault Parallel recovery from multiple node failures Support simultaneous updates 7/24 system can’t ignore faults

1.Restart – from checkpoint or from beginning 2.Notify application and let it handle the problem 3.Migrate task to other hardware before failure 4.Reassign work to spare processor(s) 5.Replicate tasks across machine 6.Ignore the fault altogether 6 options for system to handle jobs Need a mechanism for each application (or component) to specify to system what to do if fault occurs 7 Geist_FT_0611

8 Geist_FT_ recovery modes for MPI applications 1.ABORT: Just do as vendor implementations 2.BLANK: Leave holes (but make sure collectives do the right thing afterwards) 3.SHRINK: Re-order processes to make a contiguous communicator (some ranks change) 4.REBUILD: Re-spawn lost processes and add them to MPI_COMM_WORLD 5.REBUILD_ALL: Same as REBUILD except rebuilds all communicators, groups and resets all key values etc. Harness project’s FT-MPI explored 5 modes of recovery It may be time to consider an MPI-3 standard that allows applications to recover from faults These modes affect the size (extent) and ordering of the communicators

Validation of answer on such large systems 1.Fault may not be detected 2.Recovery introduces perturbations 3.Result may depend on which nodes fail 4.Result looks reasonable but is actually wrong Can’t afford to run every job three (or more) times Yearly allocations are like $5M–$10M grants 4 ways to fail anyway 9 Geist_FT_0611

There are three main steps in fault tolerance 1.Detection that something has gone wrong System: Detection in hardware Framework: Detection by runtime environment Library: Detection in math or communication library 2.Notification of the application, runtime, system components Interrupt: Signal sent to job or system component Error code returned by application routine 3.Recovery of the application to the fault By the system By the application Neither: Natural fault tolerance Subscription Notification 3 steps to fault tolerance 10 Geist_FT_0611

11 Geist_FT_0611 The drive for large-scale simulations in biology, nanotechnology, medicine, chemistry, materials, etc. From a fault tolerance perspective: Space means that the job ‘state’ to be recovered is huge Time means that many faults will occur during a single run 2 reasons the problem is only going to get worse 1.Require much larger problems (space) Easily consume the 2 GB per core in ORNL LCF systems 2.Require much longer to run (time) Science teams in climate, combustion, and fusion want to run for a couple of months dedicated

12 Geist_FT_ holistic solution Fault Tolerance Backplane We need coordinated fault awareness, prediction, and recovery across the entire HPC system from the application to the hardware Middleware Applications Operating System Hardware CIFTS project “Prediction and prevention are critical because the best fault is the one that never happens” Detection Monitor Logger Configuration Notification Event Manager Prediction and Prevention Recovery Autonomic Actions Recovery Services Project under way at ANL, ORNL, LBL, UTK, IU, OSU

13 Geist_FT_0611 Contact Al Geist Network and Cluster Computing Computational Sciences and Mathematics Division (865) Geist_FT_0611