Presented by Fault Tolerance Challenges and Solutions Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported.

Slides:

Advertisements

Similar presentations

Express5800/ft series servers Product Information Fault-Tolerant General Purpose Servers.

Advertisements

1 Concurrency: Deadlock and Starvation Chapter 6.

1 Towards an Open Service Framework for Cloud-based Knowledge Discovery Domenico Talia ICAR-CNR & UNIVERSITY OF CALABRIA, Italy Cloud.

Severs AIST Cluster (50 CPU) Titech Cluster (200 CPU) KISTI Cluster (25 CPU) Climate Simulation on ApGrid/TeraGrid at SC2003 Client (AIST) Ninf-G Severs.

Support for Fault Tolerance (Dynamic Process Control) Rich Graham Oak Ridge National Laboratory.

Presented by Fault Tolerance and Dynamic Process Control Working Group Richard L Graham.

Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007.

Chapter 6 Concurrency: Deadlock and Starvation Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee Community.

Chapter 8 Interfacing Processors and Peripherals.

System Integration and Performance

Managed by UT-Battelle for the Department of Energy MPI Must Evolve or Die! Research sponsored by ASCR Al Geist Oak Ridge National Laboratory September.

1 Tuning for MPI Protocols l Aggressive Eager l Rendezvous with sender push l Rendezvous with receiver pull l Rendezvous blocking (push or pull)

Chapter 3 Memory Management

Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.

Μπ A Scalable & Transparent System for Simulating MPI Programs Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory Adjunct Professor.

KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.

Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.

University of Minnesota Optimizing MapReduce Provisioning in the Cloud Michael Cardosa, Aameek Singh†, Himabindu Pucha†, Abhishek Chandra

Concurrency: Deadlock and Starvation Chapter 6. Deadlock Permanent blocking of a set of processes that either compete for system resources or communicate.

Walter Binder University of Lugano, Switzerland Niranjan Suri IHMC, Florida, USA Green Computing: Energy Consumption Optimized Service Hosting.

Parallel Research at Illinois Parallel Everywhere

Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.

1 Deadlocks Chapter Resource 3.2. Introduction to deadlocks 3.3. The ostrich algorithm 3.4. Deadlock detection and recovery 3.5. Deadlock avoidance.

Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.

A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.

Operating System Support Focus on Architecture

Building Fault Survivable MPI Programs with FT-MPI Using Diskless Checkpointing Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julien Langou, Thara Angskun,

1: Operating Systems Overview

Lessons Learned Implementing User-Level Failure Mitigation in MPICH Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory User-level.

A Progressive Fault Tolerant Mechanism in Mobile Agent Systems Michael R. Lyu and Tsz Yeung Wong July 27, 2003 SCI Conference Computer Science Department.

Undergraduate Poster Presentation Match 31, 2015 Department of CSE, BUET, Dhaka, Bangladesh Wireless Sensor Network Integretion With Cloud Computing H.M.A.

Simplifying the Recovery Model of User- Level Failure Mitigation Wesley Bland ExaMPI ‘14 New Orleans, LA, USA November 17, 2014.

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

1 MOLAR: MOdular Linux and Adaptive Runtime support Project Team David Bernholdt 1, Christian Engelmann 1, Stephen L. Scott 1, Jeffrey Vetter 1 Arthur.

4.x Performance Technology drivers – Exascale systems will consist of complex configurations with a huge number of potentially heterogeneous components.

LOGO OPERATING SYSTEM Dalia AL-Dabbagh

Priority Research Direction (use one slide for each) Key challenges -Fault understanding (RAS), modeling, prediction -Fault isolation/confinement + local.

Fault Tolerance BOF Possible CBHPC paper –Co-authors wanted –Tammy, Rob, Bruce, Daniel, Nanbor, Sameer, Jim, Doug, David What infrastructure is needed.

What are the main differences and commonalities between the IS and DA systems? How information is transferred between tasks: (i) IS it may be often achieved.

Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.

Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott and Christian Engelmann Computer Science.

Directed Reading 2 Key issues for the future of Software and Hardware for large scale Parallel Computing and the approaches to address these. Submitted.

1/20 Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications Sheng Di, Mohamed Slim Bouguerra, Leonardo Bautista-gomez, Franck Cappello.

Crystal Ball Panel ORNL Heterogeneous Distributed Computing Research Al Geist ORNL March 6, 2003 SOS 7.

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

1 Deadlocks Chapter Resource 3.2. Introduction to deadlocks 3.3. The ostrich algorithm 3.4. Deadlock detection and recovery 3.5. Deadlock avoidance.

Headline in Arial Bold 30pt HPC User Forum, April 2008 John Hesterberg HPC OS Directions and Requirements.

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

Diskless Checkpointing on Super-scale Architectures Applied to the Fast Fourier Transform Christian Engelmann, Al Geist Oak Ridge National Laboratory Februrary,

Scalable Systems Software for Terascale Computer Centers Coordinator: Al Geist Participating Organizations ORNL ANL LBNL.

Deadlock Detection and Recovery

Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott Christian Engelmann Computer Science Research.

Copyright © Clifford Neuman - UNIVERSITY OF SOUTHERN CALIFORNIA - INFORMATION SCIENCES INSTITUTE Advanced Operating Systems Lecture notes Dr.

University of Westminster – Checkpointing Mechanism for the Grid Environment K Sajadah, G Terstyanszky, S Winter, P. Kacsuk University.

HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.

Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.

Presented by Fault Tolerance Challenges and Solutions Al Geist Network and Cluster Computing Computational Sciences and Mathematics Division Research supported.

April 6, 2016ASPLOS 2016Atlanta, Georgia. Yaron Weinsberg IBM Research Idit Keidar Technion Hagar Porat Technion Eran Harpaz Technion Noam Shalev Technion.

VGrADS and GridSolve Asim YarKhan Jack Dongarra, Zhiao Shi, Fengguang Song Innovative Computing Laboratory University of Tennessee VGrADS Workshop – September.

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

Counting on Failure 10, 9, 8, 7,…,3, 2, 1 Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory September 12, 2006 CCGSC Conference.

Jack Dongarra University of Tennessee

ITEC 202 Operating Systems

Fault Injection: A Method for Validating Fault-tolerant System

Fault Tolerance Distributed Web-based Systems

Chapter 3 Deadlocks 3.1. Resource 3.2. Introduction to deadlocks

Chapter 3 Deadlocks 3.1. Resource 3.2. Introduction to deadlocks

Chapter 3 Deadlocks 3.1. Resource 3.2. Introduction to deadlocks

Presentation transcript:

Presented by Fault Tolerance Challenges and Solutions Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by the Department of Energys Office of Science Office of Advanced Scientific Computing Research

2 Geist_FT_SC07 Example: ORNL LCF hardware roadmap 54 TF (56 cabinets) 5294 nodes 10,588 processors 21 TB 119 TF (124 cabinets) 11,706 nodes 23,412 processors 46 TB 300 TF (84 quad) 71 TB 1 PF (128 new cabinets) 175 TB Rapid growth in scale drives fault tolerance need Jul 2006 Nov 2006 Dec 2007 Nov TF 100 TF 1 PF 250 TF Cray XT3 Cray Baker Cray XT4 25 TF 20X scale change in 2.5 years Today Late

3 Geist_FT_SC07 ORNL 1 PF Cray Baker system, 2009 Todays applications and their runtime libraries may scale, but they are not prepared for the failure rates of sustained petascale systems Assuming a linear model and a failure rate 20X what is seen today. The RAS system automatically configures around faults – up for days. But every one of these failures kills the application that was using that node!

4 Geist_FT_SC07 Good news: MTTI is better than expected for LLNL BG/L and ORNL XT4 (6–7 days, not minutes). Todays fault-tolerance paradigm (checkpoint) ceases to be viable on large systems Mean time to interrupt (hours) NovemberDecemberJanuary Time to checkpoint increases with problem size. Time 2009 (estimate) 2006 Crossover point MTTI decreases as number of parts increases.

5 Geist_FT_SC07 Need to develop new paradigms for applications to handle faults 1.Restart from checkpoint file [large apps today]. 2.Restart from diskless checkpoint [Avoids stressing the I/O system and causing more faults]. 3.Recalculate lost data from in-memory RAID. 4.Lossy recalculation of lost data [for iterative methods]. Need to develop rich methodology to run through faults. No checkpoint Some state saved 5.Recalculate lost data from initial and remaining data. 6.Replicate computation across system. 7.Reassign lost work to another resource. 8.Use natural fault-tolerant algorithms. Store checkpoint in memory.

6 Geist_FT_SC07 Harness P2P control research The file system cant let data be corrupted by faults. I/O nodes must recover and cover failures. The heterogeneous OS must be able to tolerate failures of any of its node type and instances. For example: A failed service node shouldnt take out a bunch of compute nodes. The schedulers and other system components must be aware of dynamically changing system configurations. So that tasks get assigned around failed components. Fast recovery from fault Parallel recovery from multiple node failures Support simultaneous updates 24/7 system cant ignore faults

1.Restart–from checkpoint or from beginning. 2.Notify application and let it handle the problem. 3.Migrate task to other hardware before failure. 4.Reassign work to spare processor(s). 5.Replicate tasks across machine. 6.Ignore the fault altogether. 6 options for system to handle failures Need a mechanism for each application (or component) to specify to the system what to do if a fault occurs. 7 Geist_FT_SC07

8 Geist_FT_SC07 5 recovery modes for MPI applications 1.ABORT: Just do as vendor implementations. 2.BLANK: Leave holes (but make sure collectives do the right thing afterward). 3.SHRINK: Reorder processes to make a contiguous communicator (some ranks change). 4.REBUILD: Respawn lost processes and add them to MPI_COMM_WORLD. 5.REBUILD_ALL: Same as REBUILD except that it rebuilds all communicators, and groups and resets all key values, etc. Harness projects FT-MPI explored 5 modes of recovery It may be time to consider an MPI-3 standard that allows applications to recover from faults. These modes affect the size (extent) and ordering of the communicators.

Validation of an answer on such large systems is a growing problem. Simulations are more complex, solutions are being sought in regions never before explored. Cant afford to run every job three (or more) times. Yearly allocations are like $5M–$10M grants. 4 ways to fail anyway 9 Geist_FT_SC07 1.Fault may not be detected. 2.Recovery introduces perturbations. 3.Result may depend on which nodes fail. 4.Result looks reasonable, but it is actually wrong.

1.Detection that something has gone wrong System: detection in hardware Framework: detection by runtime environment Library: detection in math or communication library 2.Notification of the application, runtime, or system components Interrupt: signal sent to job or system component Error code returned by application routine 3.Recovery of the application to the fault By the system By the application Neither: natural fault tolerance Subscription notification 3 steps to fault tolerance 10 Geist_FT_SC07

11 Geist_FT_SC07 The drive for large-scale simulations in biology, nanotechnology, medicine, chemistry, materials, etc. From a fault tolerance perspective: Space means that the job state to be recovered is huge. Time means that many faults will occur during a single run. 2 reasons the problem is only going to get worse 1.Require much larger problems (space): easily consume the 2 GB per core in ORNL LCF systems. 2.Require much longer to run (time) science teams in climate, combustion, and fusion want to run for a dedicated couple of months.

12 Geist_FT_SC07 1 holistic solution Fault Tolerance Backplane We need coordinated fault awareness, prediction, and recovery across the entire HPC system from the application to the hardware. Middleware Applications Operating System Hardware CIFTS project Prediction and prevention are critical because the best fault is the one that never happens. Detection Monitor Logger Configuration Notification Event manager Prediction and prevention Recovery Autonomic actions Recovery services Project under way at ANL, ORNL, LBL, UTK, IU, OSU

13 Geist_FT_SC07 Contact Al Geist Computer Science Research Group Computer Science and Mathematics Division (865) Geist_FT_SC07