HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.

Slides:



Advertisements
Similar presentations
Remus: High Availability via Asynchronous Virtual Machine Replication
Advertisements

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Washington WASHINGTON UNIVERSITY IN ST LOUIS Real-Time: Periodic Tasks Fred Kuhns Applied Research Laboratory Computer Science Washington University.
Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH) Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University.
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Walter Binder University of Lugano, Switzerland Niranjan Suri IHMC, Florida, USA Green Computing: Energy Consumption Optimized Service Hosting.
4.1.5 System Management Background What is in System Management Resource control and scheduling Booting, reconfiguration, defining limits for resource.
Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty.
Network Operating Systems Users are aware of multiplicity of machines. Access to resources of various machines is done explicitly by: –Logging into the.
A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.
8. Fault Tolerance in Software
Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 1 CCS-3 P AL A CASE STUDY.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
VIRTUALISATION OF HADOOP CLUSTERS Dr G Sudha Sadasivam Assistant Professor Department of CSE PSGCT.
Airbus flight control system  The organisation of the Airbus A330/340 flight control system 1Airbus FCS Overview.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
Computer System Architectures Computer System Software
 Scheduling  Linux Scheduling  Linux Scheduling Policy  Classification Of Processes In Linux  Linux Scheduling Classes  Process States In Linux.
OPERATING SYSTEMS CPU SCHEDULING.  Introduction to CPU scheduling Introduction to CPU scheduling  Dispatcher Dispatcher  Terms used in CPU scheduling.
LOGO OPERATING SYSTEM Dalia AL-Dabbagh
 What is an operating system? What is an operating system?  Where does the OS fit in? Where does the OS fit in?  Services provided by an OS Services.
1 Feedback Based Real-Time Fault Tolerance Issues and Possible Solutions Xue Liu, Hui Ding, Kihwal Lee, Marco Caccamo, Lui Sha.
Performance Concepts Mark A. Magumba. Introduction Research done on 1058 correspondents in 2006 found that 75% OF them would not return to a website that.
Cloud Computing Energy efficient cloud computing Keke Chen.
Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis [1] 4/24/2014 Presented by: Rakesh Kumar [1 ]
Cluster Reliability Project ISIS Vanderbilt University.
Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott and Christian Engelmann Computer Science.
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
1/20 Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications Sheng Di, Mohamed Slim Bouguerra, Leonardo Bautista-gomez, Franck Cappello.
Crystal Ball Panel ORNL Heterogeneous Distributed Computing Research Al Geist ORNL March 6, 2003 SOS 7.
Transparent Fault-Tolerant Java Virtual Machine Roy Friedman & Alon Kama Computer Science — Technion.
1 Coscheduling in Clusters: Is it a Viable Alternative? Gyu Sang Choi, Jin-Ha Kim, Deniz Ersoz, Andy B. Yoo, Chita R. Das Presented by: Richard Huang.
CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.
1/22 Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick.
System-Directed Resilience for Exascale Platforms LDRD Proposal Ron Oldfield (PI)1423 Ron Brightwell1423 Jim Laros1422 Kevin Pedretti1423 Rolf.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Enabling the use of e-Infrastructures with.
Fault Tolerant Grid Workflow in Water Threat Management Master’s project / thesis seminar Young Suk Moon Chair: Prof. Gregor von Laszewski Reader: Observer:
Simulation of O2 offline processing – 02/2015 Faculty of Electrical Engineering, Mechanical Engineering and Naval Architecture Eugen Mudnić.
Presented by Fault Tolerance Challenges and Solutions Al Geist Network and Cluster Computing Computational Sciences and Mathematics Division Research supported.
An operating system for a large-scale computer that is used by many people at once is a very complex system. It contains many millions of lines of instructions.
Lecture 4 CPU scheduling. Basic Concepts Single Process  one process at a time Maximum CPU utilization obtained with multiprogramming CPU idle :waiting.
Operating Systems Distributed-System Structures. Topics –Network-Operating Systems –Distributed-Operating Systems –Remote Services –Robustness –Design.
1 Performance Impact of Resource Provisioning on Workflows Gurmeet Singh, Carl Kesselman and Ewa Deelman Information Science Institute University of Southern.
CPU scheduling.  Single Process  one process at a time  Maximum CPU utilization obtained with multiprogramming  CPU idle :waiting time is wasted 2.
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
Real-Time Operating Systems RTOS For Embedded systems.
Getting the Most out of Scientific Computing Resources
REAL-TIME OPERATING SYSTEMS
Getting the Most out of Scientific Computing Resources
OpenPBS – Distributed Workload Management System
Introduction to Distributed Platforms
Real-time Software Design
湖南大学-信息科学与工程学院-计算机与科学系
Department of Computer Science University of California, Santa Barbara
QNX Technology Overview
Fault Tolerance Distributed Web-based Systems
CPU SCHEDULING.
System Testing.
Overview of Workflows: Why Use Them?
Department of Computer Science University of California, Santa Barbara
Secure Proactive Recovery – a Hardware Based Mission Assurance Scheme
Lecture Topics: 11/1 Hand back midterms
Presentation transcript:

HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John T. Daly High Performance Computing Division Los Alamos National Laboratory William M. Jones Electrical and Computer Engineering Department United States Naval Academy LA-UR Resilience 2008 : Workshop on Resiliency in High Performance Computing

HPC HPC-5 Systems Integration High Performance Computing 2 Applications WILL Fail  In spite of improved fault tolerance  Failures will inevitably occur  Hardware failures  Application and system software bugs  We are moving to petaflop-scale supercomputers  More software layers = more points of failure  Extreme temperature  Extreme power  Extreme scale  With more computing power comes more potential for wasted money when not utilized as best as possible

HPC HPC-5 Systems Integration High Performance Computing 3 Should We Even Try to Avoid Failure?  Failure - how to avoid it?  Dynamic process creation to recover from node failures  Fault Tolerant MPI  Periodic checkpoints - but how often?  System support to advise the application of imminent failure  Save spare processors allocated for use after a failure  Costly. Complex.  Let us just ask ourselves instead a simple question: Is my application performing useful work (making progress)?

HPC HPC-5 Systems Integration High Performance Computing 4 Is My Application Making Progress?  How do we ensure progress is made?  Application monitoring frameworks  Intelligent application checkpointing  Analysis of checkpoint overhead  So, what’s the main problem?

HPC HPC-5 Systems Integration High Performance Computing 5 Failures May Go Unnoticed Application stops making progress wasted time

HPC HPC-5 Systems Integration High Performance Computing 6 There Are Many Ways to Monitor Application Progress  It is a surprisingly hard task to determine if an application has stopped making progress!  Maybe it’s just waiting on a network/disk  Maybe it’s computing or maybe it’s just spinning in an infinite loop  Maybe a node is not responding or maybe another task is just switched in  Let’s take a look at a layered approach to monitoring progress

HPC HPC-5 Systems Integration High Performance Computing 7 Node-Level System Monitoring  Daemons  Heart-beat mechanisms  Coupled with useful performance data sometimes  Are we willing to pay for daemon processing time? System “noise” already is considered too high

HPC HPC-5 Systems Integration High Performance Computing 8 Subsystem-Level System Monitoring  Network heartbeat - Infiniband  Fault tolerant MPI  Parallel file system fault tolerance  Fail over nodes  Redundancy  Kernel - power, heat  Degrade performance but try and recover in some cases  Helps pinpoint failure to specific subsystems

HPC HPC-5 Systems Integration High Performance Computing 9 Application-Level System Monitoring  Who better to know if an application is making progress than the application itself?  Source/binary instrumentation to emit heart beats  Kernel modifications to look for system call usage - does the application appear to be in a wait loop?  Watch application output. Is it producing any at a regular interval?  How does one determine these intervals?

HPC HPC-5 Systems Integration High Performance Computing 10 Suppose you could detect that an error occurred, migrate the job, and restart the job from last checkpoint. How quickly would you need to determine that an interrupt occurred?

HPC HPC-5 Systems Integration High Performance Computing 11 Our Assumptions  Coupled checkpoint / restart application  Some tradeoff exists between checkpoint frequency and how far we have to backup after an interrupt  R = f(detection latency + restart overhead)

HPC HPC-5 Systems Integration High Performance Computing 12 Analytical Model

HPC HPC-5 Systems Integration High Performance Computing 13

HPC HPC-5 Systems Integration High Performance Computing 14

HPC HPC-5 Systems Integration High Performance Computing 15 Compare Theory to Simulation  How closely does real supercomputer usage match the theory?  Need a simulator - BeoSim  Need real data - Pink at Los Alamos

HPC HPC-5 Systems Integration High Performance Computing 16 Workload Distribution Event driven simulation using 4,000,000 jobs (using BeoSim) (1926 node cluster)

HPC HPC-5 Systems Integration High Performance Computing 17 BeoSim: A Computational Grid Simulator JAVA front-front C back-end Discrete event simulator Single-threaded Parameter studies in parallel Parallel Job Scheduling Research Single and Multiple Clusters Checkpointing Studies

HPC HPC-5 Systems Integration High Performance Computing 18 BeoSim Framework Beosim:

HPC HPC-5 Systems Integration High Performance Computing 19 Impact of Increasing Failure Rates May seem negligible, but, multiple interrupts, impact on throughput - NOT total number of failures

HPC HPC-5 Systems Integration High Performance Computing 20 Impact on Throughput for ALL jobs significant reduction in queueing delays CPdelta (time to determine an interrupt occurred) (min)‏

HPC HPC-5 Systems Integration High Performance Computing 21 Impact on Execution Time marginal (1.8%)‏ significant (13.5%) CPdelta (time to determine an interrupt occurred) (min)‏

HPC HPC-5 Systems Integration High Performance Computing 22 Keep in Mind That... CPdelta (time to determine an interrupt occurred) (min)‏ (6.5% of total job interrupted) (1.5% of total job interrupted) So while the averages are relatively close for both scenarios, there are an increasing number of jobs that are effected as the MTBF decreases; and therefore more resources tied to applications that are not making progress

HPC HPC-5 Systems Integration High Performance Computing 23 Conclusions  Simulation seems to relatively closely match theory approximation  Simple theory but applied to complex system not included in theory - but still closely matches  Could it extend to more complex systems?  Application monitoring is paramount  Immediate detection not necessarily a hard requirement (for this system)  Helps decision makers:  $100million to spend - do I need to pay 5x the cost for a better detection system?  What’s my expected workload?  Put it into the simulation!  Pink is a general purpose cluster - lots of different jobs with different runtimes and widths. We use averages which tend to make the results “murky”.

HPC HPC-5 Systems Integration High Performance Computing 24 Future Work  No time to factor in fixing the failure, hardware takes time to repair  Completely independent failures  Look at different “classes” of jobs or look at a system that is less diverse as Pink  How to come up with the MTBF and how it effects the optimal checkpointing intervals  More work determining parameter M for systems where we’re not running a job across the entire machine

HPC HPC-5 Systems Integration High Performance Computing 25 Thank-you! Questions? Nathan A. DeBardeleben