System-Directed Resilience for Exascale Platforms LDRD Proposal 09-0016 Ron Oldfield (PI)1423 Ron Brightwell1423 Jim Laros1422 Kevin Pedretti1423 Rolf.

Slides:

Advertisements

Similar presentations

Two phase commit. Failures in a distributed system Consistency requires agreement among multiple servers –Is transaction X committed? –Have all servers.

Advertisements

Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.

© 2005 Dorian C. Arnold Reliability in Tree-based Overlay Networks Dorian C. Arnold University of Wisconsin Paradyn/Condor Week March 14-18, 2005 Madison,

Priority Research Direction Key challenges General Evaluation of current algorithms Evaluation of use of algorithms in Applications Application of “standard”

Priority Research Direction (I/O Models, Abstractions and Software) Key challenges What will you do to address the challenges? – Develop newer I/O models.

The Who, What, Why and How of High Performance Computing Applications in the Cloud Soheila Abrishami 1.

Abstract HyFS: A Highly Available Distributed File System Jianqiang Luo, Mochan Shrestha, Lihao Xu Department of Computer Science, Wayne State University.

Introduction to DBA.

Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.

Availability in Globally Distributed Storage Systems

Persistent Linda 3.0 Peter Wyckoff New York University.

Model for Supporting High Integrity and Fault Tolerance Brian Dobbing, Aonix Europe Ltd Chief Technical Consultant.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

1 Principles of Reliable Distributed Systems Tutorial 12: Frangipani Spring 2009 Alex Shraer.

1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Virtualization in Data Centers Prashant Shenoy

16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.

Checkpoint Based Recovery from Power Failures Christopher Sutardja Emil Stefanov.

VIRTUALIZATION AND YOUR BUSINESS November 18, 2010 | Worksighted.

Implementing Failover Clustering with Hyper-V

Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.

Failure Spread in Redundant UMTS Core Network n Author: Tuomas Erke, Helsinki University of Technology n Supervisor: Timo Korhonen, Professor of Telecommunication.

Virtualization. Virtualization  In computing, virtualization is a broad term that refers to the abstraction of computer resources  It is "a technique.

Priority Research Direction (use one slide for each) Key challenges -Fault understanding (RAS), modeling, prediction -Fault isolation/confinement + local.

Fault Tolerance BOF Possible CBHPC paper –Co-authors wanted –Tammy, Rob, Bruce, Daniel, Nanbor, Sameer, Jim, Doug, David What infrastructure is needed.

A Lightweight Platform for Integration of Resource Limited Devices into Pervasive Grids Stavros Isaiadis and Vladimir Getov University of Westminster

Computer Science Open Research Questions Adversary models –Define/Formalize adversary models Need to incorporate characteristics of new technologies and.

The Global View Resilience Model Approach GVR (Global View for Resilience) Exploits a global-view data model, which enables irregular, adaptive algorithms.

Distributed Systems: Concepts and Design Chapter 1 Pages

Distributed systems A collection of autonomous computers linked by a network, with software designed to produce an integrated computing facility –A well.

A Proposal of Application Failure Detection and Recovery in the Grid Marian Bubak 1,2, Tomasz Szepieniec 2, Marcin Radecki 2 1 Institute of Computer Science,

Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott and Christian Engelmann Computer Science.

Session objectives Discuss whether or not virtualization makes sense for Exchange 2013 Describe supportability of virtualization features Explain sizing.

4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.

Our work on virtualization Chen Haogang, Wang Xiaolin {hchen, Institute of Network and Information Systems School of Electrical Engineering.

Crystal Ball Panel ORNL Heterogeneous Distributed Computing Research Al Geist ORNL March 6, 2003 SOS 7.

Advanced Computer Networks Topic 2: Characterization of Distributed Systems.

Headline in Arial Bold 30pt HPC User Forum, April 2008 John Hesterberg HPC OS Directions and Requirements.

ICS-FORTH 25-Nov Infrastructure for Scalable Services Are we Ready Yet? Angelos Bilas Institute of Computer Science (ICS) Foundation.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

Serverless Network File Systems Overview by Joseph Thompson.

OSIsoft High Availability PI Replication

Diskless Checkpointing on Super-scale Architectures Applied to the Fast Fourier Transform Christian Engelmann, Al Geist Oak Ridge National Laboratory Februrary,

VMware vSphere Configuration and Management v6

The Role of Virtualization in Exascale Production Systems Jack Lange Assistant Professor University of Pittsburgh.

What is virtualization? virtualization is a broad term that refers to the abstraction of computer resources in order to work with the computer’s complexity.

Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott Christian Engelmann Computer Science Research.

Breakout Group: Debugging David E. Skinner and Wolfgang E. Nagel IESP Workshop 3, October, Tsukuba, Japan.

This project has received funding from the European Union's Seventh Framework Programme for research, technological development.

HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.

A Binary Agent Technology for COTS Software Integrity Anant Agarwal Richard Schooler InCert Software.

Fault Tolerance and Checkpointing - Sathish Vadhiyar.

Lawrence Livermore National Laboratory 1 Science & Technology Principal Directorate - Computation Directorate Scalable Fault Tolerance for Petascale Systems.

1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12.

Adding Algorithm Based Fault-Tolerance to BLIS Tyler Smith, Robert van de Geijn, Mikhail Smelyanskiy, Enrique Quintana-Ortí 1.

Tackling I/O Issues 1 David Race 16 March 2010.

CS 5150 Software Engineering Lecture 22 Reliability 3.

Presented by Fault Tolerance Challenges and Solutions Al Geist Network and Cluster Computing Computational Sciences and Mathematics Division Research supported.

Resource Optimization for Publisher/Subscriber-based Avionics Systems Institute for Software Integrated Systems Vanderbilt University Nashville, Tennessee.

OSIsoft High Availability PI Replication Colin Breck, PI Server Team Dave Oda, PI SDK Team.

Fault-Tolerant NoC-based Manycore system: Reconfiguration & Scheduling

Introduction to Operating Systems

Distributed Shared Memory

Fault Tolerance Distributed Web-based Systems

Mattan Erez The University of Texas at Austin July 2015

Co-designed Virtual Machines for Reliable Computer Systems

Introduction To Distributed Systems

Distributed Systems and Concurrency: Distributed Systems

Presentation transcript:

System-Directed Resilience for Exascale Platforms LDRD Proposal Ron Oldfield (PI)1423 Ron Brightwell1423 Jim Laros1422 Kevin Pedretti1423 Rolf Riesen1423

System-Directed Resilience for Exascale Platforms ( ) Ron Oldfield (1423), Neil Pundit (1423), FY09-11, Total $1500 Costs Problem Current apps cannot survive a node failure Proposed Solution Application-transparent resilience to node failures Approach Design/develop system software to support: Application quiescence, Efficient state management, Automatic fault recovery Significance of Results Represents a fundamental change in the way HPC systems support resilience. Significant impact on performance: less defensive I/O overhead for checkpoints. Higher levels of reliability. Improved productivity: developers worry less about resilience, more on core science. R&D Goals & Milestones Investigate and develop new methods for quiescence that don’t hinder other apps. Identify critical application state and develop efficient methods to manage state. Identify system software requirements for dynamic node allocation, network/os virtualization, and MPI node recovery. Relationship to Other Work Scalability and efficient resource utilization, particularly memory and storage, are key issues for this effort. Our team has R&D experience in: Scalable system software (LWK, Portals, LWFS), Smart memory management techniques (Smartmap) RAS systems All efforts developed “lightweight” approaches that are both resource-efficient and scalable.

Resilience Challenges for Exascale Current Application characteristics –Require large fractions of systems –Long running –Resource constrained compute nodes –Cannot survive component failure Current Options for fault tolerance –Application-directed checkpoints –System-directed checkpoints –System-directed incremental checkpoints –Checkpoint in memory –Others: virtualization, redundant computation, … We propose to develop systems software resilient to node failure –Support for application quiescence, –Efficient (diskless) state management, –Fast methods for fault recovery.

Application Quiescence Goal: Develop methods to suspend application activity without hindering progress of other applications Requires –Methods for accurate and efficient fault detection –Mechanisms and interfaces for conveying node state to shared services (e.g., need a functional RAS system) Approach –Integrated system software for cooperation among shared services and applications Network layer: deal with messages in transit File system: isolate and suspend in-progress I/O operations

State Management Goal: Efficient methods for extracting and managing state Approach Identify critical state –Characterize memory usage –Investigate resource-efficient methods for logging modified memory. –App guidance to identify unnecessary data (e.g., ghost cells, cache) System guidance for when to extract state Explore diskless methods to manage state Explore state compression to reduce resource reqs

Fault Recovery Goal: Dynamically recover a failed node without restarting the whole application Approach Explore changes to system software to support dynamic node allocation (for swap of failed node). Develop network virtualization to abstract physical node ID from software. Develop efficient methods for state recovery –Investigate roll-back, roll-forward techniques

Summary Recovering from independent node failures is a critical issue for exascale systems We address that problem through modifications to system software –Support for application quiescence, –Efficient (diskless) state management, –Fast methods for fault recovery. Our approach represents a fundamental change in how systems support resilience

Reviewer Questions Programmatic –Firm commitments from team if LDRD goes forward? –Why is funding flat for FY10 and FY11? Technical –Is the assertion that “checkpoint overhead will exceed 50% beyond 100K nodes” too modest? –Why use the term “components” instead of cores or processors. Technical/Programmatic –Can the project really address all of the proposed work? –With technical topics have we identified all the technical risks?