The Global View Resilience Model Approach GVR (Global View for Resilience) Exploits a global-view data model, which enables irregular, adaptive algorithms.

Slides:

Advertisements

Similar presentations

Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.

Advertisements

Exascale Runtime Systems Summit Plan and Outcomes Sonia R. Sachs 04/09/2014 AGU, Washington D.C.

Priority Research Direction Key challenges General Evaluation of current algorithms Evaluation of use of algorithms in Applications Application of “standard”

Priority Research Direction (I/O Models, Abstractions and Software) Key challenges What will you do to address the challenges? – Develop newer I/O models.

Abstract HyFS: A Highly Available Distributed File System Jianqiang Luo, Mochan Shrestha, Lihao Xu Department of Computer Science, Wayne State University.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.

INTRODUCTION OS/2 was initially designed to extend the capabilities of DOS by IBM and Microsoft Corporations. To create a single industry-standard operating.

1: Operating Systems Overview

1 Dr. Frederica Darema Senior Science and Technology Advisor NSF Future Parallel Computing Systems – what to remember from the past RAMP Workshop FCRC.

16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.

Checkpoint Based Recovery from Power Failures Christopher Sutardja Emil Stefanov.

Concurrent Systems Architecture Group University of California, San Diego and University of Illinois at Urbana-Champaign Morph 9/21/98 Morph: Supporting.

Simplifying the Recovery Model of User- Level Failure Mitigation Wesley Bland ExaMPI ‘14 New Orleans, LA, USA November 17, 2014.

Introduction to High-Level Language Programming

Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of.

Darema Dr. Frederica Darema NSF Dynamic Data Driven Application Systems (Symbiotic Measurement&Simulation Systems) “A new paradigm for application simulations.

Non-Collective Communicator Creation in MPI James Dinan 1, Sriram Krishnamoorthy 2, Pavan Balaji 1, Jeff Hammond 1, Manojkumar Krishnan 2, Vinod Tipparaju.

Exploiting Global View for Resilience (GVR) An Outside-In Approach to Resilience Andrew A. Chien X-stack PI LBNL March 20-22, 2013.

A Portable Virtual Machine for Program Debugging and Directing Camil Demetrescu University of Rome “La Sapienza” Irene Finocchi University of Rome “Tor.

ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013.

4.x Performance Technology drivers – Exascale systems will consist of complex configurations with a huge number of potentially heterogeneous components.

Role of Deputy Director for Code Architecture and Strategy for Integration of Advanced Computing R&D Andrew Siegel FSP Deputy Director for Code Architecture.

Self Adaptivity in Grid Computing Reporter : Po - Jen Lo Sathish S. Vadhiyar and Jack J. Dongarra.

LOGO OPERATING SYSTEM Dalia AL-Dabbagh

Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.

Michael Ernst, page 1 Collaborative Learning for Security and Repair in Application Communities Performers: MIT and Determina Michael Ernst MIT Computer.

Priority Research Direction (use one slide for each) Key challenges -Fault understanding (RAS), modeling, prediction -Fault isolation/confinement + local.

Low-Power Wireless Sensor Networks

Center for Programming Models for Scalable Parallel Computing: Project Meeting Report Libraries, Languages, and Execution Models for Terascale Applications.

Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.

Presented by High Productivity Language Systems: Next-Generation Petascale Programming Aniruddha G. Shet, Wael R. Elwasif, David E. Bernholdt, and Robert.

Argonne National Laboratory is a U.S. Department of Energy laboratory managed by U Chicago Argonne, LLC. Xin Zhao *, Pavan Balaji † (Co-advisor) and William.

4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.

Network-on-Chip Energy-Efficient Design Techniques for Interconnects Suhail Basit.

Back-end (foundation) Working group X-stack PI Kickoff Meeting Sept 19, 2012.

Copyright: Abhinav Vishnu Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models Abhinav Vishnu 1, Huub Van Dam 1, Bert De Jong.

©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.

Cousins HPEC 2002 Session 4: Emerging High Performance Software David Cousins Division Scientist High Performance Computing Dept. Newport,

© 2009 IBM Corporation Parallel Programming with X10/APGAS IBM UPC and X10 teams  Through languages –Asynchronous Co-Array Fortran –extension of CAF with.

Operating Systems Objective n The historic background n What the OS means? n Characteristics and types of OS n General Concept of Computer System.

1 COMPUTER SCIENCE DEPARTMENT COLORADO STATE UNIVERSITY 1/9/2008 SAXS Software.

Presentation-2 Group-A1 Professor:Mohamed Khalil Anita Kanuganti Hemanth Rao.

Data Center & Large-Scale Systems (updated) Luis Ceze, Bill Feiereisen, Krishna Kant, Richard Murphy, Onur Mutlu, Anand Sivasubramanian, Christos Kozyrakis.

System-Directed Resilience for Exascale Platforms LDRD Proposal Ron Oldfield (PI)1423 Ron Brightwell1423 Jim Laros1422 Kevin Pedretti1423 Rolf.

Programmability Hiroshi Nakashima Thomas Sterling.

SensorWare: Distributed Services for Sensor Networks Rockwell Science Center and UCLA.

Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003.

Lawrence Livermore National Laboratory 1 Science & Technology Principal Directorate - Computation Directorate Scalable Fault Tolerance for Petascale Systems.

Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign.

1 SHARCS: Secure Hardware-Software Architectures for Robust Computing Systems Sotiris Ioannidis FORTH.

Empirical Comparison of Three Versioning Architecture Hajime Fujita 12*, Kamil Iskra 2, Pavan Balaji 2, Andrew A. Chien 12 1 University of Chicago, 2 Argonne.

Versioning Architectures for Local and Global Memory Hajime Fujita 123, Kamil Iskra 2, Pavan Balaji 2, Andrew A. Chien 12 1 University of Chicago, 2 Argonne.

Containment Domains A Scalable, Efficient, and Flexible Resilience Scheme for Exascale Systems Jinsuk Chung, Ikhwan Lee, Michael Sullivan, Jee Ho Ryoo,

Gerhard Dueck -- CS3013Architecture 1 Architecture-Centric Process  There is more to software development then going blindly through the workflows driven.

FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.

Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.

For Massively Parallel Computation The Chaotic State of the Art

Dynamic Data Driven Application Systems

V-Shaped SDLC Model Lecture-6.

Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab

Introduction to Operating Systems

Dynamic Data Driven Application Systems

InCheck: An In-application Recovery Scheme for Soft Errors

Parallel and Distributed Simulation

Mark McKelvin EE249 Embedded System Design December 03, 2002

Co-designed Virtual Machines for Reliable Computer Systems

Foundations and Definitions

Harrison Howell CSCE 824 Dr. Farkas

Presentation transcript:

The Global View Resilience Model Approach GVR (Global View for Resilience) Exploits a global-view data model, which enables irregular, adaptive algorithms and exascale variability Provides an abstraction of data representation which offers resilience and seamless integration of various components of memory/storage hierarchy Global-view Distributed Arrays Processes Non-uniform, Proportional Resilience Applications can specify which data are more important in order to manage reliability overheads Goals Research Challenges Impact Understand and create application-system partnership for flexible resilience Explore efficient implementation of resilient and multi-version data Create empirical understanding of GVR’s effectiveness and performance requirements Resilient, globally-visible data store Incremental, portable approach to resilience for large-scale applications Flexible, application-managed cost and coverage for resilience Understand application needs for flexible, portable resilience and performance Design of API suitable for use by application/library developers and tools Achieve efficient GVR runtime implementation for multi-version memory and flexible resilience Understand architecture support and its benefits Explore new opportunities created by GVR abstractions and its implementation technologies ASCR X-Stack Awards DE-SC /57K Applications Runtime OS Programming Abstractions Architecture Open Reliability Efficient Implementation Multi-version Memory Computation phases form “versions” of data A program can obtain and recover from earlier versions if needed Rollback & recompute if uncorrected error Parallel Computation proceeds from phase to phase Phases create new logical versions App-semantics based recovery Progress and Accomplishments Use cases and initial design of GVR API Design of GVR runtime software architecture Initial research prototype of GVR, with multi-version array and application- managed error handling Functionality and performance explorations of user/kernel/hardware-based dirty bit tracking within the Local Reliable Data Store GVR-enabled two Mantevo mini-apps Modeling of multi-version checkpoint scheme that shows multi-version checkpoints critical for latent (“silent”) errors Please come and see our “When is multi-version checkpointing needed?” poster in Poster Session 2 for details. Future Efforts Fully-capable, robust implementation Efficient implementation of redundant, distributed global-view data structure Efficient multi-version snapshot (e.g. compression) Experiments with co-design applications Collaboration with OS/runtime community for cross-layer error handling Zachary Rubenstein, Hajime Fujita, Guoming Lu, Aiman Fang, Ziming Zheng, Andrew A. Chien, University of Chicago; Pavan Balaji, Kamil Iskra, Pete Beckman, James Dinan, Jeff Hammond, Argonne National Laboratory; Robert Schreiber, Hewlett-Packard Labs Example // Array creation and initialization GDS_alloc(GDS_PRIORITY_HIGH, &gds); GDS_register_global_error_handler(gds, errorhandler); // full_check is a comprehensive and expensive // check provided by a user GDS_register_global_error_check(gds, full_check); void mainloop() { do_computation_and_communication(atoms); // lightweight error checking if (atoms_out_of_box(atoms)) { // switch to error handling mode GDS_raise_global_error(gds); } // preserve the important states GDS_put(atoms, gds); GDS_version_inc(gds); } GDS_status_t errorhandler(GDS_gds_t gds) { // find a good version in the history do { GDS_move_to_prev(gds); } while (GDS_check_global_error(gds) != OK); // restore the states of atoms and resume GDS_get(atoms, gds); GDS_resume(gds); return OK; } Cross-layer Partnership (App, Runtime, OS, Architecture) Rich error checking and recovery, including application-managed ones Efficient error handling implementation at each layer Detection and Recovery Error Signalling Error Recovery Reliability Needs Application Hardware OS Runtime (e.g. MPI…) Traditional Application Runtime Hardware Scientist/Programmer OS Other Runtimes Other Runtimes Global View Resilience Background Widely accepted that Silicon scaling and low-voltage operation will produce rising error rates Need for a new programming model and a tool which address resilience issues Pseudo-code from a molecular dynamics application Library Approach Implemented as a library Can be used together with other libraries (e.g. MPI, Trilinos), allowing gradual migration to existing applications Can be a backend of other libraries/programming models (e.g. CnC, UPC, etc…) Linear Solver Studies Inject errors of different severity at different points in computation for PCG and SOR Understand different methods for detecting injected errors Understand benefits of restoration rather than ignoring error