Presented by Fault Tolerance and Dynamic Process Control Working Group Richard L Graham.

Slides:



Advertisements
Similar presentations
The MPI Forum: Getting Started Rich Graham Oak Ridge National Laboratory.
Advertisements

Support for Fault Tolerance (Dynamic Process Control) Rich Graham Oak Ridge National Laboratory.
Presented by Fault Tolerance Working Group Update Rich Graham.
March 2008MPI Forum Voting 1 MPI Forum Voting March 2008.
2 MPI 2.2 Discussed Ground Rules with Examples Agenda for the next meeting will be drafted at the interim telecon Drafts of Changes Karl to propose sparse.
MPI ABI WG Status Jeff Brown, LANL (chair) April 28, 2008.
Presented by Structure of MPI-3 Rich Graham. 2 Current State of MPI-3 proposals Many working groups have several proposal being discussed ==> standard.
MPI ABI Working Group status March 10, 2008 Jeff Brown, chair (LANL)
Use Cases for Fault Tolerance Support in MPI Rich Graham Oak Ridge National Laboratory.
Enabling MPI Interoperability Through Flexible Communication Endpoints
User Level Failure Mitigation Fault Tolerance Working Group September 2013, MPI Forum Meeting Madrid, Spain.
Update on ULFM Fault Tolerance Working Group MPI Forum, San Jose CA December, 2014.
1Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 1Managed by UT-Battelle for the Department of Energy Richard L. Graham Computer.
Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty.
Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.
MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.
1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.
Chapter 13 Selecting a Data Collection Method. DATA COLLECTION AND THE RESEARCH PROCESS Steps 1 and 2: Selecting a General Research Topic and Focusing.
Minaashi Kalyanaraman Pragya Upreti CSS 534 Parallel Programming
Using UML, Patterns, and Java Object-Oriented Software Engineering Chapter 3, Project Organization and Communication.
Message Passing Interface In Java for AgentTeamwork (MPJ) By Zhiji Huang Advisor: Professor Munehiro Fukuda 2005.
A Progressive Fault Tolerant Mechanism in Mobile Agent Systems Michael R. Lyu and Tsz Yeung Wong July 27, 2003 SCI Conference Computer Science Department.
THE SECOND LIFE OF A SENSOR: INTEGRATING REAL-WORLD EXPERIENCE IN VIRTUAL WORLDS USING MOBILE PHONES Sherrin George & Reena Rajan.
Simplifying the Recovery Model of User- Level Failure Mitigation Wesley Bland ExaMPI ‘14 New Orleans, LA, USA November 17, 2014.
Principles for Collaboration Systems Geoffrey Fox Community Grids Laboratory Indiana University Bloomington IN 47404
EProcurement eProcurement Supplier e Enablement Briefing.
Presented by The MPI Forum Richard L Graham. Outline  Forum Structure  Schedules  Introductions  Scope  Voting Rules  Committee Rules.
Market Coordination Team Update to RMS.  MCT Kick-off Meeting Update… –1 st MCT Meeting was held on 3/2 (Austin) Twenty five people were in attendance.
An Investigation into High-Level Control Mechanism For Self Adaptive software Agents Change Negotiation Nagwa Badr Director.
CS 390- Unix Programming Environment CS 390 Unix Programming Environment Topics to be covered: Distributed Computing Fundamentals.
Submission doc.: IEEE wpp July 2004 Paul C. Canaan, Intel Corporation.Slide 1 Wireless Performance Prediction – Development Milestones.
Areas For Review L3 Review of SM Software, 28 Oct The Charge From Jim’s with instructions for the review: “The time limit for this review.
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Overview of Recent MCMD Developments Jarek Nieplocha CCA Forum Meeting San Francisco.
HYDRA: Using Windows Desktop Systems in Distributed Parallel Computing Arvind Gopu, Douglas Grover, David Hart, Richard Repasky, Joseph Rinkovsky, Steve.
Sunday, October 15, 2000 JINI Pattern Language Workshop ACM OOPSLA 2000 Minneapolis, MN, USA Fault Tolerant CORBA Extensions for JINI Pattern Language.
PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.
Diskless Checkpointing on Super-scale Architectures Applied to the Fast Fourier Transform Christian Engelmann, Al Geist Oak Ridge National Laboratory Februrary,
Scalable Systems Software for Terascale Computer Centers Coordinator: Al Geist Participating Organizations ORNL ANL LBNL.
Shuman Guo CSc 8320 Advanced Operating Systems
Project Weekly Update Student/ Group. Project Plan WEEK 01 - Creative WEEK 02 - Pitch WEEK 03 - Pitch and proposal WEEK 04 - Deliverable login system.
Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory {wbland, huiweilu, sseo, May 5, 2015 Lessons Learned Implementing.
Query Health Distributed Population Queries Implementation Group Meeting October 11, 2011.
PM Summit fall out 2 CE Vendors Spoke at Summit  CELF presented  TI presented  Free Scale presented  Nokia presented  MLI represented.
Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.
16/11/ Web Services Choreography Requirements Presenter: Emilia Cimpian, NUIG-DERI, 07April W3C Working Draft.
Sending large message counts (The MPI_Count issue)
Bronis R. de Supinski and Jeffrey S. Vetter Center for Applied Scientific Computing August 15, 2000 Umpire: Making MPI Programs Safe.
CC-MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters Amit Karwande, Xin Yuan Department of Computer Science, Florida State.
Issues #1 and #3 Fault Tolerance Working Group December 2015 MPI Forum.
MWDriver: An Object-Oriented Library for Master-Worker Applications Mike Yoder, Jeff Linderoth, Jean-Pierre Goux June 3rd, 1999.
Modified Onion Routing GYANRANJAN HAZARIKA AND KARAN MIRANI.
EJB Replication Graham, Iman, Santosh, Mark Newcastle University.
Presented by Fault Tolerance Challenges and Solutions Al Geist Network and Cluster Computing Computational Sciences and Mathematics Division Research supported.
Seminar On Rain Technology
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
Modified Onion Routing GYANRANJAN HAZARIKA AND KARAN MIRANI.
About Blacksburg Transit Started in 1983 Grown from 8 buses to 46 Grown from 3 routes to 12 Operator staff has increased from 26 part-time operators to.
Open MPI - A High Performance Fault Tolerant MPI Library Richard L. Graham Advanced Computing Laboratory, Group Leader (acting)
CCA Forum Spring Meeting April CCA Common Component Architecture Fault Tolerance and the Common Component Architecture David E. Bernholdt.
Duncan MacMichael & Galen Deal CSS 534 – Autumn 2016
Jack Dongarra University of Tennessee
Fault Tolerance in MPI Programs
Scalable Systems Software for Terascale Computer Centers
YG - CS170.
Project 7: Modeling Social Network Structures and their Dynamic Evolutions with User- Generated Data from IoT REU Student: Emma Ambrosini Graduate mentors:
Inventory of Distributed Computing Concepts
EEC 688/788 Secure and Dependable Computing
HMAC and its Design Objectives
Presentation transcript:

presented by Fault Tolerance and Dynamic Process Control Working Group Richard L Graham

Scope The focus of this group is to create additions and clarifications to the MPI standard so that an MPI application may be able to run to completion in the presence of faults in its environment. MPI provides communications services and some process control services ==> the FT aspects are aimed at restoring the state of these services to a well defined state, so that applications may continue to use these services. MPI will enable FT algorithms and applications, not provide this. The closely related topic of dynamic communicators is also being considered. The goal is change the standard such that an implementation can provide this support for applications that require it, and not impact those that do not want to use these services.

Activities Barely off the ground Con calls every 2 weeks (except for weeks in which the Forum meets) Participants HP Indiana University Intel LLNL Microsoft ORNL Sun University of Houston University of Wisconsin

Items being considered FT-MPI like process fault-tolerance Dynamic communicators Size may change with time Communicators may have sparse ranks Communicator traits may be used to set communicator type Data piggy backing OSU and LLNL are working up a prototype with a new API (collectives ?) Proposal to change data-type handling to make it simple and cheap to piggyback data on the application payload

Items being considered - Contd API addition to bring network traffic to a well defined state in support of Checkpoint/Restart Transactional messages (adding return codes) Defining fault/change handling mechanisms