Presented by Fault Tolerance Working Group Update Rich Graham.

Slides:



Advertisements
Similar presentations
The MPI Forum: Getting Started Rich Graham Oak Ridge National Laboratory.
Advertisements

Support for Fault Tolerance (Dynamic Process Control) Rich Graham Oak Ridge National Laboratory.
Presented by Fault Tolerance and Dynamic Process Control Working Group Richard L Graham.
Presented by Structure of MPI-3 Rich Graham. 2 Current State of MPI-3 proposals Many working groups have several proposal being discussed ==> standard.
Use Cases for Fault Tolerance Support in MPI Rich Graham Oak Ridge National Laboratory.
Section 7: Recovery Undo logging Redo logging ARIES.
Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.
Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung
Update on ULFM Fault Tolerance Working Group MPI Forum, San Jose CA December, 2014.
Model for Supporting High Integrity and Fault Tolerance Brian Dobbing, Aonix Europe Ltd Chief Technical Consultant.
PNUTS: Yahoo!’s Hosted Data Serving Platform Yahoo! Research present by Liyan & Fang.
A Progressive Fault Detection and Service Recovery Mechanism in Mobile Agent Systems Wong Tsz Yeung Aug 26, 2002.
MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.
CS Summer 2003 Lecture 15 MPLS Fault-Tolerance Architecture ( For details, see class notes)
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.
Lessons Learned Implementing User-Level Failure Mitigation in MPICH Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory User-level.
CS 194 Research Checkpoint Paul Salzman Advisor: Professor Glenn Reinman Winter 2007.
A Progressive Fault Tolerant Mechanism in Mobile Agent Systems Michael R. Lyu and Tsz Yeung Wong July 27, 2003 SCI Conference Computer Science Department.
CS 603 Data Replication February 25, Data Replication: Why? Fault Tolerance –Hot backup –Catastrophic failure Performance –Parallelism –Decreased.
A Distributed Web Information System Platform for High Responsiveness and Fault Tolerance Jordi Bataller, Hendrik Decker, Luis Irún, Francesc Muñoz Instituto.
Avoid DCOM and Tunnel Across Firewalls and Networks Presenters: Kevin Rutherford, Senior Applications Engineer Colin Winchester, VP Operations  OPC DA.
Simplifying the Recovery Model of User- Level Failure Mitigation Wesley Bland ExaMPI ‘14 New Orleans, LA, USA November 17, 2014.
1 Principles of Computer Organization Basics of Computer Network TCP/IP: How Messages Get Delivered Across the Internet Dr. Greg Butler Computer Science.
PVM and MPI What is more preferable? Comparative analysis of PVM and MPI for the development of physical applications on parallel clusters Ekaterina Elts.
Checkpoint & Restart for Distributed Components in XCAT3 Sriram Krishnan* Indiana University, San Diego Supercomputer Center & Dennis Gannon Indiana University.
INSTALLING MICROSOFT EXCHANGE SERVER 2003 CLUSTERS AND FRONT-END AND BACK ‑ END SERVERS Chapter 4.
Areas For Review L3 Review of SM Software, 28 Oct The Charge From Jim’s with instructions for the review: “The time limit for this review.
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
A Proposal of Application Failure Detection and Recovery in the Grid Marian Bubak 1,2, Tomasz Szepieniec 2, Marcin Radecki 2 1 Institute of Computer Science,
(Business) Process Centric Exchanges
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
Association Rule Mining in Peer-to-Peer Systems Ran Wolff Assaf Shcuster Department of Computer Science Technion I.I.T. Haifa 32000,Isreal.
Transaction Services in Component Frameworks Bruce Kessler Comp250CBS March 2, 2004.
Rev A Mikko Suominen Enhancing System Capacity and Robustness by Optimizing Software Architecture in a Real-time Multiprocessor Environment.
 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.
Budapest, September 5th, 2002 DataGrid Accounting System DGAS Current status & plans Stefano Barale INFN Budapest, September.
Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
A Collaborative Framework for Scientific Data Analysis and Visualization Jaliya Ekanayake, Shrideep Pallickara, and Geoffrey Fox Department of Computer.
1 Principles of Database Systems With Internet and Java Applications Today’s Topic Chapter 15: Reliability and Security in Database Servers Instructor’s.
Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory {wbland, huiweilu, sseo, May 5, 2015 Lessons Learned Implementing.
Lock Services in Distributed File Systems Shaan Mahbubani Anshuman Gupta Ravi Vijay Anup Tapadia UCSD CSE 221 Operating Systems - Winter 07.
IETF 69 SIPPING WG Meeting Mohammad Vakil Microsoft An Extension to Session Initiation Protocol (SIP) Events for Pausing and Resuming.
RTCWEB Considerations for NATs, Firewalls and HTTP proxies draft-hutton-rtcweb-nat-firewall- considerations A. Hutton, T. Stach, J. Uberti.
Chapter 4 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT By Jyothsna Natarajan Instructor: Prof. Yanqing Zhang Course: Advanced Operating Systems.
Revisiting failure detectors Some of you asked questions about implementing consensus using S - how does it differ from reaching consensus using P. Here.
University of Westminster – Checkpointing Mechanism for the Grid Environment K Sajadah, G Terstyanszky, S Winter, P. Kacsuk University.
Document title: DSN and its future work in ITU-T Meeting name: The 15 th CJK NGN WG meeting SDO name: CCSA Presenter name: Jianyin Zhang
EEC 688/788 Secure and Dependable Computing Lecture 9 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
EJB Replication Graham, Iman, Santosh, Mark Newcastle University.
Message Framework Topic subscribe for javascript/flex client.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Lecture 24: GFS.
Managed Communication and Consistency for Fast Data- Parallel Iterative Analytics Jinliang WeiWei DaiAurick QiaoQirong HoHenggang Cui Gregory R. GangerPhillip.
Distributed databases A brief introduction with emphasis on NoSQL databases Distributed databases1.
Towards Secure and Dependable Software-Defined Networks Fernando M. V. Ramos LaSIGE/FCUL, University of Lisbon
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
Theme Guidance - Network Traffic Proposed NMLRG IETF 95, April 2016 Sheng Jiang (Speaker, Co-chair) Page 1/6.
TensorFlow– A system for large-scale machine learning
Prepared by Ertuğrul Kuzan
EEC 688/788 Secure and Dependable Computing
Load Weighting and Priority
CJK 10th NGN-WG (Follow up IPTV-GSI) Chae Sub, Lee
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
Type Topic in here! Created by Educational Technology Network
أنماط الإدارة المدرسية وتفويض السلطة الدكتور أشرف الصايغ
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Types of Errors And Error Analysis.
Presentation transcript:

presented by Fault Tolerance Working Group Update Rich Graham

2 Current Topics Consistent error handling across the standard Being brought for discussion at the full forum this time Fault Tolerance in Master/Slave type usage scenarios Being brought for discussion at the full forum this time Data piggybacking Should move to the point-to-point wg New proposed API Proposal to accomplish this with changes to datatype support Support for process recovery Still in infancy Asynchronous dynamic process control Proposal exists, but not yet discussed

3 Current Topics - Continued Dynamic communicators Should be ready for Sept or Oct meeting Transactional messages Proposal exists, not yet discussed API addition to bring network traffic to a well defined state in support of Checkpoint/Restart