Causal Logging : Manetho Rohit C Fernandes 10/25/01.

Slides:



Advertisements
Similar presentations
Rollback-Retry Techniques & Checnkpointing Protocols.
Advertisements

1 Process groups and message ordering If processes belong to groups, certain algorithms can be used that depend on group properties membership create (
Global States and Checkpoints
TRANSACTION PROCESSING SYSTEM ROHIT KHOKHER. TRANSACTION RECOVERY TRANSACTION RECOVERY TRANSACTION STATES SERIALIZABILITY CONFLICT SERIALIZABILITY VIEW.
Reliable Communication in the Presence of Failures Kenneth Birman, Thomas Joseph Cornell University, 1987 Julia Campbell 19 November 2003.
Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.
Faults and Recovery Ludovic Henrio CNRS - projet OASIS Sources: - A survey of rollback-recovery protocols in message-passing systems.
Lecture 8: Asynchronous Network Algorithms
Replication and Consistency (2). Reference r Replication in the Harp File System, Barbara Liskov, Sanjay Ghemawat, Robert Gruber, Paul Johnson, Liuba.
Faults and Recovery Ludovic Henrio INRIA - projet OASIS Sources: - A survey of rollback-recovery protocols in message-passing systems.
Quick Review of May 1 material Concurrent Execution and Serializability –inconsistent concurrent schedules –transaction conflicts serializable == conflict.
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
1 Message Logging Pessimistic & Optimistic CS717 Lecture 10/16/01-10/18/01 Kamen Yotov
CS 717: Programming for Fault-tolerance Keshav Pingali Cornell University.
Academic Year 2014 Spring. MODULE CC3005NI: Advanced Database Systems “DATABASE RECOVERY” (PART – 1) Academic Year 2014 Spring.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems M. Elnozahy, L. Alvisi, Y. Wang, D. Johnson Carnegie Mellon University Presented by:
1 Rollback-Recovery Protocols II Mahmoud ElGammal.
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.17.1 FAULT TOLERANT SYSTEMS Chapter 6 – Checkpointing.
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
Fault Tolerance BOF Possible CBHPC paper –Co-authors wanted –Tammy, Rob, Bruce, Daniel, Nanbor, Sameer, Jim, Doug, David What infrastructure is needed.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems.
Distributed Transactions Chapter 13
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Fault Tolerant Systems
Application-Level Checkpoint-restart (CPR) for MPI Programs
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14.
CS5204 – Operating Systems 1 Checkpointing-Recovery.
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
ISADS'03 Message Logging and Recovery in Wireless CORBA Using Access Bridge Michael R. Lyu The Chinese Univ. of Hong Kong
CprE 545: Fault Tolerant Systems (G. Manimaran), Iowa State University1 CprE 545: Fault Tolerant Systems Rollback Recovery Protocols.
1 CSE 8343 Presentation # 2 Fault Tolerance in Distributed Systems By Sajida Begum Samina F Choudhry.
Coordinated Checkpointing Presented by Sarah Arnold 1.
Rollback-Recovery Protocols in Message-Passing Systems Based on A Survey of Rollback-Recovery Protocols in Message-Passing Systems by Mootaz Elnozahy Lorenzo.
Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.
Sapna E. George, Ing-Ray Chen Presented By Yinan Li, Shuo Miao April 14, 2009.
Lecture 12 Fault Tolerance, Logging and recovery Thursday Oct 8 th, Distributed Systems.
EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Rollback-Recovery Protocols I Message Passing Systems Nabil S. Al Ramli.
Movement-Based Check-pointing and Logging for Recovery in Mobile Computing Systems Sapna E. George, Ing-Ray Chen, Ying Jin Dept. of Computer Science Virginia.
Ludovic Henrio INRIA - projet OASIS
FTOP: A library for fault tolerance in a cluster R. Badrinath Rakesh Gupta Nisheeth Shrivastava.
EEC 688/788 Secure and Dependable Computing Lecture 5 Wenbing Zhao Cleveland State University
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
1 Fault Tolerance and Recovery Mostly taken from
Operating System Reliability Andy Wang COP 5611 Advanced Operating Systems.
1 Distributed Systems 2007/08 Rollback-Recovery Alberto Montresor Università di Trento This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike.
8.6. Recovery By Hemanth Kumar Reddy.
Ludovic Henrio CNRS - projet SCALE
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
Operating System Reliability
EECS 498 Introduction to Distributed Systems Fall 2017
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Fault Tolerant Distributed Computing system.
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
University of Wisconsin-Madison Presented by: Nick Kirchem
Operating System Reliability
Presentation transcript:

Causal Logging : Manetho Rohit C Fernandes 10/25/01

Manetho System Model Non determinististic events Message Receive Internal event(Kernel call) Creation of a new process Output Commit Stable Storage + Volatile Memory

Manetho properties Tolerate any number of simultaneous failures Low failure-free overhead Only failed processes roll back

Example Manetho Execution

Causal Logging : Intuition Piggyback determinant of non- deterministic event on outgoing messages Determinant? Piggyback Antecedence Graphs

Antecedence Graph Directed acyclic graph Nodes : State Intervals Edges : Happened before(immediate)

Antecedence Graph

Receive Node Two incoming edges Fields Receiver ID Sender ID Index of created state interval Unique identifier of message

Internal Event Node One incoming edge Fields Type of event Replay information

Failure Free Operation Each process maintains AG of its current interval Log that contains data and ID of each message sent Message Send : Piggyback AG of current state interval

Optimization Need not send complete AG Incremental piggybacking AG(  i+1 p ) is a proper subgraph of AG(  i p ) Process q communicates to p max j such that  j p is in q’s AG P sends AG (  i p ) - AG (  j p )

Information on Stable Storage Checkpoints AG (asynchronously) : Need not piggyback part of AG which is in disk Output commit: Save AG to disk

Incarnation Numbers Each process starts a new incarnation after recovery Integer stored in stable storage Tagged on outgoing messages Messages from old incarnations discarded

Recovery Protocol Recover(p,c,INCNUM,S) Step 1 INCNUM  INCNUM+1 ; save INCNUM INCVEC[p]  INCNUM G  AG(  p c ) // stable storage

Recovery Protocol Step 2 For all q  S, q  p (INQ,AGQ)  remote call at q:GET_AG(p) G  G  AGQ INCVEC[q]  INQ For all q  S, q  p Remote call at q: SEND_INC(p,INCVEC)

Recovery Protocol Step 3 m  max j such that  p j  G Recover upto  p m Don’t send out application messages but log them For receive, request message from sender’s log Replay internal event

Recovery Example

Available Antecedence Graphs

Application Characteristics

Performance Overhead

Coordinated vs. Uncoordinated