Presentation-2 Group-A1 Professor:Mohamed Khalil Anita Kanuganti Hemanth Rao.

Slides:



Advertisements
Similar presentations
Dr. Kalpakis CMSC621 Advanced Operating Systems Fault Tolerance.
Advertisements

Faults and Recovery Ludovic Henrio CNRS - projet OASIS Sources: - A survey of rollback-recovery protocols in message-passing systems.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
Faults and Recovery Ludovic Henrio INRIA - projet OASIS Sources: - A survey of rollback-recovery protocols in message-passing systems.
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 3 – Distributed Systems.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Synchronization. Physical Clocks Solar Physical Clocks Cesium Clocks International Atomic Time Universal Coordinate Time (UTC) Clock Synchronization Algorithms.
Chapter 7 Fault Tolerance Basic Concepts Failure Models Process Design Issues Flat vs hierarchical group Group Membership Reliable Client.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Last Class: Weak Consistency
1 CS 194: Distributed Systems Distributed Commit, Recovery Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering.
Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class:Consistency Semantics Consistency models –Data-centric consistency models –Client-centric.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
CMPT Dr. Alexandra Fedorova Lecture XI: Distributed Transactions.
CMPT Dr. Alexandra Fedorova Lecture XI: Distributed Transactions.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems M. Elnozahy, L. Alvisi, Y. Wang, D. Johnson Carnegie Mellon University Presented by:
Distributed Deadlocks and Transaction Recovery.
Distributed Transactions March 15, Transactions What is a Distributed Transaction?  A transaction that involves more than one server  Network.
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
Transaction Communications Yi Sun. Outline Transaction ACID Property Distributed transaction Two phase commit protocol Nested transaction.
Distributed Transactions Chapter 13
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Fault Tolerance CSCI 4780/6780. Distributed Commit Commit – Making an operation permanent Transactions in databases One phase commit does not work !!!
Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
Fault Tolerance Chapter 7.
Fault Tolerance. Basic Concepts Availability The system is ready to work immediately Reliability The system can run continuously Safety When the system.
V1.7Fault Tolerance1. V1.7Fault Tolerance2 A characteristic of Distributed Systems is that they are tolerant of partial failures within the distributed.
Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.
Lecture 12 Fault Tolerance, Logging and recovery Thursday Oct 8 th, Distributed Systems.
Revisiting failure detectors Some of you asked questions about implementing consensus using S - how does it differ from reaching consensus using P. Here.
EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003.
Fault Tolerance Chapter 7. Basic Concepts Dependability Includes Availability Reliability Safety Maintainability.
Introduction to Fault Tolerance By Sahithi Podila.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
Fault Tolerance Chapter 7. Goal An important goal in distributed systems design is to construct the system in such a way that it can automatically recover.
Fault Tolerance in Distributed Systems. A system’s ability to tolerate failure-1 Reliability: the likelihood that a system will remain operational for.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
Fault Tolerance Prof. Orhan Gemikonakli
Fault Tolerance Chap 7.
Recovery in Distributed Systems:
8.6. Recovery By Hemanth Kumar Reddy.
Outline Introduction Background Distributed DBMS Architecture
EEC 688/788 Secure and Dependable Computing
Fault Tolerance In Operating System
Chapter 8 Fault Tolerance Part I Introduction.
Outline Announcements Fault Tolerance.
Fault Tolerance Distributed Web-based Systems
EEC 688/788 Secure and Dependable Computing
Distributed Systems CS
Fault Tolerance CSC 8320 : AOS Class Presentation Shiraj Pokharel
Middleware for Fault Tolerant Applications
Introduction to Fault Tolerance
Distributed Systems CS
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Fault Tolerance and Reliability in DS.
EEC 688/788 Secure and Dependable Computing
Distributed Systems CS
Abstractions for Fault Tolerance
Last Class: Fault Tolerance
Presentation transcript:

Presentation-2 Group-A1 Professor:Mohamed Khalil Anita Kanuganti Hemanth Rao

Fault Tolerance in Distributed Systems

Outline Overview of Fault Tolerance Importance in Distributed systems Types of Faults Measurement of Faults Failure Models Redundancy and Forms of Redundancy Software Fault Tolerance Techniques Reliable Communication Distributed Commit Failure Recovery On-going Research References

Fault Tolerance The ability of a system to respond gracefully to an unexpected Hardware or Software Failure There are many levels of Fault tolerance, the lowest being the ability to continue operation in the event of power failure.

Importance in Distributed Systems Computer systems are not very reliable ---OS crashes frequently(Windows),buggy software,unreliable Hardware,SW/HW incompatibilities ---Growing popularity of Internet/World Wide Web ---Example: what if your TV(or car) broke down every day? Users don’t want to restart TV or fix it by opening it up. So we need to make our computer systems more Reliable and Dependable.

Types of Faults Nature ---Systematic ---Random Duration ---Transient ---Intermittent ---Permanent Extent ---Global ---Local

Measurement of Faults Fault Removal Coverage Fault Detection Coverage Fault Tolerance Coverage

Failure Models Type of FailureDescription Crash FailureServer halts but is working correctly until it halts Omission FailureServer fails to respond to incoming requests Timing FailureServer’s response lies outside the specified time interval Response FailureServer’s response is incorrect Arbitrary FailureServer may produce arbitrary responses at arbitrary times

Redundancy and its Forms Redundancy does same computation for ‘n’ number of times. So if one fails the other will operate Forms of Redundancy ---Hardware Redundancy ---Software Redundancy ---Information Redundancy ---Temporal(time) Redundancy

Software Fault tolerance Techniques N-Version Programming --- Different implementations of same program in order to avoid identical design faults Block Recovery --- Duplication of various critical software modules

Reliable Communication One-one communication --- Use reliable transport protocols(TCP) of handle at the application layer --- Possibilities 1. Client unable to locate server 2. Lost request messages 3. Server crashes after receiving request 4. Lost reply messages 5. Client crashes after sending request

Cont’d One-many Communication ---Reliable Multicast 1. Lost messages need to retransmit ---Possibilities 1. ACK-based Schemes-Sender can become bottleneck 2. NACK-based systems

Distributed Commit Atomic multicast-all processes in a group perform an operation or not at all Problem of Distributed commit---all or nothing operations in a group of processes Possible approaches---2-phase commit and 3-phase commit

Cont’d 2-Phase & 3-Phase commit Coordinator process coordinates the operation Involves 2 phases ---Voting phase-processes vote on whether to commit ---Decision phase-actually commit or abort Problem- If coordinator crashes then processes block 3-Phase commit – Variant of 2-phase that avoids blocking

Recovery Techniques thus far allow Failure handling Recovery means operations to a correct state that must be performed after a failure to recover to a correct state Techniques: 1. Check Pointing 2. Message Logging

Check Pointing Periodically checkpoint state Upon crash roll back to a previous checkpoint with a consistent state Types: -- Independent Checking -- Coordinated Checking

Message Logging Check pointing is expensive 1. All processes restart from previous consistent cut 2. Taking a snapshot is expensive 3. All computations from previous snapshot have to be redone. Combine check pointing(expensive) with message logging(cheap) 1. Take infrequent checkpoints 2. Log all messages between checkpoints to local stable storage 3. To recover: Simply replay messages from previous checkpoint and avoid recomputations from previous checkpoint

On-going Research Intelligent / Adaptive Fault Tolerance Summary

References: “Fault Tolerance in Distributed Systems” by Pankaj Jalote “Adaptive Fault tolerance in Distributed Systems” by Roger Bharath, Melanie Dumas and Mevlut Erdem Kurul

Questions ?