Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003.

Slides:

Advertisements

Similar presentations

Configuration management

Advertisements

Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH) Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University.

Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.

Reliability on Web Services Presented by Pat Chan 17/10/2005.

Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.

Transparent Robustness in Service Aggregates Onyeka Ezenwoye School of Computing and Information Sciences Florida International University May 2006.

EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

CS 582 / CMPE 481 Distributed Systems Fault Tolerance.

7. Fault Tolerance Through Dynamic (or Standby) Redundancy The lowest-cost fault-tolerance technique in multiprocessors. Steps performed: When a fault.

Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.

EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.

8. Fault Tolerance in Software

1 Message Logging Pessimistic & Optimistic CS717 Lecture 10/16/01-10/18/01 Kamen Yotov

EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.

Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.

EEC 688/788 Secure and Dependable Computing Lecture 13 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.

Constructing Reliable Software Components Across the ORB M. Robert Rwebangira Howard University Future Aerospace Science and Technology.

A Survey of Rollback-Recovery Protocols in Message-Passing Systems M. Elnozahy, L. Alvisi, Y. Wang, D. Johnson Carnegie Mellon University Presented by:

1 Rollback-Recovery Protocols II Mahmoud ElGammal.

RAID Shuli Han COSC 573 Presentation.

Managing Multi-User Databases AIMS 3710 R. Nakatsu.

1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.

DISTRIBUTED ALGORITHMS Luc Onana Seif Haridi. DISTRIBUTED SYSTEMS Collection of autonomous computers, processes, or processors (nodes) interconnected.

CH2 System models.

A Survey of Rollback-Recovery Protocols in Message-Passing Systems.

Chapter 19 Recovery and Fault Tolerance Copyright © 2008.

A Proposal of Application Failure Detection and Recovery in the Grid Marian Bubak 1,2, Tomasz Szepieniec 2, Marcin Radecki 2 1 Institute of Computer Science,

EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Fault Tolerant Systems

Advanced Computer Networks Topic 2: Characterization of Distributed Systems.

CS5204 – Operating Systems 1 Checkpointing-Recovery.

Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.

SCALABLE EVOLUTION OF HIGHLY AVAILABLE SYSTEMS BY ABHISHEK ASOKAN 8/6/2004.

SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4b) Department of Electrical.

©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.

Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.

CprE 458/558: Real-Time Systems

Fault Tolerance in CORBA and Wireless CORBA Chen Xinyu 18/9/2002.

The Totem Single-Ring Ordering and Membership Protocol Y. Amir, L. E. Moser, P. M Melliar-Smith, D. A. Agarwal, P. Ciarfella.

Presentation-2 Group-A1 Professor:Mohamed Khalil Anita Kanuganti Hemanth Rao.

Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT By Jyothsna Natarajan Instructor: Prof. Yanqing Zhang Course: Advanced Operating Systems.

EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Software Connectors Acknowledgement: slides mostly from Software Architecture: Foundations, Theory, and Practice; Richard N. Taylor, Nenad Medvidovic,

CSE 8377 Software Fault Tolerance. CSE 8377 Motivation Software is becoming central to many life- critical systems Software is created by error-prone.

A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.

Movement-Based Check-pointing and Logging for Recovery in Mobile Computing Systems Sapna E. George, Ing-Ray Chen, Ying Jin Dept. of Computer Science Virginia.

Fault Tolerant Distributed Computing system. zWhat is fault? yA fault is a blemish, weakness, or shortcoming of a particular hardware or software component.

Seminar On Rain Technology

Week#3 Software Quality Engineering.

Replication & Fault Tolerance CONARD JAMES B. FARAON

Outline Introduction Background Distributed DBMS Architecture

Prepared by Ertuğrul Kuzan

EEC 688/788 Secure and Dependable Computing

Fault Tolerance In Operating System

Outline Announcements Fault Tolerance.

Fault Tolerance Distributed Web-based Systems

EEC 688/788 Secure and Dependable Computing

Middleware for Fault Tolerant Applications

EEC 688/788 Secure and Dependable Computing

Fault Tolerant Distributed Computing system.

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

Abstractions for Fault Tolerance

Last Class: Fault Tolerance

Presentation transcript:

Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003

Outline Basic technologies in fault tolerance Middleware for fault tolerant applications –Egida –AQuA

Why Fault Tolerance? “A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.” Leslie Lamport, May 1987

Basic Technologies in Fault Tolerant Distributed Systems Hardened hardware component technologies Fault detection and membership maintenance Log-based scheme and checkpointing

Hardened hardware component technologies Hardened processor modules: –Pair of self-checking processors (PSP), RAID ( redundant array of inexpensive disks): –Popular even in database-centric business computing applications.

Fault detection and membership maintenance Timeout Comparison of the results of repeated or redundant executions Error-detection and error-correction code Acceptance test : Test reasonableness of intermediate computation results Membership maintenance –Simplest version: Master node makes a periodic roll-call of other nodes –Heartbeat message exchange

Log-based scheme and checkpointing Log-based schemes record, on stable storage, information describing all the modifications by the transaction to the various data it accessed. Checkpointing is a technique to minimize the time taken to recover in the event of a system crash.

Middleware for Fault Tolerant Applications

Egida It is an object-oriented toolkit designed to support transparent rollback recovery for low-overhead fault-tolerance.

Log-based rollback recovery protocols Log information are recorded on stable storage during failure free executions Use that information to recover after a failure The protocols have a set of variant, including checkpointing and message logging.

Checkpointing

Message Logging Pessimistic logging allows processes to communicate only from recoverable states. Optimistic logging allows processes to communicate with other processes even from states that are not yet recoverable. Causal logging allows the possibility that a state from which a process communicates may become unrecoverable because of a failure, but only if no correct process depends on that state. A correct process is one that exhibits no failures at any point in the execution under consideration. So a process that crashes at some point is “non-failed” before that point, but is not “correct” before that point.

Deconstructing Log-Based Rollback-Recovery Protocols The diversity of rollback-recovery protocols reflects the heterogeneity in the requirements of applications. This diversity shows a simple event-driven structure that all these protocols share and that all protocols are interested in the same set of “relevant” events.

Relevant Events Non-deterministic events –A non-deterministic event is an event whose outcome may change for different executions of the same program. Dependency-generating events –These events can increase the number of processes that depend on the nondeterministic events executed by a process. Output-commit events –These events can make the external environment depend on the non- deterministic events executed by a process. Checkpointing events –These events instruct the protocols to write to stable storage the state of one or more processes. Failure-detection events –These events are generated on detecting the failure of one or more processes.

A Simple Language Specifying Rollback-recovery Protocols A protocol is defined in terms the actions it takes in response to non-deterministic events, dependency generating events, output commit events, checkpointing events and failure-detection events. Implementing a specific protocol is equal to selecting the set of actions performed in response to each relevant event. A simple language is used to specify the rollback- recovery protocols.

Module Definitions To define a protocol completely, it is necessary to instantiate a set of variables which specify, for instance, the set of non-deterministic events, the form of their determinant, the implementation of stable storage, etc. Egida identifies a set of building blocks which are incorporated into the protocol structure yield different rollback recovery protocols.

Architecture

Synthesizing Protocols through Module Composition Egida allows the co-existence of multiple implementations for each of the modules. To synthesize a protocol, a specific implementation of each module must be selected. Egida maintains a binding between the values for the modules and their corresponding implementations. Therefore, synthesizing a protocol requires processing the specification along with the binding information to initialize the modules to their appropriate implementations.

Advantages Promote extensibility and flexibility by allowing multiple implementation of each of the core functionalities. Facilitate rapid implementation of rollback recovery protocols with minimal programming effort by gluing together objects from the available library of building blocks. Egida enables designers of fault-tolerance protocols to develop new rollback recovery protocols by combining different implementations of the core functionalities in novel ways.

AQuA: An Adaptive Architecture that provides dependable distributed objects

Overview To allow distributed applications to request and obtain a desired level of availability using a QuO contract through a property manager. Fault tolerance in AQuA is provided by Proteus, which dynamically manages the replication of distributed objects to make them dependable.

Background Ensemble group communication system 1.ensure reliable communication between groups of processes, 2.ensure atomic delivery of multicasts to groups with changing membership, 3.detect and exclude from the group members that fail by crashing. Maestro Object-oriented interface to Ensemble

Background (cont) Quality Objects 1. transmit applications’ availability requirements to Proteus, which attempts to configure the system to achieve the desired availability. 2. provide an adaptation mechanism that is used when Proteus is unable to provide a specified level of availability.

Background (cont) Proteus ♠ dependability manager replicated consists of advisor and protocol coordinator ♠ handlers implement voters and monitors in the gateway ♠ object factories implemented on each host

AQuA Architecture Overview

Group Structure in AQuA

Fault Tolerance in AQuA Fault Model crash failures, value faults, time faults Error Detection Proteus, voter, monitor Fault Treatment Proteus manager advisor