Fault Tolerance https://store.theartofservice.com/the-fault-tolerance-toolkit.html.

Slides:



Advertisements
Similar presentations
Chapter 8 Fault Tolerance
Advertisements

Byzantine Generals. Outline r Byzantine generals problem.
Agreement: Byzantine Generals UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau Paper: “The.
Database Administration and Security Transparencies 1.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
The Byzantine Generals Problem Boon Thau Loo CS294-4.
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 16 Wenbing Zhao Department of Electrical and Computer Engineering.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 15 Wenbing Zhao Department of Electrical and Computer Engineering.
2/23/2009CS50901 Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred B. Schneider Presenter: Aly Farahat.
Last Class: Weak Consistency
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Lesson 1: Configuring Network Load Balancing
Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class:Consistency Semantics Consistency models –Data-centric consistency models –Client-centric.
Case Study - GFS.
Byzantine Fault Tolerance CS 425: Distributed Systems Fall Material drived from slides by I. Gupta and N.Vaidya.
Lecture 13 Fault Tolerance Networked vs. Distributed Operating Systems.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
1 CMPT 471 Networking II DHCP Failover and multiple servers © Janice Regan,
1 System Models. 2 Outline Introduction Architectural models Fundamental models Guideline.
CH2 System models.
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
Practical Byzantine Fault Tolerance
Practical Byzantine Fault Tolerance Jayesh V. Salvi
Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.
1 ZYZZYVA: SPECULATIVE BYZANTINE FAULT TOLERANCE R.Kotla, L. Alvisi, M. Dahlin, A. Clement and E. Wong U. T. Austin Best Paper Award at SOSP 2007.
CprE 458/558: Real-Time Systems
Agenda Fail Stop Processors –Problem Definition –Implementation with reliable stable storage –Implementation without reliable stable storage Failure Detection.
Byzantine Fault Tolerance CS 425: Distributed Systems Fall 2012 Lecture 26 November 29, 2012 Presented By: Imranul Hoque 1.
70-293: MCSE Guide to Planning a Microsoft Windows Server 2003 Network, Enhanced Chapter 12: Planning and Implementing Server Availability and Scalability.
CSE 60641: Operating Systems Implementing Fault-Tolerant Services Using the State Machine Approach: a tutorial Fred B. Schneider, ACM Computing Surveys.
Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.
UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department
Component 8/Unit 9aHealth IT Workforce Curriculum Version 1.0 Fall Installation and Maintenance of Health IT Systems Unit 9a Creating Fault Tolerant.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
Storage Virtualization
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Fault Tolerance in Distributed Systems. A system’s ability to tolerate failure-1 Reliability: the likelihood that a system will remain operational for.
Seminar On Rain Technology
COMP1321 Digital Infrastructure Richard Henson March 2016.
Fail-Stop Processors UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau One paper: Byzantine.
1 AGREEMENT PROTOCOLS. 2 Introduction Processes/Sites in distributed systems often compete as well as cooperate to achieve a common goal. Mutual Trust/agreement.
Network-Attached Storage. Network-attached storage devices Attached to a local area network, generally an Ethernet-based network environment.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 10: Mass-Storage Systems.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
BChain: High-Throughput BFT Protocols
70-293: MCSE Guide to Planning a Microsoft Windows Server 2003 Network, Enhanced Chapter 12: Planning and Implementing Server Availability and Scalability.
Managing Multi-User Databases
RAID Redundant Arrays of Independent Disks
Unit OS10: Fault Tolerance
Introduction to Networks
Fault Tolerance In Operating System
Dependability Dependability is the ability to avoid service failures that are more frequent or severe than desired. It is an important goal of distributed.
RAID RAID Mukesh N Tekwani
Outline Announcements Fault Tolerance.
Fault Tolerance Distributed Web-based Systems
Jacob Gardner & Chuan Guo
Replication Improves reliability Improves availability
O.S Lecture 14 File Management.
Active replication for fault tolerance
UNIT IV RAID.
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
RAID RAID Mukesh N Tekwani April 23, 2019
Abstractions for Fault Tolerance
Presentation transcript:

Fault Tolerance

Brian Randell - Software fault tolerance 1 Beginning in the 1970s, Randell "set up the project that initiated research into the possibility of software fault tolerance, and introduced the "recovery block" concept. Subsequent major developments included the Newcastle Connection, and the prototype distributed Secure System".

Byzantine fault tolerance 1 Byzantine fault tolerance

Byzantine fault tolerance 1 Byzantine fault tolerance is a sub-field of fault tolerance research inspired by the Byzantine Generals' Problem, which is a generalized version of the Two Generals' Problem.

Byzantine fault tolerance 1 The objective of Byzantine fault tolerance is to be able to defend against Byzantine failures, in which components of a system fail in arbitrary ways (i.e., not just by stopping or crashing but by processing requests incorrectly, corrupting their local state, and/or producing incorrect or inconsistent outputs.). Correctly functioning components of a Byzantine fault tolerant system will be able to correctly provide the system's service assuming there are not too many Byzantine faulty components.

Byzantine fault tolerance - Failure modes 1 When a Byzantine failure has occurred, the system may respond in any unpredictable way, unless it is designed to have Byzantine fault tolerance.

Byzantine fault tolerance - Origin 1 Byzantine fault tolerance can be achieved, if the loyal (non-faulty) generals have a unanimous agreement on their strategy. Note that if the source general is correct, all loyal generals must agree upon that value. Otherwise, the choice of strategy agreed upon is irrelevant.

Byzantine fault tolerance - Early solutions 1 A second solution requires unforgeable signatures (in modern computer systems, this may be achieved in practice using public-key cryptography), but maintains Byzantine fault tolerance in the presence of an arbitrary number of traitorous generals.

Byzantine fault tolerance - Practical Byzantine fault tolerance 1 Byzantine fault tolerant replication protocols were long considered too expensive to be practical. Then in 1999, Miguel Castro and Barbara Liskov introduced the "Practical Byzantine Fault Tolerance" (PBFT) algorithm, which provides high-performance Byzantine state machine replication, processing thousands of requests per second with sub-millisecond increases in latency.

Distributed file system for cloud - Fault tolerance 1 For fault tolerance, a chunk is replicated onto multiple chunkservers, by default on three chunckservers. A chunk is available on at least a chunk server.

Application delivery network - Fault tolerance 1 The ADN provides fault tolerance at the server level, within pools or farms. This is accomplished by designating specific servers as a 'backup' that is activated automatically by the ADN in the event that the primary server(s) in the pool fail.[ 9/1219buyers2.html MacVittie, Lori: Content Switches, Network Computing, July, 2001]

Application delivery network - Fault tolerance 1 The ADN also ensures application availability and reliability through its ability to seamlessly failover to a secondary device in the event of a hardware or software failure. This ensures that traffic continues to flow in the event of a failure in one device, thereby providing fault tolerance for the applications. Fault tolerance is implemented in ADNs through either a network or serial based connection.

System Fault Tolerance 1 'System Fault Tolerance' ('SFT') is a fault tolerant system built into NetWare operating systems. There are three levels of fault tolerance:

System Fault Tolerance 1 * SFT I 'Hot Fix' maps out bad disk blocks on the file system level to help ensure data integrity (fault tolerance on disk block level)

System Fault Tolerance 1 * SFT II is a disk mirroring or duplexing system based on RAID 1; mirroring refers to two disk drives holding the same data, duplexing uses two data channels/controllers to connect the disks (fault tolerance on disk and optionally data channel level).

System Fault Tolerance 1 * SFT III is a server duplexing scheme where if a server (computing)|server fails, a constantly synchronized server seamlessly takes its place (fault tolerance on system level).

For More Information, Visit: m/the-fault-tolerance- toolkit.html m/the-fault-tolerance- toolkit.html The Art of Service