Fail-Stop Processors UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau One paper: Byzantine.

Slides:



Advertisements
Similar presentations
Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.
Advertisements

Impossibility of Distributed Consensus with One Faulty Process
CS 542: Topics in Distributed Systems Diganta Goswami.
Byzantine Generals. Outline r Byzantine generals problem.
Agreement: Byzantine Generals UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau Paper: “The.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
Computer Science 425 Distributed Systems CS 425 / ECE 428 Consensus
The Byzantine Generals Problem Boon Thau Loo CS294-4.
The Byzantine Generals Problem Leslie Lamport, Robert Shostak, Marshall Pease Distributed Algorithms A1 Presented by: Anna Bendersky.
Byzantine Generals Problem Anthony Soo Kaim Ryan Chu Stephen Wu.
1 Principles of Reliable Distributed Systems Lecture 3: Synchronous Uniform Consensus Spring 2006 Dr. Idit Keidar.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 5: Synchronous (Uniform)
Copyright 2006 Koren & Krishna ECE655/ByzGen.1 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing ECE 655.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 15 Wenbing Zhao Department of Electrical and Computer Engineering.
Replication Management using the State-Machine Approach Fred B. Schneider Summary and Discussion : Hee Jung Kim and Ying Zhang October 27, 2005.
2/23/2009CS50901 Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred B. Schneider Presenter: Aly Farahat.
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 5: Synchronous Uniform.
1 Principles of Reliable Distributed Systems Lecture 5: Failure Models, Fault-Tolerant Broadcasts and State-Machine Replication Spring 2005 Dr. Idit Keidar.
CS294, YelickConsensus, p1 CS Consensus
Last Class: Weak Consistency
Practical Byzantine Fault Tolerance (The Byzantine Generals Problem)
State Machines CS 614 Thursday, Feb 21, 2002 Bill McCloskey.
1 Fault Tolerance in Collaborative Sensor Networks for Target Detection IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 3, MARCH 2004.
Fault Tolerance via the State Machine Replication Approach Favian Contreras.
1 MSCS 237 Communication issues. 2 Colouris et al. (2001): Is a system in which hardware or software components located at networked computers communicate.
Distributed Systems: Concepts and Design Chapter 1 Pages
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
Reliable Communication in the Presence of Failures Based on the paper by: Kenneth Birman and Thomas A. Joseph Cesar Talledo COEN 317 Fall 05.
Replicated State Machines ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg.
Practical Byzantine Fault Tolerance
Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.
1 CSE 8343 Presentation # 2 Fault Tolerance in Distributed Systems By Sajida Begum Samina F Choudhry.
1 Resilience by Distributed Consensus : Byzantine Generals Problem Adapted from various sources by: T. K. Prasad, Professor Kno.e.sis : Ohio Center of.
Agenda Fail Stop Processors –Problem Definition –Implementation with reliable stable storage –Implementation without reliable stable storage Failure Detection.
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
Byzantine Fault Tolerance CS 425: Distributed Systems Fall 2012 Lecture 26 November 29, 2012 Presented By: Imranul Hoque 1.
CSE 60641: Operating Systems Implementing Fault-Tolerant Services Using the State Machine Approach: a tutorial Fred B. Schneider, ACM Computing Surveys.
Chap 15. Agreement. Problem Processes need to agree on a single bit No link failures A process can fail by crashing (no malicious behavior) Messages take.
UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department
Hwajung Lee.  Improves reliability  Improves availability ( What good is a reliable system if it is not available?)  Replication must be transparent.
Replication Improves reliability Improves availability ( What good is a reliable system if it is not available?) Replication must be transparent and create.
Ordering of Events in Distributed Systems UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau.
Fault Tolerance
Faults and fault-tolerance One of the selling points of a distributed system is that the system will continue to perform even if some components / processes.
CSci8211: Distributed Systems: State Machines 1 Detour: Some Theory of Distributed Systems Supplementary Materials  Replicated State Machines Notion of.
Distributed Systems Lecture 9 Leader election 1. Previous lecture Middleware RPC and RMI – Marshalling 2.
Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.
1 AGREEMENT PROTOCOLS. 2 Introduction Processes/Sites in distributed systems often compete as well as cooperate to achieve a common goal. Mutual Trust/agreement.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
Replication Hari Shreedharan. Papers! ● Implementing Fault-Tolerant Services using the State Machine Approach (Dec, 1990) Fred B. Schneider Cornell University.
Ordering of Events in Distributed Systems UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau.
When Is Agreement Possible
Distributed Systems – Paxos
Providing Secure Storage on the Internet
Agreement Protocols CS60002: Distributed Systems
EECS 498 Introduction to Distributed Systems Fall 2017
Outline Announcements Fault Tolerance.
Jacob Gardner & Chuan Guo
Replication Improves reliability Improves availability
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department
EEC 688/788 Secure and Dependable Computing
Fault-Tolerant State Machine Replication
Distributed Systems CS
EEC 688/788 Secure and Dependable Computing
Abstractions for Fault Tolerance
Presentation transcript:

Fail-Stop Processors UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau One paper: Byzantine Generals in Action: Implementing Fail-Stop Processors, Fred Schneider, TOCS, May 1984

Motivation Goal: Build systems that continue to work in presence of component failure Difficulty of building those systems depends upon how components can fail Fail-stop components make building reliable systems easier than components with byzantine failures

Fail-Stop Processors What is a failure? Output (or behavior) that is inconsistent with specification What is a Byzantine failure? Arbitrary, even malicious, behavior Components may collude with each other Cannot necessarily detect output is faulty What is a fail-stop processor? Halts instead of performing erroneous transformations Others can detect halted state Other can obtain uncorrupted stable storage Real processors are not fail-stop; How will we build one? Building with fail-stop processors is easier; Why?

Distributed State Machine Common approach for building reliable systems Idea: Replicate servers, coordinate client interactions with replicas Failure model of components determines how many replicas, R, are needed and their interactions RRR Client input sequence State machine output

How to build t-fault tolerant state machine? t-fault tolerant: Satisfies specification as long as no more than t components (servers) fail Inputs Key: All replicas receive and process same sequence of inputs 1) Agreement: Every nonfaulty replica receives sam request (interactive consistency or byzantine agreement) 2) Ordering: Every nonfaulty replica processes requests in same order (logical clocks) Outputs ByzantineFail-Stop Output of SM? Number of replicas?

How to build t-fault tolerant state machine? t-fault tolerant: Satisfies specification as long as no more than t components (servers) fail Inputs Key: All replicas receive and process same sequence of inputs 1) Agreement: Every nonfaulty replica receives sam request (interactive consistency or byzantine agreement) 2) Ordering: Every nonfaulty replica processes requests in same order (logical clocks) Outputs ByzantineFail-Stop Output of SM?Output of majority Number of replicas?

How to build t-fault tolerant state machine? t-fault tolerant: Satisfies specification as long as no more than t components (servers) fail Inputs Key: All replicas receive and process same sequence of inputs 1) Agreement: Every nonfaulty replica receives sam request (interactive consistency or byzantine agreement) 2) Ordering: Every nonfaulty replica processes requests in same order (logical clocks) Outputs ByzantineFail-Stop Output of SM?Output of majority Number of replicas?2t+1

How to build t-fault tolerant state machine? t-fault tolerant: Satisfies specification as long as no more than t components (servers) fail Inputs Key: All replicas receive and process same sequence of inputs 1) Agreement: Every nonfaulty replica receives sam request (interactive consistency or byzantine agreement) 2) Ordering: Every nonfaulty replica processes requests in same order (logical clocks) Outputs ByzantineFail-Stop Output of SM?Output of majority Output from any Number of replicas?2t+1

How to build t-fault tolerant state machine? t-fault tolerant: Satisfies specification as long as no more than t components (servers) fail Inputs Key: All replicas receive and process same sequence of inputs 1) Agreement: Every nonfaulty replica receives same request (interactive consistency or byzantine agreement) 2) Ordering: Every nonfaulty replica processes requests in same order (logical clocks) Outputs ByzantineFail-Stop Output of SM?Output of majority Output from any Number of replicas?2t+1t+1

Building a Fail-Stop Processor Assumption: Storage Volatile: Lost on failure Stable –Not affected (lost or corrupted) by failure –Can be read by any processor –Benefit: Recover work of failed process –Drawback: Minimize interactions since slow Can only build approximation of fail-stop processor Finite hardware -> Finite failures could disable all error detection hardware k-fail-stop processor: behaves fail-stop unless k+1 or more failures

Implementation of k-FSP: Overview Two components k+1 p-processes (program) 2k+1 s-processes (storage) Each process runs on own processor, all connected with network P-Processes (k+1) Each runs program for state machine Interacts with s-processes to read and write data If any fail (if any disagreement), then all STOP Cannot detect k+1 failures

Implementation Continued S-Processes (2k+1) Each contains contents of stable storage Provides reliable data with k failures (cannot just stop) Detects disagreements/failures across p-processes Assumptions Messages can be authenticated (digital signatures) Byzantine agreement protocol?? –Signed messages –Requires k+1 processes Synchronized clocks across all processes in FSP

FSP Algorithm: Writes Each p-process, on a write: Byzantine agreement across s-processes (agree on same input value) Each s-process, on a write: Ensure each p-process writes same value within time bound –If all okay, then update value in stable storage –If not, then halt all p-processes Set failed variable to true Do not allow future writes

FSP Algorithm: Reads Each p-process, on a read: Broadcast request to all s-processes Use result from majority (k+1 out of 2k+1) Other FSP can read as well Each p-process, determine if halted/failed: Read failed variable from s-process (use majority)

FSP Example k=2, SM code: “b=a+1” How do p-processes read a? What if 2 s-processes fail? What if 3 s-processes fail? How do p-processes write b? What if 1 p-process is very slow? What if 1 p-process gives incorrect results to all? What if 1 p-process gives incorrect results to some? What if 3 p-processes give bad result? a: b: failed: p: s:

Higher-Level Example Goal: Run app handling k faults, needs N good processors Solution: Use N+k k-failstop processors Example: N=2, k=3 FSP0FSP1FSP2FSP3FSP4 SS0SS1SS2SS3SS4 What happens if: 3 p-processes in FSP0 fail? 4 p-processes in FSP0 fail? 1 p-process in FSP0, FSP1, and FSP2 fail? also in FSP3? 2 p-processes in FSP0, FSP1, and FSP2 fail? 1 s-process in SS0 fails? also in SS1, SS2, and SS3? 4 s-processes in SS0 fail?

Should we use Fail Stop Processors? Metric: Hardware cost for state machines: Fail-stop components: –Worst-case (assuming 1 process per processor): (N+k) * (3k+2) processors –Best-case (assuming s-processes from different FSP share same processor) (N+k)(k+1) + (2k+1) processors Byzantine components: –N * (2k+1) Fail-stop can be better if s-processes share and N>k Metric: Frequency of byzantine agreement protocol Fail-Stop: On every access to stable storage Byzantine: On every input read Probably fewer input reads

Summary Why build fail-stop components? Easier for higher layers to model and deal with Matches assumptions of many distributed protocols Why not? Usually more hardware Usually more agreements needed Higher-levels may be able to cope with “slightly faulty” components End-to-end argument