1 The Case for Byzantine Fault Detection. 2 Challenge: Byzantine faults Distributed systems are subject to a variety of failures and attacks Hacker break-in.

Slides:



Advertisements
Similar presentations
Impossibility of Distributed Consensus with One Faulty Process
Advertisements

Chapter 8 Fault Tolerance
Byzantine Generals. Outline r Byzantine generals problem.
Agreement: Byzantine Generals UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau Paper: “The.
BASIC BUILDING BLOCKS -Harit Desai. Byzantine Generals Problem If a computer fails, –it behaves in a well defined manner A component always shows a zero.
Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services Authored by: Seth Gilbert and Nancy Lynch Presented by:
Accountable systems or how to catch a liar? Jinyang Li (with slides from authors of SUNDR and PeerReview)
Failure Detection The ping-ack failure detector in a synchronous system satisfies – A: completeness – B: accuracy – C: neither – D: both.
CSE331: Introduction to Networks and Security Lecture 22 Fall 2002.
Distributed Systems Overview Ali Ghodsi
SecureMR: A Service Integrity Assurance Framework for MapReduce Wei Wei, Juan Du, Ting Yu, Xiaohui Gu North Carolina State University, United States Annual.
LADIS workshop (Oct 11, 2009) A Case for the Accountable Cloud Andreas Haeberlen MPI-SWS.
P. Kouznetsov, 2006 Abstracting out Byzantine Behavior Peter Druschel Andreas Haeberlen Petr Kouznetsov Max Planck Institute for Software Systems.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
The Byzantine Generals Problem Boon Thau Loo CS294-4.
Byzantine Generals Problem: Solution using signed messages.
1 Attested Append-Only Memory: Making Adversaries Stick to their Word Byung-Gon Chun (ICSI) October 15, 2007 Joint work with Petros Maniatis (Intel Research,
Yee Jiun Song Cornell University. CS5410 Fall 2008.
1 Principles of Reliable Distributed Systems Lecture 3: Synchronous Uniform Consensus Spring 2006 Dr. Idit Keidar.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 5: Synchronous (Uniform)
SRG PeerReview: Practical Accountability for Distributed Systems Andreas Heaberlen, Petr Kouznetsov, and Peter Druschel SOSP’07.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 16 Wenbing Zhao Department of Electrical and Computer Engineering.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 15 Wenbing Zhao Department of Electrical and Computer Engineering.
2/23/2009CS50901 Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred B. Schneider Presenter: Aly Farahat.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 16 Wenbing Zhao Department of Electrical and Computer Engineering.
© 2006 Andreas Haeberlen, MPI-SWS 1 The Case for Byzantine Fault Detection Andreas Haeberlen MPI-SWS / Rice University Petr Kouznetsov MPI-SWS Peter Druschel.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 15 Wenbing Zhao Department of Electrical and Computer Engineering.
Introspective Replica Management Yan Chen, Hakim Weatherspoon, and Dennis Geels Our project developed and evaluated a replica management algorithm suitable.
Building and Programming the Cloud, Mysore, Jan Accountable distributed systems and the accountable cloud Peter Druschel joint work with Andreas.
Learning from the Past for Resolving Dilemmas of Asynchrony Paul Ezhilchelvan and Santosh Shrivastava Newcastle University England, UK.
Byzantine fault tolerance
Byzantine Fault Tolerance CS 425: Distributed Systems Fall Material drived from slides by I. Gupta and N.Vaidya.
Fault Tolerance via the State Machine Replication Approach Favian Contreras.
1 System Models. 2 Outline Introduction Architectural models Fundamental models Guideline.
NSDI (April 24, 2009) © 2009 Andreas Haeberlen, MPI-SWS 1 NetReview: Detecting when interdomain routing goes wrong Andreas Haeberlen MPI-SWS / Rice Ioannis.
1 The Design of a Robust Peer-to-Peer System Rodrigo Rodrigues, Barbara Liskov, Liuba Shrira Presented by Yi Chen Some slides are borrowed from the authors’
Presented by Keun Soo Yim March 19, 2009
Practical Byzantine Fault Tolerance
Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.
From Viewstamped Replication to BFT Barbara Liskov MIT CSAIL November 2007.
1 ZYZZYVA: SPECULATIVE BYZANTINE FAULT TOLERANCE R.Kotla, L. Alvisi, M. Dahlin, A. Clement and E. Wong U. T. Austin Best Paper Award at SOSP 2007.
Byzantine fault tolerance
Intrusion Tolerant Software Architectures Bruno Dutertre and Hassen Saïdi System Design Laboratory, SRI International OASIS PI Meeting.
BFTW 3 workshop (Sep 22, 2009)© 2009 Andreas Haeberlen 1 The Fault Detection Problem Andreas Haeberlen MPI-SWS Petr Kuznetsov TU Berlin / Deutsche Telekom.
A. Haeberlen Fault Tolerance and the Five-Second Rule 1 HotOS XV (May 18, 2015) Ang Chen Hanjun Xiao Andreas Haeberlen Linh Thi Xuan Phan Department of.
Prepared By: Md Rezaul Huda Reza
Securing Passwords Against Dictionary Attacks Presented By Chad Frommeyer.
Byzantine Fault Tolerance CS 425: Distributed Systems Fall 2012 Lecture 26 November 29, 2012 Presented By: Imranul Hoque 1.
CSE 486/586 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
EEC 688/788 Secure and Dependable Computing Lecture 15 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department
PeerReview: Practical Accountability for Distributed Systems SOSP 07.
SOSP 2007 © 2007 Andreas Haeberlen, MPI-SWS 1 Practical accountability for distributed systems Andreas Haeberlen MPI-SWS / Rice University Petr Kuznetsov.
Systems Research Barbara Liskov October Replication Goal: provide reliability and availability by storing information at several nodes.
Revisiting failure detectors Some of you asked questions about implementing consensus using S - how does it differ from reaching consensus using P. Here.
Multi-phase Commit Protocols1 Based on slides by Ken Birman, Cornell University.
PROCESS RESILIENCE By Ravalika Pola. outline: Process Resilience  Design Issues  Failure Masking and Replication  Agreement in Faulty Systems  Failure.
Langley Research Center An Architectural Concept for Intrusion Tolerance in Air Traffic Networks Jeffrey Maddalon Paul Miner {jeffrey.m.maddalon,
Fail-Stop Processors UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau One paper: Byzantine.
Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
CSE 486/586 Distributed Systems Byzantine Fault Tolerance
BChain: High-Throughput BFT Protocols
Primary-Backup Replication
CSE 486/586 Distributed Systems Byzantine Fault Tolerance
Principles of Computer Security
ACM Transactions on Information and System Security, November 2001
From Viewstamped Replication to BFT
Presentation transcript:

1 The Case for Byzantine Fault Detection

2 Challenge: Byzantine faults Distributed systems are subject to a variety of failures and attacks Hacker break-in Freeloading Censorship Data corruption Software/hardware failure Byzantine failure model: Faulty nodes may exhibit arbitrary behavior Dependable systems must be protected against Byzantine faults

3 Existing approach: Fault tolerance Byzantine fault tolerance (BFT) can mask a limited number of Byzantine faults Example: Castro and Liskov [OSDI'99] Client Server replicas

4 Alternative approach: Fault detection Nodes monitor each other for faulty behavior When a fault occurs, the correct nodes identify the faulty node(s) distribute evidence of the fault Nodes can isolate the faulty node + initiate recovery Byzantine Fault Detection

5 Alternative approach: Fault detection Nodes monitor each other for faulty behavior When a fault occurs, the correct nodes identify the faulty node(s) distribute evidence of the fault Nodes can isolate the faulty node + initiate recovery D C B A E Set X=5 D C A E D C B A E OK X=? X=7 E: X=5  7! B

6 Level3 Best approach depends on the application Best-effort service Goal: Find faulty components Wide-area delays, limited bandwidth, many nodes Air traffic control Inter-domain routing Failures may be fatal! Goal: Mask fault symptoms Delays negligible, bandwidth plentiful, few nodes Machine room AT&T Sprint Typical application for Fault DetectionTypical application for Fault Tolerance

7 Detection can provide accountability In an accountable system: Actions are undeniable State is tamper-evident Correctness can be certified Good nodes can provide evidence that they are good Bad nodes cannot hide evidence of misbehavior Proven concept in society Banking, administration... Desirable for distributed systems [Yumerefendi05] Example: Building trust in federated systems

8 What about performance? If up to f nodes can be faulty, we need f+1 replicas to guarantee detection (fault tolerance: 3f+1) More throughput using the same resources Works even when >33% of the nodes can become faulty Detection can defer overhead to periods of low load System can deliver high peak throughput Detection does not require consensus Potentially less expensive than BFT

9 Outline Introduction BFD abstraction PeerReview algorithm Conclusion

10 How is BFD used? Each correct node has state machine + detector Detector can inspect all messages at its local node When detector observes a fault on another node, it informs its local application, and it provides evidence of the fault to other detectors ? Application State machine Detector Network Node X is faulty! No assumptions about faulty nodes

11 Only observable faults can be detected Two classes of observable faults: Detectable faultiness: Node breaks the protocol Detectable ignorance: Node refuses to respond As long as the faulty node continues to follow the protocol, BFD cannot detect this! Set X=5 OK Get X 5 ABC Correct Set X=5 OK Get X ABC Set X=5 OK Get X 7 ABC Detectably ignorant Detectably faulty

12 BFD can give strong guarantees Three types of detector output Trusted, suspected, exposed Strong completeness "No false negatives" Strong accuracy "No false positives" Precise definitions are in the paper Trusted Suspected Exposed

13 Outline Introduction BFD abstraction PeerReview algorithm Conclusion

14 Assumptions 1. Protocol can be modeled as a deterministic state machine 2. Each node has a strong identity, as well as a public/private keypair for signing messages 3. The faulty nodes cannot  prevent two correct nodes from communicating  break the cryptographic keys

15 Secure logging All messages are signed and acknowledged Each node keeps a log of all local inputs and outputs Nodes must commit to the contents of their log Log is tamper-evident [Maniatis02] Rcv(A, "Set X=5") Send(A, "Okay") Rcv(C, "Get X") Send(C, "5") Snd(B, "Set X=5") Rcv(B, "Okay") Snd(B, "Get X") Rcv(B, "5") B's log A B C

16 Detecting ignorance If a node refuses to acknowledge a message Send message as evidence to other nodes Correct nodes will challenge the ignorant node to prove that its log contains a 'Rcv' entry for that message A correct node can always respond Rcv(A, "Set X=5") Send(A, "Okay") Recv(C, "Get X") A B C

17 Detecting faultiness Nodes can audit each other's log at any time Auditors replay input in the log, compare output If a divergence is detected Send log as evidence to other nodes Other nodes can repeat the same procedure to check whether the node is really faulty (no he-said-she-said!) Rcv(A, "Set X=5") Send(A, "Okay") Rcv(B, "Get X") Send(B, "7") A B C B' Rcv(A, "Set X=5") Send(A, "Okay") Rcv(B, "Get X") Send(B, "5") State machine B is expected to run Rcv(A, "Set X=5") Send(A, "Okay") Rcv(B, "Get X") Send(B, "7") Snap- shots

18 Summary New approach: Byzantine Fault Detection Alternative to fault tolerance Provides accountability Fault Detection can give strong guarantees Eventual strong accuracy and completeness Early results indicate Fault Detection is practical Example: PeerReview algorithm