Distributed Storage Systems: Data Replication using Quorums.

Slides:

Advertisements

Similar presentations

COS 461 Fall 1997 Replication u previous lectures: replication for performance u today: replication for availability and fault tolerance –availability:

Advertisements

COS 461 Fall 1997 Group Communication u communicate to a group of processes rather than point-to-point u uses –replicated service –efficient dissemination.

CS 542: Topics in Distributed Systems Diganta Goswami.

Linearizability Linearizability is a correctness criterion for concurrent object (Herlihy & Wing ACM TOPLAS 1990). It provides the illusion that each operation.

Copyright 2004 Koren & Krishna ECE655/DataRepl.1 Fall 2006 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing.

Failure Detection The ping-ack failure detector in a synchronous system satisfies – A: completeness – B: accuracy – C: neither – D: both.

Replication Management. Motivations for Replication Performance enhancement Increased availability Fault tolerance.

DISTRIBUTED SYSTEMS II REPLICATION CNT. II Prof Philippas Tsigas Distributed Computing and Systems Research Group.

Coding for Atomic Shared Memory Emulation Viveck R. Cadambe (MIT) Joint with Prof. Nancy Lynch (MIT), Prof. Muriel Médard (MIT) and Dr. Peter Musial (EMC)

QUORUMS By gil ben-zvi. definition Assume a universe U of servers, sized n. A quorum system S is a set of subsets of U, every pair of which intersect,

TRUST Spring Conference, April 2-3, 2008 Write Markers for Probabilistic Quorum Systems Michael Merideth, Carnegie Mellon University Michael Reiter, University.

1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.

1/6/2015HostAP1 P2P Security Case Study: COCA (Cornell Online Certification Authority) Mobile Multimedia Lab, AUEB, 04/04/2003.

CS 582 / CMPE 481 Distributed Systems Fault Tolerance.

Distributed Systems 2006 Styles of Client/Server Computing.

CS 582 / CMPE 481 Distributed Systems

EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 16 Wenbing Zhao Department of Electrical and Computer Engineering.

2/23/2009CS50901 Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred B. Schneider Presenter: Aly Farahat.

Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.

6.4 Data and File Replication Gang Shen. Why replicate  Performance  Reliability  Resource sharing  Network resource saving.

6.4 Data And File Replication Presenter : Jing He Instructor: Dr. Yanqing Zhang.

Fault Tolerance via the State Machine Replication Approach Favian Contreras.

1 System Models. 2 Outline Introduction Architectural models Fundamental models Guideline.

1 The Design of a Robust Peer-to-Peer System Rodrigo Rodrigues, Barbara Liskov, Liuba Shrira Presented by Yi Chen Some slides are borrowed from the authors’

Replication March 16, Replication What is Replication?  A technique for increasing availability, fault tolerance and sometimes, performance 

Reliable Communication in the Presence of Failures Based on the paper by: Kenneth Birman and Thomas A. Joseph Cesar Talledo COEN 317 Fall 05.

Concurrency Server accesses data on behalf of client – series of operations is a transaction – transactions are atomic Several clients may invoke transactions.

Practical Byzantine Fault Tolerance

Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.

From Viewstamped Replication to BFT Barbara Liskov MIT CSAIL November 2007.

Improving the Efficiency of Fault-Tolerant Distributed Shared-Memory Algorithms Eli Sadovnik and Steven Homberg Second Annual MIT PRIMES Conference, May.

Agile Survivable Store PIs: Mustaque Ahamad, Douglas M. Blough, Wenke Lee and H.Venkateswaran PhD Students: Prahlad Fogla, Lei Kong, Subbu Lakshmanan,

The virtue of dependent failures in multi-site systems Flavio Junqueira and Keith Marzullo University of California, San Diego Workshop on Hot Topics in.

Byzantine fault tolerance

IM NTU Distributed Information Systems 2004 Replication Management -- 1 Replication Management Yih-Kuen Tsay Dept. of Information Management National Taiwan.

Agenda Fail Stop Processors –Problem Definition –Implementation with reliable stable storage –Implementation without reliable stable storage Failure Detection.

Efficient Fork-Linearizable Access to Untrusted Shared Memory Presented by: Alex Shraer (Technion) IBM Zurich Research Laboratory Christian Cachin IBM.

Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.

Re-Configurable Byzantine Quorum System Lei Kong S. Arun Mustaque Ahamad Doug Blough.

Distributed Systems CS Consistency and Replication – Part I Lecture 10, September 30, 2013 Mohammad Hammoud.

Chapter 4 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.

Spring 2003CS 4611 Replication Outline Failure Models Mirroring Quorums.

Hwajung Lee.  Improves reliability  Improves availability ( What good is a reliable system if it is not available?)  Replication must be transparent.

Systems Research Barbara Liskov October Replication Goal: provide reliability and availability by storing information at several nodes.

Revisiting failure detectors Some of you asked questions about implementing consensus using S - how does it differ from reaching consensus using P. Here.

Linearizability Linearizability is a correctness criterion for concurrent object (Herlihy & Wing ACM TOPLAS 1990). It provides the illusion that each operation.

Replication Improves reliability Improves availability ( What good is a reliable system if it is not available?) Replication must be transparent and create.

EEC 688/788 Secure and Dependable Computing Lecture 9 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

“Towards Self Stabilizing Wait Free Shared Memory Objects” By:  Hopeman  Tsigas  Paptriantafilou Presented By: Sumit Sukhramani Kent State University.

The Raft Consensus Algorithm Diego Ongaro and John Ousterhout Stanford University.

Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.

6.4 Data and File Replication

Implementing Consistency -- Paxos

Outline Announcements Fault Tolerance.

Linearizability Linearizability is a correctness criterion for concurrent object (Herlihy & Wing ACM TOPLAS 1990). It provides the illusion that each operation.

CSE 486/586 Distributed Systems Consistency --- 1

Replication Improves reliability Improves availability

Lecture 21: Concurrency & Locking

PERSPECTIVES ON THE CAP THEOREM

EEC 688/788 Secure and Dependable Computing

From Viewstamped Replication to BFT

EEC 688/788 Secure and Dependable Computing

Distributed Transactions

Distributed Systems CS

Replica Placement Model: We consider objects (and don’t worry whether they contain just data or code, or both) Distinguish different processes: A process.

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S

Last Class: Web Caching

EEC 688/788 Secure and Dependable Computing

Implementing Consistency -- Paxos

CSE 486/586 Distributed Systems Consistency --- 1

Presentation transcript:

Distributed Storage Systems: Data Replication using Quorums

Background Software/object replication focus on dependability of computations What if we are primarily concerned with integrity and availability (and perhaps confidentiality) of data? Data replication has several advantages –Simpler protocols (no atomic ordering requirement, no determinism assumptions) –Variety of consistency models provides efficiency/consistency tradeoffs –Performance better than software replication: simpler protocols can be implemented efficiently; caching and hashing

Distributed Storage Model Clients Distributed data store Read/Write LAN or WAN

Basic Quorum Systems Gifford, “Weighted Voting for Replicated Data,” SOSP Proceedings, 1979 each client reads from r servers and writes to w servers if r+w>n, then the intersection of every pair of read/write sets is non-empty  every read will see at least one copy of the latest value written r=1 and w=n  full replication (write-all, read-one): undesirable when servers can be unavailable because writes are not guaranteed to complete best performance (throughput/availability) when 1 < r < w < n, because reads are more frequent than writes in most applications generalization: r, w vary across clients but non-empty intersection between all read/write sets is maintained

Basic Quorum Systems: Protocols time stamps are maintained for each object in store read protocol for a data object V: –read from all servers in some read set –select v with latest time stamp t write protocol for a data object V: –read value according to above protocol to determine current time stamp t –write to all servers in some write set with time stamp t’>t guarantees serial consistency (sometimes called safe semantics): a read operation that is not concurrent with any write operation on the same object returns the last value written in some serialization of the preceding writes to that object

Serial Consistency W1W1 W2W2 W3W3 W4W4 R1R1 R2R2 (R 1, R 2 both read value written by W 3 )

Serial Consistency, cont. W1W1 W2W2 W3W3 W4W4 R1R1 R2R2 (R 1, R 2 both read same value, could be either the result of W 3 or W 4, i.e. serialization could be W 1, W 2, W 3, W 4 or W 1, W 2, W 4, W 3 but it must be seen the same way by R 1 and R 2 ) – how can this be achieved? – how are timestamps for W 3 and W 4 set?

Serial Consistency, cont. W1W1 W2W2 W3W3 W4W4 R1R1 (R 1 can return either the value written by W 4 or the value written by W 3 - serial consistency does not cover the case of a read that overlaps with a write)

Quorum Systems: Formalization assume a universe U of servers, where |U| = n An asymmetric quorum system Q 1, Q 2  2 U is a pair of non-empty sets of subsets of U, where  Q r  Q 1, Q w  Q 2, Q r  Q w ≠ Ø A symmetric quorum system Q  2 U is a non-empty set of subsets of U, where  Q 1, Q 2  Q, Q 1  Q 2 ≠ Ø (read sets and write sets are identical) each Q  Q is called a quorum (for asymmetric system, there are read quorums and write quorums)

Byzantine Quorum Systems Malkhi and Reiter, Distributed Computing, Vol. 11, No. 4, 1998 uses asynchronous system model and assumes servers can fail or be compromised, i.e. servers can experience Byzantine faults increase the size of the quorum intersections to mask responses from faulty servers if intersection is at least 2f max +1, where f max is max. number of faulty servers, then f max +1 correct servers will match in any Q problem: with asynchronous system model, can not distinguish slow servers from faulty servers

Threshold Masking Quorum Systems |(Q w  Q r )\F| = f max +1 (correct, up- to-date) Q r \Q w can be out of date and F can be arbitrary if f max faulty servers match out-of- date values in Q r \Q w, then f max +1 or more matching old values can exist not sufficient to accept the first f max +1 values that match Qw\QrQw\Qr F QwQr\FQwQr\F Qr\QwQr\Qw

Threshold Masking Quorums (cont.) suppose a read operation has returned f max +1 values that match but some values have not yet been returned if the servers that have not responded are correct but slow, the result might change after more values arrive; if the non- responding servers are faulty, they might never respond if all responses arrive from a quorum, pick the newest value that has at least f max +1 representatives (in some cases, correct result can be determined without waiting for all responses) approach needs additional constraint: a quorum consisting solely of correct servers must exist at all times to ensure progress can contact |Q| + f max servers initially to guarantee a quorum of responses, or can contact different quorums sequentially until a full quorum responds

Threshold Masking Quorums (cont.) read protocol for a data object V: – read from all servers in U – wait until |Q| responses have been received from some quorum Q – select value with latest time stamp that has at least f max +1 representatives write protocol for a data object V: –read value according to above protocol to determine current time stamp t –write to all servers in some quorum Q with time stamp t’>t guarantees serial consistency as long as a “fully- correct” quorum exists at all times

Threshold Masking Quorums: Sufficient Condition if n > 4 f max and a symmetric quorum system is used with Q such that |Q| =  (n+2 f max +1)/2 , then the preceding read/write protocols are correct and guarantee progress |Q w  Q r |  2 f max +1 n > 4 f max   (n+2 f max +1)/2   n- f max  at least one “fully correct” quorum exists at all time

Grid Masking Quorum Systems quorum is any 2 f max + 1 rows and 1 column of servers  intersection between any 2 quorums ≥ 2 f max f max + 1 1

System Load load = avg. fraction of servers that must be contacted per read/write operation for symmetric quorum systems, load = |Q| / n –threshold masking load  (n+2 f max +1) / 2n  (0.5, 0.75) –grid masking load  (2 f max +2) n 1/2 / n  O(f max / n 1/2 )  0 as n   (for small f max ) for asymmetric quorum systems, load depends on frequency of reads vs. writes, load = [ c |Q r | + (1-c) |Q w | ] / n, where c = fraction of accesses that are reads –write-all, read-one load = 1-c + c/n > 1-c

Byzantine Quorum System Variations dissemination quorum systems –work with self-verifying data –sufficient that quorums’ intersection contains at least one non-faulty server, i.e intersection size  f max +1 –quorum size =  (n+f max +1)/2 , n > 3 f max opaque masking quorum systems –clients don’t need to know f max (harder for adversary to attack system) –read protocol simply does a majority vote among responses –n > 5 f max (need to outvote faulty + out-of-date servers) –load for any opaque masking quorum system > 1/2 dealing with faulty/malicious clients –can not prevent faulty but authorized clients from writing bad data of which they have write permission –have servers execute agreement protocol before installing write values to avoid faulty clients from leaving servers in inconsistent state

Dynamic Byzantine Quorum Systems Alvisi, et al., Proceedings of DSN 2000 allow f to vary dynamically but keep n fixed when f is small, protocols are quite efficient when f is large, protocols are more expensive variable f holds current estimated number of faulty servers and can vary in interval [f min, f max ] f is read and written according to a quorum protocol (omitted) and is assumed to be correct at all times

Dynamic Byz. Quorum Systems (cont.) read protocol for a data object V: – read current fault threshold f – read from all servers in U – wait until  (n+2f+1)/2  responses have been received – consider value v with latest time stamp that has at least f min +1 representatives – if v has at least f+1 representatives, select v – else, wait for f-f min additional responses and select value with latest time stamp that has at least f+1 representatives write protocol for a data object V: –read current fault threshold f –read value according to above protocol to determine current time stamp t –write to a quorum Q of size  (n+2f+1)/2  with time stamp t’>t

Dynamic Byz. Quorum Systems (cont.) |Q w |   (n+2f min +1)/2  |Q r | =  (n+2f+1)/2  | Q r  Q w |  f+f min +1 at least f min +1 correct, up-to- date values in Q r  Q w Q w \Q r F QrQw\FQrQw\F Q r \Q w