1 Dynamic Atomic Storage Without Consensus Alex Shraer (Technion) Joint work with: Marcos K. Aguilera (MSR), Idit Keidar (Technion), Dahlia Malkhi (MSR.

Slides:

Advertisements

Similar presentations

Impossibility of Distributed Consensus with One Faulty Process

Advertisements

CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Consensus Steve Ko Computer Sciences and Engineering University at Buffalo.

CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Consistency Steve Ko Computer Sciences and Engineering University at Buffalo.

CPSC 689: Discrete Algorithms for Mobile and Wireless Systems Spring 2009 Prof. Jennifer Welch.

Distributed Systems Overview Ali Ghodsi

Distributed Computing 8. Impossibility of consensus Shmuel Zaks ©

CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Consensus Steve Ko Computer Sciences and Engineering University at Buffalo.

Announcements. Midterm Open book, open note, closed neighbor No other external sources No portable electronic devices other than medically necessary medical.

Timeliness, Failure Detectors, and Consensus Performance Alex Shraer Joint work with Dr. Idit Keidar Technion – Israel Institute of Technology In PODC.

DISTRIBUTED SYSTEMS II REPLICATION CNT. II Prof Philippas Tsigas Distributed Computing and Systems Research Group.

Distributed Computing 8. Impossibility of consensus Shmuel Zaks ©

Data-Centric Reconfiguration with Network-Attached Disks Alex Shraer (Technion) Joint work with: J.P. Martin, D. Malkhi, M. K. Aguilera (MSR) I. Keidar.

Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.

Eddie Bortnikov & Aran Bergman, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Recitation.

CS 582 / CMPE 481 Distributed Systems Fault Tolerance.

OPODIS 05 Reconfigurable Distributed Storage for Dynamic Networks Gregory Chockler, Seth Gilbert, Vincent Gramoli, Peter M Musial, Alexander A Shvartsman.

Asynchronous Consensus (Some Slides borrowed from ppt on Web.(by Ken Birman) )

Transis Efficient Message Ordering in Dynamic Networks PODC 1996 talk slides Idit Keidar and Danny Dolev The Hebrew University Transis Project.

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 10: SMR with Paxos.

Timeliness, Failure Detectors, and Consensus Performance Idit Keidar and Alexander Shraer Technion – Israel Institute of Technology.

Filterfresh Fault-tolerant Java Servers Through Active Replication Arash Baratloo

2/23/2009CS50901 Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred B. Schneider Presenter: Aly Farahat.

Idit Keidar, Topics in Reliable Distributed Systems, Technion EE, Winter Topics in Reliable Distributed Systems Winter Dr.

Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 5: Synchronous Uniform.

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 9: SMR with Paxos.

1 Principles of Reliable Distributed Systems Lecture 5: Failure Models, Fault-Tolerant Broadcasts and State-Machine Replication Spring 2005 Dr. Idit Keidar.

Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 3: Fault-Tolerant.

Impossibility of Distributed Consensus with One Faulty Process Michael J. Fischer Nancy A. Lynch Michael S. Paterson Presented by: Oren D. Rubin.

Edward Bortnikov – Topics in Reliable Distributed Computing Slides partially borrowed from Nancy Lynch (DISC ’02) Seth Gilbert (DSN ’03) and Idit.

Timed Quorum Systems … for large-scale and dynamic environments Vincent Gramoli, Michel Raynal.

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 6: Impossibility.

Distributed Systems 2006 Virtual Synchrony* *With material adapted from Ken Birman.

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 12: Impossibility.

1 A Framework for Highly Available Services Based on Group Communication Alan Fekete Idit Keidar University of Sidney MIT.

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 7: Failure Detectors.

CS492: Special Topics on Distributed Algorithms and Systems Fall 2008 Lab 3: Final Term Project.

Lecture 8-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2010 Indranil Gupta (Indy) September 16, 2010 Lecture 8 The Consensus.

Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.

Improving the Efficiency of Fault-Tolerant Distributed Shared-Memory Algorithms Eli Sadovnik and Steven Homberg Second Annual MIT PRIMES Conference, May.

1 © R. Guerraoui Regular register algorithms R. Guerraoui Distributed Programming Laboratory lpdwww.epfl.ch.

Efficient Fork-Linearizable Access to Untrusted Shared Memory Presented by: Alex Shraer (Technion) IBM Zurich Research Laboratory Christian Cachin IBM.

Totally Ordered Broadcast in the face of Network Partitions [Keidar and Dolev,2000] INF5360 Student Presentation 4/3-08 Miran Damjanovic

Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group.

The CoBFIT Toolkit PODC-2007, Portland, Oregon, USA August 14, 2007 HariGovind Ramasamy IBM Zurich Research Laboratory Mouna Seri and William H. Sanders.

SysRép / 2.5A. SchiperEté The consensus problem.

Systems Research Barbara Liskov October Replication Goal: provide reliability and availability by storing information at several nodes.

Fault tolerance and related issues in distributed computing Shmuel Zaks GSSI - Feb

“Towards Self Stabilizing Wait Free Shared Memory Objects” By:  Hopeman  Tsigas  Paptriantafilou Presented By: Sumit Sukhramani Kent State University.

“Distributed Algorithms” by Nancy A. Lynch SHARED MEMORY vs NETWORKS Presented By: Sumit Sukhramani Kent State University.

Computing with Byzantine Shared Memory Topics in Reliable Distributed Systems Fall , Idit Keidar.

Space Bounds for Reliable Storage: Fundamental Limits of Coding Alexander Spiegelman Yuval Cassuto Gregory Chockler Idit Keidar 1.

Dynamic Memory Time to Talk About Complexity

The consensus problem in distributed systems

Dynamic Storage Reconfiguration

View Change Protocols and Reconfiguration

Implementing Consistency -- Paxos

Active replication for fault tolerance

Fault-tolerance techniques RSM, Paxos

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

View Change Protocols and Reconfiguration

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

RAMBO: A Reconfigurable Atomic Memory Service for Dynamic Networks Nancy Lynch, MIT Alex Shvartsman, U. Conn.

Implementing Consistency -- Paxos

Abstractions for Fault Tolerance

Fault-Tolerant SemiFast Implementations of Atomic Read/Write Registers

Presentation transcript:

1 Dynamic Atomic Storage Without Consensus Alex Shraer (Technion) Joint work with: Marcos K. Aguilera (MSR), Idit Keidar (Technion), Dahlia Malkhi (MSR )

2 The Goal client Reliable replicated storage Using unreliable components Asynchrony - tolerate unpredictable network delays server (process)

3 Designing an Asynchronous Replicated System State machine replication (e.g., Paxos) –Any object –Impossible in asynchronous systems Atomic R/W Register [Attiya, Bar-Noy, Dolev 95] –Simple object: read( ), write(v) –Possible in asynchronous system –Atomic (linearizable) –Liveness: if #failures < #servers/2 then every operation invoked on a correct server eventually completes.

4 Breaking the Minority Barrier Over a long period of time #failures < #servers/2 is not good enough Reconfiguration! –Increasing resilience by changing the set of servers –Example: 3 failures out of 5 Semantics of Reconfigurable R/W register: –Atomic (linearizable) –Liveness: ? ABC DE Our first contribution: First "black box" definition (in terms of user interface)

Reconfigurable Register: User Interface read() (returns a value) write(value) (returns OK) reconfig(c) (returns OK) –c is a set of changes (relative to current config.) –Each change is either (Add, pid) or (Remove, pid) –Example: c = {+C, +E, –D} Only processes that were successfully added can invoke ops Universe of processes (servers): –Unknown, unbounded, possibly infinite –At any given time, only a finite number has been added change

Definitions Current(t) – servers in the system at time t –the “current configuration” AddPending(t) – servers whose Add is pending at t RemovePending(t) – servers whose Remove is pending at t Faulty(t) – servers that have crashed by t p i is active in an execution if –During the execution, p i does not crash –Some process invokes reconfig adding p i –No process invokes reconfig removing p i

Dynamic System Liveness Static system: operations complete if #failures<#servers/2 What should this be in a dynamic system? Try #1: for every t, a minority of Current(t) is in Faulty(t) What if processes crash while others are removed? no operation is guaranteed to complete in new configuration! Try #2: for every t, a minority of Current(t) is in Faulty(t)  RemovePending(t) reconfig({–A}) OK AB C

Adding Servers Q: At time t 0, who can crash from {A, B,..., G}? A: minority of {A, B,..., E}, and in addition, –in this scenario G can crash –in a different scenario F can crash Simple condition: any 2 servers can fail (fewer than |Current(t)|/2) reconfig({+F}) reconfig({+G}) OK time t 0 A F B G E D C

Dynamic Service Liveness If #reconfigs invoked in the execution is finite and at every time t in the execution, fewer than |Current(t)|/2 processes out of Current(t)  AddPending(t) are in Faulty(t)  RemovePending(t) Then: Eventually, every active process that was successfully added can invoke operations Every operation invoked by an active process eventually completes

10 Reconfigurable Solutions Many previous solutions: All use consensus (or similar) State machine replication (Paxos) –Use state-machine to agree on set of servers Virtual Synchrony based solutions –e.g., [Yeger-Lotem, Keidar, Dolev 97] R/W register + reconfiguration service –[Lynch, Shvartsman 97], [Englert, Shvartsman 00] –Rambo [Lynch, Shvartsman 02] –Rambo II [Gilbert, Lynch, Shvartsman 03] –Long Lived Rambo [Georgiou, Musial, Shvartsman 04] Is consensus really necessary? consensus to agree on next configuration one designated “reconfigurer” membership service stronger than consensus (equivalent to  P) Our second contribution: Consensus is NOT needed! DynaStore - algorithm for a completely asynchronous system

“Old” and “New” Configurations A reconfiguration transfers the state from a majority of the old config. to a majority of the new config. What if there are concurrent reconfigurations ? Suppose that initial configuration is {A, B, C, D} –A invokes reconfig({+E}); C invokes reconfig({  D}) –A writes to {A, D, E}, a majority of {A, B, C, D, E} –C reads from {B, C}, a majority of {A, B, C} –No intersection  Atomicity is violated! Simple solution: consensus on the sequence of configurations But how can we do this without consensus?

The approach in DynaStore For each configuration c, we use a (weak) snapshot nextConfig(c) to store the next configuration (weak) snapshot objects are (easily) implemented in an asynchronous environment Processes update nextConfig(c) to suggest the next configuration after c (concurrent updates possible) Sequence of Established Configurations (simplified): – The initial configuration is established – If c is established, then the first snapshot update to nextConfig(c) is the next established configuration after c included in every scan from nextConfig(c)

Transferring the State scan of nextConfig(c) returns a set of configs that follow c – if c is established, one config in the returned set is the next established config after c scanning nextConfig for each returned config returns a further set, etc. this creates a DAG of configurations – This DAG contains the sequence of established configs A reconfiguration transfers state along all paths in the DAG – This guarantees that state is transferred along the sequence of established configurations

Suppose that initial configuration is {A, B, C, D} A invokes reconfig({+E}); C invokes reconfig({  D}) A updates nextConfig(C 0 ) to C 1 A scans nextConfig(C 0 ) to check for concurrent updates. Scan returns {C 1 }, i.e., no concurrent updates detected –C 1 is the next established config after C 0 A’s state transfer: –Read from maj. of C 0 and maj. of C 1 –Write latest value found to maj. of C 1 Example C0C0 C1C1 {A, B, C, D, E} {A, B, C, D}

Suppose that initial configuration is {A, B, C, D} A invokes reconfig({+E}); C invokes reconfig({  D}) Concurrently, C updates nextConfig(C 0 ) to C 2 and scans it. Scan returns {C 1, C 2 }, implying that A’s update was concurrent C updates nextConfig(C 1 ) and nextConfig(C 2 ) to C 3. No concurrent updates detected –C 3 is an established configuration C’s state transfer: –Read from maj. of each config on every path found from C 0 to C 3 –Write latest value found to maj. of C 3 Example C0C0 C1C1 {A, B, C, D, E} {A, B, C, D} C2C2 {A, B, C} C3C3 {A, B, C, E}

Suppose that initial configuration is {A, B, C, D} A invokes reconfig({+E}); C invokes reconfig({  D}) A invokes a write(newValue) operation in C 1 In this scenario, DynaStore guarantees: 1.Either C’s state transfer finds newValue in C 1, or A’s write op discovers C 3 and ends after writing newValue to maj. of C 3 3.Read operations also traverse the DAG, and will find newValue on the path of established configurations, intersecting the write Example C0C0 C1C1 {A, B, C, D, E} {A, B, C, D} C2C2 {A, B, C} C3C3 {A, B, C, E}

17 Conclusions First “black box” definition of dynamic R/W register – In terms of events visible to user – A natural failure model – resilience changes dynamically – Possibly useful for specifying other dynamic problems DynaStore: first asynch. dynamic storage protocol – Implements a Reconfigurable Atomic MWMR register – In a completely asynchronous system (consensus impossible) – Proves that R/W storage is really easier than consensus (not only in a static system)