EECS 498 Introduction to Distributed Systems Fall 2017

Slides:



Advertisements
Similar presentations
Remus: High Availability via Asynchronous Virtual Machine Replication
Advertisements

Replication techniques Primary-backup, RSM, Paxos Jinyang Li.
1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
EEC 688/788 Secure and Dependable Computing Lecture 13 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Remus: High Availability via Asynchronous Virtual Machine Replication.
EEC-681/781 Distributed Computing Systems Lecture 11 Wenbing Zhao Cleveland State University.
Configuration.
Remus: VM Replication Jeff Chase Duke University.
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Primary-Backup Replication COS 418: Distributed Systems Lecture 5 Kyle Jamieson.
Primary-Backup Replication
Distributed Systems – Paxos
CPS 512 midterm exam #1, 10/7/2016 Your name please: ___________________ NetID:___________ /60 /40 /10.
EEC 688/788 Secure and Dependable Computing
CSE 486/586 Distributed Systems Global States
Replication State Machines via Primary-Backup
View Change Protocols and Reconfiguration
Introduction to Operating Systems
EECS 498 Introduction to Distributed Systems Fall 2017
EECS 498 Introduction to Distributed Systems Fall 2017
EECS 498 Introduction to Distributed Systems Fall 2017
湖南大学-信息科学与工程学院-计算机与科学系
EECS 498 Introduction to Distributed Systems Fall 2017
EECS 498 Introduction to Distributed Systems Fall 2017
EECS 498 Introduction to Distributed Systems Fall 2017
EECS 498 Introduction to Distributed Systems Fall 2017
EECS 498 Introduction to Distributed Systems Fall 2017
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Replication Improves reliability Improves availability
EECS 498 Introduction to Distributed Systems Fall 2017
EECS 498 Introduction to Distributed Systems Fall 2017
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EECS 498 Introduction to Distributed Systems Fall 2017
EEC 688/788 Secure and Dependable Computing
Components of a CPU AS Computing - F451.
Lecture 21: Replication Control
Replication State Machines via Primary-Backup
EEC 688/788 Secure and Dependable Computing
Fault-Tolerant State Machine Replication
EECS 498 Introduction to Distributed Systems Fall 2017
Replication State Machines via Primary-Backup
Lecture 10: Coordination and Agreement
EEC 688/788 Secure and Dependable Computing
Distributed Systems CS
View Change Protocols and Reconfiguration
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Lecture 11: Coordination and Agreement
EEC 688/788 Secure and Dependable Computing
CSE 486/586 Distributed Systems Global States
Lecture 21: Replication Control
IS 698/800-01: Advanced Distributed Systems State Machine Replication
Presentation transcript:

EECS 498 Introduction to Distributed Systems Fall 2017 Harsha V. Madhyastha

RSMs and Failures Logical clocks enable all replicas in an RSM to execute updates in same order But, cannot handle failure of any replica! To deal with crash failures: Chandy-Lamport snapshot for coordinated checkpoint Log sources of non-determinism between checkpoints Today: Dealing with fail-stop failures September 20, 2017 EECS 498 – Lecture 5

Types of failures Crash failures Fail stop Can resume with saved state All state is lost upon failure Need to replicate state September 20, 2017 EECS 498 – Lecture 5

Primary Backup Replication Client Primary Backup Backup September 20, 2017 EECS 498 – Lecture 5

Primary Backup Replication How to handle primary failure? Promote one of the backups as the primary How to handle backup failure? Add another machine as a backup September 20, 2017 EECS 498 – Lecture 5

Primary Backup Sync When should primary sync up with backups? What should be transferred when syncing? September 20, 2017 EECS 498 – Lecture 5

Example: MapReduce Master RegisterServer() { while (1) { receive msg and parse addr pick task to assign mark task as assigned respond with task assignment } Primary & Backups in sync When to sync again? September 20, 2017 EECS 498 – Lecture 5

Takeaways Cannot tolerate primary failure after update to its state is externally visible Corollary: Okay for primary to be out of sync with backup until change is externally visible External consistency Implications for when primary should sync with backups? September 20, 2017 EECS 498 – Lecture 5

Primary-Backup Sync C1 C2 P B September 20, 2017 EECS 498 – Lecture 5

Reads vs. Updates For operations that do not update state, primary need not consult backups When is this true? If backup is externally consistent with primary  If backup takes over as primary, it will generate identical response as primary may have September 20, 2017 EECS 498 – Lecture 5

What to transfer in sync? Snapshot of primary’s state Slow When is this necessary? Necessary when bootstrapping new backup Every operation Why is this okay? Leverage determinism of state machine September 20, 2017 EECS 498 – Lecture 5

Service Development Getting coordination right between primary and backups is tricky Easy to mess up Must make replication transparent to developer September 20, 2017 EECS 498 – Lecture 5

RSM with Logical Clocks Application Application Ordered Updates Server1 Server2 Replicated State Machine Replicated State Machine Updates Updates September 20, 2017 EECS 498 – Lecture 5

Transparent Primary Backup Application relies on library to keep primary and backups in sync Receive message from client Sync with backups before sending response to client Will this solution work? Does not capture non-determinism in execution September 20, 2017 EECS 498 – Lecture 5

Virtual Machine Monitor Virtual Machines Applications Process File system Virtual memory Virtual Machine Operating System Hardware Virtual Machine Monitor CPU Disk RAM September 20, 2017 EECS 498 – Lecture 5

VMM-based Primary Backup Primary and backup execute on two virtual machines Primary logs inputs and outputs Backup applies inputs from log Primary waits for backup output Primary-backup monitor each other If primary fails, backup takes over September 20, 2017 EECS 498 – Lecture 5

Log-based VM replication VMM at primary logs external inputs and causes of non-determinism Example: Log results of non-deterministic instructions e.g., timestamp counter read (RDTSC) Random number generation? September 20, 2017 EECS 498 – Lecture 5

Log-based VM replication VMM at primary sends log entries to VMM at backup over the logging channel Backup hypervisor replays log entries Stops backup VM at next input event or non-deterministic instruction Delivers same input as pimary Delivers same result to non-deterministic instruction as primary September 20, 2017 EECS 498 – Lecture 5

Virtual Machine I/O VM inputs Incoming network packets Disk reads Keyboard and mouse events Clock timer interrupt events Results of non-deterministic instructions VM outputs Outgoing network packets Disk writes Why log outputs? September 20, 2017 EECS 498 – Lecture 5

Example Scenario Primary Backup Write input to log Apply input Produce output Fail! Take over as primary Read from log and apply input Timer interrupt fires Produce output Backup does not know timing of output relative to timer interrupt Execution of interrupt handler may affect output September 20, 2017 EECS 498 – Lecture 5

Optimizing Performance Primary must buffer output until ACK from backup Why? Slow from client perspective! How to optimize performance? Pipeline sync with backup and execution at primary September 20, 2017 EECS 498 – Lecture 5

VMM-based Replication September 20, 2017 EECS 498 – Lecture 5