Consistency David E. Culler CS162 – Operating Systems and Systems Programming Lecture 35 Nov 19, 2014 Read:

Slides:



Advertisements
Similar presentations
Two phase commit. Failures in a distributed system Consistency requires agreement among multiple servers –Is transaction X committed? –Have all servers.
Advertisements

CS542: Topics in Distributed Systems Distributed Transactions and Two Phase Commit Protocol.
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
ICS 421 Spring 2010 Distributed Transactions Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 3/16/20101Lipyeow.
OCT Distributed Transaction1 Lecture 13: Distributed Transactions Notes adapted from Tanenbaum’s “Distributed Systems Principles and Paradigms”
Distributed Systems 2006 Styles of Client/Server Computing.
Computer Science Lecture 12, page 1 CS677: Distributed OS Last Class Distributed Snapshots –Termination detection Election algorithms –Bully –Ring.
Systems of Distributed Systems Module 2 -Distributed algorithms Teaching unit 3 – Advanced algorithms Ernesto Damiani University of Bozen Lesson 6 – Two.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Recovery Fall 2006McFadyen Concepts Failures are either: catastrophic to recover one restores the database using a past copy, followed by redoing.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
CSE 461: Sliding Windows & ARQ. Next Topic  We begin on the Transport layer  Focus  How do we send information reliably?  Topics  The Transport layer.
1 CS 194: Distributed Systems Distributed Commit, Recovery Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering.
©Silberschatz, Korth and Sudarshan19.1Database System Concepts Distributed Transactions Transaction may access data at several sites. Each site has a local.
1 More on Distributed Coordination. 2 Who’s in charge? Let’s have an Election. Many algorithms require a coordinator. What happens when the coordinator.
Distributed Commit. Example Consider a chain of stores and suppose a manager – wants to query all the stores, – find the inventory of toothbrushes at.
Computer Science Lecture 12, page 1 CS677: Distributed OS Last Class Vector timestamps Global state –Distributed Snapshot Election algorithms.
Commit Protocols. CS5204 – Operating Systems2 Fault Tolerance Causes of failure: process failure machine failure network failure Goals : transparent:
Distributed Commit Dr. Yingwu Zhu. Failures in a distributed system Consistency requires agreement among multiple servers – Is transaction X committed?
CS162 Section Lecture 10 Slides based from Lecture and
Distributed File Systems
© Janice Regan, CMPT 128, CMPT 371 Data Communications and Networking Flow Control 0.
Distributed Transactions Chapter 13
PMIT-6102 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
Distributed 2PC & Deadlock David E. Culler CS162 – Operating Systems and Systems Programming Lecture 36 Nov 21, 2014 Reading: OSC Ch 7 (deadlock)
Fault Tolerance CSCI 4780/6780. Distributed Commit Commit – Making an operation permanent Transactions in databases One phase commit does not work !!!
Commit Algorithms Hamid Al-Hamadi CS 5204 November 17, 2009.
Lampson and Lomet’s Paper: A New Presumed Commit Optimization for Two Phase Commit Doug Cha COEN 317 – SCU Spring 05.
Distributed Systems CS Consistency and Replication – Part IV Lecture 13, Oct 23, 2013 Mohammad Hammoud.
CS 162 Section 10 Two-phase commit Fault-tolerant computing.
Fault Tolerance Chapter 7. Goal An important goal in distributed systems design is to construct the system in such a way that it can automatically recover.
CS162 Operating Systems and Systems Programming Lecture 17 TCP, Flow Control, Reliability April 3, 2013 Anthony D. Joseph
TCP Flow Control – an illustration of distributed system thinking David E. Culler CS162 – Operating Systems and Systems Programming
CSE/EE 461 Sliding Windows and ARQ. 2 Last Time We finished up the Network layer –Internetworks (IP) –Routing (DV/RIP, LS/OSPF) It was all about routing:
1 End-to-End Protocols UDP TCP –Connection Establishment/Termination –Sliding Window Revisited –Flow Control –Congestion Control –Adaptive Timeout.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
Distributed Databases – Advanced Concepts Chapter 25 in Textbook.
Database Recovery Techniques
Last Class: Introduction
Chapter 5 TCP Sliding Window
Anthony D. Joseph and John Canny
Replication and Consistency
Replication and Consistency
Anthony D. Joseph and Ion Stoica
Commit Protocols CS60002: Distributed Systems
Outline Announcements Fault Tolerance.
April 2, 2014 Anthony D. Joseph CS162 Operating Systems and Systems Programming Lecture 17 TCP, Flow Control, Reliability.
October 31, 2012 Ion Stoica CS162 Operating Systems and Systems Programming Lecture 18 TCP’s Flow Control, Transactions.
5. End-to-end protocols (part 2)
Distributed File Systems
EEC 688/788 Secure and Dependable Computing
Distributed File Systems
EEC 688/788 Secure and Dependable Computing
Lecture 21: Replication Control
Distributed File Systems
Database Recovery 1 Purpose of Database Recovery
Distributed Databases Recovery
Distributed File Systems
Anthony D. Joseph and Ion Stoica
Chapter 5 TCP Sliding Window
Lecture 21: Replication Control
Distributed File Systems
November 26th, 2018 Prof. Ion Stoica
Replication and Consistency
Last Class: Fault Tolerance
Presentation transcript:

Consistency David E. Culler CS162 – Operating Systems and Systems Programming Lecture 35 Nov 19, 2014 Read:

Recap: TCP Flow Control LastByteAcked(200) Sending Process LastByteRead(100) Receiving Process LastByteWritten(350) NextByteExpected( 201) LastByteRcvd(350) 101, , 350 Data[1,100] {[1,100]} 201, 300 {[1,300]} Data[201,300] 301, 350 {101, 300} Data[101,200] {[1,200]} {[101,200]} 101, 200 Data[301,350] {[201,350]} {[101,200],[301,350]} 301, 350 LastByteSent(350) 301, 350 Ack=201, AdvWin = 50 {201, 350} AdvertisedWindow = MaxRcvBuffer – (LastByteRcvd – LastByteRead) SenderWindow = AdvertisedWindow – (LastByteSent – LastByteAcked) WriteWindow = MaxSendBuffer – (LastByteWritten – LastByteAcked)

Summary: Reliability & Flow Control Flow control: three pairs of producer consumers –Sending process  sending TCP –Sending TCP  receiving TCP –Receiving TCP  receiving process AdvertisedWindow: tells sender how much new data the receiver can buffer SenderWindow: specifies how more the sender can transmit. Depends on AdvertisedWindow and on data sent since sender received AdvertisedWindow WriteWindow: How much more the sending application can send to the sending OS

Discussion Why not have a huge buffer at the receiver (memory is cheap!)? Sending window (SndWnd) also depends on network congestion –Congestion control: ensure that a fast sender doesn’t overwhelm a router in the network –discussed in detail in CS168 In practice there is other sets of buffers in the protocol stack, at the link layer (i.e., Network Interface Card)

Internet Layering – engineering for intelligence and change Transport Layer Trans. Hdr. Network Layer Trans. Hdr. Net. Hdr. Datalink Layer Trans. Hdr. Net. Hdr. Frame Hdr. Physical Layer Data Application Layer Any distributed protocol (e.g., HTTP, Skype, p2p, KV protocol in your project) Send bits to other node directly connected to same physical network Send frames to other node directly connected to same physical network Send packets to another node possibly located in a different network Send segments to another process running on same or different node

The Shared Storage Abstraction Information (and therefore control) is communicated from one point of computation to another by –The former storing/writing/sending to a location in a shared address space –And the second later loading/reading/receiving the contents of that location Memory (address) space of a process File systems Dropbox, … Google Docs, … Facebook, … 11/12/14UCB CS162 Fa14 L32 6

What are you assuming? Writes happen –Eventually a write will become visible to readers –Until another write happens to that location Within a sequential thread, a read following a write returns the value written by that write –Dependences are respected –Here a control dependence –Each read returns the most recent value written to the location 11/12/14UCB CS162 Fa14 L32 7

For example 11/12/14UCB CS162 Fa14 L32 8 Write: A := 162 Read: print(A)

What are you assuming? Writes happen –Eventually a write will become visible to readers –Until another write happens to that location Within a sequential thread, a read following a write returns the value written by that write –Dependences are respected –Here a control dependence –Each read returns the most recent value written to the location A sequence of writes will be visible in order –Control dependences –Data dependences 11/12/14UCB CS162 Fa14 L32 9

For example 11/12/14UCB CS162 Fa14 L32 10 Write: A := 162 Read: print(A) Write: A := A , 163, 170, 171, … 162, 163, 170, 164, 171, …

What are you assuming? Writes happen –Eventually a write will become visible to readers –Until another write happens to that location Within a sequential thread, a read following a write returns the value written by that write –Dependences are respected –Here a control dependence –Each read returns the most recent value written to the location A sequence of writes will be visible in order –Control dependences –Data dependences –May not see every write, but the ones seen are consistent with order written A readers see a consistent order –It is as if the total order was visible to all and they took samples 11/12/14UCB CS162 Fa14 L32 11

For example 11/12/14UCB CS162 Fa14 L32 12 Write: A := 162 Read: print(A) Write: A := A + 1 Read: print(A) 162, 163, 170, 171, … 164, 170, 186, …

Demo heets/d/1INjjYqUnFurPLKnnWrexx09Ww5LS5Bh NxKt3BoJY6Eg/edit 11/12/14UCB CS162 Fa14 L32 13

For example 11/12/14UCB CS162 Fa14 L32 14 A := 162 Write: A := 199 Read: print(A) 162, 199, 199, 61, 61 … Write: A := , 61, 199, … 61, 199, … 162, 199, 61, 199 …

For example 11/12/14UCB CS162 Fa14 L32 15 A := 162 Write: A := 199 Read: print(A) 162, 199, 199, 61, 61 … Write: A := , 199, 61, … 162, 61, … 162, 61, 199, … Read: print(A)

What is the key to performance AND reliability Replication 11/12/14UCB CS162 Fa14 L32 16

What is the source of inconsistency? Replication 11/12/14UCB CS162 Fa14 L32 17

Any Storage Abstraction 11/12/14UCB CS162 Fa14 L32 18 Client Storage Server Processor Memory Process Address Space File System NFS Client NFS Server Browser Server

Multiple Clients access server: OK But slow 11/12/14UCB CS162 Fa14 L32 19 Client Storage Server Client

Multi-level Storage Hierarchy: OK 11/12/14UCB CS162 Fa14 L32 20 Client Storage Server Cache Replication within storage hierarchy to make it fast

Multiple Clients and Multi-Level 11/12/14UCB CS162 Fa14 L32 21 Client Storage Server Cache Fast, but not OK Client Cache Client Cache

Multiple Servers What happens if cannot update all the replicas? Availability => Inconsistency 11/12/14UCB CS162 Fa14 L32 22 Client Storage Server Storage Server

Basic solution to multiple client replicas Enforce single-writer multiple reader discipline Allow readers to cache copies Before an update is performed, writer must gain exclusive access Simple Approach: invalidate all the copies then update Who keeps track of what? 11/12/14UCB CS162 Fa14 L32 23

The Multi-processor/Core case 11/12/14UCB CS162 Fa14 L32 24 Proc Memory Cache Interconnect is a broadcast medium All clients can observe all writes and invalidate local replicas (write-thru invalidate protocol) Proc Cache Proc Cache

The Multi-processor/Core case 11/12/14UCB CS162 Fa14 L32 25 Proc Memory Cache Write-Back via read-exclusive Atomic Read-modify-write Proc Cache Proc Cache

NFS “Eventual” Consistency 11/12/14UCB CS162 Fa14 L32 26 Client Storage Server Cache Stateless server allows multiple cached copies –Files written locally (at own risk) Update Visibility by “flush on close” GetAttributes on file ops to check modify since cache Client Cache Client Cache Flush on Close GetAttr on files

Other Options Server can keep a “directory” of cached copies On update, sends invalidate to clients holding copies Or can send updates to clients Pros and Cons ??? OS Consistency = Architecture Coherence requires invalidate copies prior to write Write buffer has be to be treated as primary copy –like transaction log 11/12/14UCB CS162 Fa14 L32 27

Multiple Servers What happens if cannot update all the replicas? Availability => Inconsistency 11/12/14UCB CS162 Fa14 L32 28 Client Storage Server Storage Server

Durability and Atomicity How do you make sure transaction results persist in the face of failures (e.g., server node failures)? Replicate store / database –Commit transaction to each replica What happens if you have failures during a transaction commit? –Need to ensure atomicity: either transaction is committed on all replicas or none at all

Two Phase (2PC) Commit 2PC is a distributed protocol High-level problem statement –If no node fails and all nodes are ready to commit, then all nodes COMMIT –Otherwise ABORT at all nodes Developed by Turing award winner Jim Gray (first Berkeley CS PhD, 1969)

2PC Algorithm One coordinator N workers (replicas) High level algorithm description –Coordinator asks all workers if they can commit –If all workers reply “ VOTE-COMMIT”, then coordinator broadcasts “ GLOBAL-COMMIT”, Otherwise coordinator broadcasts “ GLOBAL-ABORT” –Workers obey the GLOBAL messages

Detailed Algorithm Coordinator sends VOTE-REQ to all workers – Wait for VOTE-REQ from coordinator – If ready, send VOTE-COMMIT to coordinator – If not ready, send VOTE-ABORT to coordinator – And immediately abort – If receive VOTE-COMMIT from all N workers, send GLOBAL-COMMIT to all workers – If doesn’t receive VOTE-COMMIT from all N workers, send GLOBAL- ABORT to all workers – If receive GLOBAL-COMMIT then commit – If receive GLOBAL-ABORT then abort Coordinator AlgorithmWorker Algorithm

Failure Free Example Execution coordinator worker 1 time VOTE- REQ VOTE- COMMIT GLOBAL- COMMIT worker 2 worker 3

State Machine of Coordinator Coordinator implements simple state machine INIT WAIT ABORTCOMMIT Recv: START Send: VOTE-REQ Recv: VOTE-ABORT Send: GLOBAL-ABORT Recv: VOTE-COMMIT Send: GLOBAL-COMMIT

State Machine of Workers INIT READY ABORTCOMMIT Recv: VOTE-REQ Send: VOTE-ABORT Recv: VOTE-REQ Send: VOTE-COMMIT Recv: GLOBAL-ABORTRecv: GLOBAL-COMMIT

Dealing with Worker Failures How to deal with worker failures? –Failure only affects states in which the node is waiting for messages –Coordinator only waits for votes in “WAIT” state –In WAIT, if doesn’t receive N votes, it times out and sends GLOBAL-ABORT INIT WAIT ABORTCOMMIT Recv: START Send: VOTE-REQ Recv: VOTE-ABORT Send: GLOBAL-ABORT Recv: VOTE-COMMIT Send: GLOBAL-COMMIT

Example of Worker Failure coordinator worker 1 time VOTE-REQ VOTE- COMMIT GLOBAL- ABORT INIT WAIT ABORTCOMM timeout worker 2 worker 3

Dealing with Coordinator Failure How to deal with coordinator failures? –worker waits for VOTE-REQ in INIT »Worker can time out and abort (coordinator handles it) –worker waits for GLOBAL-* message in READY »If coordinator fails, workers must BLOCK waiting for coordinator to recover and send GLOBAL_* message INIT READY ABORTCOMMIT Recv: VOTE-REQ Send: VOTE-ABORT Recv: VOTE-REQ Send: VOTE-COMMIT Recv: GLOBAL-ABORTRecv: GLOBAL-COMMIT

Example of Coordinator Failure #1 coordinator worker 1 VOTE- REQ VOTE- ABORT timeout INIT READY ABORTCOMM timeout worker 2 worker 3

Example of Coordinator Failure #2 VOTE-REQ VOTE- COMMIT INIT READY ABORTCOMM block waiting for coordinator restarted GLOBAL- ABORT coordinator worker 1 worker 2 worker 3

Durability All nodes use stable storage* to store which state they are in Upon recovery, it can restore state and resume: –Coordinator aborts in INIT, WAIT, or ABORT –Coordinator commits in COMMIT –Worker aborts in INIT, ABORT –Worker commits in COMMIT –Worker asks Coordinator in READY * - stable storage is non-volatile storage (e.g. backed by disk) that guarantees atomic writes.

Blocking for Coordinator to Recover A worker waiting for global decision can ask fellow workers about their state –If another worker is in ABORT or COMMIT state then coordinator must have sent GLOBAL-* –Thus, worker can safely abort or commit, respectively –If another worker is still in INIT state then both workers can decide to abort –If all workers are in ready, need to BLOCK (don’t know if coordinator wanted to abort or commit) INIT READY ABORTCOMMIT Recv: VOTE-REQ Send: VOTE-ABORT Recv: VOTE-REQ Send: VOTE-COMMIT Recv: GLOBAL-ABORTRecv: GLOBAL-COMMIT