CS 162 Section 10 Two-phase commit Fault-tolerant computing.

Slides:



Advertisements
Similar presentations
Two phase commit. Failures in a distributed system Consistency requires agreement among multiple servers –Is transaction X committed? –Have all servers.
Advertisements

(c) Oded Shmueli Distributed Recovery, Lecture 7 (BHG, Chap.7)
COS 461 Fall 1997 Transaction Processing u normal systems lose their state when they crash u many applications need better behavior u today’s topic: how.
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Jan. 2014Dr. Yangjun Chen ACS Database recovery techniques (Ch. 21, 3 rd ed. – Ch. 19, 4 th and 5 th ed. – Ch. 23, 6 th ed.)
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
E-Transactions: End-to-End Reliability for Three-Tier Architectures Svend Frølund and Rachid Guerraoui.
Transaction Processing Lecture ACID 2 phase commit.
OCT Distributed Transaction1 Lecture 13: Distributed Transactions Notes adapted from Tanenbaum’s “Distributed Systems Principles and Paradigms”
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Systems of Distributed Systems Module 2 -Distributed algorithms Teaching unit 3 – Advanced algorithms Ernesto Damiani University of Bozen Lesson 6 – Two.
CS 582 / CMPE 481 Distributed Systems
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Recovery Fall 2006McFadyen Concepts Failures are either: catastrophic to recover one restores the database using a past copy, followed by redoing.
Chapter 7 Fault Tolerance Basic Concepts Failure Models Process Design Issues Flat vs hierarchical group Group Membership Reliable Client.
Atomic TransactionsCS-502 Fall Atomic Transactions in Distributed Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating.
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 5: Synchronous Uniform.
CSS490 Replication & Fault Tolerance
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
J. Gray, Dependability in the Internet Era (acknowledgement: slides from J.Gray, E.Brewer)
©Silberschatz, Korth and Sudarshan19.1Database System Concepts Distributed Transactions Transaction may access data at several sites. Each site has a local.
1 More on Distributed Coordination. 2 Who’s in charge? Let’s have an Election. Many algorithms require a coordinator. What happens when the coordinator.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 18: Replication Control All slides © IG.
CS162 Operating Systems and Systems Programming Lecture 20 Why Systems Fail and What We Can Do About It April 15, 2013 Anthony D. Joseph
CS162 Section Lecture 11. Project 4 Implement a distributed key-value store that uses – Two-Phase Commit for atomic operations, – Replication for performance.
Distributed Commit Dr. Yingwu Zhu. Failures in a distributed system Consistency requires agreement among multiple servers – Is transaction X committed?
CS162 Section Lecture 10 Slides based from Lecture and
Distributed Transactions March 15, Transactions What is a Distributed Transaction?  A transaction that involves more than one server  Network.
8. Distributed DBMS Reliability
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
Replication March 16, Replication What is Replication?  A technique for increasing availability, fault tolerance and sometimes, performance 
CS 162 Discussion Section Week 9 11/11 – 11/15. Today’s Section ●Project discussion (5 min) ●Quiz (10 min) ●Lecture Review (20 min) ●Worksheet and Discussion.
Distributed Transactions Chapter 13
Distributed Systems CS Fault Tolerance- Part III Lecture 19, Nov 25, 2013 Mohammad Hammoud 1.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Chapter 15 Recovery. Topics in this Chapter Transactions Transaction Recovery System Recovery Media Recovery Two-Phase Commit SQL Facilities.
Transactions and Concurrency Control Distribuerade Informationssystem, 1DT060, HT 2013 Adapted from, Copyright, Frederik Hermans.
Distributed 2PC & Deadlock David E. Culler CS162 – Operating Systems and Systems Programming Lecture 36 Nov 21, 2014 Reading: OSC Ch 7 (deadlock)
Fault Tolerance CSCI 4780/6780. Distributed Commit Commit – Making an operation permanent Transactions in databases One phase commit does not work !!!
Commit Algorithms Hamid Al-Hamadi CS 5204 November 17, 2009.
Chap 7: Consistency and Replication
Consistency David E. Culler CS162 – Operating Systems and Systems Programming Lecture 35 Nov 19, 2014 Read:
Introduction to Distributed Databases Yiwei Wu. Introduction A distributed database is a database in which portions of the database are stored on multiple.
Fault Tolerance Chapter 7. Goal An important goal in distributed systems design is to construct the system in such a way that it can automatically recover.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
Distributed Databases – Advanced Concepts Chapter 25 in Textbook.
Fault Tolerance - Transactions
Anthony D. Joseph and John Canny
Distributed Systems CS
Fault Tolerance - Transactions
Anthony D. Joseph and Ion Stoica
Commit Protocols CS60002: Distributed Systems
Outline Announcements Fault Tolerance.
7.1. CONSISTENCY AND REPLICATION INTRODUCTION
Fault Tolerance Distributed Web-based Systems
EEC 688/788 Secure and Dependable Computing
Fault Tolerance - Transactions
Distributed Systems CS
Lecture 21: Replication Control
Causal Consistency and Two-Phase Commit
Distributed Systems CS
Fault Tolerance - Transactions
Distributed Databases Recovery
EEC 688/788 Secure and Dependable Computing
Abstractions for Fault Tolerance
Last Class: Fault Tolerance
Fault Tolerance - Transactions
Presentation transcript:

CS 162 Section 10 Two-phase commit Fault-tolerant computing

Sec /14/2014 CS162 ©UCB Spring 2014 Administrivia Project 3 code due Thursday 4/17 by 11:59PM Project 4 design due date –Thursday 4/24 by 11:59PM Midterm II is April 28 th 4-5:30pm in 245 Li Ka Shing and 100 GPB –Covers Lectures #13-24 –Closed book and notes, no calculators –One double-sides handwritten page of notes allowed –Review session: Fri Apr pm in 245 Li Ka Shing

Sec /14/2014 CS162 ©UCB Spring 2014 QUIZ

Sec /14/2014 CS162 ©UCB Spring 2014 Quiz True/False 1.Two-phase commit is used to guarantee consistency. False 2.You always need 2PC if you want multiple systems to stay consistent. False (ex: use consistent client-side hashing to put keys amongst slaves, with no replication.) 3.If a master server comes awake after crashing in the WAIT state, it resends VOTE_REQ in order to recount the slaves. False 4.Increasing mean-time-to-repair decreases availability (as defined in lecture). True 5.Bohrbugs won’t be fixed by restarting a task or system. True Short answer 5.What was the purpose of the TLS heartbeat, target of the Heartbleed attack? Make sure NATs/firewalls in the middle don’t shut down the connection.

Sec /14/2014 CS162 ©UCB Spring 2014 TWO-PHASE COMMIT

Sec /14/2014 CS162 ©UCB Spring 2014 Durability and Atomicity, distributed How do you make sure transaction results persist in the face of failures (e.g., disk failures)? Replicate database –Commit transaction to each replica What happens if you have failures during a transaction commit? –Need to ensure atomicity: either transaction is committed on all replicas or none at all How can we replicate with atomicity?

Sec /14/2014 CS162 ©UCB Spring 2014 Two-Phase Commit Coordinator sends VOTE-REQ to all workers – Wait for VOTE-REQ from coordinator – If ready, send VOTE-COMMIT to coordinator – If not ready, send VOTE-ABORT to coordinator – And immediately abort – If receive VOTE-COMMIT from all N workers, send GLOBAL-COMMIT to all workers – If doesn’t receive VOTE-COMMIT from all N workers, send GLOBAL- ABORT to all workers – If receive GLOBAL-COMMIT then commit – If receive GLOBAL-ABORT then abort Coordinator AlgorithmWorker Algorithm

Sec /14/2014 CS162 ©UCB Spring 2014 Worker, Master states INIT WAIT ABORTCOMMIT Recv: START Send: VOTE-REQ Recv: VOTE-ABORT Send: GLOBAL-ABORT Recv: VOTE-COMMIT Send: GLOBAL-COMMIT INIT READY ABORTCOMMIT Recv: VOTE-REQ Send: VOTE-ABORT Recv: VOTE-REQ Send: VOTE-COMMIT Recv: GLOBAL-ABORT Recv: GLOBAL-COMMIT Master Worker

Sec /14/2014 CS162 ©UCB Spring 2014 Failure Free Example Execution coordinator worker 1 time VOTE-REQ VOTE- COMMIT GLOBAL- COMMIT worker 2 worker 3

Sec /14/2014 CS162 ©UCB Spring 2014 Dealing with Worker Failures How to deal with worker failures? –Failure only affects states in which the node is waiting for messages –Coordinator only waits for votes in “WAIT” state –In WAIT, if doesn’t receive N votes, it times out and sends GLOBAL-ABORT INIT WAIT ABORTCOMMIT Recv: START Send: VOTE-REQ Recv: VOTE-ABORT Send: GLOBAL-ABORT Recv: VOTE-COMMIT Send: GLOBAL-COMMIT

Sec /14/2014 CS162 ©UCB Spring 2014 Dealing with Coordinator Failure How to deal with coordinator failures? –worker waits for VOTE-REQ in INIT »Worker can time out and abort (coordinator handles it) –worker waits for GLOBAL-* message in READY »If coordinator fails, workers must BLOCK waiting for coordinator to recover and send GLOBAL_* message INIT READY ABORTCOMMIT Recv: VOTE-REQ Send: VOTE-ABORT Recv: VOTE-REQ Send: VOTE-COMMIT Recv: GLOBAL-ABORTRecv: GLOBAL-COMMIT

Sec /14/2014 CS162 ©UCB Spring 2014 Example of Coordinator Failure VOTE-REQ VOTE-COMMIT INIT READY ABORTCOMM block waiting for coordinator restarted GLOBAL-ABORT coordinator worker 1 worker 2 worker 3

Sec /14/2014 CS162 ©UCB Spring 2014 Remembering Where We Were (Durability) All nodes use stable storage* to store which state they are in Upon recovery, it can restore state and resume: –Coordinator aborts in INIT, WAIT, or ABORT –Coordinator commits in COMMIT –Worker aborts in INIT, ABORT –Worker commits in COMMIT –Worker asks Coordinator in READY * - stable storage is non-volatile storage (e.g. backed by disk) that guarantees atomic writes.

Sec /14/2014 CS162 ©UCB Spring 2014 FAULT-TOLERANT COMPUTING

Lec /14/2014 CS162 ©UCB Spring 2014 Dependability: The 3 ITIES Reliability / Integrity: does the right thing. (Need large MTBF) Availability: does it now. (Need small MTTR MTBF+MTTR) System Availability: if 90% of terminals up & 99% of DB up? (=> 89% of transactions are serviced on time ) Security Integrity Reliability Availability MTBF or MTTF = Mean Time Between (To) Failure MTTR = Mean Time To Repair (see next slide)

Sec /14/2014 CS162 ©UCB Spring 2014 Mean Time to Recovery Critical time as further failures can occur during recovery Total Outage duration (MTTR) = Time to Detect(need good monitoring) + Time to Diagnose(need good docs/ops, best practices) + Time to Decide(need good org/leader, best practices) + Time to Act(need good execution!)

Sec /14/2014 CS162 ©UCB Spring 2014 Traditional Fault Tolerance Techniques Fail fast modules: work or stop Spare modules: yield instant repair time Process/Server pairs: Mask HW and SW faults Transactions: yields ACID semantics (simple fault model)

Sec /14/2014 CS162 ©UCB Spring 2014 Fail-Fast is Good, but Repair is Needed Improving either MTTR or MTBF gives benefit Simple redundancy does not help much (can actually hurt!) Lifecycle of a module fail-fast gives short fault latency High Availability is low UN-Availability Unavailability ~ MTTR MTBF X FaultDetect RepairReturn

Sec /14/2014 CS162 ©UCB Spring 2014 Fail-Fast and High-Availability Execution Process Pairs: Instant repair Use Defensive programming to make a process fail-fast Have separate backup process ready to “take over” if primary faults SW fault is a Bohrbug  no repair “wait for the next release” or “get an emergency bug fix” or “get a new vendor” SW fault is a Heisenbug  restart process “ reboot and retry” Yields millisecond repair times Tolerates some HW faults