Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class:Consistency Semantics Consistency models –Data-centric consistency models –Client-centric.

Slides:



Advertisements
Similar presentations
Chapter 8 Fault Tolerance
Advertisements

Replication. Topics r Why Replication? r System Model r Consistency Models r One approach to consistency management and dealing with failures.
Consistency and Replication Chapter 7 Part II Replica Management & Consistency Protocols.
Consistency and Replication
1 CS 194: Lecture 8 Consistency and Replication Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering and Computer.
Consistency and Replication Chapter 6. Object Replication (1) Organization of a distributed remote object shared by two different clients.
Consistency and Replication. Replication of data Why? To enhance reliability To improve performance in a large scale system Replicas must be consistent.
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Making Services Fault Tolerant
Distributed Processing, Client/Server, and Clusters
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Distributed Systems Spring 2009
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Computer Science Lecture 14, page 1 CS677: Distributed OS Consistency and Replication Today: –Introduction –Consistency models Data-centric consistency.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class: Web Caching Use web caching as an illustrative example Distribution protocols –Invalidate.
CS 582 / CMPE 481 Distributed Systems
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Chapter 7 Fault Tolerance Basic Concepts Failure Models Process Design Issues Flat vs hierarchical group Group Membership Reliable Client.
Distributed Systems CS Fault Tolerance- Part I Lecture 13, Oct 17, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.
CS 603 Failure Models April 12, Fault Tolerance in Distributed Systems Perfect world: No Failures –W–We don’t live in a perfect world Non-distributed.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Last Class: Weak Consistency
Computer Science Lecture 14, page 1 CS677: Distributed OS Consistency and Replication Introduction Consistency models –Data-centric consistency models.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Client-Centric.
1 Making Services Fault Tolerant Pat Chan, Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Miroslaw Malek.
Case Study - GFS.
1 6.4 Distribution Protocols Different ways of propagating/distributing updates to replicas, independent of the consistency model. First design issue.
Consistency and Replication CSCI 4780/6780. Chapter Outline Why replication? –Relations to reliability and scalability How to maintain consistency of.
Consistency And Replication
CH2 System models.
Consistency and Replication Chapter 6. Release Consistency (1) A valid event sequence for release consistency. Use acquire/release operations to denote.
Consistency and Replication. Replication of data Why? To enhance reliability To improve performance in a large scale system Replicas must be consistent.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Outline Introduction (what’s it all about) Data-centric consistency Client-centric consistency Replica management Consistency protocols.
Consistency and Replication Distributed Software Systems.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 7 Consistency.
第5讲 一致性与复制 §5.1 副本管理 Replica Management §5.2 一致性模型 Consistency Models
Replication (1). Topics r Why Replication? r System Model r Consistency Models – How do we reason about the consistency of the “global state”? m Data-centric.
Distributed Systems CS Consistency and Replication – Part IV Lecture 21, Nov 10, 2014 Mohammad Hammoud.
Chapter 7 Consistency And Replication (CONSISTENCY PROTOCOLS)
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Distributed Systems CS Consistency and Replication – Part I Lecture 10, September 30, 2013 Mohammad Hammoud.
Chap 7: Consistency and Replication
Fault Tolerance Chapter 7.
Fault Tolerance. Basic Concepts Availability The system is ready to work immediately Reliability The system can run continuously Safety When the system.
V1.7Fault Tolerance1. V1.7Fault Tolerance2 A characteristic of Distributed Systems is that they are tolerant of partial failures within the distributed.
Presentation-2 Group-A1 Professor:Mohamed Khalil Anita Kanuganti Hemanth Rao.
Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.
Distributed Systems CS Consistency and Replication – Part IV Lecture 13, Oct 23, 2013 Mohammad Hammoud.
Consistency and Replication Chapter 6 Presenter: Yang Jie RTMM Lab Kyung Hee University.
Fault Tolerance Chapter 7. Basic Concepts Dependability Includes Availability Reliability Safety Maintainability.
Introduction to Fault Tolerance By Sahithi Podila.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
Fault Tolerance Chapter 7. Goal An important goal in distributed systems design is to construct the system in such a way that it can automatically recover.
Distributed Systems CS Consistency and Replication – Part IV Lecture 14, Oct 28, 2015 Mohammad Hammoud.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
PERFORMANCE MANAGEMENT IMPROVING PERFORMANCE TECHNIQUES Network management system 1.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
Distributed Systems – Paxos
Fault Tolerance In Operating System
Dependability Dependability is the ability to avoid service failures that are more frequent or severe than desired. It is an important goal of distributed.
Fault Tolerance - Transactions
Outline Midterm results summary Distributed file systems – continued
Fault Tolerance Distributed Web-based Systems
Introduction to Fault Tolerance
Distributed Systems CS
Last Class: Web Caching
Fault Tolerance and Reliability in DS.
Abstractions for Fault Tolerance
Presentation transcript:

Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class:Consistency Semantics Consistency models –Data-centric consistency models –Client-centric consistency models Eventual Consistency and epidemic protocols

Computer Science Lecture 16, page 2 CS677: Distributed OS Today: More on Consistency Consistency protocols –Primary-based –Replicated-write Putting it all together –Final thoughts Fault-tolerance Introduction Project 2

Computer Science Lecture 16, page 3 CS677: Distributed OS Implementation Issues Two techniques to implement consistency models –Primary-based protocols Assume a primary replica for each data item Primary responsible for coordinating all writes –Replicated write protocols No primary is assumed for a data item Writes can take place at any replica

Computer Science Lecture 16, page 4 CS677: Distributed OS Remote-Write Protocols Traditionally used in client-server systems

Computer Science Lecture 16, page 5 CS677: Distributed OS Remote-Write Protocols (2) Primary-backup protocol –Allow local reads, sent writes to primary –Block on write until all replicas are notified –Implements sequential consistency

Computer Science Lecture 16, page 6 CS677: Distributed OS Local-Write Protocols (1) Primary-based local-write protocol in which a single copy is migrated between processes. –Limitation: need to track the primary for each data item

Computer Science Lecture 16, page 7 CS677: Distributed OS Local-Write Protocols (2) Primary-backup protocol in which the primary migrates to the process wanting to perform an update

Computer Science Lecture 16, page 8 CS677: Distributed OS Replicated-write Protocols Relax the assumption of one primary –No primary, any replica is allowed to update –Consistency is more complex to achieve Quorum-based protocols –Use voting to request/acquire permissions from replicas –Consider a file replicated on N servers –Update: contact at least (N/2+1) servers and get them to agree to do update (associate version number with file) –Read: contact majority of servers and obtain version number If majority of servers agree on a version number, read

Computer Science Lecture 16, page 9 CS677: Distributed OS Gifford’s Quorum-Based Protocol Three examples of the voting algorithm: a)A correct choice of read and write set b)A choice that may lead to write-write conflicts c)A correct choice, known as ROWA (read one, write all)

Computer Science Lecture 16, page 10 CS677: Distributed OS Final Thoughts Replication and caching improve performance in distributed systems Consistency of replicated data is crucial Many consistency semantics (models) possible –Need to pick appropriate model depending on the application –Example: web caching: weak consistency is OK since humans are tolerant to stale information (can reload browser) –Implementation overheads and complexity grows if stronger guarantees are desired

Computer Science Lecture 16, page 11 CS677: Distributed OS Fault Tolerance Single machine systems –Failures are all or nothing OS crash, disk failures Distributed systems: multiple independent nodes –Partial failures are also possible (some nodes fail) Question: Can we automatically recover from partial failures? –Important issue since probability of failure grows with number of independent components (nodes) in the systems –Prob(failure) = Prob(Any one component fails)=1-P(no failure)

Computer Science Lecture 16, page 12 CS677: Distributed OS A Perspective Computing systems are not very reliable –OS crashes frequently (Windows), buggy software, unreliable hardware, software/hardware incompatibilities –Until recently: computer users were “tech savvy” Could depend on users to reboot, troubleshoot problems –Growing popularity of Internet/World Wide Web “Novice” users Need to build more reliable/dependable systems –Example: what is your TV (or car) broke down every day? Users don’t want to “restart” TV or fix it (by opening it up) Need to make computing systems more reliable

Computer Science Lecture 16, page 13 CS677: Distributed OS Basic Concepts Need to build dependable systems Requirements for dependable systems –Availability: system should be available for use at any given time % availability (five 9s) => very small down times –Reliability: system should run continuously without failure –Safety: temporary failures should not result in a catastrophic Example: computing systems controlling an airplane, nuclear reactor –Maintainability: a failed system should be easy to repair

Computer Science Lecture 16, page 14 CS677: Distributed OS Basic Concepts (contd) Fault tolerance: system should provide services despite faults –Transient faults –Intermittent faults –Permanent faults

Computer Science Lecture 16, page 15 CS677: Distributed OS Failure Models Different types of failures. Type of failureDescription Crash failureA server halts, but is working correctly until it halts Omission failure Receive omission Send omission A server fails to respond to incoming requests A server fails to receive incoming messages A server fails to send messages Timing failureA server's response lies outside the specified time interval Response failure Value failure State transition failure The server's response is incorrect The value of the response is wrong The server deviates from the correct flow of control Arbitrary failureA server may produce arbitrary responses at arbitrary times

Computer Science Lecture 16, page 16 CS677: Distributed OS Failure Masking by Redundancy Triple modular redundancy.

Computer Science Lecture 16, page 17 CS677: Distributed OS Project 2 Online banking using a distributed database Database distributed across 3 machines –Set of accounts split across the three disks Replication: each account is also replicated on a second machine

Computer Science Lecture 16, page 18 CS677: Distributed OS Project 2 Load balancer/request redirector –All client requests arrive at this component –Forwards the request to an “appropriate” server –Two policies: per-account round-robin, least-loaded Distributed locks: Ricart and Agarwala algorithm –Can use logical clocks or simplify using physical clocks Consistency: strict/release, before releasing a lock, propagate changes to replica