The CAP Theorem Tomer Gabel, Wix BuildStuff 2014.

Slides:



Advertisements
Similar presentations
Dynamo: Amazon’s Highly Available Key-value Store ID2210-VT13 Slides by Tallat M. Shafaat.
Advertisements

Case Study - Amazon. Amazon r Amazon has many Data Centers r Hundreds of services r Thousands of commodity machines r Millions of customers at peak times.
Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services Authored by: Seth Gilbert and Nancy Lynch Presented by:
Replication. Topics r Why Replication? r System Model r Consistency Models r One approach to consistency management and dealing with failures.
Failure Detection The ping-ack failure detector in a synchronous system satisfies – A: completeness – B: accuracy – C: neither – D: both.
Amazon’s Dynamo Simple Cloud Storage. Foundations 1970 – E.F. Codd “A Relational Model of Data for Large Shared Data Banks”E.F. Codd –Idea of tabular.
Dynamo: Amazon's Highly Available Key-value Store Distributed Storage Systems CS presented by: Hussam Abu-Libdeh.
Consensus Hao Li.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Replication Steve Ko Computer Sciences and Engineering University at Buffalo.
Consensus Algorithms Willem Visser RW334. Why do we need consensus? Distributed Databases – Need to know others committed/aborted a transaction to avoid.
NoSQL Databases: MongoDB vs Cassandra
CS 582 / CMPE 481 Distributed Systems
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Mutual Consistency Detection of mutual inconsistency in distributed systems (Parker, Popek, et. al.) Distributed system with replication for reliability.
Distributed Systems 2006 Group Membership * *With material adapted from Ken Birman.
G Robert Grimm New York University Bayou: A Weakly Connected Replicated Storage System.
Computer Science Lecture 12, page 1 CS677: Distributed OS Last Class Vector timestamps Global state –Distributed Snapshot Election algorithms.
Distributed Databases
IBM Haifa Research 1 The Cloud Trade Off IBM Haifa Research Storage Systems.
Amazon’s Dynamo System The material is taken from “Dynamo: Amazon’s Highly Available Key-value Store,” by G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati,
Cloud Storage – A look at Amazon’s Dyanmo A presentation that look’s at Amazon’s Dynamo service (based on a research paper published by Amazon.com) as.
Dynamo: Amazon’s Highly Available Key-value Store Presented By: Devarsh Patel 1CS5204 – Operating Systems.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Case Study: Amazon Dynamo Steve Ko Computer Sciences and Engineering University at Buffalo.
Peer-to-Peer in the Datacenter: Amazon Dynamo Aaron Blankstein COS 461: Computer Networks Lectures: MW 10-10:50am in Architecture N101
Distributed Storage System Survey
Dynamo: Amazon’s Highly Available Key-value Store COSC7388 – Advanced Distributed Computing Presented By: Eshwar Rohit
Fault Tolerance via the State Machine Replication Approach Favian Contreras.
Feb 7, 2001CSCI {4,6}900: Ubiquitous Computing1 Announcements Tomorrow’s class is officially cancelled. If you need someone to go over the reference implementation.
Replication and Consistency. Reference The Dangers of Replication and a Solution, Jim Gray, Pat Helland, Patrick O'Neil, and Dennis Shasha. In Proceedings.
1. Big Data A broad term for data sets so large or complex that traditional data processing applications ae inadequate. 2.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Trade-offs in Cloud.
Replication March 16, Replication What is Replication?  A technique for increasing availability, fault tolerance and sometimes, performance 
CSE 486/586 CSE 486/586 Distributed Systems Case Study: Amazon Dynamo Steve Ko Computer Sciences and Engineering University at Buffalo.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Data Versioning Lecturer.
Dynamo: Amazon's Highly Available Key-Value Store Offense: Jori and Ning.
CAP + Clocks Time keeps on slipping, slipping…. Logistics Last week’s slides online Sign up on Piazza now – No really, do it now Papers are loaded in.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation MongoDB Architecture.
CSE 486/586 CSE 486/586 Distributed Systems Consistency Steve Ko Computer Sciences and Engineering University at Buffalo.
Feb 1, 2001CSCI {4,6}900: Ubiquitous Computing1 Eager Replication and mobile nodes Read on disconnected clients may give stale data Eager replication prohibits.
Geo-distributed Messaging with RabbitMQ
Fault Tolerance and Replication
Consensus and leader election Landon Cox February 6, 2015.
CSE 486/586 Distributed Systems Consistency --- 3
Highly Available Services and Transactions with Replicated Data Jason Lenthe.
Log Shipping, Mirroring, Replication and Clustering Which should I use? That depends on a few questions we must ask the user. We will go over these questions.
ZOOKEEPER. CONTENTS ZooKeeper Overview ZooKeeper Basics ZooKeeper Architecture Getting Started with ZooKeeper.
A MIDDLEWARE LIBRARY FOR CONFLICT FREE REPLICATED DATA TYPES (CRDT)
CS 540 Database Management Systems NoSQL & NewSQL Some slides due to Magda Balazinska 1.
Data Loss and Data Duplication in Kafka
CSE 486/586 Distributed Systems Case Study: Amazon Dynamo
CS 440 Database Management Systems
Trade-offs in Cloud Databases
Lecturer : Dr. Pavle Mogin
View Change Protocols and Reconfiguration
CAP THEOREM Have the rules really changed?
CSE 486/586 Distributed Systems Consistency --- 3
Implementing Consistency -- Paxos
EECS 498 Introduction to Distributed Systems Fall 2017
CS 440 Database Management Systems
EECS 498 Introduction to Distributed Systems Fall 2017
CRDTs and Coordination Avoidance (Lecture 8, cs262a)
Fault-tolerance techniques RSM, Paxos
Clustering and Distributed Data: The Winning Combination?
EECS 498 Introduction to Distributed Systems Fall 2017
Transaction Properties: ACID vs. BASE
View Change Protocols and Reconfiguration
Global Distribution.
Implementing Consistency -- Paxos
CSE 486/586 Distributed Systems Consistency --- 3
CSE 486/586 Distributed Systems Case Study: Amazon Dynamo
Presentation transcript:

The CAP Theorem Tomer Gabel, Wix BuildStuff 2014

Credits Originally a talk by Yoav Abrahami (Wix) Based on “Call Me Maybe” by Kyle “Aphyr” Kingsbury

Brewer’s CAP Theorem Partition Tolerance ConsistencyAvailability

Brewer’s CAP Theorem Partition Tolerance ConsistencyAvailability Pick two!

By Example I want this book! –I add it to the cart –Then continue browsing There’s only one copy in stock!

By Example I want this book! –I add it to the cart –Then continue browsing There’s only one copy in stock! … and someone else just bought it.

Consistency

Consistency: Defined In a consistent system: All participants see the same value at the same time “Do you have this book in stock?”

Consistency: Defined If our book store is an inconsistent system: –Two customers may buy the book –But there’s only one item in inventory! We’ve just violated a business constraint.

Availability

Availability: Defined An available system: –Is reachable –Responds to requests (within SLA) Availability does not guarantee success! –The operation may fail –“This book is no longer available”

Availability: Defined What if the system is unavailable? –I complete the checkout –And click on “Pay” –And wait –And wait some more –And… Did I purchase the book or not?!

Partition Tolerance: Defined Partition: one or more nodes are unreachable This is distinct from a dead node… … but observably the same No practical system runs on a single node So all systems are susceptible! A B C D E

“The Network is Reliable” All four happen in an IP network To a client, delays and drops are the same Perfect failure detection is provably impossible 1 ! 1 “Impossibility of Distributed Consensus with One Faulty Process”, Fischer, Lynch and Paterson“Impossibility of Distributed Consensus with One Faulty Process”

Partition Tolerance: Reified External causes: –Bad network config –Faulty equipment –Scheduled maintenance Even software causes partitions: –Bad network config. –GC pauses –Overloaded servers Plenty of war stories! –Netflix –Twilio –GitHub –Wix :-) Some hard numbers 1 : –5.2 failed devices/day –59K lost packets/day –Adding redundancy only improves by 40% 1 “Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications”, Gill et al“Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications”

“Proving” CAP

In Pictures Let’s consider a simple system: –Service A writes values –Service B reads values –Values are replicated between nodes These are “ideal” systems –Bug-free, predictable Node 1 V0V0 V0V0 A A Node 2 V0V0 V0V0 B B

In Pictures “Sunny day scenario”: –A writes a new value V 1 –The value is replicated to node 2 –B reads the new value Node 1 V0V0 V0V0 A A Node 2 V0V0 V0V0 B B V1V1 V1V1 V1V1 V1V1 V1V1 V1V1

In Pictures What happens if the network drops? –A writes a new value V 1 –Replication fails –B still sees the old value –The system is inconsistent Node 1 V0V0 V0V0 A A Node 2 V0V0 V0V0 B B V1V1 V0V0 V1V1 V1V1

In Pictures Possible mitigation is synchronous replication –A writes a new value V 1 –Cannot replicate, so write is rejected –Both A and B still see V 0 –The system is logically unavailable Node 1 V0V0 V0V0 A A Node 2 V0V0 V0V0 B B V1V1

What does it all mean?

The network is not reliable Distributed systems must handle partitions Any modern system runs on >1 nodes… … and is therefore distributed Ergo, you have to choose: –Consistency over availability –Availability over consistency

Granularity Real systems comprise many operations –“Add book to cart” –“Pay for the book” Each has different properties It’s a spectrum, not a binary choice! ConsistencyAvailability Shopping CartCheckout

CAP IN THE REAL WORLD Kyle “Aphyr” Kingsbury Breaking consistency guarantees since 2013

PostgreSQL Traditional RDBMS –Transactional –ACID compliant Primarily a CP system –Writes against a master node “Not a distributed system” –Except with a client at play!

PostgreSQL Writes are a simplified 2PC: –Client votes to commit –Server validates transaction –Server stores changes –Server acknowledges commit –Client receives acknowledgement Client Server Commit Store Ack

PostgreSQL But what if the ack is never received? The commit is already stored… … but the client has no indication! The system is in an inconsistent state Client Server Commit Store Ack

PostgreSQL Let’s experiment! 5 clients write to a PostgreSQL instance We then drop the server from the network Results: –1000 writes –950 acknowledged –952 survivors

So what can we do? 1.Accept false-negatives –May not be acceptable for your use case! 2.Use idempotent operations 3.Apply unique transaction IDs –Query state after partition is resolved These strategies apply to any RDBMS

A document-oriented database Availability/scale via replica sets –Client writes to a master node –Master replicates writes to n replicas User-selectable consistency guarantees

MongoDB When a partition occurs: –If the master is in the minority, it is demoted –The majority promotes a new master… –… selected by the highest optime

MongoDB The cluster “heals” after partition resolution: –The “old” master rejoins the cluster –Acknowleged minority writes are reverted!

MongoDB Let’s experiment! Set up a 5-node MongoDB cluster 5 clients write to the cluster We then partition the cluster … and restore it to see what happens

MongoDB With write concern unacknowleged: –Server does not ack writes (except TCP) –The default prior to November 2012 Results: –6000 writes –5700 acknowledged –3319 survivors –42% data loss!

MongoDB With write concern acknowleged: –Server acknowledges writes (after store) –The default guarantee Results: –6000 writes –5900 acknowledged –3692 survivors –37% data loss!

MongoDB With write concern replica acknowleged: –Client specifies minimum replicas –Server acks after writes to replicas Results: –6000 writes –5695 acknowledged –3768 survivors –33% data loss!

MongoDB With write concern majority: –For an n-node cluster, requires at least n/2 replicas –Also called “quorum” Results: –6000 writes –5700 acknowledged –5701 survivors –2 false positives :-( –No data loss

So what can we do? 1.Keep calm and carry on –As Aphyr puts it, “not all applications need consistency” –Have a reliable backup strategy –… and make sure you drill restores! 2.Use write concern majority –And take the performance hit

The prime suspects Aphyr’s Jepsen tests include: –Redis –Riak –Zookeeper –Kafka –Cassandra –RabbitMQ –etcd (and consul) –ElasticSearch If you’re considering them, go read his posts In fact, go read his posts regardless

Immutable Data Immutable (adj.): “Unchanging over time or unable to be changed.” Meaning: –No deletes –No updates –No merge conflicts –Replication is trivial

Idempotence An idempotent operation: –Can be applied one or more times with the same effect Enables retries Not always possible –Side-effects are key –Consider: payments

Eventual Consistency A design which prefers availability … but guarantees that clients will eventually see consistent reads Consider git: –Always available locally –Converges via push/pull –Human conflict resolution

Eventual Consistency The system expects data to diverge … and includes mechanisms to regain convergence –Partial ordering to minimize conflicts –A merge function to resolve conflicts

Vector Clocks A technique for partial ordering Each node has a logical clock –The clock increases on every write –Track the last observed clocks for each item –Include this vector on replication When observed and inbound vectors have no common ancestor, we have a conflict This lets us know when history diverged

Vector Clocks: Example 1 1 “Why Vector Clocks Are Hard”, JustinSheehy on the Basho Blog“Why Vector Clocks Are Hard”

CRDTs Commutative Replicated Data Types 1 A CRDT is a data structure that: –Eventually converges to a consistent state –Guarantees no conflicts on replication 1 “A comprehensive study of Convergent and Commutative Replicated Data Types”, Shapiro et al“A comprehensive study of Convergent and Commutative Replicated Data Types”

CRDTs CRDTs provide specialized semantics: –G-Counter: Monotonously increasing counter –PN-Counter: Also supports decrements –G-Set: A set that only supports adds –2P-Set: Supports removals but only once OR-Sets are particularly useful –Keeps track of both additions and removals –Can be used for shopping carts

Questions? Complaints?

WE’RE DONE HERE! Thank you for Aphyr’s “Call Me Maybe” blog posts: