Distributed Data, Replication and NoSQL

Distributed Data, Replication and NoSQL
The slides for this text are organized into chapters. This lecture covers Chapter 20. Chapter 1: Introduction to Database Systems Chapter 2: The Entity-Relationship Model Chapter 3: The Relational Model Chapter 4 (Part A): Relational Algebra Chapter 4 (Part B): Relational Calculus Chapter 5: SQL: Queries, Programming, Triggers Chapter 6: Query-by-Example (QBE) Chapter 7: Storing Data: Disks and Files Chapter 8: File Organizations and Indexing Chapter 9: Tree-Structured Indexing Chapter 10: Hash-Based Indexing Chapter 11: External Sorting Chapter 12 (Part A): Evaluation of Relational Operators Chapter 12 (Part B): Evaluation of Relational Operators: Other Techniques Chapter 13: Introduction to Query Optimization Chapter 14: A Typical Relational Optimizer Chapter 15: Schema Refinement and Normal Forms Chapter 16 (Part A): Physical Database Design Chapter 16 (Part B): Database Tuning Chapter 17: Security Chapter 18: Transaction Management Overview Chapter 19: Concurrency Control Chapter 20: Crash Recovery Chapter 21: Parallel and Distributed Databases Chapter 22: Internet Databases Chapter 23: Decision Support Chapter 24: Data Mining Chapter 25: Object-Database Systems Chapter 26: Spatial Data Management Chapter 27: Deductive Databases Chapter 28: Additional Topics

Outline Partitioned Data Classical Distributed Databases
Two Phase Commit (2PC) Logging and Recovery for 2PC Replicated Data Transactional replication “Relaxing” transactions Weak Isolation Loose Consistency & NoSQL Replica “consistency” & linearizability Quorums Eventual Consistency The Programmer’s View?

Distributed Concurrency Control
Consider a parallel or distributed database One that handles updates Unlike, say, Spark Data is partitioned (sharded) across nodes This is how we scale up data volume! Assume only one copy of each record (for now) Assume each transaction is assigned to a node that serves as its “coordinator” Where is concurrency control state? locks for 2PL? timestamps/versions for MVCC? Easy! Partition like the data

Partitioned Data, Partitioned CC
Every node does its own CC What could go wrong?

Partitioned Data, Partitioned CC
Every node does its own CC What could go wrong? Distributed deadlock: easy enough Gather up all the waits-for graphs, union together. Message failure/delay, Node failure/delay We cannot distinguish delay from failure! Classical distributed systems challenge How do we decide to commit/abort? In the face of message/node failure/delay? Hold a vote. How many votes does a commit need to win? How do we implement distributed voting?!

2-Phase Commit Phase 1: Coordinator tells participants to “prepare”
Participants respond with yes/no votes Unanimity required for yes! Phase 2: Coordinator disseminates result of the vote Need to do some logging for failure handling! More on this in a couple slides

2-Phase Commit Like a wedding ceremony!
Phase 1: “do you take this man/woman...” Coordinator tells participants to “prepare” Participants respond with yes/no votes Unanimity required for yes! Phase 2: “I now pronounce you...” Coordinator disseminates result of the vote Need to do some logging for failure handling.... More on this in a couple slides

2PC: Messages Coordinator Participant Prepare Vote Yes/No Commit/Abort
Ack Time

2PC + Concurrency Control
Straightforward for Strict 2PL If messages are ordered, we have all the locks we’ll ever need when we receive commit request. Message ordering is not hard: use sequence numbers TCP does this for you... On abort, its safe to abort: no cascade. Somewhat trickier for other CC schemes We’ll skip the details

2PC and Recovery What if a participant crashes?
How does it know what the coordinator “believes”? Simpler: How does it know what messages it sent/received to coord? What if the coordinator crashes? How does it know what the participants “believe”? How does it know what messages it sent/received to whom? WAL for 2PC Write log ahead of messages Recovery deals with continuing the 2PC protocol

2PC: Messages Coordinator Participant Prepare Vote Yes/No Commit/Abort
NOTE asterisk*: wait for log flush before sending next msg Coordinator Participant Prepare Vote Yes/No Commit/Abort Ack on commit prepare* or abort (with coord ID) commit* or abort (commit includes participant IDs) commit* or abort end (on commit) Time

Upon Communication Failure
Assume everybody recovers eventually Big assumption! Depends on WAL (and short downtimes) Coordinator notices a Participant is down? If participant hasn’t voted yet, abort transaction locally If waiting for a commit Ack, hand to “recovery process” Participant notices Coordinator is down? If it hasn’t yet logged prepare, then abort unilaterally If it has logged prepare, hand to “recovery process” Note Thinking a node is “down” may be incorrect!

Integration with ARIES Recovery
On recovery Assume there’s a “Recovery Process” at each node It will be given tasks to do by the Analysis phase of ARIES These tasks can run in the background (asynchronously) Note: multiple roles on a single node Coordinator for some xacts, Participant for others

Integration with ARIES: Analysis
Recall transaction table states Running, Committing, Aborting On seeing Prepare log record (participant) State change to committing Tell recovery process to ask coordinator recovery process for status When coordinator responds, recov process handles commit/abort as usual (Note: During REDO, Strict 2PL locks will be acquired) On seeing Commit/Abort log record (coordinator) State change to committing/aborting respectively Tell recovery process to send commit/abort msgs to participants Once all participants ack commit, recovery process writes End and forgets If at end of analysis there’s no 2PC log records for xact X Simply set to Aborting locally, and let ToUndo handle it. A.k.a. “Presumed Abort”

Recovery Process Inquiry from a “prepared” participant
If transaction table says aborting/committing send appropriate response and continue protocol on both sides If transaction table says nothing: send ABORT Only happens if Coordinator had also crashed before writing commit/abort Inquirer does abort on its end

Notes We are glossing over some optimizations
Further reduce waits for log flush (asterisks) Esp. For read-only participants Commit is the expensive case, but should be the common case Can we design a “Presumed Commit” protocol? Yes! But doesn’t always win if we handle read-only transactions with Presumed Abort

Availability Concerns
What happens while a node is down? Other nodes may be in limbo, holding locks So certain data is unavailable This may be bad... Dead Participants? Respawned by coordinator Recover from log And if the old participant comes back from the dead, just ignore it Dead Coordinator? This is a problem See 3PC, but that has assumptions See also Paxos Commit

Why Replicate Data? Unlike partitioning, replication may not help with scale! At least not data scale. Replication increases availability Survive catastrophic failures Avoid correlation: different rack, different datacenter Bypass transient failures Replication reduces latency Choose a “nearby” server Particularly for geo-distributed DBs Ask many servers and take the 1st response Replication makes load balancing possible Other reasons facilitate sharing, autonomy, system mgmt, etc.

Traditional DB Replication Mechanisms
Assume disk mirroring at each machine Don’t worry about local media failure Small number of big databases (say one per continent) Each replicated elsewhere for failure recovery, mostly Data-shipping Expensive: deal with update volumes at both ends Expensive to get transactional guarantees Even single-site transactions would require 2PC! But easy interoperability across different systems Log-shipping Cheap/free at the source node Modest bandwidth: diffs Tricky to do across different systems Replaying Oracle logs to an MS SQL Server DB? Ick.

Single- vs. Multi-Master Writes
Single-master Every data item has one appointed Master node Writes are first performed at the Master Replication propagates the writes to others 2PL/2PC can be used for transactions Easy to understand, but defeats many benefits for writers Esp. reducing latency, also load balancing Multi-master Writes can happen anywhere Replication among all copies Allows even writers to enjoy lower latency, load balancing Fast, but hard to make sense of things Same item can be updated in two places; which update “wins”? 2PL/2PC defeat the positive effects of multi-master

A Compromise: Quorums Suppose you have N nodes replicating each item
Send each write request to W<N nodes Send each read request to R<N nodes How to ensure readers see latest data: ensure W+R >= N (“strict” quorum) E.g. W = R > N/2 (majority quorum) E.g. W=N, R=1 (writer broadcasts) Advantages Not at the mercy of a single slow “master” Relatively easy to recover from a failed node But... Transactional semantics still require 2PC

Is 2PC really so expensive?
How expensive is it to hold a unanimous vote? Raw latencies depend on your network Geo-replication and speed of light Vs. modern datacenter switches But best-case behavior is not a good metric! Much depends on your machines’ delay distribution Hardware: NW switches, mag disk vs. Flash, etc. Software: GC in the JVM, carefully-crafted event handlers...

Google does 2PC; it must be good?!
Google Spanner uses 2PC and a host of other stuff But the latencies! Speed of light. 7 times round the world per second 10 TPS!

Hamilton Quote on Coordination
“The first principle of successful scalability is to batter the consistency mechanisms down to a minimum, move them off the critical path, hide them in a rarely visited corner of the system, and then make it as hard as possible for application developers to get permission to use them” —James Hamilton (IBM, MS, Amazon) Only relevant to massive scalability!?

Weak Isolation: Motivation
Even on a single node, sometimes transactions seem too restrictive The Chancellor requests the average GPA of all students Various Profs want to make individual updates to grades Can’t we all just get along? Sometimes transactions seem too expensive E.g. 2PC requires computers to wait for each other Must transactions be “all or nothing”? Can’t we have “loose transactions” or “a little bit of” transactions Short Answer (tl;dr): Yes, but the API for the programmer is hard to reason about. Still, many people adopt “don’t worry be happy” attitude

SQL Isolation Levels (Lock-based)
Read Uncommitted Idea: can read dirty data Implementation: no locks on read Read Committed Idea: only read committed items Implementation: can unlock immediately after read Cursor Stability Idea: ensure reads are consistent while app “thinks” Implementation: unlock an object when moving to the next Repeatable Read Idea: if you read an item twice in a transaction, you see the same committed version Implementation: hold read locks until end of transaction No phantom protection Serializable

Snapshot Isolation (SI)
All reads made in a transaction are from the same point in (transactional) time Typically the time when the transaction starts It’s like a “snapshot” from start time Transaction aborts if its writes conflict with any writes since the snapshot. When implemented on a MultiVersion system, this can run very efficiently! Oracle pioneered this. Postgres also implements it. Fact 1: This is not equivalent to serializability. Fact 2: Oracle calls this “serializable” mode.

SI Problem: Write Skew Checking (C) and Savings (S accounts)
Constraint: Ci + Si >= 0 Begin: Ci = Si = 100 T1: withdraw $200 from Ci T2: withdraw $200 from Si Serial schedules: T1; T2. Outcome: ?? T2; T1. Outcome: ?? SI schedule: ?? !!

Bad News The lock-based implementations don’t exactly match the SQL standards They do uphold the ANSI standards But the official SQL definitions are (unintentionally) somewhat more general Upshot: It’s very hard to reason about the meaning of weak isolation Usually people resort to thinking about the implementation This provides little help for the app developer!

Worse News!

Worse News! With thanks to Peter Bailis

Two Phase Commit (2PC) Logging and Recovery for 2PC Replicated Data Transactional replication “Relaxing” transactions Weak Isolation Loose Consistency & NoSQL Replica “consistency” & linearizability Quorums Eventual Consistency The Programmer’s View

A Definition: Replica Consistency
Illusion: all copies of item X updated atomically Really: readers see values of X that they’d see without concurrency Linearizability (a.k.a. “Atomic Consistency”) Can play with multi-versioning to get some more concurrency Not to be confused with the A or C in ACID! This overloading has caused endless headaches Not to be confused with serializablity Actions only; not transactions. Single-object writes You can layer serializability on top of linearizable stores

Digging Deeper: Linearizability
History: Set of request/response pairs Linear History A history with immediate responses to requests Linearizable Equivalence History h1 equiv to h2 if: Same set of requests/responses If responsei precedes requestj in h1, the same in h2 Linearizable History Linearizably equivalent to Linear History I.e. A fixed order of atomic request/response pairs

Update-heavy Global-Scale Services
E.g. Amazon, Facebook, LinkedIn, Twitter Shopping, Posting and Connecting Latency and Availability both paramount Replication is critical Favors multi-master solutions, or loose quora Replica consistency becomes frustrating See Hamilton quote Becomes quite natural to ditch Consistency! Even linearizability is too expensive! (?) All the more so serializability!! The rise of NoSQL

NoSQL Born in Amazon Dynamo
Mostly “NoACID”, but often drops rich schemas and queries too Typical NoSQL approach Simple key/value data model. put(key, val). get(key, val) Replicated data (think 3+ replicas) Multi-master: write anywhere, with writer timestamp Update is acked by a single node No 2PC! Update “gossiped” lazily among the replicas in background A.k.a “anti-entropy”, “hinted handoff” and other fancy names Reads can consult 1 or more replicas Quorums can be optionally used to achieve linearizability

NoSQL ambiguities Atypical to bother with linearizability
Per-object ambiguities You may not read the latest version Writers can conflict Write the same key in different places ‘Conflict Resolution’ or ‘Merge’ rules apply: E.g. Last/First writer wins But what clock do you use? E.g. Semantic merge (e.g. Increment) Across objects Typically no guarantees No notion of “trans”-actional semantics

“Eventual Consistency”
“if no new updates are made to the object, eventually all accesses will return the last updated value” -- Werner Vogels, “Eventually Consistent”, CACM 2009 Which in practice means...??

Problems with E.C. Two kinds of properties people discuss in distributed systems: Safety: nothing bad ever happens False! At any given time, the state of the DB may be “bad” Liveness: a good thing eventually happens Sort of! Only in a “quiescent” eventuality.

How do programmers deal with EC?
What, me worry? Application-level coordination Provably consistent code: monotonicity

Case Study: The Shopping Cart
Based on Amazon Dynamo With the global 2PC commit order “clock”

Key Value Key Value

-1 -1 Key Value Key Value

Key Value Key Value 1 1 1 1

Key Value Key Value 1 1 1 1 1 1

Key Value Key Value 1 1 1 1

Key Value Key Value ✔ 2 2 1 1

Eventually Consistent Version
What could go wrong?

-1 -1 Item Count Item Count 1 1 2 1 1 1 1

What, me worry? How often does inconsistency occur?
E.g. What are the odds of a “stale read”? P.B.S. Study at Berkeley (pbs.cs.berkeley.edu) Based on LinkedIn and Yammer traces Stale reads are pretty rare Esp in a single datacenter, using flash disks But they happen! How bad are the effects of inconsistency? In terms of application semantics In terms of exception detection/handling (“apologies”)

Application-level Reasoning
The most interesting part of the Dynamo paper is its application-level smarts! Basic idea: Don’t mutate values in a key-value store Accumulate logs at each node monotonically Union up a set of items (grows monotonically) Union is commutative/associative! Only coordinate for checkout

Barrier at Checkout

Monotonic Code Merge things “upward” sets grow bigger (merge: Union)
counters go up (merge: MAX) booleans go from false to true (merge: OR)

Intuition from the Integers
VON NEUMANN MONOTONIC int ctr; operator:= (x) { // assign ctr = x; } int ctr; operator<= (x) { // merge ctr = MAX(ctr, x); } DISORDERLY INPUT STREAMS: 2, 5, 6, 7, 11, 22, 44, 91 5, 7, 2, 11, 44, 6, 22, 91, 5

VON NEUMANN MONOTONIC DISORDERLY INPUT STREAMS: 2, 5, 6, 7, 11, 22, 44, 91 5, 7, 2, 11, 44, 6, 22, 91, 5

VON NEUMANN MONOTONIC DISORDERLY INPUT STREAMS: 2, 5, 6, 7, 11, 22, 44, 91 5, 7, 2, 11, 44, 6, 22, 91, 5 + monotonic “progress” + order-insensitive outcome

General Monotonicity Merge things “upward”
sets grow bigger (merge: Union) counters go up (merge: MAX) booleans go from false to true (merge: OR) Can be partially ordered E.g. sets growing Note why this works! Associative, Commutative, Idempotent Core mathematical objects Lattices & Monotonic logic E.g. Select/project/join/union. But not set-difference, negation

CALM Theorem When can you have consistency without coordination?
Really. I mean really. Like “complexity theory” really. I.e. given application X, is there any way to code it up to run correctly without coordination? CALM: Consistency As Logical Monotonicity Programs are eventually consistent (without coordination) iff they are monotonic. Monotonicity => coordination can be avoided (somehow) Non-monotonicity => coordination required Hellerstein: The Declarative Imperative, 2010

Example The fully monotone shopping cart

✔ ✔ A “seal” or “manifest” Barrier at Checkout

The Burden on App Developers
Given the tools you have Java/Eclipse, C++/gdb, etc. Convince yourself your app is correct No race conditions? Fault tolerant? Convince yourself your app will remain correct Even after you are replaced Convince yourself the app plays well with others Does your eventual consistency taint somebody else’s serializable data? Wow. OK. Do you miss transactions yet? Amazon reportedly replaced Dynamo Transactions could be practical in a datacenter MS Azure makes use of this But what about global, high-performance systems?

databases Buy two apples and an orange
Read Write Read Write Read Write.. With thanks to Peter Bailis

Shameless plug: <~ bloom
If you know your application you can avoid coordination Really good programmers do this! Maybe Everyone else needs a better programming model One that encourages monotonicity, and analyzes for it. E.g. bloom ( + Confluence analysis (Blazes) + Fault Tolerance analysis (Molly) Etc. See We are still working on these ideas in my group Now in the high-performance main-memory database setting

Data Models Key/Value Stores Opaque values “Document” Stores
Transparent values: JSON or XML Amenable to indexing by tags Vanilla SQL can’t “unpack” the JSON So you get funky query languages E.g. Xquery for XML is a rich one E.g. MongoDB find() syntax

Popular NoSQL Systems Hbase, Cassandra, Voldemort, Riak, Couchbase, CouchDB, MongoDB, Redis, etc. etc. Why so many? Not all that hard to build Lack of standards, design alternatives Interesting to compare to Hadoop Why no NoSQL core? Cause or effect?

Should you be using NoSQL?
Data model and query support Do you want/need the power of something like SQL? And are your queries canned, or ad-hoc? Do you want/need fixed or flexible schemas Note that you can put flexible data into a SQL database See for example PostgreSQL’s JSON support Scale Do you want/need massive scalability and high availability? What’s your data volume? Update/query workload? Will you need geo-replication Are you willing to sacrifice replica consistency? Agility and growth Are you building a service that could grow exponentially? Optimizing for quick, simple coding? Or maintainability?

Summing Up Distributed Databases
A central aspect of Distributed Systems Partitioning for Scale-Out Replication for Fault-Tolerance and Performance Challenges with the Read-Write API Serializability very expensive Witness weak isolation Even Linearizability considered expensive Witness NoSQL This is all being revisited today Programmability and maintenance are key issues Linearizability and ACID creeping back into favor New programming models/languages/libraries to do better than the Read-Write API

Backup material

Brewer’s CAP Theorem (replica) Consistency Availability
(tolerance of network) Partitions Basic formulation A system can only provide 2 of {C, A, P} More practical formulation Partition ~= high latency (the CAL theorem?) If you want High Availability and Low Latency: Give up on Consistency. If you want Consistency: Give up on High Availability and Low Latency

Additional concern: finding replicas
Centralized directory A single node that maps keys to machines Hierarchical directory Namespace is partitioned and sub-partitioned Ask “up the tree” DNS works this way Hashing Static, well-known hash function is not very flexible. More adaptive schemes based on “consistent hashing” Truly distributed hashing: e.g. Chord All of these should be performant and fault-tolerant too! Caching of locations is important throughout Variety of trickiness here

Fault Tolerance and Load Balancing
Nodes may fail Can be recovered from replicas In the interim, the replica nodes may be overloaded Interesting questions on how to get graceful degradation Some process needs to oversee this Various solutions here, both centralized and distributed Load balancing Even without failures, keys may have skewed access Consistent hashing helps adapt placement while still providing directories

Distributed Data, Replication and NoSQL

Similar presentations

Presentation on theme: "Distributed Data, Replication and NoSQL"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Distributed Data, Replication and NoSQL

Similar presentations

Presentation on theme: "Distributed Data, Replication and NoSQL"— Presentation transcript:

Similar presentations

About project

Feedback