CSL 771: Database Implementation Transaction Processing Maya Ramanath All material (including figures) from: Concurrency Control and Recovery in Database Systems Phil Bernstein, Vassos Hadzilacos and Nathan Goodman (http://research.microsoft.com/en-us/people/philbe/ccontrol.aspx)
A transaction is a unit of interaction Transactions Interaction with the DBMS through SQL update Airlines set price = price - price*0.1, status = “cheap” where price < 5000 A transaction is a unit of interaction
Database system must ensure ACID properties Atomicity Consistency Isolation Durability Database system must ensure ACID properties
Atomicity and Consistency Single transaction Execution of a transaction: “all-or-nothing” Either a transaction completes in its entirety Or it “does not even start” As if the transaction never existed No partial effect must be visible 2 outcomes: A transaction COMMITs or ABORTs
Consistency and Isolation Multiple transactions Concurrent execution can cause an inconsistent database state Each transaction executed as if isolated from the others
Durability If a transaction commits the effects are permanent But, durability has a bigger scope Catastrophic failures (floods, fires, earthquakes)
What we will study… Concurrency Control Recovery Ensuring atomicity, consistency and isolation when multiple transactions are executed concurrently Recovery Ensuring durability and consistency in case of software/hardware failures
Terminology Data item Read (x) Write (x, 5) Start (T) Commit (T) A tuple, table, block Read (x) Write (x, 5) Start (T) Commit (T) Abort (T) Active Transaction A transaction which has neither committed nor aborted Commit: guarantees that the transaction will not be aborted! (Bank transaction, for example) Abort: can occur because the transaction itself encounters an error condition, or by system failures beyond its control. The DB itself may abort for some reason.
High level model Transaction Manager Scheduler Recovery Manager Transaction n Transaction Manager Scheduler Disk Recovery Manager Cache Manager
Recoverability (1/2) Transaction T Aborts DBMS has to… T wrote some data items T’ read items that T wrote DBMS has to… Undo the effects of T Undo effects of T’ But, T’ has already committed T T’ Read (x) Write (x, k) Read (y) Write (y, k’) Commit Abort
Ti commits after all Tk commit Recoverability (2/2) Let T1,…,Tn be a set of transactions Ti reads a value written by Tk, k < i An execution of transactions is recoverable if Ti commits after all Tk commit T1 T2 Write (x,2) Read (x) Write (y,2) Commit T1 T2 Write (x,2) Read (x) Write (y,2) Commit
Cascading Aborts (1/2) Because T was aborted, T1,…, Tk also have to be aborted T T’ T’’ Read (x) Write (x, k) Read (y) Write (y, k’) Abort
Cascading Aborts (2/2) Recoverable executions do not prevent cascading aborts How can we prevent them then ? T1 T2 Write (x,2) Read (x) Write (y,2) Commit T1 T2 Write (x,2) Commit Read (x) Write (y,2)
What we learnt so far… Reading a value, committing a transaction Recoverable with cascading aborts Recoverable without cascading aborts Not recoverable T1 T2 Write (x,2) Read (x) Write (y,2) Commit T1 T2 Write (x,2) Read (x) Write (y,2) Commit T1 T2 Write (x,2) Commit Read (x) Write (y,2)
Strict Schedule (1/2) “Undo”-ing the effects of a transaction Restore the before image of the data item T1 T2 Write (x,1) Write (y,3) Write (y,1) Commit Read (x) Abort T1 T2 Write (x,1) Write (y,3) Commit Equivalent to Final value of y: 3
Strict Schedule (2/2) Initial value of x: 1 T1 T2 Write (x,2) Write (x,3) Abort T1 T2 Write (x,2) Write (x,3) Abort T1 T2 Write (x,2) Abort Write (x,3) Should x be restored to 1 or 3? T1 restores x to 3? T2 restores x to 2? Do not read or write a value which has been written by an active transaction until that transaction has committed or aborted
The Lost Update Problem Read (x) Write (x, 200,000) Commit Write (x, 200) Assume x is your account balance
Serializable Schedules Serial schedule Simply execute transactions one after the other A serializable schedule is one which equivalent to some serial schedule
Serializability Theory
Serializable Schedules T1: op11, op12, op13 T2: op21, op22, op23, op24 Serial schedule Simply execute transactions one after the other op11, op12, op13 op21, op22, op23, op24 op21, op22, op23, op24 op11, op12, op13 Serializable schedule Interleave operations Ensure end result is equivalent to some serial schedule
Notation r1[x] = Transaction 1, Read (x) w1[x] = Transaction 1, Write (x) c1 = Transaction 1, Commit a1= Transaction 1, Abort r1[x], r1[y], w2[x], r2[y], c1, c2
Histories (1/3) Operations of transaction T can be represented by a partial order. r1[x] r1[y] w1[z] c1
Histories (2/3) Conflicting operations Of two ops operating on the same data item, if one of them is a write, then the ops conflict An order has to be specified for conflicting operations
Histories (3/3) Complete History
Serializable Histories The goal: Ensure that the interleaving operations guarantee a serializable history. The method When are two histories equivalent? When is a history serial?
Equivalence of Histories (1/2) H ≅ H’ if they are defined over the same set of transactions and they have the same operations they order conflicting operations the same way
Equivalence of Histories (2/2) y Source: Concurrency Control and Recovery in Database Systems: Bernstein, Hadzilacos and Goodman
Serial History A complete history is serial if for every pair of transactions Ti and Tk, all operations of Ti occur before Tk OR all operations of Tk occur before Ti A history is serializable if its committed projection is equivalent to a serial history.
Serialization Graph T1 T3 T2
Serializability Theorem A history H is serializable if its serialization graph SG(H) is acyclic On your own How do recoverability, strict schedules, cascading aborts fit into the big picture?
Locking
High level model Transaction Manager Scheduler Recovery Manager Transaction n Transaction Manager Scheduler Disk Recovery Manager Cache Manager
Transaction Management . Transaction n Transaction Manager Receives Transactions Sends operations to scheduler Read1(x) Write2(y,k) Read2(x) Commit1 Scheduler Execute op Reject op Delay op Disk
Locking is used by the scheduler to ensure serializability Each data item x has a lock associated with it If T wants to access x Scheduler first acquires a lock on x Only one transaction can hold a lock on x T releases the lock after processing Locking is used by the scheduler to ensure serializability
Notation Read lock and write lock Obtaining read and write locks rl[x], wl[x] Obtaining read and write locks rli[x], wli[x] Lock table Entries of the form [x, r, Ti] Conflicting locks pli[x], qlk[y], x = y and p,q conflict Unlock rui[x], wui[x]
Basic 2-Phase Locking (2PL) RULE 1 RULE 2 pli[x] cannot be released until pi[x] is completed Receive pi[x] is qlk[x] set such that p and q conflict? NO Acquire pli[x] RULE 3 (2 Phase Rule) Once a lock is released no other locks may be obtained. YES pi[x] scheduled pi[x] delayed
The 2-phase rule Once a lock is released no other locks may be obtained. T1: r1[x] w1[y] c1 T2: w2[x] w2[y] c2 H = rl1[x] r1[x] ru1[x] wl2[x] w2[x] wl2[y] w2[y] wu2[x] wu2[y] c2 wl1[y] w1[y] wu1[y] c1 T1 T2
Correctness of 2PL 2PL always produces serializable histories Proof outline STEP 1: Characterize properties of the scheduler STEP 2: Prove that any history with these properties is serializable (That is, SG(H) is acyclic)
Deadlocks (1/2) T1: r1[x] w1[y] c1 T2: w2[y] w2[x] c2 Scheduler rl1[x] wl2[y] r1[x] w2[y] <cannot proceed>
Deadlocks (2/2) Strategies to deal with deadlocks Timeouts Leads to inefficiency Detecting deadlocks Maintain a wait-for graph, cycle indicates deadlock Once a deadlock is detected, break the cycle by aborting a transaction New problem: Starvation
Conservative 2PL Avoids deadlocks altogether T declares its readset and writeset Scheduler tries to acquire all required locks If not all locks can be acquired, T waits in a queue T never “starts” until all locks are acquired Therefore, it can never be involved in a deadlock On your own Strict 2PL (2PL which ensures only strict schedules)
Extra Information Assumption: Data items are organized in a tree Can we come up with a better (more efficient) protocol?
Tree Locking Protocol (1/3) RULE 2 if x is an intermediate node, and y is a parent of x, the ali[x] is possible only if ali[y] RULE 1 Receive ai[x] is alk[x] ? NO RULE 2 RULE 3 ali[x] cannot be released until ai[x] is completed YES pi[x] scheduled ai[x] delayed RULE 4 Once a lock is released the same lock may not be re-obtained.
Tree Locking Protocol (2/3) Proposition: If Ti locks x before Tk, then for every v which is a descendant of x, if both Ti and Tk lock v, then Ti locks v before Tk. Theorem: Tree Locking Protocol always produces Serializable Schedules
Tree Locking Protocol (3/3) Tree Locking Protocol avoids deadlock Releases locks earlier than 2PL BUT Needs to know the access pattern to be effective Transactions should access nodes from root-to-leaf
Multi-granularity Locking (1/3) Refers to the relative size of the data item Attribute, tuple, table, page, file, etc. Efficiency depends on granularity of locking Allow transactions to lock at different granularities
Multi-granularity Locking (2/3) Lock Instance Graph Explicit and Implicit Locks Intention read and intention write locks Intention locks conflict with explicit read and write locks but not with other intention locks Source: Concurrency Control and Recovery in Database Systems: Bernstein, Hadzilacos and Goodman
Multi-granularity Locking (3/3) To set rli[x] or irli[x], first hold irli[y] or iwli[y], such that y is the parent of x. To set wli[x] or iwli[x], first hold iwli[y], such that y is the parent of x. To schedule ri[x] (or wi[x]), Ti must hold rli[y] (or wli[y]) where y = x, or y is an ancestor of x. To release irli[x] (or iwli[x]) no child of x can be locked by Ti
The Phantom Problem How to lock a tuple, which (currently) does not exist? T1: r1[x1], r1[x2], r1[X], c1 T2: w[x3], w[X], c2 rl1[x1], r1[x1], rl1[x2], r1[x2], wl2[x3], wl[X], w2[x3], wu2[x3,X], c2, rl1[X], ru1[x1,x2,X], c1
Non-lock-based schedulers
Timestamp Ordering (1/3) Each transaction is associated with a timestamp Ti indicates Transaction T with timestamp i. Each operation in the transaction has the same timestamp
Timestamp Ordering (2/3) TO Rule If pi[x] and qk[x] are conflicting operations, then pi[x] is processed before qk[x] iff i < k Theorem: If H is a history representing an execution produced by a TO scheduler, then H is serializable.
Timestamp Ordering (3/3) For each data item x, maintain: max-rt(x), max-wt(x), c(x) Request ri[x] Grant request if TS (i) >= max-wt (x) and c(x), update max-rt (x) Delay if TS(i) > max-wt(x) and !c(x) Else abort and restart Ti Request wi[x] Grant request if TS (i) >= max-wt (x) and TS (i) >= max-rt (x), update max-wt (x), set c(x) = false ON YOUR OWN: Thomas write rule, actions taken when a transaction has to commit or abort
Validation Aggressively schedule all operations Do not commit until the transaction is “validated” ON YOUR OWN
Summary Lock-based Schedulers Non-lock-based Schedulers 2-Phase Locking Tree Locking Protocol Multi-granularity Locking Locking in the presence of updates Non-lock-based Schedulers Timestamp Ordering Validation-based Concurrency Control (on your own)
SOURCE: Database System: The complete book SOURCE: Database System: The complete book. Garcia-Molina, Ullman and Widom Recovery
Logging Log the operations in the transaction(s) Believe the log Does the log say transaction T has committed? Or does it say aborted? Or has only a partial trace (implicit abort)? In case of failures, reconstruct the DB from its log
The basic setup Buffer Space for each transaction Buffer Space for data and log Transactions LOG T1 The Disk T2 T3 Tk
Terminology Data item: an element which can be read or written tuple, relation, B+-tree index, etc Input x: fetch x from the disk to buffer Read x,t: read x into variable local variable t Write x,t: write value of t into x Output x: write x to disk
Example update Airlines set price = price - price*0.1, status = “cheap” where price < 5000 Read P, x x -= x* 0.1 Write x,P Read S, y y = “CHEAP” Write y, S Output P Output S System fails here System fails here System fails here
Logs Sequence of log records Need to keep track of Start of transaction Update operations (Write operations) End of transaction (COMMIT or ABORT) “Believe” the log, use the log to reconstruct a consistent DB state
All 3 logging styles ensure atomicity and durability Types of logs Undo logs Ensure that uncommitted transactions are rolled back (or undone) Redo logs Ensure that committed transactions are redone Undo/Redo logs Both of the above All 3 logging styles ensure atomicity and durability
Undo Logging (1/3) <START T>: Start of transaction T <COMMIT T> <ABORT T> <T, A, x>: Transaction T modified A whose before-image is x.
Undo Logging (2/3) Read P, x x -= x* 0.1 Write x,P Read S, y y = “CHEAP” Write y, S FLUSH LOG Output P Output S <START T> U1: <T, X, v> should be flushed before Output X U2: <COMMIT T> should be flushed after all OUTPUTs <T, P, x> <T, S, y> <COMMIT T>
Undo Logging (3/3) Recovery with Undo log If T has a <COMMIT T> entry, do nothing If T has a <START T> entry, but no <COMMIT T> T is incomplete and needs to be undone Restore old values from <T,X,v> records There may be multiple transactions Start scanning from the end of the log
Redo Logging (1/3) All incomplete transactions can be ignored Redo all completed transactions <T, A, x>: Transaction T modified A whose after-image is x.
Redo Logging (2/3) Read P, x x -= x* 0.1 Write x,P Read S, y y = “CHEAP” Write y, S FLUSH LOG Output P Output S <START T> R1: <T, X, v> and <COMMIT T> should be flushed before Output X <T, P, x> <T, S, y> <COMMIT T> Write-ahead Logging
Redo Logging (3/3) Recovery with Redo Logging If T has a <COMMIT T> entry, redo T If T is incomplete, do nothing (add <ABORT T>) For multiple transactions Scan from the beginning of the log
Undo/Redo Logging (1/3) Undo logging: Cannot COMMIT T unless all updates are written to disk Redo logging: Cannot release memory unless transaction commits Undo/Redo logs attempt to strike a balance
Undo/Redo Logging (2/3) Read P, x x -= x* 0.1 Write x,P Read S, y y = “CHEAP” Write y, S FLUSH LOG Output P Output S <START T> UR1: <T, X, a, b> should be flushed before Output X U1: <T, X, v> should be flushed before Output X U2: <COMMIT T> should be flushed after all OUTPUTs <T, P, x, a> <T, S, y, b> R1: <T, X, v> and <COMMIT T> should be flushed before Output X <COMMIT T>
Undo/Redo Logging (3/3) Recovery with Undo/Redo Logging Redo all committed transactions (earliest-first) Undo all uncommitted transactions (latest-first) What happens if there is a crash when you are writing a log? What happens if there is a crash during recovery?
Checkpointing Logs can be huge…can we throw away portions of it? Can we avoid processing all of it when there is a crash? ON YOUR OWN