Deferred Runtime Pipelining for contentious multicore transactions

Deferred Runtime Pipelining for contentious multicore transactions
Shuai Mu Sebastian Angel Dennis Shasha

Multicore programming today…
Dangerous atomics Confusing semaphores Too many locks

Yet developers use transactions to interact with databases and distributed systems
Simpler, less error-prone, efficient enough...

What keeps multicore devs in the dark ages?
multicore txs have very high overheads (historically)

Three main points in this talk
STM is not crazy expensive anymore Recent proposals (STO [EuroSys ‘16], TDSL [PLDI ‘16]) use type information Competitive with fine-grained locking for many workloads DB community has for decades leveraged workload knowledge We extend STO to support more workloads efficiently New concurrency control protocol called DRP DRP is inspired by work in DB, but avoids static analysis Can handle arbitrary transactions defined at runtime

STM is not crazy expensive anymore DB community has for decades leveraged workload knowledge We extend STO to support many more workloads efficiently

Software Transactional Objects [EuroSys ‘16]
void transfer(TArray<int>& bal, TBox<int>& num, int src, int dst, int amt) { TRANSACTION { bal[src] = bal[src] - amt; bal[dst] = bal[dst] + amt; num = num + 1; } RETRY(true); } Begin transaction Write operations Read operations Try to commit STO uses OCC (or TL2) to execute transactions Reads performed without locks and writes buffered locally Locks are acquired on all objects in write set Reads are certified: check values read are still valid (abort otherwise) Install writes and release locks

STO inherits the performance profile of OCC
no aborts + low overhead = life is good we want to be here! many aborts = wasted work throughput workload contention (probability of conflicts)

Transaction chopping [Shasha et al., TODS ‘95]
Tx 1 R; W; W; R; R; R; W; R; W; Tx 2 W; R; W; R; R; R; W; R; W; Assume they conflict Time Static analysis Tx 1 R; W; W; R; R; R; W; R; W; Tx 2 W; R; W; R; R; R; W; R; W; Time

Runtime Pipelining [Xie et al., SOSP ‘15]
Static analysis + runtime checks  finer chopping  more concurrency under contention Tx 1 R; W; W; R; R; R; W; R; W; Tx 2 W; R; W; R; R; R; W; R; W; Time

Runtime Pipelining is good at high contention
Runtime Pipelining (RP) throughput OCC workload contention (probability of conflicts)

Problem: hard to port RP to STO
Transactions are defined at runtime and may have unknown read/write sets void credit(TArray<int>& bal, int clients, int amt) { TRANSACTION { for (int i = 0; i < clients; i++) { bal[i] = bal[i] + amt; }

Strawman Give each object a unique rank (for example its memory address) Acquire locks in increasing order of rank Release locks when all operations on a rank are done Locks currently held TRANSACTION { bal[src] = bal[src] - amt; // rank 1 bal[dst] = bal[dst] + amt; // rank 2 }

Strawman Give each object a unique rank (for example its memory address) Acquire locks in increasing order of rank Release locks when all operations on a rank are done Locks currently held TRANSACTION { bal[src] = bal[src] - amt; // rank 1 bal[dst] = bal[dst] + amt; // rank 2 } 1

Strawman Give each object a unique rank (for example its memory address) Acquire locks in increasing order of rank Release locks when all operations on a rank are done But what if program order, control flow, or data dependencies disagree with ranks? Locks currently held TRANSACTION { bal[src] = bal[src] - amt; // rank 1 bal[dst] = bal[dst] + amt; // rank 2 } 2 Tx 1: R; W; R; W; Tx 2: R; W; R; W;

Strawman Give each object a unique rank (for example its memory address) Acquire locks in increasing order of rank Release locks when all operations on a rank are done Locks currently held TRANSACTION { bal[dst] = bal[dst] + amt; // rank 2 bal[src] = bal[src] - amt; // rank 1 } 2

Strawman Give each object a unique rank (for example its memory address) Acquire locks in increasing order of rank Release locks when all operations on a rank are done Locks currently held TRANSACTION { bal[dst] = bal[dst] + amt; // rank 2 bal[src] = bal[src] - amt; // rank 1 } Cannot abort either! cannot lock 1 (not in rank order!) This is an in-place update to the shared state (not a local update as in OCC) And we have already release its lock…. oops.

Deferred Runtime Pipelining (DRP)
During execution: Asynchronously read operations + Buffer write logic After call to commit: Acquire locks in rank order Enqueue write logic onto objects in a pipeline Similar to lazy evaluation Similar to OCC (but for logic instead of values) Similar to RP (but enqueues logic instead of performing operations)

Example of DRP (actual syntax is less verbose)
TRANSACTION { auto bal_dst = bal.async_read(dst); // rank 2 bal.enqueue(dst, { bal_dst + amt }); // rank 2 auto bal_src = bal.async_read(src); // rank 1 bal.enqueue(src, {bal_src – amt}); // rank 1 } Locks currently held Thread-local state Shared global state bal[dst]: ({bal_dst + amt})

TRANSACTION { auto bal_dst = bal.async_read(dst); // rank 2 bal.enqueue(dst, { bal_dst + amt }); // rank 2 auto bal_src = bal.async_read(src); // rank 1 bal.enqueue(src, {bal_src – amt}); // rank 1 } Locks currently held Thread-local state Shared global state bal[dst]: ({bal_dst + amt}) bal[src]: ({bal_src - amt})

TRANSACTION { auto bal_dst = bal.async_read(dst); // rank 2 bal.enqueue(dst, { bal_dst + amt }); // rank 2 auto bal_src = bal.async_read(src); // rank 1 bal.enqueue(src, {bal_src – amt}); // rank 1 } Locks currently held We know write set! Thread-local state Shared global state bal[dst]: ({bal_dst + amt}) bal[src]: ({bal_src - amt})

TRANSACTION { auto bal_dst = bal.async_read(dst); // rank 2 bal.enqueue(dst, { bal_dst + amt }); // rank 2 auto bal_src = bal.async_read(src); // rank 1 bal.enqueue(src, {bal_src – amt}); // rank 1 } Locks currently held 1 Thread-local state Shared global state bal[dst]: ({bal_dst + amt}) append bal[src]: ({bal_src - amt}) bal[src]: ({bal_src - amt})

TRANSACTION { auto bal_dst = bal.async_read(dst); // rank 2 bal.enqueue(dst, { bal_dst + amt }); // rank 2 auto bal_src = bal.async_read(src); // rank 1 bal.enqueue(src, {bal_src – amt}); // rank 1 } Locks currently held 2 Thread-local state Shared global state bal[dst]: ({bal_dst + amt}) append bal[src]: ({bal_src - amt}) bal[src]: ({bal_src - amt}) bal[dst]: ({bal_dst + amt})

Wait a second… This requires the developer to rewrite transactions...
And some txs can’t be expressed with async reads + deferred writes

DRP also supports legacy transactions
Legacy transactions execute with an OCC-ish protocol New protocol allows async + legacy transactions to coexist Cool and very efficient mechanism. See paper. DRP automatically detects if a transaction is legacy or not! Read set of legacy tx = not empty Read set of async transaction = empty (only promises)

Evaluation questions (more in the paper)
How does DRP perform on standard benchmarks? How does DRP perform as contention varies?

How does DRP perform on standard benchmarks?
Silo [SOSP ‘13] multicore database ported to STO and DRP TPC-C benchmark new order, payment, delivery, STAMP: suite of multicore applications See paper or ask me about it stock level, order status

(transaction chopping)
TPC-C workload STO (transaction chopping)

Evaluation questions (more in the paper)
How does DRP perform on standard benchmarks? How does DRP perform as contention varies?

TPC-C with varying contention (32 threads)
STO (transaction chopping) TPC-C has 5 transaction types. We only make 3 of them async. The other 2 types abort a lot. Amount of stuff decreases to the right, so less to keep in memory, which is why IC3 performance increases. Workload contention increases

Summary DRP expands the workloads that can benefit from using STO
DRP works transparently with async and legacy transactions DRP guarantees opacity and avoids deadlock and aborts

STAMP Results (32 threads)
STO uses workload-specific optimizations called predicates (DRP does not implement these)

Impact of async txs at high contention
Percentage of async transactions

Actual syntax void transfer(TArray<int>& bal, TBox<int>& num, int src, int dst, int amt) { TRANSACTION { // the next two lines explicitly use deferred interfaces to buffer intentions Intention<int>* bal_src = bal.defer_at(src); bal.defer_update(src, new Intention<int>([&](int& val){ return bal_src->result - amt; }, {bal_src})); // the next three lines use syntactic sugar based on C++'s operator // overloading and implicit type conversion to achieve the same effects auto bal_dst = bal[dst]; bal[dst] = bal_dst + amt; num += 1; }

Lots more in the paper DRP guarantees opacity and deadlock freedom
DRP can be implemented incrementally Details on how to implement DRP into STO with low overhead

Avoiding rank mismatch
If we know read/write set Acquire all locks ahead of time (predeclaration locking) If you ever need to acquire a lock out of rank order, acquire all prior locks Ask programmer to predefine accesses ACCESS([&bal[dst], 2], [&bal[src], 2], [&num, 1]); TRANSACTION { bal[dst] = bal[dst] + amt; // rank 2 bal[src] = bal[src] - amt; // rank 1 num = num + 1; // rank 3 } This is a nightmare: error-prone, hard to reason about, etc.

Deferred Runtime Pipelining for contentious multicore transactions

Similar presentations

Presentation on theme: "Deferred Runtime Pipelining for contentious multicore transactions"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Deferred Runtime Pipelining for contentious multicore transactions

Similar presentations

Presentation on theme: "Deferred Runtime Pipelining for contentious multicore transactions"— Presentation transcript:

Similar presentations

About project

Feedback