Deferred Runtime Pipelining for contentious multicore transactions

Slides:



Advertisements
Similar presentations
Inferring Locks for Atomic Sections Cornell University (summer intern at Microsoft Research) Microsoft Research Sigmund CheremTrishul ChilimbiSumit Gulwani.
Advertisements

CM20145 Concurrency Control
Time-based Transactional Memory with Scalable Time Bases Torvald Riegel, Christof Fetzer, Pascal Felber Presented By: Michael Gendelman.
Two phase commit. Failures in a distributed system Consistency requires agreement among multiple servers –Is transaction X committed? –Have all servers.
Optimistic Methods for Concurrency Control By : H.T. Kung & John T. Robinson Presenters: Munawer Saeed.
Transaction Management: Concurrency Control CS634 Class 17, Apr 7, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Concurrency Control II
ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto
Exploiting Distributed Version Concurrency in a Transactional Memory Cluster Kaloian Manassiev, Madalin Mihailescu and Cristiana Amza University of Toronto,
Concurrency Control Part 2 R&G - Chapter 17 The sequel was far better than the original! -- Nobody.
Enforcing Security Policies using Transactional Memory Introspection Vinod Ganapathy Rutgers University Arnar BirgissonMohan Dhawan Ulfar ErlingssonLiviu.
Lecture 11 Recoverability. 2 Serializability identifies schedules that maintain database consistency, assuming no transaction fails. Could also examine.
Threading Part 2 CS221 – 4/22/09. Where We Left Off Simple Threads Program: – Start a worker thread from the Main thread – Worker thread prints messages.
TOWARDS A SOFTWARE TRANSACTIONAL MEMORY FOR GRAPHICS PROCESSORS Daniel Cederman, Philippas Tsigas and Muhammad Tayyab Chaudhry.
Distributed Systems 2006 Styles of Client/Server Computing.
Two phase commit. What we’ve learnt so far Sequential consistency –All nodes agree on a total order of ops on a single object Crash recovery –An operation.
CS 582 / CMPE 481 Distributed Systems Concurrency Control.
1 Lecture 24: Transactional Memory Topics: transactional memory implementations.
Transaction Management
CS510 Concurrent Systems Class 13 Software Transactional Memory Should Not be Obstruction-Free.
Language Support for Lightweight transactions Tim Harris & Keir Fraser Presented by Narayanan Sundaram 04/28/2008.
Adaptive Locks: Combining Transactions and Locks for efficient Concurrency Takayuki Usui et all.
Chapter 9 Overview  Reasons to monitor SQL Server  Performance Monitoring and Tuning  Tools for Monitoring SQL Server  Common Monitoring and Tuning.
Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.
©2009 HP Confidential1 A Proposal to Incorporate Software Transactional Memory (STM) Support in the Open64 Compiler Dhruva R. Chakrabarti HP Labs, USA.
© Chinese University, CSE Dept. Distributed Systems / Distributed Systems Topic 10: Concurrency Control Dr. Michael R. Lyu Computer Science & Engineering.
Cosc 4740 Chapter 6, Part 3 Process Synchronization.
Reduced Hardware NOrec: A Safe and Scalable Hybrid Transactional Memory Alexander Matveev Nir Shavit MIT.
A Qualitative Survey of Modern Software Transactional Memory Systems Virendra J. Marathe Michael L. Scott.
Concurrency Server accesses data on behalf of client – series of operations is a transaction – transactions are atomic Several clients may invoke transactions.
Low-Overhead Software Transactional Memory with Progress Guarantees and Strong Semantics Minjia Zhang, 1 Jipeng Huang, Man Cao, Michael D. Bond.
On the Performance of Window-Based Contention Managers for Transactional Memory Gokarna Sharma and Costas Busch Louisiana State University.
Transactions and Concurrency Control. Concurrent Accesses to an Object Multiple threads Atomic operations Thread communication Fairness.
Technology from seed Exploiting Off-the-Shelf Virtual Memory Mechanisms to Boost Software Transactional Memory Amin Mohtasham, Paulo Ferreira and João.
Software Transactional Memory Should Not Be Obstruction-Free Robert Ennals Presented by Abdulai Sei.
© 2008 Multifacet ProjectUniversity of Wisconsin-Madison Pathological Interaction of Locks with Transactional Memory Haris Volos, Neelam Goyal, Michael.
Synchronization These notes introduce:
MULTIVIE W Slide 1 (of 21) Software Transactional Memory Should Not Be Obstruction Free Paper: Robert Ennals Presenter: Emerson Murphy-Hill.
Lecture 20: Consistency Models, TM
Concurrency control.
Thread Pools (Worker Queues) cs
Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun
Part 2: Software-Based Approaches
Lazy Evaluation of Transactions in Database Systems
Faster Data Structures in Transactional Memory using Three Paths
EEC 688/788 Secure and Dependable Computing
Concurrency Control More !
Concurrency Control.
Threads and Memory Models Hal Perkins Autumn 2011
Changing thread semantics
Lecture 6: Transactions
Chapter 10 Transaction Management and Concurrency Control
EEC 688/788 Secure and Dependable Computing
Threads and Memory Models Hal Perkins Autumn 2009
Concurrency Control WXES 2103 Database.
Chapter 15 : Concurrency Control
Lecture 22: Consistency Models, TM
Atomic Commit and Concurrency Control
Software Transactional Memory Should Not be Obstruction-Free
CS510 - Portland State University
Lecture 20: Intro to Transactions & Logging II
EEC 688/788 Secure and Dependable Computing
Locking Protocols & Software Transactional Memory
Kernel Synchronization II
UNIVERSITAS GUNADARMA
CSCI1600: Embedded and Real Time Software
Lecture 23: Transactional Memory
CSE 332: Concurrency and Locks
Dynamic Performance Tuning of Word-Based Software Transactional Memory
Controlled Interleaving for Transactions
Presentation transcript:

Deferred Runtime Pipelining for contentious multicore transactions Shuai Mu Sebastian Angel Dennis Shasha

Multicore programming today… Dangerous atomics Confusing semaphores Too many locks

Yet developers use transactions to interact with databases and distributed systems Simpler, less error-prone, efficient enough...

What keeps multicore devs in the dark ages? multicore txs have very high overheads (historically)

Three main points in this talk STM is not crazy expensive anymore Recent proposals (STO [EuroSys ‘16], TDSL [PLDI ‘16]) use type information Competitive with fine-grained locking for many workloads DB community has for decades leveraged workload knowledge We extend STO to support more workloads efficiently New concurrency control protocol called DRP DRP is inspired by work in DB, but avoids static analysis Can handle arbitrary transactions defined at runtime

Three main points in this talk STM is not crazy expensive anymore DB community has for decades leveraged workload knowledge We extend STO to support many more workloads efficiently

Software Transactional Objects [EuroSys ‘16] void transfer(TArray<int>& bal, TBox<int>& num, int src, int dst, int amt) { TRANSACTION { bal[src] = bal[src] - amt; bal[dst] = bal[dst] + amt; num = num + 1; } RETRY(true); } Begin transaction Write operations Read operations Try to commit STO uses OCC (or TL2) to execute transactions Reads performed without locks and writes buffered locally Locks are acquired on all objects in write set Reads are certified: check values read are still valid (abort otherwise) Install writes and release locks

STO inherits the performance profile of OCC no aborts + low overhead = life is good we want to be here! many aborts = wasted work throughput workload contention (probability of conflicts)

Three main points in this talk STM is not crazy expensive anymore DB community has for decades leveraged workload knowledge We extend STO to support many more workloads efficiently

Transaction chopping [Shasha et al., TODS ‘95] Tx 1 R; W; W; R; R; R; W; R; W; Tx 2 W; R; W; R; R; R; W; R; W; Assume they conflict Time Static analysis Tx 1 R; W; W; R; R; R; W; R; W; Tx 2 W; R; W; R; R; R; W; R; W; Time

Runtime Pipelining [Xie et al., SOSP ‘15] Static analysis + runtime checks  finer chopping  more concurrency under contention Tx 1 R; W; W; R; R; R; W; R; W; Tx 2 W; R; W; R; R; R; W; R; W; Time

Runtime Pipelining is good at high contention Runtime Pipelining (RP) throughput OCC workload contention (probability of conflicts)

Problem: hard to port RP to STO Transactions are defined at runtime and may have unknown read/write sets void credit(TArray<int>& bal, int clients, int amt) { TRANSACTION { for (int i = 0; i < clients; i++) { bal[i] = bal[i] + amt; }

Three main points in this talk STM is not crazy expensive anymore DB community has for decades leveraged workload knowledge We extend STO to support many more workloads efficiently

Strawman Give each object a unique rank (for example its memory address) Acquire locks in increasing order of rank Release locks when all operations on a rank are done Locks currently held TRANSACTION { bal[src] = bal[src] - amt; // rank 1 bal[dst] = bal[dst] + amt; // rank 2 }

Strawman Give each object a unique rank (for example its memory address) Acquire locks in increasing order of rank Release locks when all operations on a rank are done Locks currently held TRANSACTION { bal[src] = bal[src] - amt; // rank 1 bal[dst] = bal[dst] + amt; // rank 2 } 1

Strawman Give each object a unique rank (for example its memory address) Acquire locks in increasing order of rank Release locks when all operations on a rank are done But what if program order, control flow, or data dependencies disagree with ranks? Locks currently held TRANSACTION { bal[src] = bal[src] - amt; // rank 1 bal[dst] = bal[dst] + amt; // rank 2 } 2 Tx 1: R; W; R; W; Tx 2: R; W; R; W;

Strawman Give each object a unique rank (for example its memory address) Acquire locks in increasing order of rank Release locks when all operations on a rank are done Locks currently held TRANSACTION { bal[dst] = bal[dst] + amt; // rank 2 bal[src] = bal[src] - amt; // rank 1 } 2

Strawman Give each object a unique rank (for example its memory address) Acquire locks in increasing order of rank Release locks when all operations on a rank are done Locks currently held TRANSACTION { bal[dst] = bal[dst] + amt; // rank 2 bal[src] = bal[src] - amt; // rank 1 } Cannot abort either! cannot lock 1 (not in rank order!) This is an in-place update to the shared state (not a local update as in OCC) And we have already release its lock…. oops.

Deferred Runtime Pipelining (DRP) During execution: Asynchronously read operations + Buffer write logic After call to commit: Acquire locks in rank order Enqueue write logic onto objects in a pipeline Similar to lazy evaluation Similar to OCC (but for logic instead of values) Similar to RP (but enqueues logic instead of performing operations)

Example of DRP (actual syntax is less verbose) TRANSACTION { auto bal_dst = bal.async_read(dst); // rank 2 bal.enqueue(dst, { bal_dst + amt }); // rank 2 auto bal_src = bal.async_read(src); // rank 1 bal.enqueue(src, {bal_src – amt}); // rank 1 } Locks currently held Thread-local state Shared global state bal[dst]: ({bal_dst + amt})

Example of DRP (actual syntax is less verbose) TRANSACTION { auto bal_dst = bal.async_read(dst); // rank 2 bal.enqueue(dst, { bal_dst + amt }); // rank 2 auto bal_src = bal.async_read(src); // rank 1 bal.enqueue(src, {bal_src – amt}); // rank 1 } Locks currently held Thread-local state Shared global state bal[dst]: ({bal_dst + amt}) bal[src]: ({bal_src - amt})

Example of DRP (actual syntax is less verbose) TRANSACTION { auto bal_dst = bal.async_read(dst); // rank 2 bal.enqueue(dst, { bal_dst + amt }); // rank 2 auto bal_src = bal.async_read(src); // rank 1 bal.enqueue(src, {bal_src – amt}); // rank 1 } Locks currently held We know write set! Thread-local state Shared global state bal[dst]: ({bal_dst + amt}) bal[src]: ({bal_src - amt})

Example of DRP (actual syntax is less verbose) TRANSACTION { auto bal_dst = bal.async_read(dst); // rank 2 bal.enqueue(dst, { bal_dst + amt }); // rank 2 auto bal_src = bal.async_read(src); // rank 1 bal.enqueue(src, {bal_src – amt}); // rank 1 } Locks currently held 1 Thread-local state Shared global state bal[dst]: ({bal_dst + amt}) append bal[src]: ({bal_src - amt}) bal[src]: ({bal_src - amt})

Example of DRP (actual syntax is less verbose) TRANSACTION { auto bal_dst = bal.async_read(dst); // rank 2 bal.enqueue(dst, { bal_dst + amt }); // rank 2 auto bal_src = bal.async_read(src); // rank 1 bal.enqueue(src, {bal_src – amt}); // rank 1 } Locks currently held 2 Thread-local state Shared global state bal[dst]: ({bal_dst + amt}) append bal[src]: ({bal_src - amt}) bal[src]: ({bal_src - amt}) bal[dst]: ({bal_dst + amt})

Wait a second… This requires the developer to rewrite transactions... And some txs can’t be expressed with async reads + deferred writes

DRP also supports legacy transactions Legacy transactions execute with an OCC-ish protocol New protocol allows async + legacy transactions to coexist Cool and very efficient mechanism. See paper. DRP automatically detects if a transaction is legacy or not! Read set of legacy tx = not empty Read set of async transaction = empty (only promises)

Evaluation questions (more in the paper) How does DRP perform on standard benchmarks? How does DRP perform as contention varies?

How does DRP perform on standard benchmarks? Silo [SOSP ‘13] multicore database ported to STO and DRP TPC-C benchmark new order, payment, delivery, STAMP: suite of multicore applications See paper or ask me about it stock level, order status

(transaction chopping) TPC-C workload STO (transaction chopping)

Evaluation questions (more in the paper) How does DRP perform on standard benchmarks? How does DRP perform as contention varies?

TPC-C with varying contention (32 threads) STO (transaction chopping) TPC-C has 5 transaction types. We only make 3 of them async. The other 2 types abort a lot. Amount of stuff decreases to the right, so less to keep in memory, which is why IC3 performance increases. Workload contention increases

Summary DRP expands the workloads that can benefit from using STO DRP works transparently with async and legacy transactions DRP guarantees opacity and avoids deadlock and aborts

STAMP Results (32 threads) STO uses workload-specific optimizations called predicates (DRP does not implement these)

Impact of async txs at high contention Percentage of async transactions

Actual syntax void transfer(TArray<int>& bal, TBox<int>& num, int src, int dst, int amt) { TRANSACTION { // the next two lines explicitly use deferred interfaces to buffer intentions Intention<int>* bal_src = bal.defer_at(src); bal.defer_update(src, new Intention<int>([&](int& val){ return bal_src->result - amt; }, {bal_src})); // the next three lines use syntactic sugar based on C++'s operator // overloading and implicit type conversion to achieve the same effects auto bal_dst = bal[dst]; bal[dst] = bal_dst + amt; num += 1; }

Lots more in the paper DRP guarantees opacity and deadlock freedom DRP can be implemented incrementally Details on how to implement DRP into STO with low overhead

Avoiding rank mismatch If we know read/write set Acquire all locks ahead of time (predeclaration locking) If you ever need to acquire a lock out of rank order, acquire all prior locks Ask programmer to predefine accesses ACCESS([&bal[dst], 2], [&bal[src], 2], [&num, 1]); TRANSACTION { bal[dst] = bal[dst] + amt; // rank 2 bal[src] = bal[src] - amt; // rank 1 num = num + 1; // rank 3 } This is a nightmare: error-prone, hard to reason about, etc.