Paul E. McKenney, IBM Linux Technology Center

Slides:



Advertisements
Similar presentations
Copyright 2008 Sun Microsystems, Inc Better Expressiveness for HTM using Split Hardware Transactions Yossi Lev Brown University & Sun Microsystems Laboratories.
Advertisements

Chapter 6 Concurrency: Deadlock and Starvation Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee Community.
CM20145 Concurrency Control
Transactional Memory Parag Dixit Bruno Vavala Computer Architecture Course, 2012.
Optimistic Methods for Concurrency Control By : H.T. Kung & John T. Robinson Presenters: Munawer Saeed.
Transaction Management: Concurrency Control CS634 Class 17, Apr 7, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Concurrency: Deadlock and Starvation Chapter 6. Deadlock Permanent blocking of a set of processes that either compete for system resources or communicate.
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
CSC271 Database Systems Lecture # 32.
CS510 – Advanced Operating Systems 1 The Synergy Between Non-blocking Synchronization and Operating System Structure By Michael Greenwald and David Cheriton.
Lock-Based Concurrency Control
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.
Transactional Memory (TM) Evan Jolley EE 6633 December 7, 2012.
CS444/CS544 Operating Systems Synchronization 2/16/2006 Prof. Searleman
PARALLEL PROGRAMMING with TRANSACTIONAL MEMORY Pratibha Kona.
“THREADS CANNOT BE IMPLEMENTED AS A LIBRARY” HANS-J. BOEHM, HP LABS Presented by Seema Saijpaul CS-510.
1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
1 MetaTM/TxLinux: Transactional Memory For An Operating System Hany E. Ramadan, Christopher J. Rossbach, Donald E. Porter and Owen S. Hofmann Presenter:
CS510 Concurrent Systems Class 2 A Lock-Free Multiprocessor OS Kernel.
OS Fall’02 Concurrency: Principles of Deadlock Operating Systems Fall 2002.
CS510 Concurrent Systems Class 13 Software Transactional Memory Should Not be Obstruction-Free.
TxLinux: Using and Managing Hardware Transactional Memory in an Operating System Christopher J. Rossbach, Owen S. Hofmann, Donald E. Porter, Hany E. Ramadan,
Language Support for Lightweight transactions Tim Harris & Keir Fraser Presented by Narayanan Sundaram 04/28/2008.
Christopher J. Rossbach, Owen S. Hofmann, Donald E. Porter, Hany E. Ramadan, Aditya Bhandari, and Emmett Witchel - Presentation By Sathish P.
1 Concurrency: Deadlock and Starvation Chapter 6.
Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory Written by: Paul E. McKenney Jonathan Walpole Maged.
Introduction to Embedded Systems
CS510 Concurrent Systems Jonathan Walpole. A Lock-Free Multiprocessor OS Kernel.
Cosc 4740 Chapter 6, Part 3 Process Synchronization.
Reduced Hardware NOrec: A Safe and Scalable Hybrid Transactional Memory Alexander Matveev Nir Shavit MIT.
CS5204 – Operating Systems Transactional Memory Part 2: Software-Based Approaches.
Chapter 11 Concurrency Control. Lock-Based Protocols  A lock is a mechanism to control concurrent access to a data item  Data items can be locked in.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.
Kernel Locking Techniques by Robert Love presented by Scott Price.
ECE 720T5 Fall 2012 Cyber-Physical Systems Rodolfo Pellizzoni.
II.I Selected Database Issues: 2 - Transaction ManagementSlide 1/20 1 II. Selected Database Issues Part 2: Transaction Management Lecture 4 Lecturer: Chris.
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
CS510 Concurrent Systems Why the Grass May Not Be Greener on the Other Side: A Comparison of Locking and Transactional Memory.
CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.
The read-copy-update mechanism for supporting real-time applications on shared-memory multiprocessor systems with Linux Guniguntala et al.
Software Transactional Memory Should Not Be Obstruction-Free Robert Ennals Presented by Abdulai Sei.
CS510 Concurrent Systems Jonathan Walpole. RCU Usage in Linux.
© 2008 Multifacet ProjectUniversity of Wisconsin-Madison Pathological Interaction of Locks with Transactional Memory Haris Volos, Neelam Goyal, Michael.
CS333 Intro to Operating Systems Jonathan Walpole.
MULTIVIE W Slide 1 (of 21) Software Transactional Memory Should Not Be Obstruction Free Paper: Robert Ennals Presenter: Emerson Murphy-Hill.
Solving Difficult HTM Problems Without Difficult Hardware Owen Hofmann, Donald Porter, Hany Ramadan, Christopher Rossbach, and Emmett Witchel University.
On Transactional Memory, Spinlocks and Database Transactions Khai Q. Tran Spyros Blanas Jeffrey F. Naughton (University of Wisconsin Madison)
Silberschatz, Galvin and Gagne ©2009 Edited by Khoury, 2015 Operating System Concepts – 9 th Edition, Chapter 7: Deadlocks.
Mutual Exclusion -- Addendum. Mutual Exclusion in Critical Sections.
Adaptive Software Lock Elision
Lecture 20: Consistency Models, TM
Part 2: Software-Based Approaches
Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory By McKenney, Michael, Triplett and Walpole.
Concurrency Control.
Changing thread semantics
Lecture 6: Transactions
Chapter 10 Transaction Management and Concurrency Control
Christopher J. Rossbach, Owen S. Hofmann, Donald E. Porter, Hany E
Chapter 15 : Concurrency Control
Lecture 22: Consistency Models, TM
Lecture 2 Part 2 Process Synchronization
Hybrid Transactional Memory
Introduction of Week 13 Return assignment 11-1 and 3-1-5
Concurrency: Mutual Exclusion and Process Synchronization
Software Transactional Memory Should Not be Obstruction-Free
CSE 153 Design of Operating Systems Winter 19
Lecture 23: Transactional Memory
CSE 542: Operating Systems
CSE 542: Operating Systems
Presentation transcript:

Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory Paul E. McKenney, IBM Linux Technology Center Maged M. Michael, IBM TJ Watson Research Jonathan Walpole, Portland State University Presented by Vidhya Priyadharshnee Palaniswamy Gnanam

Outline Concurrency Control Techniques Review Objective Locking Critique TM Critique Where do Locking and TM fit in? Conclusion Recent Work Future Work

Concurrency Control Techniques Review

Multicore Computing With the speed of individual cores no longer increasing at the rate it used to, we started using increased number of CPU cores to increase the speed of our ever-more complicated applications. To use these extra cores, programs must be parallelized. Synchronization of shared data access is critical for correctness of these programs.

Lock Based Synchronization “Traditional” pessimistic synchronization approach Simple. Partition the shared data and protect each partition with separate a lock Locks prevent concurrent access and enable sequential reasoning about critical section code. Reader Writer Locking: Allows multiple readers to gain access concurrently. Improves scalability if used correctly. Priority Inversion: High priority threads/processes cannot proceed, if a low priority thread/process is holding the common lock. Convoying: All other threads have to wait, if a thread holding a lock is descheduled due to a time-slice interrupt or page fault. Deadlock: The situation when each of two tasks is waiting for a lock that the other task holds. Composable: Property that enables arbitrary atomic operations to be composed into larger atomic operations.

Lock Based Synchronization: Downsides Lock based Synchronization open a whole new can of worms, though. High Contention on non-partitionable data structures. Coarse Grained locking limits concurrency. Lock Contention. Poorly Scales. Fine Grained locking is hard. Lock acquisition overhead affects performance. Introduces dependencies among threads. Propagation of thread failure Affects fault tolerance of the system

Non Blocking Synchronization Lock-free, “optimistic” synchronization. Execute the critical section unconstrained, and check at the end to see if you were the only one If so, continue. If not roll back and retry Optimistic synchronization keep threads independent giving different levels of fault tolerant properties like Block Freedom, Wait Freedom and Obstruction Freedom based on implementation. Avoids common problems seen with conventional techniques such as Priority inversion, Convoying and Deadlock

Non Blocking Synchronization: Downsides Difficult programming logic Heavy use of atomic operations like CAS to do combination of verification and finalization (if passes). Impact of contention can be quite severe. Increased number of retries causes heavy bus contention, cache contention and thus slows down progressive threads. May not perform as well as a lock-based approach in non preemptible kernel.

Objective Each technique has both green and dry areas. The goal of paper is to Spot green and dry areas of Lock Based Synchronization and Transactional memory (NBS) Constructively criticize to them to understand where each technique fit

Locking Critique

Locking Strengths Simple and elegant idea. Allow only one CPU to access a given data at a time. Provides Disjoint access parallelism but with more effort. Does not require any specialized HW support. Can be used on existing commodity hardware. Supported in multiple platforms as it is largely used and well-defined standardized locking APIs like POSIX pthread API exists. Much of the legacy code use locking. More experienced programmers Contention effects are concentrated within locking primitives, allowing critical sections to run at full speed.

Locking Strengths Degradation on performance can be minimized by reducing the power consumption during waiting on lock. Good for protecting non-idempotent operations such as I/O, thread creation, memory remapping and system rebooting. Interacts naturally with other synchronization mechanisms, including reference counting, atomic operations, non-blocking synchronization, RCU Interacts in a natural manner with debuggers Some CPU has special instructions to reduce the power consumption impact of waiting on locks thus minimally degrading the performance.

Locking: Problems & Improvements Problem: Lock Contention Some data structures such as unstructured graphs and trees are difficult to partition. May have to settle for coarse grained locking which leading to high contention and reduced scalability Solution Redesign algorithms to use partition-able data structures Replace trees and graphs with hash tables and radix trees. Problem remains with non-partitionable data structures!

Locking: Problems & Improvements Problem: Lock Overhead Lock granularity determines scalability. Can we partition the shared data as much as possible and protect each partition with separate lock? Locking uses expensive instructions and creates high synchronization overhead. Locking introduces communication related cache misses into read mostly workloads which would otherwise run entirely within the cpu cache. Solution While lock overhead cannot be completely overcome, it can be avoided. In read mostly situations, locked updates may be paired with read-copy-update (RCU) or hazard pointers thus reducing lock overhead in common cases, increasing read side performance and scalability. Problem Remains in Update heavy workloads!

Locking: Problems & Improvements Performance Vs Scalability Need right granularity of locks!

Locking: Problems & Improvements Problem: Deadlock Multiple threads acquire the same set of locks in different order. Self-deadlock: if interrupt occurs while a lock is held by a thread and the interrupt handler also needs that lock Solution Require a clear locking hierarchy; multiple locks are acquired in a pre-specified order If lock not free, thread surrenders conflicting locks and retries Detect deadlock; break cycle by terminating selected threads based upon priority/ work done. Track lock acquisition, dynamically detect potential deadlock and prevent before it occurs To avoid self deadlocks disable interrupts on entering CS/ avoid lock acquisition in handlers

Locking: Problems & Improvements Problem: Priority Inversion Priority inversion can cause a high-priority thread to miss its real-time scheduling deadline, which is unacceptable in safety-critical systems Solution Low priority thread holding the lock temporarily inherits priority of high priority blocked thread so that no medium priority thread can preempt it Lock holder is assigned priority of the highest priority task that might acquire that lock Preemption is disabled entirely while locks are held

Locking: Problems & Improvements Problem: Convoying Preemption or blocking (due to I/O, page fault etc.) of the lock holder can block other threads. Unrealistically increased critical section length. Non-deterministic lock acquisition latency May lead to starvation of large critical sections. Problem for real-time workloads. Solution Use scheduler-conscious synchronization to avoid scheduler to preempt the thread holding a lock. Use RCU for read side critical sections to avoid Non-deterministic lock acquisition latency in read side. To avoid starvation use FCFS lock acquisition primitives with limit on number of threads- e.g. Semaphores

Locking: Problems & Improvements Problem: Lack of composability and Modularity Enabling atomic operations to be composed into larger atomic operations is difficult. Leads to self deadlock if the inner critical section tries to acquire same lock out critical section is holding Solution Need to know what locks other modules use before calling/composing them. Abstraction is lost!

Locking: Problems & Improvements Problems: Indefinite blocking Due to termination of the lock holder. Creates problems for fault tolerant software. Solution Abort and restart entire application- Simple, reliable Identify the terminated lock holder and clean up its state- extremely complex Fault tolerance of the software is still affected!

Transactional memory Critique

Composability In locking, operations may be thread safe individually, but not composed together. Consider, pop from one stack and push into another. T2 struct foo *push (struct foo_stack *dst) { struct foo *q; lock (dst); get(q); q->next = dst; dst = q; unlock (dst); } T1 struct foo *pop (struct foo_stack *src) { struct foo *q; lock (src); q = src; src = q->next; unlock (src); } Intermediate state (item is in neither stacks) is visible!

TM Approach struct foo *pop_push(struct foo_stack *src, struct foo_stack *dst) { struct foo *q; begin_txn; q = src; src = q->next; q->next = dst; dst = q; end_txn; } Let the TM system take care of the rest!

Transactional Memory Solution to the problem of consistency in the face of concurrency adopted from the database world - Transactions. Simple, Composable, Scalable Atomic Blocks == Transactions Atomicity: All-or-nothing execution of a tx. Isolation: Partial results are invisible to other txs/ threads

Transactional Memory TM is a non-blocking synchronization mechanism: at least one thread will succeed Can be constructed to be either as Optimistic Speculate concurrency without waiting for permission (acquire no locks on reads/writes) Performs well when critical regions do not interfere with each other more often. Pessimistic "Always ask for permission"- Acquire locks on read/ writes (blocking) used in databases. Good when conflicts are more

HW Transactional Memory New instructions (LT, LTX, ST, Abort, Commit, Validate) Fully-associative transactional cache for buffering updates Piggy Backing on multi-processor cache coherence protocol to detect transaction conflicts

SW Transactional Memory Obstruction free Introduce level of indirection Log the modifications to memory locations in descriptors. Based on tx outcome, commit by writing the new values to memory locations atomically or abort by reverting to old values. Non Obstruction free Revocable Two Phase Locking for Writes: A transaction locks all objects that it writes and does not release these locks until the transaction terminates. If deadlock occurs then one transaction aborts, releasing its locks and reverting its writes. Optimistic Concurrency Control for Reads: Whenever a transaction reads from an object, it logs the version it read. When the transaction commits, it verifies that these are still the current versions of the objects. Non Obstruction free Revocable Two Phase Locking for Writes: A transaction locks all objects that it writes and does not release these locks until the transaction terminates. If deadlock occurs then one transaction aborts, releasing its locks and reverting its writes. Optimistic Concurrency Control for Reads: Whenever a transaction reads from an object, it logs the version it read. When the transaction commits, it verifies that these are still the current versions of the objects.

TM Strengths Non-blocking: system as a whole makes progress Familiar to large users in the context of database systems and trivial hardware implementation LL/SC Scalable Allows multiple, non-interfering threads to concurrently execute in a critical section. Automatic Disjoint access parallelism Achieved automatically without having to design complex fine grain locking solution. Modular & Composable Transactions may be nested or composed

TM Strengths Deadlock Free Avoids common pitfalls of lock composition such as deadlock. Fault tolerance Failure of one transaction will not affect others Non Partitionable datastructures Can be used with difficult to partition data structures such as unstructured graphs

TM Problems & Improvements Problem: Portability in Hardware TM Portability: need special hardware Size of transaction limited by transaction cache. Overflow of transaction cache addressed by virtualization in newer implementations. Solution Use HTM in case of small txs, but fall back to STM otherwise with language support. Transparency to application requires semantics of HTM and STM to be identical.

TM Problems & Improvements Problem: Performance in Software TM Poor performance compared to locking even at low levels of contention Atomic operations for acquiring shared object handles Cost of consistency validation Effect on cache of shared object metadata Dynamic allocation, data copying and memory reclamation Solution: STM performance can be improved by eliminating overheads of indirection, dynamic allocation, data copying, and memory reclamation by relaxing the non-blocking property Reintroduce many of the problems of locking!

TM Problems & Improvements Problem: Non Idempotent Operations: I/O Cannot perform any operation that cannot be undone like I/O, memory remapping, thread creation and destruction It cannot be performed multiple times on tx retry as it will lead to multiple send requests Common Solution Postpone I/O until outcome of tx is known to avoid I/O retries. Problematic scenario I/O waits until commit And commit waits for I/O completion. Self deadlock!

TM Problems & Improvements Solutions: Non Idempotent Operations: I/O Buffered I/O might be addressed by including the buffering mechanism within the scope of the transactions doing I/O This cannot handle the scenario shown. Can expand both sender and receiver in one tx. But Tx limited to single system currently. Txs performing non idempotent operations can be executed in “inevitable” mode, where it is guaranteed to commit avoiding the irreversibility problem of I/O etc. But it does not scale, as at most only one transaction can be inevitable.

TM Problems & Improvements Problem: Contention Management When transactions collide, only one can proceed, others must be rolled back. Starvation of large transactions by smaller ones delay of a high-priority thread via rollback of its transactions due to conflicts with those of a lower-priority thread Solution Communication b/w scheduler and tx contention manager. Carefully select the transactions to roll back based on priority, amount of work done etc. Convert read only transactions to non-transactional form, in a manner similar to the pairing of locking with RCU. Writer should have necessary primitives to support non transactional readers. Eg, A Relativistic Enhancement to Software Transactional Memory," by Philip Howard and Jonathan Walpole HTM Lack of support for large tx in commodity Hardware. Portability issues in S/W relying on HTM. STM Poor Contention Free Performance compared to locking Due to Atomic operations, Consistency Validation, Indirection, dynamic allocation, data copying, memory reclamation, book keeping and over instrumentation, false conflicts, privatization safety cost, poor amortization.

TM Problems- Privatization Optimization technique that allows access to some data non -transactionally. Need To improve performance by temporarily exempting objects from the overhead of transactional access. Trade Strong Isolation for performance Problems Can break isolation guarantees causing inconsistent concurrent access. Performance vs Correctness Weak Isolation (Weak Atomicity) Isolation of transactions from each other Strong Isolation (Strong Atomicity) Plus Isolation of transactions from non-transactional operations

TM Problems- Privatization Certain STM optimizations can result in allowing concurrent access to privatized data! T1 T2 T1 intends to insert A1   T2 intends to Privatize list T1 read A T1 locks A T2 locks Head T2 Commit by local = head, head=null. Unlock T2 privatized, perform operations T1 Commit by A1->next=B, A->next = A. Unlock time

TM Problems & Improvements Problem: Ratio of data and control operation overheads DBMS: Data operation usually includes reads/writes to mass storage device. Tx overhead becomes negligible comparatively. TM: Data operations almost always includes only reads/writes to memory. Tx overhead seems large. Solution Use TM for heavy weight operations like grouping system calls. Problem: Debugability Difficult debugability of Transactions- break points causes unconditional aborting Debugging issue can be addressed by using STM- High degree of compatibility between STM and HTM needed.

TM Problems & Improvements Others Problems: Interaction with other systems is important. In practice it is complicated and expensive. Conflict Prone Variables- inevitable data structures appearing in every CS causes excessive conflicts. Performance overhead due to Conflict Resolution and excessive restarts in the face of High conflict rates.

Where do Locking and TM fit in? Scenario Best Technique Why? Partitionable data structures Locking Disjoint Access Parallelism Large Non Partitionable data structures TM Automatic Disjoint Access Parallelism Read Mostly Situations Locking/TM with Hazard Pointers/ RCU Readers Scalable Update Heavy Situations Writers Scalable Complex fine grain locking design, No clear lock hierarchy exists Deadlock Avoidance Atomic operations spanning multiple independent data structures, eg pop from one stack and push to another Composability Single threaded software having embarrassingly parallel core containing only idempotent operations Performance benefits without much programming effort Non Idempotent Operations Supportability of non idempotent operations. Large Critical Sections Lock acquisition cost small compared to retry Commodity Hardware Commodity HW suffices. HTM requires specialized H/W and depends on cache geometry details. Else performance limited by STM

Conclusion: Use the Right Tool For The Job! There is no silver bullet: successful adoption of multithreaded/multi-core CPUs will require combination of techniques Analogy with engineering: How many types of fasteners are there? How many subtypes? Nail, screw, clip, bolt, glue, joint, magnet... Neither locking nor TM solve the fundamental performance and scalability problems Combine strengths of various synchronization mechanisms according to the need Integrate with other techniques: “use the right tool for the job” TM's applicability may increase if STM performance improves Formalize and generalize existing techniques such as RCU

Recent Work cx_spinlocks new hybrid TM and locking primitive TxLinux: Using and Managing Hardware Transactional Memory in an Operating System by Christopher J. Rossbach, Owen S. Hofmann, Donald E. Porter, Hany E. Ramadan, Aditya Bhandari, and Emmett Witchel “Inevitable Transactions” special transactions containing non-idempotent operations (I/O). Such transactions unconditionally abort any conflicting transactions, thus non-idempotence is OK. Allowing more than one concurrent inevitable transaction is necessary to achieve reasonable I/O performance, but feasibility is an open question

Recent Work Glue together Relativistic programming and Transactional Memory to gain scalability of readers and writers A Relativistic Enhancement to  Software Transactional Memory by Philip Howard, Jonathan Walpole

Future Work Expand the comparison to include other synchronization mechanisms (message passing, deferred reclamation, RCU) Investigate combining different mechanisms: TM and locking (much work in this area) RCU and locking (typical use of RCU) TM and RCU (very little work done here) There might still be hope for a “silver bullet” But until then, it would be quite foolish to ignore combinations of existing mechanisms

References Lecture Slides from Winter 2008 by the authors Parallel Programming with Transactional Memory by Ulrich Drepper, Red Hat Software Transactional Memory why is it only a research toy? by Calin Cascaval, Colin Blundell, Maged Michael, Harold W.Cain, Peng Wu, Stefane Chiras and Siddhartha Chatterjee Privatization Techniques for Software Transactional Memory by Michael F. Spear, Virendra J. Marathe, Luke Dalessandro, and Michael L. Scott Inevitability Mechanisms for Software Transactional Memory by Michael F. Spear, Maged M. Michael, Michael L. Scott http://en.wikipedia.org/wiki/Software_transactional_memory

Thank you!