Paul E. McKenney, IBM Linux Technology Center

Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory
Paul E. McKenney, IBM Linux Technology Center Maged M. Michael, IBM TJ Watson Research Jonathan Walpole, Portland State University Presented by Vidhya Priyadharshnee Palaniswamy Gnanam

Outline Concurrency Control Techniques Review Objective
Locking Critique TM Critique Where do Locking and TM fit in? Conclusion Recent Work Future Work

Concurrency Control Techniques Review

Multicore Computing With the speed of individual cores no longer increasing at the rate it used to, we started using increased number of CPU cores to increase the speed of our ever-more complicated applications. To use these extra cores, programs must be parallelized. Synchronization of shared data access is critical for correctness of these programs.

Lock Based Synchronization
“Traditional” pessimistic synchronization approach Simple. Partition the shared data and protect each partition with separate a lock Locks prevent concurrent access and enable sequential reasoning about critical section code. Reader Writer Locking: Allows multiple readers to gain access concurrently. Improves scalability if used correctly. Priority Inversion: High priority threads/processes cannot proceed, if a low priority thread/process is holding the common lock. Convoying: All other threads have to wait, if a thread holding a lock is descheduled due to a time-slice interrupt or page fault. Deadlock: The situation when each of two tasks is waiting for a lock that the other task holds. Composable: Property that enables arbitrary atomic operations to be composed into larger atomic operations.

Lock Based Synchronization: Downsides
Lock based Synchronization open a whole new can of worms, though. High Contention on non-partitionable data structures. Coarse Grained locking limits concurrency. Lock Contention. Poorly Scales. Fine Grained locking is hard. Lock acquisition overhead affects performance. Introduces dependencies among threads. Propagation of thread failure Affects fault tolerance of the system

Non Blocking Synchronization
Lock-free, “optimistic” synchronization. Execute the critical section unconstrained, and check at the end to see if you were the only one If so, continue. If not roll back and retry Optimistic synchronization keep threads independent giving different levels of fault tolerant properties like Block Freedom, Wait Freedom and Obstruction Freedom based on implementation. Avoids common problems seen with conventional techniques such as Priority inversion, Convoying and Deadlock

Non Blocking Synchronization: Downsides
Difficult programming logic Heavy use of atomic operations like CAS to do combination of verification and finalization (if passes). Impact of contention can be quite severe. Increased number of retries causes heavy bus contention, cache contention and thus slows down progressive threads. May not perform as well as a lock-based approach in non preemptible kernel.

Objective Each technique has both green and dry areas.
The goal of paper is to Spot green and dry areas of Lock Based Synchronization and Transactional memory (NBS) Constructively criticize to them to understand where each technique fit

Locking Critique

Locking Strengths Simple and elegant idea. Allow only one CPU to access a given data at a time. Provides Disjoint access parallelism but with more effort. Does not require any specialized HW support. Can be used on existing commodity hardware. Supported in multiple platforms as it is largely used and well-defined standardized locking APIs like POSIX pthread API exists. Much of the legacy code use locking. More experienced programmers Contention effects are concentrated within locking primitives, allowing critical sections to run at full speed.

Locking Strengths Degradation on performance can be minimized by reducing the power consumption during waiting on lock. Good for protecting non-idempotent operations such as I/O, thread creation, memory remapping and system rebooting. Interacts naturally with other synchronization mechanisms, including reference counting, atomic operations, non-blocking synchronization, RCU Interacts in a natural manner with debuggers Some CPU has special instructions to reduce the power consumption impact of waiting on locks thus minimally degrading the performance.

Locking: Problems & Improvements
Problem: Lock Contention Some data structures such as unstructured graphs and trees are difficult to partition. May have to settle for coarse grained locking which leading to high contention and reduced scalability Solution Redesign algorithms to use partition-able data structures Replace trees and graphs with hash tables and radix trees. Problem remains with non-partitionable data structures!

Problem: Lock Overhead Lock granularity determines scalability. Can we partition the shared data as much as possible and protect each partition with separate lock? Locking uses expensive instructions and creates high synchronization overhead. Locking introduces communication related cache misses into read mostly workloads which would otherwise run entirely within the cpu cache. Solution While lock overhead cannot be completely overcome, it can be avoided. In read mostly situations, locked updates may be paired with read-copy-update (RCU) or hazard pointers thus reducing lock overhead in common cases, increasing read side performance and scalability. Problem Remains in Update heavy workloads!

Performance Vs Scalability Need right granularity of locks!

Problem: Deadlock Multiple threads acquire the same set of locks in different order. Self-deadlock: if interrupt occurs while a lock is held by a thread and the interrupt handler also needs that lock Solution Require a clear locking hierarchy; multiple locks are acquired in a pre-specified order If lock not free, thread surrenders conflicting locks and retries Detect deadlock; break cycle by terminating selected threads based upon priority/ work done. Track lock acquisition, dynamically detect potential deadlock and prevent before it occurs To avoid self deadlocks disable interrupts on entering CS/ avoid lock acquisition in handlers

Problem: Priority Inversion Priority inversion can cause a high-priority thread to miss its real-time scheduling deadline, which is unacceptable in safety-critical systems Solution Low priority thread holding the lock temporarily inherits priority of high priority blocked thread so that no medium priority thread can preempt it Lock holder is assigned priority of the highest priority task that might acquire that lock Preemption is disabled entirely while locks are held

Problem: Convoying Preemption or blocking (due to I/O, page fault etc.) of the lock holder can block other threads. Unrealistically increased critical section length. Non-deterministic lock acquisition latency May lead to starvation of large critical sections. Problem for real-time workloads. Solution Use scheduler-conscious synchronization to avoid scheduler to preempt the thread holding a lock. Use RCU for read side critical sections to avoid Non-deterministic lock acquisition latency in read side. To avoid starvation use FCFS lock acquisition primitives with limit on number of threads- e.g. Semaphores

Problem: Lack of composability and Modularity Enabling atomic operations to be composed into larger atomic operations is difficult. Leads to self deadlock if the inner critical section tries to acquire same lock out critical section is holding Solution Need to know what locks other modules use before calling/composing them. Abstraction is lost!

Problems: Indefinite blocking Due to termination of the lock holder. Creates problems for fault tolerant software. Solution Abort and restart entire application- Simple, reliable Identify the terminated lock holder and clean up its state- extremely complex Fault tolerance of the software is still affected!

Transactional memory Critique

Composability In locking, operations may be thread safe individually, but not composed together. Consider, pop from one stack and push into another. T2 struct foo *push (struct foo_stack *dst) { struct foo *q; lock (dst); get(q); q->next = dst; dst = q; unlock (dst); } T1 struct foo *pop (struct foo_stack *src) { struct foo *q; lock (src); q = src; src = q->next; unlock (src); } Intermediate state (item is in neither stacks) is visible!

TM Approach struct foo *pop_push(struct foo_stack *src, struct foo_stack *dst) { struct foo *q; begin_txn; q = src; src = q->next; q->next = dst; dst = q; end_txn; } Let the TM system take care of the rest!

Transactional Memory Solution to the problem of consistency in the face of concurrency adopted from the database world - Transactions. Simple, Composable, Scalable Atomic Blocks == Transactions Atomicity: All-or-nothing execution of a tx. Isolation: Partial results are invisible to other txs/ threads

Transactional Memory TM is a non-blocking synchronization mechanism: at least one thread will succeed Can be constructed to be either as Optimistic Speculate concurrency without waiting for permission (acquire no locks on reads/writes) Performs well when critical regions do not interfere with each other more often. Pessimistic "Always ask for permission"- Acquire locks on read/ writes (blocking) used in databases. Good when conflicts are more

HW Transactional Memory
New instructions (LT, LTX, ST, Abort, Commit, Validate) Fully-associative transactional cache for buffering updates Piggy Backing on multi-processor cache coherence protocol to detect transaction conflicts

SW Transactional Memory
Obstruction free Introduce level of indirection Log the modifications to memory locations in descriptors. Based on tx outcome, commit by writing the new values to memory locations atomically or abort by reverting to old values. Non Obstruction free Revocable Two Phase Locking for Writes: A transaction locks all objects that it writes and does not release these locks until the transaction terminates. If deadlock occurs then one transaction aborts, releasing its locks and reverting its writes. Optimistic Concurrency Control for Reads: Whenever a transaction reads from an object, it logs the version it read. When the transaction commits, it verifies that these are still the current versions of the objects. Non Obstruction free Revocable Two Phase Locking for Writes: A transaction locks all objects that it writes and does not release these locks until the transaction terminates. If deadlock occurs then one transaction aborts, releasing its locks and reverting its writes. Optimistic Concurrency Control for Reads: Whenever a transaction reads from an object, it logs the version it read. When the transaction commits, it verifies that these are still the current versions of the objects.

TM Strengths Non-blocking: system as a whole makes progress
Familiar to large users in the context of database systems and trivial hardware implementation LL/SC Scalable Allows multiple, non-interfering threads to concurrently execute in a critical section. Automatic Disjoint access parallelism Achieved automatically without having to design complex fine grain locking solution. Modular & Composable Transactions may be nested or composed

TM Strengths Deadlock Free
Avoids common pitfalls of lock composition such as deadlock. Fault tolerance Failure of one transaction will not affect others Non Partitionable datastructures Can be used with difficult to partition data structures such as unstructured graphs

TM Problems & Improvements
Problem: Portability in Hardware TM Portability: need special hardware Size of transaction limited by transaction cache. Overflow of transaction cache addressed by virtualization in newer implementations. Solution Use HTM in case of small txs, but fall back to STM otherwise with language support. Transparency to application requires semantics of HTM and STM to be identical.

Problem: Performance in Software TM Poor performance compared to locking even at low levels of contention Atomic operations for acquiring shared object handles Cost of consistency validation Effect on cache of shared object metadata Dynamic allocation, data copying and memory reclamation Solution: STM performance can be improved by eliminating overheads of indirection, dynamic allocation, data copying, and memory reclamation by relaxing the non-blocking property Reintroduce many of the problems of locking!

Problem: Non Idempotent Operations: I/O Cannot perform any operation that cannot be undone like I/O, memory remapping, thread creation and destruction It cannot be performed multiple times on tx retry as it will lead to multiple send requests Common Solution Postpone I/O until outcome of tx is known to avoid I/O retries. Problematic scenario I/O waits until commit And commit waits for I/O completion. Self deadlock!

Solutions: Non Idempotent Operations: I/O Buffered I/O might be addressed by including the buffering mechanism within the scope of the transactions doing I/O This cannot handle the scenario shown. Can expand both sender and receiver in one tx. But Tx limited to single system currently. Txs performing non idempotent operations can be executed in “inevitable” mode, where it is guaranteed to commit avoiding the irreversibility problem of I/O etc. But it does not scale, as at most only one transaction can be inevitable.

Problem: Contention Management When transactions collide, only one can proceed, others must be rolled back. Starvation of large transactions by smaller ones delay of a high-priority thread via rollback of its transactions due to conflicts with those of a lower-priority thread Solution Communication b/w scheduler and tx contention manager. Carefully select the transactions to roll back based on priority, amount of work done etc. Convert read only transactions to non-transactional form, in a manner similar to the pairing of locking with RCU. Writer should have necessary primitives to support non transactional readers. Eg, A Relativistic Enhancement to Software Transactional Memory," by Philip Howard and Jonathan Walpole HTM Lack of support for large tx in commodity Hardware. Portability issues in S/W relying on HTM. STM Poor Contention Free Performance compared to locking Due to Atomic operations, Consistency Validation, Indirection, dynamic allocation, data copying, memory reclamation, book keeping and over instrumentation, false conflicts, privatization safety cost, poor amortization.

TM Problems- Privatization
Optimization technique that allows access to some data non -transactionally. Need To improve performance by temporarily exempting objects from the overhead of transactional access. Trade Strong Isolation for performance Problems Can break isolation guarantees causing inconsistent concurrent access. Performance vs Correctness Weak Isolation (Weak Atomicity) Isolation of transactions from each other Strong Isolation (Strong Atomicity) Plus Isolation of transactions from non-transactional operations

TM Problems- Privatization
Certain STM optimizations can result in allowing concurrent access to privatized data! T1 T2 T1 intends to insert A1 T2 intends to Privatize list T1 read A T1 locks A T2 locks Head T2 Commit by local = head, head=null. Unlock T2 privatized, perform operations T1 Commit by A1->next=B, A->next = A. Unlock time

Problem: Ratio of data and control operation overheads DBMS: Data operation usually includes reads/writes to mass storage device. Tx overhead becomes negligible comparatively. TM: Data operations almost always includes only reads/writes to memory. Tx overhead seems large. Solution Use TM for heavy weight operations like grouping system calls. Problem: Debugability Difficult debugability of Transactions- break points causes unconditional aborting Debugging issue can be addressed by using STM- High degree of compatibility between STM and HTM needed.

Others Problems: Interaction with other systems is important. In practice it is complicated and expensive. Conflict Prone Variables- inevitable data structures appearing in every CS causes excessive conflicts. Performance overhead due to Conflict Resolution and excessive restarts in the face of High conflict rates.

Where do Locking and TM fit in?
Scenario Best Technique Why? Partitionable data structures Locking Disjoint Access Parallelism Large Non Partitionable data structures TM Automatic Disjoint Access Parallelism Read Mostly Situations Locking/TM with Hazard Pointers/ RCU Readers Scalable Update Heavy Situations Writers Scalable Complex fine grain locking design, No clear lock hierarchy exists Deadlock Avoidance Atomic operations spanning multiple independent data structures, eg pop from one stack and push to another Composability Single threaded software having embarrassingly parallel core containing only idempotent operations Performance benefits without much programming effort Non Idempotent Operations Supportability of non idempotent operations. Large Critical Sections Lock acquisition cost small compared to retry Commodity Hardware Commodity HW suffices. HTM requires specialized H/W and depends on cache geometry details. Else performance limited by STM

Conclusion: Use the Right Tool For The Job!
There is no silver bullet: successful adoption of multithreaded/multi-core CPUs will require combination of techniques Analogy with engineering: How many types of fasteners are there? How many subtypes? Nail, screw, clip, bolt, glue, joint, magnet... Neither locking nor TM solve the fundamental performance and scalability problems Combine strengths of various synchronization mechanisms according to the need Integrate with other techniques: “use the right tool for the job” TM's applicability may increase if STM performance improves Formalize and generalize existing techniques such as RCU

Recent Work cx_spinlocks new hybrid TM and locking primitive
TxLinux: Using and Managing Hardware Transactional Memory in an Operating System by Christopher J. Rossbach, Owen S. Hofmann, Donald E. Porter, Hany E. Ramadan, Aditya Bhandari, and Emmett Witchel “Inevitable Transactions” special transactions containing non-idempotent operations (I/O). Such transactions unconditionally abort any conflicting transactions, thus non-idempotence is OK. Allowing more than one concurrent inevitable transaction is necessary to achieve reasonable I/O performance, but feasibility is an open question

Recent Work Glue together Relativistic programming and Transactional Memory to gain scalability of readers and writers A Relativistic Enhancement to Software Transactional Memory by Philip Howard, Jonathan Walpole

Future Work Expand the comparison to include other synchronization mechanisms (message passing, deferred reclamation, RCU) Investigate combining different mechanisms: TM and locking (much work in this area) RCU and locking (typical use of RCU) TM and RCU (very little work done here) There might still be hope for a “silver bullet” But until then, it would be quite foolish to ignore combinations of existing mechanisms

References Lecture Slides from Winter 2008 by the authors
Parallel Programming with Transactional Memory by Ulrich Drepper, Red Hat Software Transactional Memory why is it only a research toy? by Calin Cascaval, Colin Blundell, Maged Michael, Harold W.Cain, Peng Wu, Stefane Chiras and Siddhartha Chatterjee Privatization Techniques for Software Transactional Memory by Michael F. Spear, Virendra J. Marathe, Luke Dalessandro, and Michael L. Scott Inevitability Mechanisms for Software Transactional Memory by Michael F. Spear, Maged M. Michael, Michael L. Scott

Thank you!

Paul E. McKenney, IBM Linux Technology Center

Similar presentations

Presentation on theme: "Paul E. McKenney, IBM Linux Technology Center"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Paul E. McKenney, IBM Linux Technology Center

Similar presentations

Presentation on theme: "Paul E. McKenney, IBM Linux Technology Center"— Presentation transcript:

Similar presentations

About project

Feedback