Spin Locks and Contention

Spin Locks and Contention
Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Focus so far: Correctness and Progress
Models Accurate (we never lied to you) But idealized (so we forgot to mention a few things) Protocols Elegant Important But naïve So far we have focused on idealized models because we were interesting in understanding the foundations of the field. Understanding how to build a bakery lock isn’t all that useful if you really want to implement a lock directly, but you now know more about the underlying problems than you would if we had started looking at complicated practical techniques right away. Art of Multiprocessor Programming

New Focus: Performance
Models More complicated (not the same as complex!) Still focus on principles (not soon obsolete) Protocols Elegant (in their fashion) Important (why else would we pay attention) And realistic (your mileage may vary) No we are going to transition over to the “real” world and look at the kinds of protocols you might actually use. These models are necessarily more complicated, in the sense of having more details and exceptions and things to remember, but not necessarily more complex, in the sense of encompassing deeper ideas. We are still going to focus on issues that are important, that is, things that matter for all kinds of platforms and applications, and avoid other optimizations that might sometimes work in practice but that don’t teach us anyting. Art of Multiprocessor Programming

Kinds of Architectures
SISD (Uniprocessor) Single instruction stream Single data stream SIMD (Vector) Single instruction Multiple data MIMD (Multiprocessors) Multiple instruction Multiple data. You can classify processors by how they manage data and instruction streams. A single-instruction single-data stream architecture is just a uniprocessor. A few years ago, single-instruction multiple-data stream architectures were popular, for example the Connection Machine CM2. These have fallen out of favor, at least for the time being. Art of Multiprocessor Programming

Kinds of Architectures
SISD (Uniprocessor) Single instruction stream Single data stream SIMD (Vector) Single instruction Multiple data MIMD (Multiprocessors) Multiple instruction Multiple data. Our space Instead, most modern multiprocessors provide multiple instruction streams, meaning that processors execute independent sequences of instructions, and multiple data streams, meaning that processors issue independent sequences of memory reads and writes. Such architectures are usually called ``MIMD''. (1) Art of Multiprocessor Programming

Art of Multiprocessor Programming
MIMD Architectures memory Shared Bus Distributed There are two basic kinds of MIMD architectures. In a shared bus architecture, processors and memory are connected by a shared broadcast medium called a bus (like a tiny Ethernet). Both the processors and the memory controller can broadcast on the bus. Only one processor (or memory) can broadcast on the bus at a time, but all processors (and memory) can listen. Bus-based architectures are the most common today because they are easy to build. The principal things that affect performance are (1) contention for the memory: not all the processors can usually get at the same memory location at the same times, and if they try, they will have to queue up, (2) contention for the communication medium. If everyone wants to communicate at the same time, or to the same processor, then the processors will have to wait for one another. Finally, there is a growing communication latency, the time it takes for a processor to communicate with memory or with another processor. Memory Contention Communication Contention Communication Latency Art of Multiprocessor Programming

Today: Revisit Mutual Exclusion
Performance, not just correctness Proper use of multiprocessor architectures A collection of locking algorithms… When programming uniprocessors, one can generally ignore the exact structure and properties of the underlying system. Unfortunately, multiprocessor programming has yet to reach that state, and at present it is crucial that the programmer understand the properties and limitations of the underlying machine architecture. This will be our goal in this chapter. We revisit the familiar mutual exclusion problem, this time with the aim of devising mutual exclusion protocols that work well on today’s multiprocessors. Art of Multiprocessor Programming (1)

What Should you do if you can’t get a lock?
Keep trying “spin” or “busy-wait” Good if delays are short Give up the processor Good if delays are long Always good on uniprocessor Any mutual exclusion protocol poses the question: what do you do if you cannot acquire the lock? There are two alternatives. If you keep trying, the lock is called a spin lock, and repeatedly testing the lock is called spinning. or busy-waiting. We will use these terms interchangeably. The filter and bakery algorithms are spin-locks. Spinning is sensible when you expect the lock delay to be short. For obvious reasons, spinning makes sense only on multiprocessors. The alternative is to suspend yourself and ask the operating system's scheduler to schedule another thread on your processor, which is sometimes called blocking. Java’s built-in synchronization is blocking. Because switching from one thread to another is expensive, blocking makes sense only if you expect the lock delay to be long. Many operating systems mix both strategies, spinning for a short time and then blocking. Art of Multiprocessor Programming (1)

What Should you do if you can’t get a lock?
Keep trying “spin” or “busy-wait” Good if delays are short Give up the processor Good if delays are long Always good on uniprocessor Both spinning and blocking are important techniques. In this lecture, we turn our attention to locks that use spinning. our focus Art of Multiprocessor Programming

Basic Spin-Lock . CS With spin locks, synchronization usually looks like this. Some set of threads contend for the lock. One of them acquires it, and the others spin. The winner enters the critical section, does its job, and releases the lock on exit. spin lock critical section Resets lock upon exit Art of Multiprocessor Programming

Basic Spin-Lock …lock introduces sequential bottleneck . CS Contention occurs when multiple threads try to acquire a lock at the same time. High contention means there are many such threads, and low contention means the opposite. spin lock critical section Resets lock upon exit Art of Multiprocessor Programming

Basic Spin-Lock …lock suffers from contention . CS Contention occurs when multiple threads try to acquire a lock at the same time. High contention means there are many such threads, and low contention means the opposite. spin lock critical section Resets lock upon exit Art of Multiprocessor Programming

Basic Spin-Lock …lock suffers from contention . CS Our goal is to understand how contention works, and to develop a set of techniques that can avoid or alleviate it. These techniques provide a useful toolkit for all kinds of synchronization problems. spin lock critical section Resets lock upon exit Notice: these are distinct phenomena Art of Multiprocessor Programming

Basic Spin-Lock …lock suffers from contention . CS Our goal is to understand how contention works, and to develop a set of techniques that can avoid or alleviate it. These techniques provide a useful toolkit for all kinds of synchronization problems. spin lock critical section Resets lock upon exit Seq Bottleneck  no parallelism Art of Multiprocessor Programming

Basic Spin-Lock …lock suffers from contention . CS Our goal is to understand how contention happens, and to develop a set of techniques that can avoid or alleviate it. These techniques provide a useful toolkit for all kinds of synchronization problems. spin lock critical section Resets lock upon exit Contention  ??? Art of Multiprocessor Programming

Review: Test-and-Set Boolean value Test-and-set (TAS) Swap true with current value Return value tells if prior value was true or false Can reset just by writing false TAS aka “getAndSet” The test-and-set machine instruction atomically stores true in that word, and returns that word's previous value, swapping the value true for the word's current value. You can reset the word just by writing false to it. Note in Java TAS is called getAndSet. We will use the terms interchangeably. Art of Multiprocessor Programming

Review: Test-and-Set public class AtomicBoolean { boolean value; public synchronized boolean getAndSet(boolean newValue) { boolean prior = value; value = newValue; return prior; } The test-and-set machine instruction, with consensus number two, was the principal synchronization instruction provided by many early multiprocessors. Art of Multiprocessor Programming (5)

java.util.concurrent.atomic
Review: Test-and-Set public class AtomicBoolean { boolean value; public synchronized boolean getAndSet(boolean newValue) { boolean prior = value; value = newValue; return prior; } Here we are going to build one out of the atomic Boolean class, which is provided as part of Java’s standard library of atomic primitives. You can think of this object as a box holding a Boolean value. Package java.util.concurrent.atomic Art of Multiprocessor Programming

Review: Test-and-Set public class AtomicBoolean { boolean value; public synchronized boolean getAndSet(boolean newValue) { boolean prior = value; value = newValue; return prior; } The getAndSet method swaps a Boolean value with the current contents of the box. Swap old and new values Art of Multiprocessor Programming

Review: Test-and-Set AtomicBoolean lock = new AtomicBoolean(false) … boolean prior = lock.getAndSet(true) At first glance, the AtomicBoolean class seems ideal for implementing a spin lock. Art of Multiprocessor Programming

Swapping in true is called “test-and-set” or TAS
Review: Test-and-Set AtomicBoolean lock = new AtomicBoolean(false) … boolean prior = lock.getAndSet(true) Swapping in true is called “test-and-set” or TAS If we call getAndSet(true), then we have a test-and-set. Art of Multiprocessor Programming (5)

Test-and-Set Locks Locking Lock is free: value is false Lock is taken: value is true Acquire lock by calling TAS If result is false, you win If result is true, you lose Release lock by writing false The lock is free when the word's value is false, and busy when it is true. Art of Multiprocessor Programming

Test-and-set Lock class TASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }} Here is what it looks like in more detail. Art of Multiprocessor Programming

Test-and-set Lock class TASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }} The lock is just an AtomicBoolean initialized to false. Lock state is AtomicBoolean Art of Multiprocessor Programming

Test-and-set Lock class TASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }} The lock() method repeatedly applies test-and-set to the location until that instruction returns false (that is, until the lock is free). Keep trying until lock acquired Art of Multiprocessor Programming

Release lock by resetting state to false
Test-and-set Lock class TASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }} Release lock by resetting state to false The unlock() method simply writes the value false to that word. Art of Multiprocessor Programming

Space Complexity TAS spin-lock has small “footprint” N thread spin-lock uses O(1) space As opposed to O(n) Peterson/Bakery How did we overcome the W(n) lower bound? We used a RMW operation… We call real world space complexity the “footprint”. By using TAS we are able to reduce the footprint from linear to constant. Art of Multiprocessor Programming

Performance Experiment n threads Increment shared counter 1 million times How long should it take? How long does it take? Let’s do an experiment on a real machine. Take n threads and have them collectively acquire a lock, increment a counter, and release the lock. Have them do it collectively, say, one million times. Before we look at any curves, let’s try to reason about how long it should take them. Art of Multiprocessor Programming

Graph time threads ideal no speedup because of sequential bottleneck
Ideally the curve should stay flat after that. Why? Because we have a sequential bottleneck so no matter How many threads we add running in parallel, we will not get any speedup (remember Amdahl’s law). threads Art of Multiprocessor Programming

Mystery #1 TAS lock Ideal time What is going on? The curve for TAS lock looks like this. In fact, if you do the experiment you have to give up because it takes so long beyond a certain number of processors? What is happening? Let us try this again. threads Art of Multiprocessor Programming

Test-and-Test-and-Set Locks
Lurking stage Wait until lock “looks” free Spin while read returns true (lock taken) Pouncing state As soon as lock “looks” available Read returns false (lock free) Call TAS to acquire lock If TAS loses, back to lurking Let’s try a slightly different approach. Instead of repeatedly trying to test-and-set the lock, let’s split the locking method into two phases. In the lurking phase, we wait until the lock looks like it’s free, and when it’s free, we pounce, attempting to acquire the lock by a call to test-and-set. If we win, we’re in, if we lose, we go back to lurking. Art of Multiprocessor Programming

Test-and-test-and-set Lock
class TTASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (true) { while (state.get()) {} if (!state.getAndSet(true)) return; } Here is what the code looks like. Art of Multiprocessor Programming

class TTASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (true) { while (state.get()) {} if (!state.getAndSet(true)) return; } First we spin on the value, repeatedly reading it until it looks like the lock is free. We don’t try to modify it, we just read it. Wait until lock looks free Art of Multiprocessor Programming

class TTASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (true) { while (state.get()) {} if (!state.getAndSet(true)) return; } Then try to acquire it As soon as it looks like the lock is free, we call test-and-set to try to acquire it. If we are first and we succeed, the lock() method returns, and otherwise, if someone else got there before us, we go back to lurking, to repeatedly rereading the variable. Art of Multiprocessor Programming

Mystery #2 TAS lock TTAS lock Ideal time The difference is dramatic. The TTAS lock performs much better than the TAS lock, but still much worse than we expected from an ideal lock. threads Art of Multiprocessor Programming

Mystery Both TAS and TTAS Do the same thing (in our model) Except that TTAS performs much better than TAS Neither approaches ideal There are two mysteries here: why is the TTAS lock so good (that is, so much better than TAS), and why is it so bad (so much worse than ideal)? Art of Multiprocessor Programming

Opinion Our memory abstraction is broken TAS & TTAS methods Are provably the same (in our model) Except they aren’t (in field tests) Need a more detailed model … From everything we told you in the earlier section on mutual exclusion, you would expect the TAS and TTAS locks to be the same --- after all, they are logically equivalent programs. In fact, they are equivalent with respect to correctness (they both work), but very different with respect to performance. The problem here is that the shared memory abstraction is broken with respect to performance. If you don’t understand the underlying architecture, you will never understand why your reasonable-looking programs are so slow. It’s a little like not being able to drive a car unless you know how an automatic transmission works, but there it is. Art of Multiprocessor Programming

Bus-Based Architectures
cache cache cache We are going to do yet another review of multiprocessor architecture. Bus memory Art of Multiprocessor Programming

Random access memory (10s of cycles) cache cache cache The processors share a memory that has a high latency, say 50 to 100 cycles to read or write a value. This means that while you are waiting for the memory to respond, you can execute that many instructions. Bus memory Art of Multiprocessor Programming

Shared Bus Broadcast medium One broadcaster at a time Processors and memory all “snoop” cache cache cache Processors communicate with the memory and with one another over a shared bus. The bus is a broadcast medium, meaning that only one processor at a time can send a message, although everyone can passively listen. Bus memory Art of Multiprocessor Programming

Per-Processor Caches Small Fast: 1 or 2 cycles Address & state information Bus-Based Architectures cache cache cache Each processor has a cache, a small high-speed memory where the processor keeps data likely to be of interest. A cache access typically requires one or two machine cycles, while a memory access typically requires tens of machine cycles. Technology trends are making this contrast more extreme: although both processor cycle times and memory access times are becoming faster, the cycle times are improving faster than the memory access times, so cache performance is critical to the overall performance of a multiprocessor architecture. Bus memory Art of Multiprocessor Programming

Jargon Watch Cache hit “I found what I wanted in my cache” Good Thing™ If a processor finds data in its cache, then it doesn’t have to go all the way to memory. This is a very good thing, which we call a cache hit. Art of Multiprocessor Programming

Jargon Watch Cache hit “I found what I wanted in my cache” Good Thing™ Cache miss “I had to shlep all the way to memory for that data” Bad Thing™ If the processor doesn’t find what it wants in its cache, then we have a cache miss, which is very time-consuming. How well a synchronization protocol or concurrent algorithm performs is largely determined by its cache behavior: how many hits and misses. Art of Multiprocessor Programming

Cave Canem This model is still a simplification But not in any essential way Illustrates basic principles Will discuss complexities later Real cache coherence protocols can be very complex. For example, modern multiprocessors have multi-level caches, where each processor has an on-chip \textit{level-one} (L1) cache, and clusters of processors share a \textit{level-two} (L2) cache. The L2 cache is on-chip in some modern architectures, and off chip in others, a detail that greatly changes the observed performance. We are going to avoid going into too much detail here because the basic principles don't depend on that level of detail. Art of Multiprocessor Programming

Processor Issues Load Request
cache cache cache Here is one thing that can happen when a processor issues a load request. Bus memory data Art of Multiprocessor Programming

Gimme data cache cache cache It broadcasts a message asking for the data it needs. Notice that while it is broadcasting, no one else can use the bus. Bus Bus memory data Art of Multiprocessor Programming

Memory Responds memory cache cache cache data data Bus Bus
In this case, the memory responds to the request, also over the bus. Bus Bus Got your data right here memory data data Art of Multiprocessor Programming

Gimme data data cache cache Now suppose another processor issues a load request for the same data. Bus memory data Art of Multiprocessor Programming

Gimme data data cache cache It broadcasts its request over the bus. Bus Bus memory data Art of Multiprocessor Programming

I got data data cache cache This time, however, the request is picked up by the first processor, which has the data in its cache. Usually, when a processor has the data cached, it, rather than the memory, will respond to load requests. Bus Bus memory data Art of Multiprocessor Programming

Other Processor Responds
I got data data data cache cache The processor puts the data on the bus. Bus Bus memory data Art of Multiprocessor Programming

Other Processor Responds
data data cache cache Now both processors have the same data cached. Bus Bus memory data Art of Multiprocessor Programming

Modify Cached Data data data cache Now what happens if the red processor decides to modify the cached data? Bus memory data Art of Multiprocessor Programming (1)

Modify Cached Data data data data cache It changes the copy in its cache (from blue to white). Bus memory data Art of Multiprocessor Programming (1)

Modify Cached Data data data cache Bus memory data Art of Multiprocessor Programming

Modify Cached Data memory What’s up with the other copies? data data
Now we have a problem. The data cached at the red processor disagrees with the same copy of that memory location stored both at the other processors and in the memory itself. Bus What’s up with the other copies? memory data Art of Multiprocessor Programming

Cache Coherence We have lots of copies of data Original copy in memory Cached copies at processors Some processor modifies its own copy What do we do with the others? How to avoid confusion? The problem of keeping track of multiple copies of the same data is called the cache coherence problem, and ways to accomplish it are called cache coherence protocols. Art of Multiprocessor Programming

Write-Back Caches Accumulate changes in cache Write back when needed Need the cache for something else Another processor wants it On first modification Invalidate other entries Requires non-trivial protocol … A write-back coherence protocol sends out an invalidation message when the value is first modified, instructing the other processors to discard that value from their caches. Once the processor has invalidated the other cached values, it can make subsequent modifications without further bus traffic. A value that has been modified in the cache but not written back is called dirty. If the processor needs to use the cache for another value, however, it must remember to write back any dirty values. Art of Multiprocessor Programming

Write-Back Caches Cache entry has three states Invalid: contains raw seething bits Valid: I can read but I can’t write Dirty: Data has been modified Intercept other load requests Write back to memory before using cache Instead, we keep track of each cache’s state as follows. If the cache is invalid, then its contents are meaningless. If it is valid, then the processor can read the value, but does not have permission to write it because it may be cached elsewhere. If the value is dirty, then the processor has modified the value and it must be written back before that cache can be reused. Art of Multiprocessor Programming

Invalidate data data cache Let’s rewind back to the moment when the red processor updated its cached data. Bus memory data Art of Multiprocessor Programming

Invalidate Mine, all mine! data data cache It broadcasts an invalidation message warning the other processors to invalidate, or discard their cached copies of that data. Bus Bus memory data Art of Multiprocessor Programming

Invalidate Uh,oh cache data data cache When the other processors hear the invalidation message, they set their caches to the invalid state. Bus Bus memory data Art of Multiprocessor Programming

Invalidate memory Other caches lose read permission cache data cache
From this point on, the red processor can update that data value without causing any bus traffic, because it knows that it has the only cached copy. This is much more efficient than a write-through cache because it produces much less bus traffic. Bus memory data Art of Multiprocessor Programming

Invalidate memory Other caches lose read permission
data cache From this point on, the red processor can update that data value without causing any bus traffic, because it knows that it has the only cached copy. This is much more efficient than a write-through cache because it produces much less bus traffic. Bus This cache acquires write permission memory data Art of Multiprocessor Programming

Invalidate Memory provides data only if not present in any cache, so no need to change it now (expensive) cache data cache Finally, there is no need to update memory until the processor wants to use that cache space for something else. Any other processor that asks for the data will get it from the red processor. Bus memory data Art of Multiprocessor Programming

Another Processor Asks for Data
cache data cache If another processor wants the data, it asks for it over the bus. Bus Bus memory data Art of Multiprocessor Programming

Owner Responds Here it is! cache data data cache And the owner responds by sending the data over. Bus Bus memory data Art of Multiprocessor Programming

End of the Day … data data data cache Bus memory data Reading OK, no writing Art of Multiprocessor Programming

Mutual Exclusion What do we want to optimize? Bus bandwidth used by spinning threads Release/Acquire latency Acquire latency for idle lock Note that optimizing a spin lock is not a simple question, because we have to figure out exactly what we want to optimize: whether it’s the bus bandwidth used by spinning threads, the latency of lock acquisition or release, or whether we mostly care about uncontended locks. Art of Multiprocessor Programming

Simple TASLock TAS invalidates cache lines Spinners Miss in cache Go to bus Thread wants to release lock delayed behind spinners We now consider how the simple test-and-set algorithm performs using a bus-based write-back cache (the most common case in practice). Each test-and-set call goes over the bus, and since all of the waiting threads are continually using the bus, all threads, even those not waiting for the lock, must wait to use the bus for each memory access. Even worse, the test-and-set call invalidates all cached copies of the lock, so every spinning thread encounters a cache miss almost every time, and has to use the bus to fetch the new, but unchanged value. Adding insult to injury, when the thread holding the lock tries to release it, it may be delayed waiting to use the bus that is monopolized by the spinners. We now understand why the TAS lock performs so poorly. Art of Multiprocessor Programming

Test-and-test-and-set
Wait until lock “looks” free Spin on local cache No bus use while lock busy Problem: when lock is released Invalidation storm … Now consider the behavior of the TAS lock algorithm while the lock is held by a thread A. The first time thread B reads the lock it takes a cache miss, forcing B to block while the value is loaded into B's cache. As long as A holds the lock, B repeatedly rereads the value, but each time B hits in its cache (finding the desired value). B thus produces no bus traffic, and does not slow down other threads' memory accesses. Moreover, a thread that releases a lock is not delayed by threads spinning on that lock. Art of Multiprocessor Programming

Local Spinning while Lock is Busy
While the lock is held, all the contenders spin in their caches, rereading cached data without causing any bus traffic. Bus memory busy Art of Multiprocessor Programming

On Release invalid invalid free Things deteriorate, however, when the lock is released. The lock holder releases the lock by writing false to the lock variable… Bus memory free Art of Multiprocessor Programming

On Release memory Everyone misses, rereads miss miss free invalid
… which immediately invalidates the spinners' cached copies. Each one takes a cache miss, rereads the new value, Bus memory free Art of Multiprocessor Programming (1)

On Release Everyone tries TAS TAS(…) TAS(…) invalid invalid free and they all (more-or-less simultaneously) call test-and-set to acquire the lock. The first to succeed invalidates the others, who must then reread the value, causing a storm of bus traffic. Bus memory free Art of Multiprocessor Programming (1)

Problems Everyone misses Reads satisfied sequentially Everyone does TAS Invalidates others’ caches Eventually quiesces after lock acquired How long does this take? Eventually, the processors settle down once again to local spinning. This could explain why TATAS lock cannot get to the ideal lock. How long does this take? Art of Multiprocessor Programming

Measuring Quiescence Time
Acquire lock Pause without using bus Use bus heavily P1 P2 Pn If pause > quiescence time, critical section duration independent of number of threads This is the classical experiment conducted by Anderson. We decrease X until the bus intensive operations in Y interleave with the quiescing of the lock release opeartion, at which point we will see a drop in throughput or an increase in latency. If pause < quiescence time, critical section duration slower with more threads Art of Multiprocessor Programming

Quiescence Time Increses linearly with the number of processors for bus architecture time threads Art of Multiprocessor Programming

Mystery Explained time threads
TAS lock TTAS lock Ideal time So now we understand why the TTAS lock performs much better than the TAS lock, but still much worse than an ideal lock. Better than TAS but still not as good as ideal threads Art of Multiprocessor Programming

Solution: Introduce Delay
If the lock looks free But I fail to get it There must be contention Better to back off than to collide again Another approach is to introduce a delay. If the lock looks free but you can’t get it, then instead of trying again right away, it makes sense to back off a little and let things calm down. There are several places we could introduce a delay --- after the lock rl time spin lock r2d r1d d Art of Multiprocessor Programming

Dynamic Example: Exponential Backoff
time spin lock 4d 2d d If I fail to get lock wait random duration before retry Each subsequent failure doubles expected wait For how long should the thread back off before retrying? A good rule of thumb is that the larger the number of unsuccessful tries, the higher the likely contention, and the longer the thread should back off. Here is a simple approach. Whenever the thread sees the lock has become free but fails to acquire it, it backs off before retrying. To ensure that concurrent conflicting threads do not fall into ``lock-step'', all trying to acquire the lock at the same time, the thread backs off for a random duration. Each time the thread tries and fails to get the lock, it doubles the expected time it backs off, up to a fixed maximum. Art of Multiprocessor Programming

Exponential Backoff Lock
public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Here is the code for an exponential back-off lock. Art of Multiprocessor Programming

public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} The constant MIN_DELAY indicates the initial, shortest limit (it makes no sense for the thread to back off for too short a duration), Fix minimum delay Art of Multiprocessor Programming

public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} As in the TTAS algorithm, the thread spins testing the lock until the lock appears to be free. Wait until lock looks free Art of Multiprocessor Programming

public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Then the thread tries to acquire the lock. If we win, return Art of Multiprocessor Programming

public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Back off for random duration If it fails, then it computes a random delay between zero and the current limit. and then sleeps for that delay before retrying. Art of Multiprocessor Programming

public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Double max delay, within reason It doubles the limit for the next back-off, up to MAX_DELAY. It is important to note that the thread backs off only when it fails to acquire a lock that it had immediately before observed to be free. Observing that the lock is held by another thread says nothing about the level of contention. Art of Multiprocessor Programming

Spin-Waiting Overhead
TTAS Lock time Backoff lock The graph shows that the backoff lock outperforms the TTAS lock, though it is far from the ideal curve which is flat. The slope of the backoff curve varies greatly from one machine to another, but is invariably better than that of a TTAS lock. threads Art of Multiprocessor Programming

Backoff: Other Issues Good Easy to implement Beats TTAS lock Bad Must choose parameters carefully Not portable across platforms The Backoff lock is easy to implement, and typically performs significantly better than TTAS lock on many architectures. Unfortunately, its performance is sensitive to the choice of minimum and maximum delay constants. To deploy this lock on a particular architecture, it is easy to experiment with different values, and choose the ones that work best. Experience shows, however, that these optimal values are sensitive to the number of processors and their speed, so it is not easy to tune the back-off lock class to be portable across a range of different machines. Art of Multiprocessor Programming

Idea Avoid useless invalidations By keeping a queue of threads Each thread Notifies next in line Without bothering the others We now explore a different approach to implementing spin locks, one that is a little more complicated than backoff locks, but inherently more portable. One can overcome these drawbacks by having threads form a line, or queue. In a queue, each thread can learn if its turn has arrived by checking whether its predecessor has been served. Invalidation traffic is reduced by having each thread spin on a different location. A queue also allows for better utilization of the critical section since there is no need to guess when to attempt to access it: each thread is notified directly by its predecessor in the queue. Finally, a queue provides first-come-first-served fairness, the same high level of fairness achieved by the Bakery algorithm. We now explore different ways to implement queue locks, a family of locking algorithms that exploit these insights. Art of Multiprocessor Programming

Anderson Queue Lock idle next flags Here is the Anderson queue lock, a simple array-based queue lock. The threads share an atomic integer tail field, initially zero. To acquire the lock, each thread atomically increments the tail field. Call the resulting value the thread's slot. The slot is used as an index into a Boolean flag array. If flag[j] is true, then the thread with slotj has permission to acquire the lock. Initially, flag[0] is true. T F F F F F F F Art of Multiprocessor Programming

Anderson Queue Lock acquiring next getAndIncrement flags To acquire the lock, each thread atomically increments the tail field. Call the resulting value the thread's slot. T F F F F F F F Art of Multiprocessor Programming

Anderson Queue Lock acquiring next getAndIncrement flags T F F F F F F F Art of Multiprocessor Programming

Anderson Queue Lock acquired next Mine! flags The slot is used as an index into a Boolean flag array. If flag[j] is true, then the thread with slot j has permission to acquire the lock. Initially, flag[0] is true. To acquire the lock, a thread spins until the flag at its slot becomes true. T F F F F F F F Art of Multiprocessor Programming

Anderson Queue Lock acquired acquiring next flags Here another thread wants to acquire the lock. T F F F F F F F Art of Multiprocessor Programming

Anderson Queue Lock acquired acquiring next getAndIncrement flags It applies get-and-increment to the next pointer. T F F F F F F F Art of Multiprocessor Programming

Anderson Queue Lock acquired acquiring next getAndIncrement flags Advances the next pointer to acquire its own slot. T F F F F F F F Art of Multiprocessor Programming

Anderson Queue Lock acquired acquiring next flags Then it spins until the flag variable at that slot becomes true. T F F F F F F F Art of Multiprocessor Programming

Anderson Queue Lock released acquired next flags The first thread releases the lock by setting the next slot to true. T T F F F F F F Art of Multiprocessor Programming

Anderson Queue Lock released acquired next Yow! flags The second thread notices the change, and enters its critical section. T T F F F F F F Art of Multiprocessor Programming

Anderson Queue Lock class ALock implements Lock { boolean[] flags={true,false,…,false}; AtomicInteger next = new AtomicInteger(0); ThreadLocal<Integer> mySlot; Here is what the code looks like. Art of Multiprocessor Programming

Anderson Queue Lock class ALock implements Lock { boolean[] flags={true,false,…,false}; AtomicInteger next = new AtomicInteger(0); ThreadLocal<Integer> mySlot; One flag per thread We have one flag per thread (this means we have to know how many threads there are). Art of Multiprocessor Programming

Anderson Queue Lock class ALock implements Lock { boolean[] flags={true,false,…,false}; AtomicInteger next = new AtomicInteger(0); ThreadLocal<Integer> mySlot; The next field tells us which flag to use. Next flag to use Art of Multiprocessor Programming

Anderson Queue Lock class ALock implements Lock { boolean[] flags={true,false,…,false}; AtomicInteger next = new AtomicInteger(0); ThreadLocal<Integer> mySlot; Each thread has a thread-local variable that keeps track of its slot. Thread-local variable Art of Multiprocessor Programming

Anderson Queue Lock public lock() { mySlot = next.getAndIncrement(); while (!flags[mySlot % n]) {}; flags[mySlot % n] = false; } public unlock() { flags[(mySlot+1) % n] = true; Here is the code for the lock and unlock methods. Art of Multiprocessor Programming

Anderson Queue Lock public lock() { mySlot = next.getAndIncrement(); while (!flags[mySlot % n]) {}; flags[mySlot % n] = false; } public unlock() { flags[(mySlot+1) % n] = true; First, claim a slot by atomically incrementing the next field. Take next slot Art of Multiprocessor Programming

Anderson Queue Lock public lock() { mySlot = next.getAndIncrement(); while (!flags[mySlot % n]) {}; flags[mySlot % n] = false; } public unlock() { flags[(mySlot+1) % n] = true; Next wait until my predecessor has released the lock. Spin until told to go Art of Multiprocessor Programming

Anderson Queue Lock public lock() { myslot = next.getAndIncrement(); while (!flags[myslot % n]) {}; flags[myslot % n] = false; } public unlock() { flags[(myslot+1) % n] = true; Reset my slot to false so that it can be used the next time around. Prepare slot for re-use Art of Multiprocessor Programming

Anderson Queue Lock public lock() { mySlot = next.getAndIncrement(); while (!flags[mySlot % n]) {}; flags[mySlot % n] = false; } public unlock() { flags[(mySlot+1) % n] = true; Tell next thread to go To release the lock, set the slot after mine to true, being careful to wrap around. Art of Multiprocessor Programming

Local Spinning released acquired T F F F F F F F Spin on my next bit
flags Is spinning really local if multiple threads’ bits sit in the same cache line? T F F F F F F F Unfortunately many bits share cache line Art of Multiprocessor Programming

False Sharing released acquired Spin on my bit next Result: contention Spinning thread gets cache invalidation on account of store by threads it is not waiting for flags False sharing occurs when a thread spinning on one bit incurs an invalidation of the line it is spinning on even though the actual bit it is interested in hs not changed. T F F F F F F F Line 1 Line 2 Art of Multiprocessor Programming

The Solution: Padding released acquired Spin on my line next flags To avoid false sharing each bit sits in a separate cache line. T / / / F / / / Line 1 Line 2 Art of Multiprocessor Programming

Performance TTAS Shorter handover than backoff Curve is practically flat Scalable performance The Anderson queue lock improves on Back-off locks because it reduces invalidations to a minimum and schedules access to the critical section tightly, minimizing the interval between when a lock is freed by one thread and when is acquired by another. There is also a theoretical benefit: unlike the TTAS and backoff lock, this algorithm guarantees that there is no lockout, and in fact, provides first-come-first-served fairness, which we actually loose in TTAS and TTAS with backoff. queue Art of Multiprocessor Programming

Anderson Queue Lock Good First truly scalable lock Simple, easy to implement Back to FIFO order (like Bakery) Art of Multiprocessor Programming

Anderson Queue Lock Bad Space hog… One bit per thread  one cache line per thread What if unknown number of threads? What if small number of actual contenders? The Anderson lock has several disadvantages. First, it is not space-efficient. It requires knowing a bound N on the maximum number of concurrent threads, and it allocates an array of that size per lock. Thus, L locks will require O(LN) space even if a thread accesses only one lock at a given time. Second, the lock is poorly suited for uncached architectures, since any thread may end up spinning on any array location, and in the absence of caches, spinning on a remote location may be very expensive. Finally, even with caching one must use some form of padding to avoid false sharing which increases the space complexity further. Art of Multiprocessor Programming

CLH Lock FIFO order Small, constant-size overhead per thread The CLH queue lock (by Travis Craig, Erik Hagersten, and Anders Landin) is much more space-efficient, since it incurs a small constant-size overhead per thread. Art of Multiprocessor Programming

Initially idle This algorithm records each thread's status in a QNode object, which has a Boolean locked field. If that field is true, then the corresponding thread either has acquired the lock or is waiting for the lock. If that field is false, then the thread has released the lock. The lock itself is represented as a virtual linked list of QNodes objects. We use the term ``virtual'' because the list is implicit: each thread points to its predecessor through a thread-local predvariable. The public tail variable points to the last node in the queue. tail false Art of Multiprocessor Programming

Initially idle tail false Queue tail Art of Multiprocessor Programming

Initially idle Lock is free tail false Art of Multiprocessor Programming

Initially idle tail false Art of Multiprocessor Programming

Purple Wants the Lock acquiring tail false Art of Multiprocessor Programming

Purple Wants the Lock acquiring To acquire the lock, a thread sets the locked field of its QNode to true, meaning that the thread is not ready to release the lock. tail false true Art of Multiprocessor Programming

Purple Wants the Lock acquiring Swap The thread applies swap to the tail to make its own node the tail of the queue, simultaneously acquiring a reference to its predecessor's QNode. tail false true Art of Multiprocessor Programming

Purple Has the Lock acquired Because it sees that its predecessor’s QNode is false, this thread now has the lock. tail false true Art of Multiprocessor Programming

Red Wants the Lock acquired acquiring Another thread that wants the lock does the same sequence of steps … tail false true true Art of Multiprocessor Programming

Red Wants the Lock acquired acquiring Swap tail false true true Art of Multiprocessor Programming

Red Wants the Lock acquired acquiring Notice that the list is ordered implictely, there are no real pointers between the nodes… tail false true true Art of Multiprocessor Programming

Red Wants the Lock acquired acquiring Implicit Linked list Notice that the list is ordered implictely, there are no real pointers between the nodes… tail false true true Art of Multiprocessor Programming

Red Wants the Lock acquired acquiring The thread then spins on the predecessor's QNode until the predecessor releases the lock. tail false true true Art of Multiprocessor Programming

Red Wants the Lock acquired acquiring Actually, it spins on cached copy true The thread then spins actually spins on a chaced copy of purples node which is very efficient in terms of interconnect traffic. tail false true true Art of Multiprocessor Programming

Purple Releases release acquiring Bingo! false The thread then spins actually spins on a cached copy of purples node which is very efficient in terms of interconnect traffic. In fact, in some coherence protocols shared memory might not be updated at all, only the cached copy. This is very efficient. tail false false true Art of Multiprocessor Programming

Purple Releases released acquired When a thread acquires a lock it can reuses its predecessor's QNode as its new node for future lock accesses. Note that it can do so since at this point the thread's predecessor's QNode will no longer be used by the predecessor, and the thread's old QNode is pointed-to either by the thread's successor or by the tail. tail true Art of Multiprocessor Programming

Space Usage Let L = number of locks N = number of threads ALock O(LN) CLH lock O(L+N) The reuse of the QNode's means that for $L$ locks, if each thread accesses at most one lock at a time, we only need O(L+N) space as compared with O(LN) for the ALock. Art of Multiprocessor Programming

CLH Queue Lock class Qnode { AtomicBoolean locked = new AtomicBoolean(true); } Here is what the node looks like as a Java object. Art of Multiprocessor Programming

CLH Queue Lock class Qnode { AtomicBoolean locked = new AtomicBoolean(true); } If the locked field is true, the lock has not been released yet (it may also not have been acquired yet either). Not released yet Art of Multiprocessor Programming

CLH Queue Lock class CLHLock implements Lock { AtomicReference<Qnode> tail; ThreadLocal<Qnode> myNode = new Qnode(); public void lock() { Qnode pred = tail.getAndSet(myNode); while (pred.locked) {} }} Here is the code for lock(). Art of Multiprocessor Programming

CLH Queue Lock class CLHLock implements Lock { AtomicReference<Qnode> tail; ThreadLocal<Qnode> myNode = new Qnode(); public void lock() { Qnode pred = tail.getAndSet(myNode); while (pred.locked) {} }} Here is the code for lock(). Queue tail Art of Multiprocessor Programming

CLH Queue Lock class CLHLock implements Lock { AtomicReference<Qnode> tail; ThreadLocal<Qnode> myNode = new Qnode(); public void lock() { Qnode pred = tail.getAndSet(myNode); while (pred.locked) {} }} Here is the code for lock(). Thread-local Qnode Art of Multiprocessor Programming

CLH Queue Lock class CLHLock implements Lock { AtomicReference<Qnode> tail; ThreadLocal<Qnode> myNode = new Qnode(); public void lock() { Qnode pred = tail.getAndSet(myNode); while (pred.locked) {} }} Swap in my node Here is the code for lock(). Art of Multiprocessor Programming

CLH Queue Lock Spin until predecessor releases lock
class CLHLock implements Lock { AtomicReference<Qnode> tail; ThreadLocal<Qnode> myNode = new Qnode(); public void lock() { Qnode pred = tail.getAndSet(myNode); while (pred.locked) {} }} Spin until predecessor releases lock Here is the code for lock(). Art of Multiprocessor Programming

CLH Queue Lock Class CLHLock implements Lock { … public void unlock() { myNode.locked.set(false); myNode = pred; } Art of Multiprocessor Programming

CLH Queue Lock Class CLHLock implements Lock { … public void unlock() { myNode.locked.set(false); myNode = pred; } Notify successor Art of Multiprocessor Programming

CLH Queue Lock Recycle predecessor’s node
Class CLHLock implements Lock { … public void unlock() { myNode.locked.set(false); myNode = pred; } Recycle predecessor’s node Art of Multiprocessor Programming

(we don’t actually reuse myNode. Code in book shows how it’s done.)
CLH Queue Lock Class CLHLock implements Lock { … public void unlock() { myNode.locked.set(false); myNode = pred; } (we don’t actually reuse myNode. Code in book shows how it’s done.) Art of Multiprocessor Programming

CLH Lock Good Lock release affects predecessor only Small, constant-sized space Bad Doesn’t work for uncached NUMA architectures Like the ALock, this algorithm has each thread spin on a distinct location, so when one thread releases its lock, it invalidates its successor's cache only, and does not invalidate any other thread's cache. It does so with a lower space overhead, and, importantly, without requiring prior knowledge of the number of threads accessing a lock It also provides first-come-first-served fairness. Art of Multiprocessor Programming

NUMA Architecturs Acronym: Non-Uniform Memory Architecture Illusion: Flat shared memory Truth: No caches (sometimes) Some memory regions faster than others The principal disadvantage of this lock algorithm is that it performs poorly on cacheless NUMA architectures. Each thread spins waiting for its predecessor's node's to become false. If this memory location is remote, then performance will suffer. On cache-coherent architectures, however, this approach should work well. Art of Multiprocessor Programming

Spinning on local memory is fast
NUMA Machines Spinning on local memory is fast Art of Multiprocessor Programming

Spinning on remote memory is slow
NUMA Machines Spinning on remote memory is slow Art of Multiprocessor Programming

CLH Lock Each thread spins on predecessor’s memory Could be far away … Art of Multiprocessor Programming

MCS Lock FIFO order Spin on local memory only Small, Constant-size overhead The MCS lock is another kind of queue lock that ensures that processes always spin on a fixed location, so this one works well for cacheless architectures. Like the CLH lock, it uses only a small fixed-size overhead per thread. Art of Multiprocessor Programming

Initially idle Here, too, the lock is represented as a linked list of QNodes, where each QNode represents either a lock holder or a thread waiting to acquire the lock. Unlike the CLH lock, the list is explicit, not virtual. tail false false Art of Multiprocessor Programming

Acquiring acquiring (allocate Qnode) true To acquire the lock, a thread places its own QNode} at the tail of the list. (Notice that the Setting of the value to true is a simplification of the presentation. In the code, we allocate the node with its value false so that we can get out of the with loop immediately if there is no predecessor. But if there is a predecessor, a thread sets the value of its flag to true before it redirects the other thread's pointer to it, so in effect its true whenever its used...thus, in order not to confuse the viewer with the minor issue, we diverge slightly from the code and describe the node as being true here). tail false false Art of Multiprocessor Programming

Acquiring acquired true swap The node then swaps in a reference to its own QNode. (Notice that the Setting of the value to true is a simplification of the presentation. In the code, we allocate the node with its value false so that we can get out of the with loop immediately if there is no predecessor. But if there is a predecessor, a thread sets the value of its flag to true before it redirects the other thread's pointer to it, so in effect its true whenever its used...thus, in order not to confuse the viewer with the minor issue, we diverge slightly from the code and describe the node as being true here). tail false false Art of Multiprocessor Programming

Acquiring acquired true At this point the swap is completed, and the queue variable points to the tail of the queue. tail false false Art of Multiprocessor Programming

Acquired acquired true To acquire the lock, a thread places its own QNode} at the tail of the list. If it has a predecessor, it modifies the predecessor's node to refer back to its own QNode. tail false false Art of Multiprocessor Programming

Acquiring acquiring acquired false tail true swap Art of Multiprocessor Programming

Acquiring acquiring acquired false tail true Art of Multiprocessor Programming

Acquiring acquiring acquired true tail false true Art of Multiprocessor Programming

Acquiring acquiring acquired Yes! true tail false true Art of Multiprocessor Programming

MCS Queue Lock class Qnode { boolean locked = false; qnode next = null; } Art of Multiprocessor Programming

MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void lock() { Qnode qnode = new Qnode(); Qnode pred = tail.getAndSet(qnode); if (pred != null) { qnode.locked = true; pred.next = qnode; while (qnode.locked) {} }}} Art of Multiprocessor Programming

MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void lock() { Qnode qnode = new Qnode(); Qnode pred = tail.getAndSet(qnode); if (pred != null) { qnode.locked = true; pred.next = qnode; while (qnode.locked) {} }}} Make a QNode Art of Multiprocessor Programming

add my Node to the tail of queue
MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void lock() { Qnode qnode = new Qnode(); Qnode pred = tail.getAndSet(qnode); if (pred != null) { qnode.locked = true; pred.next = qnode; while (qnode.locked) {} }}} add my Node to the tail of queue Art of Multiprocessor Programming

Fix if queue was non-empty
MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void lock() { Qnode qnode = new Qnode(); Qnode pred = tail.getAndSet(qnode); if (pred != null) { qnode.locked = true; pred.next = qnode; while (qnode.locked) {} }}} Fix if queue was non-empty Art of Multiprocessor Programming

MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void lock() { Qnode qnode = new Qnode(); Qnode pred = tail.getAndSet(qnode); if (pred != null) { qnode.locked = true; pred.next = qnode; while (qnode.locked) {} }}} Wait until unlocked Art of Multiprocessor Programming

MCS Queue Unlock class MCSLock implements Lock { AtomicReference tail; public void unlock() { if (qnode.next == null) { if (tail.CAS(qnode, null) return; while (qnode.next == null) {} } qnode.next.locked = false; }} Art of Multiprocessor Programming

MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void unlock() { if (qnode.next == null) { if (tail.CAS(qnode, null) return; while (qnode.next == null) {} } qnode.next.locked = false; }} Missing successor? Art of Multiprocessor Programming

If really no successor, return
MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void unlock() { if (qnode.next == null) { if (tail.CAS(qnode, null) return; while (qnode.next == null) {} } qnode.next.locked = false; }} If really no successor, return Art of Multiprocessor Programming

Otherwise wait for successor to catch up
MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void unlock() { if (qnode.next == null) { if (tail.CAS(qnode, null) return; while (qnode.next == null) {} } qnode.next.locked = false; }} Otherwise wait for successor to catch up Art of Multiprocessor Programming

MCS Queue Lock class MCSLock implements Lock { AtomicReference queue; public void unlock() { if (qnode.next == null) { if (tail.CAS(qnode, null) return; while (qnode.next == null) {} } qnode.next.locked = false; }} Pass lock to successor Art of Multiprocessor Programming

Purple Release releasing swap false false Art of Multiprocessor Programming

Purple Release releasing swap
By looking at the queue, I see another thread is active releasing swap false false Art of Multiprocessor Programming

Purple Release releasing swap
By looking at the queue, I see another thread is active releasing swap false false I have to wait for that thread to finish Art of Multiprocessor Programming

Purple Release releasing prepare to spin true false Art of Multiprocessor Programming

Purple Release releasing spinning true false Art of Multiprocessor Programming

Purple Release releasing spinning true false false Art of Multiprocessor Programming

Purple Release releasing Acquired lock true false false Art of Multiprocessor Programming

Abortable Locks What if you want to give up waiting for a lock? For example Timeout Database transaction aborted by user The spin-lock constructions we have studied up till now provide first-come-first-served access with little contention, making them useful in many applications. In real-time systems, however, threads may require the ability to \emph{abort}, canceling a pending request to acquire a lock in order to meet some other real-time deadline. Art of Multiprocessor Programming

Back-off Lock Aborting is trivial Just return from lock() call Extra benefit: No cleaning up Wait-free Immediate return Aborting a backoff lock request is trivial: if the request takes too long, simply return from the lock() method. The lock's state is unaffected by any thread's abort. A nice property of this lock is that the abort itself is immediate: it is wait-free, requiring only a constant number of steps. Art of Multiprocessor Programming

Queue Locks Can’t just quit Thread in line behind will starve Need a graceful way out Art of Multiprocessor Programming

Queue Locks spinning spinning spinning true true true Art of Multiprocessor Programming

Queue Locks locked spinning spinning false true true Art of Multiprocessor Programming

Queue Locks locked spinning false true Art of Multiprocessor Programming

Queue Locks locked false Art of Multiprocessor Programming

Queue Locks spinning spinning spinning true true true Art of Multiprocessor Programming

Queue Locks spinning spinning true true true Art of Multiprocessor Programming

Queue Locks locked spinning false true true Art of Multiprocessor Programming

Queue Locks spinning false true Art of Multiprocessor Programming

Queue Locks pwned false true Art of Multiprocessor Programming

Abortable CLH Lock When a thread gives up Removing node in a wait-free way is hard Idea: let successor deal with it. Removing a node in a wait-free manner from the middle of the list is difficult if we want that thread to reuse its own node. Instead, the aborting thread marks the node so that the successor in the list will reuse the abandoned node, and wait for that node's predecessor. Art of Multiprocessor Programming

Initially idle Pointer to predecessor (or null) tail A
Art of Multiprocessor Programming

Initially idle Distinguished available node means lock is free tail A
Art of Multiprocessor Programming

Acquiring acquiring tail A Art of Multiprocessor Programming

Acquiring Null predecessor means lock not released or aborted
Each thread allocates a new QNode for each lock access. A Null predecessor value means that the QNode is not aborted and has not yet completed executing the critical section. A Art of Multiprocessor Programming

Acquiring acquiring Swap Each thread allocates a new QNode for each lock access. A Null predecessor value means that the QNode is not aborted and has not yet completed executing the critical section. A Art of Multiprocessor Programming

Acquiring acquiring Each thread allocates a new QNode for each lock access. A Null predecessor value means that the QNode is not aborted and has not yet completed executing the critical section. A Art of Multiprocessor Programming

Reference to AVAILABLE means lock is free.
Acquired Reference to AVAILABLE means lock is free. locked Each thread allocates a new QNode for each lock access. A Null predecessor value means that the QNode is not aborted and has not yet completed executing the critical section. A Art of Multiprocessor Programming

Null means lock is not free & request not aborted
Normal Case spinning locked Null means lock is not free & request not aborted Art of Multiprocessor Programming

One Thread Aborts locked Timed out spinning Art of Multiprocessor Programming

Non-Null means predecessor aborted
Successor Notices locked Timed out spinning Non-Null means predecessor aborted Art of Multiprocessor Programming

Recycle Predecessor’s Node
locked spinning Art of Multiprocessor Programming

Spin on Earlier Node locked spinning Art of Multiprocessor Programming

Spin on Earlier Node released spinning A The lock is now mine Art of Multiprocessor Programming

Time-out Lock public class TOLock implements Lock { static Qnode AVAILABLE = new Qnode(); AtomicReference<Qnode> tail; ThreadLocal<Qnode> myNode; Art of Multiprocessor Programming

AVAILABLE node signifies free lock
Time-out Lock public class TOLock implements Lock { static Qnode AVAILABLE = new Qnode(); AtomicReference<Qnode> tail; ThreadLocal<Qnode> myNode; AVAILABLE node signifies free lock Art of Multiprocessor Programming

Time-out Lock public class TOLock implements Lock { static Qnode AVAILABLE = new Qnode(); AtomicReference<Qnode> tail; ThreadLocal<Qnode> myNode; Tail of the queue Art of Multiprocessor Programming

Time-out Lock public class TOLock implements Lock { static Qnode AVAILABLE = new Qnode(); AtomicReference<Qnode> tail; ThreadLocal<Qnode> myNode; Remember my node … Art of Multiprocessor Programming

Time-out Lock public boolean lock(long timeout) { Qnode qnode = new Qnode(); myNode.set(qnode); qnode.prev = null; Qnode myPred = tail.getAndSet(qnode); if (myPred== null || myPred.prev == AVAILABLE) { return true; } … Art of Multiprocessor Programming

Create & initialize node
Time-out Lock public boolean lock(long timeout) { Qnode qnode = new Qnode(); myNode.set(qnode); qnode.prev = null; Qnode myPred = tail.getAndSet(qnode); if (myPred == null || myPred.prev == AVAILABLE) { return true; } Create & initialize node Art of Multiprocessor Programming

Time-out Lock public boolean lock(long timeout) { Qnode qnode = new Qnode(); myNode.set(qnode); qnode.prev = null; Qnode myPred = tail.getAndSet(qnode); if (myPred == null || myPred.prev == AVAILABLE) { return true; } Swap with tail Art of Multiprocessor Programming

If predecessor absent or released, we are done
Time-out Lock public boolean lock(long timeout) { Qnode qnode = new Qnode(); myNode.set(qnode); qnode.prev = null; Qnode myPred = tail.getAndSet(qnode); if (myPred == null || myPred.prev == AVAILABLE) { return true; } ... If predecessor absent or released, we are done Art of Multiprocessor Programming

Time-out Lock spinning locked … long start = now(); while (now()- start < timeout) { Qnode predPred = myPred.prev; if (predPred == AVAILABLE) { return true; } else if (predPred != null) { myPred = predPred; } Art of Multiprocessor Programming

Keep trying for a while …
Time-out Lock … long start = now(); while (now()- start < timeout) { Qnode predPred = myPred.prev; if (predPred == AVAILABLE) { return true; } else if (predPred != null) { myPred = predPred; } Keep trying for a while … Art of Multiprocessor Programming

Spin on predecessor’s prev field
Time-out Lock … long start = now(); while (now()- start < timeout) { Qnode predPred = myPred.prev; if (predPred == AVAILABLE) { return true; } else if (predPred != null) { myPred = predPred; } Spin on predecessor’s prev field Art of Multiprocessor Programming

Predecessor released lock
Time-out Lock … long start = now(); while (now()- start < timeout) { Qnode predPred = myPred.prev; if (predPred == AVAILABLE) { return true; } else if (predPred != null) { myPred = predPred; } Predecessor released lock Art of Multiprocessor Programming

Predecessor aborted, advance one
Time-out Lock … long start = now(); while (now()- start < timeout) { Qnode predPred = myPred.prev; if (predPred == AVAILABLE) { return true; } else if (predPred != null) { myPred = predPred; } Predecessor aborted, advance one Art of Multiprocessor Programming

What do I do when I time out?
Time-out Lock … if (!tail.compareAndSet(qnode, myPred)) qnode.prev = myPred; return false; } What do I do when I time out? Art of Multiprocessor Programming

Time-out Lock … if (!tail.compareAndSet(qnode, myPred)) qnode.prev = myPred; return false; } In case a node is timing out it first checks if tail points to its node, if Do I have a successor? If CAS fails, I do. Tell it about myPred Art of Multiprocessor Programming

If CAS succeeds: no successor, simply return false
Time-out Lock … if (!tail.compareAndSet(qnode, myPred)) qnode.prev = myPred; return false; } Redirecting to predecessor tells the successors that node belongs to aborted lock() request and that they must spin on its predecessor. If CAS succeeds: no successor, simply return false Art of Multiprocessor Programming

Time-Out Unlock public void unlock() { Qnode qnode = myNode.get(); if (!tail.compareAndSet(qnode, null)) qnode.prev = AVAILABLE; } Art of Multiprocessor Programming

Time-out Unlock public void unlock() { Qnode qnode = myNode.get(); if (!tail.compareAndSet(qnode, null)) qnode.prev = AVAILABLE; } If the CAS() fails, the condition is true, there is a successor and so I must notify it what to wait on since I am timing out. If CAS failed: successor exists, notify it can enter Art of Multiprocessor Programming

Timing-out Lock public void unlock() { Qnode qnode = myNode.get(); if (!tail.compareAndSet(qnode, null)) qnode.prev = AVAILABLE; } If the tail points to me, then no one is waiting and putting a null completes the unlock() CAS successful: set tail to null, no clean up since no successor waiting Art of Multiprocessor Programming

One Lock To Rule Them All?
TTAS+Backoff, CLH, MCS, ToLock… Each better than others in some way There is no one solution Lock we pick really depends on: the application the hardware which properties are important In this lecture we saw a variety of spin locks that vary in characteristics and performance. Such a variety is useful, because no single algorithm is ideal for all applications. For some applications, complex algorithms work best, and for others, simple algorithms are preferable. The best choice usually depends on specific aspects of the application and the target architecture. Art of Multiprocessor Programming

This work is licensed under a Creative Commons Attribution-ShareAlike 2.5 License. You are free: to Share — to copy, distribute and transmit the work to Remix — to adapt the work Under the following conditions: Attribution. You must attribute the work to “The Art of Multiprocessor Programming” (but not in any way that suggests that the authors endorse you or your use of the work). Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license. For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to Any of the above conditions can be waived if you get permission from the copyright holder. Nothing in this license impairs or restricts the author's moral rights. Art of Multiprocessor Programming

Spin Locks and Contention

Similar presentations

Presentation on theme: "Spin Locks and Contention"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Spin Locks and Contention

Similar presentations

Presentation on theme: "Spin Locks and Contention"— Presentation transcript:

Similar presentations

About project

Feedback