Spin Locks and Contention Management The Art of Multiprocessor Programming Spring 2007.

Slides:

Advertisements

Similar presentations

Spin Locks and Contention

Advertisements

CS 603 Process Synchronization: The Colored Ticket Algorithm February 13, 2002.

1 Synchronization A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Types of Synchronization.

Synchronization without Contention

CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support.

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.

John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott.

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.

ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto

Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.

Synchronization without Contention John M. Mellor-Crummey and Michael L. Scott+ ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun.

Spin Locks and Contention Based on slides by by Maurice Herlihy & Nir Shavit Tomer Gurevich.

Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.

Transactional Memory: Architectural Support for Lock- Free Data Structures Herlihy & Moss Presented by Robert T. Bauer.

Local-Spin Algorithms Multiprocessor synchronization algorithms ( ) Lecturer: Danny Hendler This presentation is based on the book “Synchronization.

Transactional Memory Yujia Jin. Lock and Problems Lock is commonly used with shared data Priority Inversion –Lower priority process hold a lock needed.

The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.

Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

CS510 Concurrent Systems Class 1b Spin Lock Performance.

CS510 Advanced OS Seminar Class 10 A Methodology for Implementing Highly Concurrent Data Objects by Maurice Herlihy.

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture IV: OS Support.

Scalable Reader Writer Synchronization John M.Mellor-Crummey, Michael L.Scott.

Synchronization Todd C. Mowry CS 740 November 1, 2000 Topics Locks Barriers Hardware primitives.

1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )

Computer Science Lecture 12, page 1 CS677: Distributed OS Last Class Vector timestamps Global state –Distributed Snapshot Election algorithms.

Multiprocessor Cache Coherency

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

More on Locks: Case Studies

Art of Multiprocessor Programming1 This work is licensed under a Creative Commons Attribution- ShareAlike 2.5 License.Creative Commons Attribution- ShareAlike.

Process Synchronization Continued 7.2 Critical-Section Problem 7.3 Synchronization Hardware 7.4 Semaphores.

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Spin Locks and Contention

Mutual Exclusion Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

MULTIVIE W Slide 1 (of 23) The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Paper: Thomas E. Anderson Presentation: Emerson.

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 12: May 3, 2003 Shared Memory.

Spin Locks Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Jeremy Denham April 7,  Motivation  Background / Previous work  Experimentation  Results  Questions.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 10: May 8, 2001 Synchronization.

Mutual Exclusion Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Operating Systems CMPSC 473 Mutual Exclusion Lecture 11: October 5, 2010 Instructor: Bhuvan Urgaonkar.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 22: May 20, 2005 Synchronization.

Multiprocessor Architecture Basics Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.

SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU.

Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Distributed shared memory u motivation and the main idea u consistency models F strict and sequential F causal F PRAM and processor F weak and release.

Local-Spin Mutual Exclusion Multiprocessor synchronization algorithms ( ) Lecturer: Danny Hendler This presentation is based on the book “Synchronization.

Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Modified by Rajeev Alur for CIS 640,

CS267 Lecture 61 Shared Memory Hardware and Memory Consistency Modified from J. Demmel and K. Yelick

Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Queue Locks and Local Spinning Some Slides based on: The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Spin Locks and Contention

CS703 – Advanced Operating Systems

Atomic Operations in Hardware

Atomic Operations in Hardware

Multiprocessor Cache Coherency

Spin Locks and Contention

CS510 Concurrent Systems Jonathan Walpole.

Designing Parallel Algorithms (Synchronization)

Spin Locks and Contention Management

CS533 Concepts of Operating Systems

CS510 Concurrent Systems Jonathan Walpole.

CSE 153 Design of Operating Systems Winter 19

Chapter 6: Synchronization Tools

Lecture 18: Coherence and Synchronization

Presentation transcript:

Spin Locks and Contention Management The Art of Multiprocessor Programming Spring 2007

© Herlihy-Shavit Focus so far: Correctness Models –Accurate (we never lied to you) –But idealized (so we forgot to mention a few things) Protocols –Elegant –Important –But naïve

© Herlihy-Shavit New Focus: Performance Models –More complicated (not the same as complex!) –Still focus on principles (not soon obsolete) Protocols –Elegant (in their fashion) –Important (why else would we pay attention) –And realistic (your mileage may vary)

© Herlihy-Shavit Kinds of Architectures SISD (Uniprocessor) –Single instruction stream –Single data stream SIMD (Vector) –Single instruction –Multiple data MIMD (Multiprocessors) –Multiple instruction –Multiple data.

© Herlihy-Shavit Kinds of Architectures SISD (Uniprocessor) –Single instruction stream –Single data stream SIMD (Vector) –Single instruction –Multiple data MIMD (Multiprocessors) –Multiple instruction –Multiple data. Our space (1)

© Herlihy-Shavit MIMD Architectures Memory Contention Communication Contention Communication Latency Shared Bus memory Distributed

© Herlihy-Shavit Today: Revisit Mutual Exclusion Think of performance, not just correctness and progress Begin to understand how performance depends on our software properly utilizing the multipprocessor machines hardware And get to know a collection of locking algorithms… (1)

© Herlihy-Shavit What Should you do if you can’t get a lock? Keep trying –“spin” or “busy-wait” –Good if delays are short Give up the processor –Good if delays are long –Always good on uniprocessor (1)

© Herlihy-Shavit What Should you do if you can’t get a lock? Keep trying –“spin” or “busy-wait” –Good if delays are short Give up the processor –Good if delays are long –Always good on uniprocessor our focus

© Herlihy-Shavit Basic Spin-Lock CS Resets lock upon exit spin lock critical section...

© Herlihy-Shavit Basic Spin-Lock CS Resets lock upon exit spin lock critical section... …lock introduces sequntial bottleneck

© Herlihy-Shavit Basic Spin-Lock CS Resets lock upon exit spin lock critical section... …lock suffers from contention

© Herlihy-Shavit Basic Spin-Lock CS Resets lock upon exit spin lock critical section... Notice: these are two different phenomenon …lock suffers from contention Seq Bottleneck  no parallelism

© Herlihy-Shavit Basic Spin-Lock CS Resets lock upon exit spin lock critical section... Contention  ??? …lock suffers from contention

© Herlihy-Shavit Review: Test-and-Set Boolean value Test-and-set (TAS) –Swap true with current value –Return value tells if prior value was true or false Can reset just by writing false TAS aka “getAndSet”

© Herlihy-Shavit Review: Test-and-Set public class AtomicBoolean { boolean value; public synchronized boolean getAndSet(boolean newValue) { boolean prior = value; value = newValue; return prior; } (5)

© Herlihy-Shavit Review: Test-and-Set public class AtomicBoolean { boolean value; public synchronized boolean getAndSet(boolean newValue) { boolean prior = value; value = newValue; return prior; } Package java.util.concurrent.atomic

© Herlihy-Shavit Review: Test-and-Set public class AtomicBoolean { boolean value; public synchronized boolean getAndSet(boolean newValue) { boolean prior = value; value = newValue; return prior; } Swap old and new values

© Herlihy-Shavit Review: Test-and-Set AtomicBoolean lock = new AtomicBoolean(false) … boolean prior = lock.getAndSet(true)

© Herlihy-Shavit Review: Test-and-Set AtomicBoolean lock = new AtomicBoolean(false) … boolean prior = lock.getAndSet(true) (5) Swapping in true is called “test-and-set” or TAS

© Herlihy-Shavit Test-and-Set Locks Locking –Lock is free: value is false –Lock is taken: value is true Acquire lock by calling TAS –If result is false, you win –If result is true, you lose Release lock by writing false

© Herlihy-Shavit Test-and-set Lock class TASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }}

© Herlihy-Shavit Test-and-set Lock class TASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }} Lock state is AtomicBoolean

© Herlihy-Shavit Test-and-set Lock class TASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }} Keep trying until lock acquired

© Herlihy-Shavit Test-and-set Lock class TASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }} Release lock by resetting state to false

© Herlihy-Shavit Space Complexity TAS spin-lock has small “footprint” N thread spin-lock uses O(1) space As opposed to O(n) Peterson/Bakery How did we overcome the  (n) lower bound? We used a RMW operation…

© Herlihy-Shavit Performance Experiment –n threads –Increment shared counter 1 million times How long should it take? How long does it take?

© Herlihy-Shavit Graph ideal time threads no speedup because of sequential bottleneck

© Herlihy-Shavit Mystery #1 time threads TAS lock Ideal (1) What is going on? Lets try and fix it…

© Herlihy-Shavit Test-and-Test-and-Set Locks Lurking stage –Wait until lock “looks” free –Spin while read returns true (lock taken) Pouncing state –As soon as lock “looks” available –Read returns false (lock free) –Call TAS to acquire lock –If TAS loses, back to lurking

© Herlihy-Shavit Test-and-test-and-set Lock class TTASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (true) { while (state.get()) {} if (!state.getAndSet(true)) return; }

© Herlihy-Shavit Test-and-test-and-set Lock class TTASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (true) { while (state.get()) {} if (!state.getAndSet(true)) return; } Wait until lock looks free

© Herlihy-Shavit Test-and-test-and-set Lock class TTASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (true) { while (state.get()) {} if (!state.getAndSet(true)) return; } Then try to acquire it

© Herlihy-Shavit Mystery #2 TAS lock TTAS lock Ideal time threads

© Herlihy-Shavit Mystery Both –TAS and TTAS –Do the same thing (in our model) Except that –TTAS performs much better than TAS –Neither approaches ideal

© Herlihy-Shavit Opinion Our memory abstraction is broken TAS & TTAS methods –Are provably the same (in our model) –Except they aren’t (in field tests) Need a more detailed model …

© Herlihy-Shavit Bus-Based Architectures Bus cache memory cache

© Herlihy-Shavit Bus-Based Architectures Bus cache memory cache Random access memory (10s of cycles)

© Herlihy-Shavit Bus-Based Architectures cache memory cache Shared Bus broadcast medium One broadcaster at a time Processors and memory all “snoop” Bus

© Herlihy-Shavit Bus-Based Architectures Bus cache memory cache Per-Processor Caches Small Fast: 1 or 2 cycles Address & state information

© Herlihy-Shavit Jargon Watch Cache hit –“I found what I wanted in my cache” –Good Thing™

© Herlihy-Shavit Jargon Watch Cache hit –“I found what I wanted in my cache” –Good Thing™ Cache miss –“I had to shlep all the way to memory for that data” –Bad Thing™

© Herlihy-Shavit Cave Canem This model is still a simplification –But not in any essential way –Illustrates basic principles Will discuss complexities later

© Herlihy-Shavit Bus Processor Issues Load Request cache memory cache data

© Herlihy-Shavit Bus Processor Issues Load Request Bus cache memory cache data Gimme data

© Herlihy-Shavit cache Bus Memory Responds Bus memory cache data Got your data right here data

© Herlihy-Shavit Bus Processor Issues Load Request memory cache data Gimme data

© Herlihy-Shavit Bus Processor Issues Load Request Bus memory cache data Gimme data

© Herlihy-Shavit Bus Processor Issues Load Request Bus memory cache data I got data

© Herlihy-Shavit Bus Other Processor Responds memory cache data I got data data Bus

© Herlihy-Shavit Bus Other Processor Responds memory cache data Bus

© Herlihy-Shavit Modify Cached Data Bus data memory cachedata (1)

© Herlihy-Shavit Modify Cached Data Bus data memory cachedata (1)

© Herlihy-Shavit memory Bus data Modify Cached Data cachedata

© Herlihy-Shavit memory Bus data Modify Cached Data cache What’s up with the other copies? data

© Herlihy-Shavit Cache Coherence We have lots of copies of data –Original copy in memory –Cached copies at processors Some processor modifies its own copy –What do we do with the others? –How to avoid confusion?

© Herlihy-Shavit Write-Back Caches Accumulate changes in cache Write back when needed –Need the cache for something else –Another processor wants it On first modification –Invalidate other entries –Requires non-trivial protocol …

© Herlihy-Shavit Write-Back Caches Cache entry has three states –Invalid: contains raw seething bits –Valid: I can read but I can’t write –Dirty: Data has been modified Intercept other load requests Write back to memory before using cache

© Herlihy-Shavit Bus Invalidate memory cachedata

© Herlihy-Shavit Bus Invalidate Bus memory cachedata Mine, all mine!

© Herlihy-Shavit Bus Invalidate Bus memory cachedata cache Uh,oh

© Herlihy-Shavit cache Bus Invalidate memory cachedata Other caches lose read permission

© Herlihy-Shavit cache Bus Invalidate memory cachedata Other caches lose read permission This cache acquires write permission

© Herlihy-Shavit cache Bus Invalidate memory cachedata Memory provides data only if not present in any cache, so no need to change it now (expensive) (2)

© Herlihy-Shavit cache Bus Another Processor Asks for Data memory cachedata (2) Bus

© Herlihy-Shavit cache data Bus Owner Responds memory cachedata (2) Bus Here it is!

© Herlihy-Shavit Bus End of the Day … memory cachedata (1) Reading OK, no writing data

© Herlihy-Shavit Mutual Exclusion What do we want to optimize? –Bus bandwidth used by spinning threads –Release/Acquire latency –Acquire latency for idle lock

© Herlihy-Shavit Simple TASLock TAS invalidates cache lines Spinners –Miss in cache –Go to bus Thread wants to release lock –delayed behind spinners

© Herlihy-Shavit Test-and-test-and-set Wait until lock “looks” free –Spin on local cache –No bus use while lock busy Problem: when lock is released –Invalidation storm …

© Herlihy-Shavit Local Spinning while Lock is Busy Bus memory busy

© Herlihy-Shavit Bus On Release memory freeinvalid free

© Herlihy-Shavit On Release Bus memory freeinvalid free miss Everyone misses, rereads (1)

© Herlihy-Shavit On Release Bus memory freeinvalid free TAS(…) Everyone tries TAS (1)

© Herlihy-Shavit Problems Everyone misses –Reads satisfied sequentially Everyone does TAS –Invalidates others’ caches Eventually quiesces after lock acquired –How long does this take?

© Herlihy-Shavit Measuring Quiescence Time P1P1 P2P2 PnPn X = time of ops that don’t use the bus Y = time of ops that cause intensive bus traffic In critical section, run ops X then ops Y. As long as Quiescence time is less than X, no drop in performance. By gradually varying X, can determine the exact time to quiesce.

© Herlihy-Shavit Quiescence Time Increses linearly with the number of processors for bus architecture time threads

© Herlihy-Shavit Mystery Explained TAS lock TTAS lock Ideal time threads Better than TAS but still not as good as ideal

© Herlihy-Shavit Solution: Introduce Delay spin lock time d r1dr1d r2dr2d If the lock looks free But I fail to get it There must be lots of contention Better to back off than to collide again

© Herlihy-Shavit Dynamic Example: Exponential Backoff time d 2d4d spin lock If I fail to get lock –wait random duration before retry –Each subsequent failure doubles expected wait

© Herlihy-Shavit Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}}

© Herlihy-Shavit Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Fix minimum delay

© Herlihy-Shavit Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Wait until lock looks free

© Herlihy-Shavit Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} If we win, return

© Herlihy-Shavit Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Back off for random duration

© Herlihy-Shavit Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Double max delay, within reason

© Herlihy-Shavit Spin-Waiting Overhead TTAS Lock Backoff lock time threads

© Herlihy-Shavit Backoff: Other Issues Good –Easy to implement –Beats TTAS lock Bad –Must choose parameters carefully –Not portable across platforms

© Herlihy-Shavit Idea Avoid useless invalidations –By keeping a queue of threads Each thread –Notifies next in line –Without bothering the others

© Herlihy-Shavit Anderson Queue Lock flags next TFFFFFFF idle

© Herlihy-Shavit Anderson Queue Lock flags next TFFFFFFF acquiring getAndIncrement

© Herlihy-Shavit Anderson Queue Lock flags next TFFFFFFF acquiring getAndIncrement

© Herlihy-Shavit Anderson Queue Lock flags next TFFFFFFF acquired Mine!

© Herlihy-Shavit Anderson Queue Lock flags next TFFFFFFF acquired acquiring

© Herlihy-Shavit Anderson Queue Lock flags next TFFFFFFF acquired acquiring getAndIncrement

© Herlihy-Shavit Anderson Queue Lock flags next TFFFFFFF acquired acquiring getAndIncrement

© Herlihy-Shavit acquired Anderson Queue Lock flags next TFFFFFFF acquiring

© Herlihy-Shavit released Anderson Queue Lock flags next TTFFFFFF acquired

© Herlihy-Shavit released Anderson Queue Lock flags next TTFFFFFF acquired Yow!

© Herlihy-Shavit Anderson Queue Lock class ALock implements Lock { boolean[] flags={true,false,…,false}; AtomicInteger next = new AtomicInteger(0); int[] slot = new int[n];

© Herlihy-Shavit Anderson Queue Lock class ALock implements Lock { boolean[] flags={true,false,…,false}; AtomicInteger next = new AtomicInteger(0); int[] slot = new int[n]; One flag per thread

© Herlihy-Shavit Anderson Queue Lock class ALock implements Lock { boolean[] flags={true,false,…,false}; AtomicInteger next = new AtomicInteger(0); int[] slot = new int[n]; Next flag to use

© Herlihy-Shavit Anderson Queue Lock class ALock implements Lock { boolean[] flags={true,false,…,false}; AtomicInteger next = new AtomicInteger(0); ThreadLocal mySlot; Thread-local variable

© Herlihy-Shavit Anderson Queue Lock public lock() { int mySlot = next.getAndIncrement(); while (!flags[mySlot % n]) {}; flags[mySlot % n] = false; } public unlock() { flags[(mySlot+1) % n] = true; }

© Herlihy-Shavit Anderson Queue Lock public lock() { int mySlot = next.getAndIncrement(); while (!flags[mySlot % n]) {}; flags[mySlot % n] = false; } public unlock() { flags[(mySlot+1) % n] = true; } Take next slot

© Herlihy-Shavit Anderson Queue Lock public lock() { int mySlot = next.getAndIncrement(); while (!flags[mySlot % n]) {}; flags[mySlot % n] = false; } public unlock() { flags[(mySlot+1) % n] = true; } Spin until told to go

© Herlihy-Shavit Anderson Queue Lock public lock() { int slot[i]=next.getAndIncrement(); while (!flags[slot[i] % n]) {}; flags[slot[i] % n] = false; } public unlock() { flags[slot[i]+1 % n] = true; } Prepare slot for re-use

© Herlihy-Shavit Anderson Queue Lock public lock() { int mySlot = next.getAndIncrement(); while (!flags[mySlot % n]) {}; flags[mySlot % n] = false; } public unlock() { flags[(mySlot+1) % n] = true; } Tell next thread to go

© Herlihy-Shavit Performance Curve is practically flat Scalable performance FIFO fairness queue TTAS

© Herlihy-Shavit Anderson Queue Lock Good –First truly scalable lock –Simple, easy to implement Bad –Space hog –One bit per thread Unknown number of threads? Small number of actual contenders?

© Herlihy-Shavit CLH Lock FIFO order Small, constant-size overhead per thread

© Herlihy-Shavit Initially false tail idle

© Herlihy-Shavit Initially false tail idle Queue tail

© Herlihy-Shavit Initially false tail idle Lock is free

© Herlihy-Shavit Initially false tail idle

© Herlihy-Shavit Purple Wants the Lock false tail acquiring

© Herlihy-Shavit Purple Wants the Lock false tail acquiring true

© Herlihy-Shavit Purple Wants the Lock false tail acquiring true Swap

© Herlihy-Shavit Purple Has the Lock false tail acquired true

© Herlihy-Shavit Red Wants the Lock false tail acquired acquiring true

© Herlihy-Shavit Red Wants the Lock false tail acquired acquiring true Swap true

© Herlihy-Shavit Red Wants the Lock false tail acquired acquiring true

© Herlihy-Shavit Red Wants the Lock false tail acquired acquiring true

© Herlihy-Shavit Red Wants the Lock false tail acquired acquiring true Actually, it spins on cached copy

© Herlihy-Shavit Purple Releases false tail release acquiring false true false Bingo!

© Herlihy-Shavit Purple Releases tail released acquired true

© Herlihy-Shavit Space Usage Let –L = number of locks –N = number of threads ALock –O(LN) CLH lock –O(L+N)

© Herlihy-Shavit CLH Queue Lock class Qnode { AtomicBoolean locked = new AtomicBoolean(true); }

© Herlihy-Shavit CLH Queue Lock class Qnode { AtomicBoolean locked = new AtomicBoolean(true); } Not released yet

© Herlihy-Shavit CLH Queue Lock class CLHLock implements Lock { AtomicReference tail; ThreadLocal myNode = new Qnode(); public void lock() { Qnode pred = queue.getAndSet(myNode); while (pred.locked) {} }} (3)

© Herlihy-Shavit CLH Queue Lock class CLHLock implements Lock { AtomicReference tail; ThreadLocal myNode = new Qnode(); public void lock() { Qnode pred = queue.getAndSet(myNode); while (pred.locked) {} }} (3) Tail of the queue

© Herlihy-Shavit CLH Queue Lock class CLHLock implements Lock { AtomicReference tail; ThreadLocal myNode = new Qnode(); public void lock() { Qnode pred = queue.getAndSet(myNode); while (pred.locked) {} }} (3) Thread-local Qnode

© Herlihy-Shavit CLH Queue Lock class CLHLock implements Lock { AtomicReference tail; ThreadLocal myNode = new Qnode(); public void lock() { Qnode pred = queue.getAndSet(myNode); while (pred.locked) {} }} (3) Swap in my node

© Herlihy-Shavit CLH Queue Lock class CLHLock implements Lock { AtomicReference tail; ThreadLocal myNode = new Qnode(); public void lock() { Qnode pred = queue.getAndSet(myNode); while (pred.locked) {} }} (3) Spin until predecessor releases lock

© Herlihy-Shavit CLH Queue Lock Class CLHLock implements Lock { … public void unlock() { myNode.locked.set(false); myNode = pred; } (3)

© Herlihy-Shavit CLH Queue Lock Class CLHLock implements Lock { … public void unlock() { myNode.locked.set(false); myNode = pred; } (3) Notify successor

© Herlihy-Shavit CLH Queue Lock Class CLHLock implements Lock { … public void unlock() { myNode.locked.set(false); myNode = pred; } (3) Recycle predecessor’s node

© Herlihy-Shavit CLH Lock Good –Lock release affects predecessor only –Small, constant-sized space Bad –Doesn’t work for uncached NUMA architectures

© Herlihy-Shavit NUMA Architecturs Acronym: –Non-Uniform Memory Architecture Illusion: –Flat shared memory Truth: –No caches (sometimes) –Some memory regions faster than others

© Herlihy-Shavit NUMA Machines Spinning on local memory is fast

© Herlihy-Shavit NUMA Machines Spinning on remote memory is slow

© Herlihy-Shavit CLH Lock Each thread spin’s on predecessor’s memory Could be far away …

© Herlihy-Shavit MCS Lock FIFO order Spin on local memory only Small, Constant-size overhead

© Herlihy-Shavit Initially false tail false idle

© Herlihy-Shavit Acquiring false queue false true acquiring (allocate Qnode)

© Herlihy-Shavit Acquiring false tail false true acquired swap

© Herlihy-Shavit Acquiring false tail false true acquired

© Herlihy-Shavit Acquired false tail false true acquired

© Herlihy-Shavit Acquiring tail true acquired acquiring true swap

© Herlihy-Shavit Acquiring tail acquired acquiring true

© Herlihy-Shavit Acquiring tail acquired acquiring true

© Herlihy-Shavit Acquiring tail acquired acquiring true

© Herlihy-Shavit Acquiring tail acquired acquiring true false

© Herlihy-Shavit Acquiring tail acquired acquiring true Yes! false

© Herlihy-Shavit MCS Queue Lock class Qnode { boolean locked = false; qnode next = null; }

© Herlihy-Shavit MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void lock() { Qnode qnode = new Qnode(); Qnode pred = tail.getAndSet(qnode); if (pred != null) { qnode.locked = true; pred.next = qnode; while (qnode.locked) {} }}} (3)

© Herlihy-Shavit MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void lock() { Qnode qnode = new Qnode(); Qnode pred = tail.getAndSet(qnode); if (pred != null) { qnode.locked = true; pred.next = qnode; while (qnode.locked) {} }}} (3) Make a QNode

© Herlihy-Shavit MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void lock() { Qnode qnode = new Qnode(); Qnode pred = tail.getAndSet(qnode); if (pred != null) { qnode.locked = true; pred.next = qnode; while (qnode.locked) {} }}} (3) add my Node to the tail of queue

© Herlihy-Shavit MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void lock() { Qnode qnode = new Qnode(); Qnode pred = tail.getAndSet(qnode); if (pred != null) { qnode.locked = true; pred.next = qnode; while (qnode.locked) {} }}} (3) Fix if queue was non-empty

© Herlihy-Shavit MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void lock() { Qnode qnode = new Qnode(); Qnode pred = tail.getAndSet(qnode); if (pred != null) { qnode.locked = true; pred.next = qnode; while (qnode.locked) {} }}} (3) Wait until unlocked

© Herlihy-Shavit MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void unlock() { if (qnode.next == null) { if (queue.CAS(qnode, null) return; while (qnode.next == null) {} } qnode.next.locked = false; }} (3)

© Herlihy-Shavit MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void unlock() { if (qnode.next == null) { if (queue.CAS(qnode, null) return; while (qnode.next == null) {} } qnode.next.locked = false; }} (3) Missing successor?

© Herlihy-Shavit MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void unlock() { if (qnode.next == null) { if (queue.CAS(qnode, null) return; while (qnode.next == null) {} } qnode.next.locked = false; }} (3) If really no successor, return

© Herlihy-Shavit MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void unlock() { if (qnode.next == null) { if (queue.CAS(qnode, null) return; while (qnode.next == null) {} } qnode.next.locked = false; }} (3) Otherwise wait for successor to catch up

© Herlihy-Shavit MCS Queue Lock class MCSLock implements Lock { AtomicReference queue; public void unlock() { if (qnode.next == null) { if (queue.CAS(qnode, null) return; while (qnode.next == null) {} } qnode.next.locked = false; }} (3) Pass lock to successor

© Herlihy-Shavit Purple Release false releasing swap false (2)

© Herlihy-Shavit Purple Release false releasing swap false By looking at the queue, I see another thread is active (2)

© Herlihy-Shavit Purple Release false releasing swap false By looking at the queue, I see another thread is active I have to wait for that thread to finish (2)

© Herlihy-Shavit Purple Release false releasing spinning true

© Herlihy-Shavit Purple Release false releasing spinning true false

© Herlihy-Shavit Purple Release false releasing spinning true locked false

© Herlihy-Shavit Abortable Locks What if you want to give up waiting for a lock? For example –Timeout –Database transaction aborted by user

© Herlihy-Shavit Back-off Lock Aborting is trivial –Just return from lock() call Extra benefit: –No cleaning up –Wait-free –Immediate return

© Herlihy-Shavit Queue Locks Can’t just quit –Thread in line behind will starve Need a graceful way out

© Herlihy-Shavit Queue Locks spinning true spinning true spinning

© Herlihy-Shavit Queue Locks spinning true spinning true false locked

© Herlihy-Shavit Queue Locks spinning true locked false

© Herlihy-Shavit Queue Locks locked false

© Herlihy-Shavit Queue Locks spinning true spinning true spinning

© Herlihy-Shavit Queue Locks spinning true spinning

© Herlihy-Shavit Queue Locks spinning true false locked

© Herlihy-Shavit Queue Locks spinning true false

© Herlihy-Shavit Queue Locks pwned true false

© Herlihy-Shavit Abortable CLH Lock When a thread gives up –Removing node in a wait-free way is hard Idea: –let successor deal with it.

© Herlihy-Shavit Initially tail idle Pointer to predecessor (or null) A

© Herlihy-Shavit Initially tail idle Distinguished available node means lock is free A

© Herlihy-Shavit Acquiring tail acquiring A

© Herlihy-Shavit Acquiring acquiring A Null predecessor means lock not released or aborted

© Herlihy-Shavit Acquiring acquiring A Swap

© Herlihy-Shavit Acquiring acquiring A

© Herlihy-Shavit Acquired locked A Pointer to AVAILABLE means lock is free.

© Herlihy-Shavit Normal Case spinning locked Null means lock is not free & request not aborted

© Herlihy-Shavit One Thread Aborts spinning Timed out locked

© Herlihy-Shavit Successor Notices spinning Timed out locked Non-Null means predecessor aborted

© Herlihy-Shavit Recycle Predecessor’s Node spinning locked

© Herlihy-Shavit Spin on Earlier Node spinning locked

© Herlihy-Shavit Spin on Earlier Node spinning released A The lock is now mine

© Herlihy-Shavit Time-out Lock public class TOLock implements Lock { static Qnode AVAILABLE = new Qnode(); AtomicReference tail; ThreadLocal myNode; (3)

© Herlihy-Shavit Time-out Lock public class TOLock implements Lock { static Qnode AVAILABLE = new Qnode(); AtomicReference tail; ThreadLocal myNode; (3) Distinguished node to signify free lock

© Herlihy-Shavit Time-out Lock public class TOLock implements Lock { static Qnode AVAILABLE = new Qnode(); AtomicReference tail; ThreadLocal myNode; (3) Tail of the queue

© Herlihy-Shavit Time-out Lock public class TOLock implements Lock { static Qnode AVAILABLE = new Qnode(); AtomicReference tail; ThreadLocal myNode; (3) Remember my node …

© Herlihy-Shavit Time-out Lock public boolean lock(long timeout) { Qnode qnode = new Qnode(); myNode.set(qnode); qnode.prev = null; Qnode pred = tail.getAndSet(qnode); if (pred == null || pred.prev == AVAILABLE) { return true; } … (3)

© Herlihy-Shavit Time-out Lock public boolean lock(long timeout) { Qnode qnode = new Qnode(); myNode.set(qnode); qnode.prev = null; Qnode pred = tail.getAndSet(qnode); if (pred == null || pred.prev == AVAILABLE) { return true; } (3) Create & initialize node

© Herlihy-Shavit Time-out Lock public boolean lock(long timeout) { Qnode qnode = new Qnode(); myNode.set(qnode); qnode.prev = null; Qnode pred = tail.getAndSet(qnode); if (pred == null || pred.prev == AVAILABLE) { return true; } (3) Swap with tail

© Herlihy-Shavit Time-out Lock public boolean lock(long timeout) { Qnode qnode = new Qnode(); myNode.set(qnode); qnode.prev = null; Qnode pred = tail.getAndSet(qnode); if (pred == null || pred.prev == AVAILABLE) { return true; }... (3) If predecessor absent or released, we are done

© Herlihy-Shavit Time-out Lock … long start = now(); while (now()- start < timeout) { Qnode predPred = pred.prev; if (predPred == AVAILABLE) { return true; } else if (predPred != null) { pred = predPred; } … (3)

© Herlihy-Shavit Time-out Lock … long start = now(); while (now()- start < timeout) { Qnode predPred = pred.prev; if (predPred == AVAILABLE) { return true; } else if (predPred != null) { pred = predPred; } … (3) Keep trying for a while …

© Herlihy-Shavit Time-out Lock … long start = now(); while (now()- start < timeout) { Qnode predPred = pred.prev; if (predPred == AVAILABLE) { return true; } else if (predPred != null) { pred = predPred; } … (3) Spin on predecessor’s prev field

© Herlihy-Shavit Time-out Lock … long start = now(); while (now()- start < timeout) { Qnode predPred = pred.prev; if (predPred == AVAILABLE) { return true; } else if (predPred != null) { pred = predPred; } … (3) Predecessor released lock

© Herlihy-Shavit Time-out Lock … long start = now(); while (now()- start < timeout) { Qnode predPred = pred.prev; if (predPred == AVAILABLE) { return true; } else if (predPred != null) { pred = predPred; } … (3) Predecessor aborted, advance one

© Herlihy-Shavit Time-out Lock … if (!tail.compareAndSet(qnode, pred)) qnode.prev = pred; return false; } (3)

© Herlihy-Shavit Time-out Lock … if (!tail.compareAndSet(qnode, pred)) qnode.prev = pred; return false; } (3) Try to put predecessor back

© Herlihy-Shavit Time-out Lock … if (!tail.compareAndSet(qnode, pred)) qnode.prev = pred; return false; } (3) Otherwise, if it’s too late, redirect to predecessor

© Herlihy-Shavit Time-out Lock public void unlock() { Qnode qnode = myNode.get(); if (!tail.compareAndSet(qnode, null)) qnode.prev = AVAILABLE; } (3)

© Herlihy-Shavit Time-out Lock public void unlock() { Qnode qnode = myNode.get(); if (!tail.compareAndSet(qnode, null)) qnode.prev = AVAILABLE; } (3) Clean up if no one waiting

© Herlihy-Shavit Time-out Lock public void unlock() { Qnode qnode = myNode.get(); if (!tail.compareAndSet(qnode, null)) qnode.prev = AVAILABLE; } (3) Notify successor by pointing to AVAILABLE node

© Herlihy-Shavit One Lock To Rule Them All? TTAS+Backoff, CLH, MCS, ToLock… Each better than others in some way There is no one solution Lock we pick really depends on: – the application – the hardware – which properties are important