Multiprocessors – Locks

Slides:

Advertisements

Similar presentations

1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )

Advertisements

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.

Multiprocessors—Synchronization. Synchronization Why Synchronize? Need to know when it is safe for different processes to use shared data Issues for Synchronization:

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.

CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.

Mutual Exclusion.

CS6290 Synchronization. Synchronization Shared counter/sum update example –Use a mutex variable for mutual exclusion –Only one processor can own the mutex.

Review: Multiprocessor Systems (MIMD)

1 Lecture 21: Synchronization Topics: lock implementations (Sections )

1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Fundamentals of Parallel Computer Architecture - Chapter 91 Chapter 9 Hardware Support for Synchronization Yan Solihin Copyright.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 12: May 3, 2003 Shared Memory.

Anshul Kumar, CSE IITD CSL718 : Multiprocessors Synchronization, Memory Consistency 17th April, 2006.

Anshul Kumar, CSE IITD ECE729 : Advance Computer Architecture Lecture 26: Synchronization, Memory Consistency 25 th March, 2010.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,

Ch4. Multiprocessors & Thread-Level Parallelism 4. Syn (Synchronization & Memory Consistency) ECE562 Advanced Computer Architecture Prof. Honggang Wang.

1 Lecture: Coherence Topics: snooping-based coherence, directory-based coherence protocols (Sections )

EE 382 Processor DesignWinter 98/99Michael Flynn 1 EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I.

1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Outline Introduction Centralized shared-memory architectures (Sec. 5.2) Distributed shared-memory and directory-based coherence (Sec. 5.4) Synchronization:

CS703 – Advanced Operating Systems

Background on the need for Synchronization

Lecture 21 Synchronization

Lecture 19: Coherence and Synchronization

Lecture 5: Synchronization

Atomic Operations in Hardware

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Atomic Operations in Hardware

Parallel Shared Memory

Lecture 18: Coherence and Synchronization

Reactive Synchronization Algorithms for Multiprocessors

Multiprocessors Oracle SPARC M core, 64MB L3 cache (8 x 8 MB), 1.6TB/s. 256 KB of 4-way SA L2 ICache, 0.5 TB/s per cluster. 2 cores share 256 KB,

The University of Adelaide, School of Computer Science

CPE 631 Session 22: Multiprocessors (Part 3)

The University of Adelaide, School of Computer Science

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Lecture 2: Snooping-Based Coherence

CPE 631 Session 21: Multiprocessors (Part 2)

Cache Coherence Protocols 15th April, 2006

Designing Parallel Algorithms (Synchronization)

Lecture 5: Snooping Protocol Design Issues

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Shared Memory Systems Miodrag Bolic.

Lecture 21: Synchronization and Consistency

Lecture: Coherence and Synchronization

Lecture 25: Multiprocessors

Professor David A. Patterson Computer Science 252 Spring 1998

Lecture 4: Synchronization

Lecture 25: Multiprocessors

Lecture 26: Multiprocessors

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

CSE 153 Design of Operating Systems Winter 19

Lecture 24: Multiprocessors

Chapter 6: Synchronization Tools

Lecture: Coherence, Synchronization

Lecture: Coherence Topics: wrap-up of snooping-based coherence,

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 21: Synchronization & Consistency

Lecture: Coherence and Synchronization

Lecture 19: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

Multiprocessors – Locks Oracle SPARC M7 - 32 core, 64MB L3 cache (8 x 8 MB), 1.6TB/s. 256 KB of 4-way SA L2 ICache, 0.5 TB/s per cluster. 2 cores share 256 KB, 8-way SA L2 DCache, 0.5 TB/s.

Outline Locks – Issues Test and Set Cached Locks Test and Test and Set Load Linked, Store Conditional

Implementing Locks Must synchronize processes so that they access shared variable one at a time in critical section; called Mutual Exclusion Mutex Lock: a synchronization primitive AcquireLock(L) Done before critical section of code Returns when safe for process to enter critical section ReleaseLock(L) Done after critical section Allows another process to acquire lock

Implementing Locks int L=0; AcquireLock(L): while (L==1) ; L = 1; /* BUSY WAITING */ ReleaseLock(L): L = 0;

Problem in Implementing Locks AcquireLock(L): while (L==1) ; L = 1; wait: LW R1, Addr(L) BNEZ wait ADDI R1, R1, 1 SW R1, Addr(L)

Problem in Implementing Locks Process 1 Process 2 wait: LW R1, Addr(L) BNEZ wait ADDI R1, R1, 1 SW R1, Addr(L) LW R1, Addr(L) Context Switch LW R1, Addr(L) BNEZ wait Initally L=0. P1 and P2 are in contention to acquire the lock. ADDI R1, R1, 1 # Critical Section # Context Switch BNEZ wait ADDI R1, R1, 1 Both P1 and P2 are executing in the Critical Section !!! # Critical Section #

Atomic Read-Modify-Write (RMW) instruction Atomic Exchange Hardware support for lock implementation Atomic exchange: Swap contents between register and memory. Test&Set Takes one memory operand and a register operand Test&Set Lock tmp = Lock Lock = 1 return tmp Test&Set occurs atomically (indivisibly). Atomic Read-Modify-Write (RMW) instruction

Lock Implementation lock: Test&Set R1, L BNZ R1, lock Critical Section SW R0, L 1 R1 L MM The atomic read-modify-write hardware primitive facilitates synchronization implementations (locks, barriers, etc.)

Test&Set Implementation Test&Set R1, L R1 t = L; # Store a copy of Lock L = 1; # Set Lock R1 ← t; # Write original value of the Lock in R1 L 2 stores (t, L) and 1 load (R1) (atomic – practically 1 instruction) MM The atomic read-modify-write hardware primitive facilitates synchronization implementations (locks, barriers, etc.)

Test&Set Example T0 executes T&S T0 T1 T2 P0 P1 P2 R: 1 Interconnect Main Memory L: 0

Test&Set Example T0 executes T&S tmp ← L; L ← 1; R ← tmp; T0 T1 T2 T0 observes R; Realizes lock was free before its T&S P0 P1 P2 R: 0 T0 has acquired Lock Interconnect Main Memory L: 1

Test&Set Example T0 executes T&S T0 has acquired Lock T0 T1 T2 P0 P1 tmp ← L; L ← 1; R ← tmp; Interconnect T1 observes R; Realizes lock was occupied before its T&S Main Memory L: 1 T1 retries T&S till Lock is free All threads retry T&S till Lock is free

Test&Set Timeline T0 in CS P0 P1 P2 MM P0 P1 MM T0 executes T&S R: R: Interconnect MM L: 1 P0 P1 MM T0 executes T&S T0's T&S execution R: 1 L: 0 T1 executes T&S R: 1 L: 1 R: 0 T0 in CS T1's T&S execution L: 1 Time L: 1 R: 1 T1 executes T&S T1's T&S execution L: 1 L: 1 R: 1 ...

Lock Performance Issues Spin Lock – Process may enter into an infinite loop of read-modify till it succeeds Atomicity ensures that process is not switched out – other processes do not progress If lock is in memory – heavy traffic Solution: Move lock variables from memory to caches

Improvements on T&S Cached locks Test and Test and Set LL-SC

Caching Locks Locks can be cached Atomic exchange happens between RF and local copy in cache Coherence ensures that a lock update is seen by other processors. P1 P2 R Atomic Exchange C1 C2 L Interconnect Main Memory

Caching Locks Main Memory P1 P2 ... Pn S S S Interconnect Tn P1 P2 ... Pn S S S Interconnect Main Memory Atomic Exchange – Coherence Events Read: Read the latest value of the lock variable (coherence events if lock in I state). Modify: Modify the lock variable (coherence events). Write: Write lock value to register.

Caching Locks – Example T0 T1 Tn P1 P2 ... Pn T&S S S S Interconnect Main Memory Read: Lock already in shared state. No coherence traffic.

Caching Locks – Example T0 T1 Tn P1 P2 ... Pn T&S S S S Interconnect Inv L Main Memory Read: Lock already in shared state. No coherence traffic. Modify: Write invalidate message to all caches having a copy of the lock.

Caching Locks – Example T0 T1 Tn P1 P2 ... Pn T&S M I I Interconnect Main Memory Read: Lock already in shared state. No coherence traffic. Modify: Write invalidate message to all caches having a copy of the lock.

Caching Locks – Example T0 T1 Tn P1 P2 ... Pn T&S M I I Interconnect Main Memory Read: Lock already in shared state. No coherence traffic. Modify: Write invalidate message to all caches having a copy of the lock. Write: Write lock value to register.

Caching Locks – Example T&S T0 T1 Tn P1 P2 ... Pn M I I Interconnect Main Memory

Caching Locks – Example T&S T0 T1 Tn P1 P2 ... Pn M I I Interconnect Read Miss L Main Memory Read: Lock in Invalid state. Read Miss.

Caching Locks – Example T&S T0 T1 Tn P1 P2 ... Pn S S I Interconnect Main Memory Read: Lock in Invalid state. Read Miss. Cache block updated.

Caching Locks – Example T&S T0 T1 Tn P1 P2 ... Pn S S I Interconnect Inv L Main Memory Read: Lock in Invalid state. Read Miss. Cache block updated. Modify: Write invalidate message.

Caching Locks – Example T&S T0 T1 Tn P1 P2 ... Pn I M I Interconnect Main Memory Read: Lock in Invalid state. Read Miss. Cache block updated. Modify: Write invalidate message.

Caching Locks – Example T&S T0 T1 Tn P1 P2 ... Pn I M I Interconnect Main Memory T0 might be in CS all this while!!! Read: Lock in Invalid state. Read Miss. Cache block updated. Modify: Write invalidate message. Write: Write lock value to register.

Caching Locks – Example ReleaseLock T0 T1 Tn P1 P2 ... Pn I M I Interconnect Main Memory T0 executes ReleaseLock

Caching Locks – Example ReleaseLock T0 T1 Tn P1 P2 ... Pn I M I Interconnect Inv L Main Memory T0 executes ReleaseLock

Caching Locks – Example ReleaseLock T0 T1 Tn P1 P2 ... Pn M I I Interconnect Main Memory T0 executes ReleaseLock

Caching Locks – Example T&S T&S T0 T1 Tn P1 P2 ... Pn M I I Interconnect Main Memory

Caching Locks – Example T&S T&S T0 T1 Tn P1 P2 ... Pn M I I Interconnect Read Miss L Main Memory Read: Locks in Invalid state. 2 read misses (Bus vs. Interconnection networks)

Caching Locks – Example T&S T&S T0 T1 Tn P1 P2 ... Pn S S S Interconnect Main Memory Read: Locks in Invalid state. 2 read misses (Bus vs. Interconnection networks) P2 and Pn contain L=0.

Caching Locks – Example T&S T&S T0 T1 Tn P1 P2 ... Pn S S S Interconnect Inv L Main Memory Read: Locks in Invalid state. 2 read misses (Bus vs. Interconnection networks) Modify: Write invalidate messages by P2, Pn. (Contention in Buses, Delay in Interconnection networks)

Caching Locks – Example T&S T&S T0 T1 Tn P1 P2 ... Pn S S S Inv L Interconnect Contention in Buses – P2 and Pn intend to send Write Invalidates P2 wins Bus arbitration Pn loses Bus arbitration. Try Later. Sends Inv L Pn invalidates its lock copy. Updates L. L=1. Pn wins Bus arbitration. Sends Inv L P2 updates its cache block in MM and other caches. M → I. Pn gets an updated cache block. Pn modifies the Lock variable. I → M.

Caching Locks – Messages T1 T2 Tn Caching Locks – Messages P1 P2 Pn Interconnect MM Lock release – 1 Write Invalidate. i threads execute T&S Read Misses – i Responses – 1. All threads have copy in S state. 1 Write Invalidate by the arbitration winner S → I in i-1 copies. S → M in 1 cache. 1 T&S completes (Lock acquired). i-1 T&S yet to complete – i-1 invalidates. All copies: S → I in i-1 copies. S → M in 1 cache. Each cache sends the same lock value. Every thread sees that the lock has been acquired.

Caching Locks If two processes share a lock variable, T&S generate huge amounts of coherence traffic Each cache sends the same lock value. After 1st T&S every thread fails to acquire lock. After L=1, and all copies are in S state, each thread can test its local copy and decide not to execute T&S. Pi Pj T&S T&S C1 C2 I M I M L L P1 holds the lock. Pi and Pj are in an infinite T&S loop. Main Memory

Coherence Traffic for a Lock lockloop: test R1, Lock bnz R1, lockloop t&s R1, Lock # Critical Section # st Lock, #0 Test and Test and Set T0 T1 T2 P0 P1 P2 S 1 S 1 S 1 Contending threads in lockloop as long as lock has been acquired. Interconnect Main Memory 1

Coherence Traffic for a Lock lockloop: test R1, Lock bnz R1, lockloop t&s R1, Lock # Critical Section # st Lock, #0 Test and Test and Set T0 T1 T2 P0 P1 P2 M S 1 S I 1 I S 1 Write Invalidate Lock T0 releases Lock Main Memory 1

Coherence Traffic for a Lock lockloop: test R1, Lock bnz R1, lockloop t&s R1, Lock # Critical Section # st Lock, #0 Test and Test and Set T0 T1 T2 P0 P1 P2 M S 1 S I I 1 T1 tests Lock Read Miss Lock T1 exits inner loop Main Memory 1

Coherence Traffic for a Lock lockloop: test R1, Lock bnz R1, lockloop t&s R1, Lock # Critical Section # st Lock, #0 Test and Test and Set T0 T1 T2 P0 P1 P2 S 1 S I 1 T1 tests-and-sets Lock Write Miss Lock Main Memory

Coherence Traffic for a Lock lockloop: test R1, Lock bnz R1, lockloop t&s R1, Lock # Critical Section # st Lock, #0 Test and Test and Set T0 T1 T2 P0 P1 P2 I 1 M I 1 T1 tests-and-sets Lock Write Miss Lock Main Memory

Coherence Traffic for a Lock lockloop: test R1, Lock bnz R1, lockloop t&s R1, Lock # Critical Section # st Lock, #0 Test and Test and Set T0 T1 T2 P0 P1 P2 1 I 1 M I 1 T1 tests-and-sets Lock Atomic Read-Modify-Write Main Memory

Test and Test and Set – Messages i threads execute T&T&S in the inner loop. Assume Lock = 1. None of the threads exit the inner loop when Lock is set. No coherence traffic. vs. T&S – Every T&S loop generates coherence traffic.

Test and Test and Set – Messages i threads do Test and T&S. Each in their inner loop (L=1). Lock release (L=0). i copies: S → I. Next iteration of inner loop: i read misses. Responses – 1. All threads have copy in S state (L=0). All i threads: Inner loop Test succeeds. i threads execute T&S – Sequence of events identical to the T&S example.

T&T&S – Room for Improvement As soon as lock is released each waiting thread sends Write Invalidate – huge coherence traffic How to reduce this coherence traffic? After the inner loop exits, snoop on the bus to track writes to the lock variable Write: AcquireLock by setting lock variable to 1 If any process completes a write, do not send write invalidate request on the bus

Load Linked and Store Conditional T&S is an atomic Read-Modify-Write instruction Coherence traffic cannot be avoided Use LL-SC LL records the lock variable address in a table Snoop on the bus; Record the first Write Invalidate request on the lock variable in the table. Before the thread updates its cache copy, SC checks the table. If flag is set, SC fails (no coherence traffic). Fail is indicated by a special value in the register. If flag is not set, SC succeeds. Lock acquired.

LL-SC Example LD R2, X .... SW R2, X LL R2, X .... SC R2, X Atomic execution required! LD R2, X .... SW R2, X LL R2, X .... SC R2, X X: if (R2 == ) ... Some other thread has modified X R2 is filled with a special value indicating failure of SC

Spin lock with lower coherence traffic. Spin Lock using LL-SC Spin lock with lower coherence traffic. lockit: LL R2, 0(R1) BNEZ R2, lockit DADDUI R2, R0, #1 SC R2, 0(R1) BEQZ R2, lockit ; no coherence traffic ; not available, keep spinning ; put value 1 in R2 ; store-conditional succeeds if no one ; updated the lock since the last LL ; confirm that SC succeeded, else keep trying If there are i processes waiting for the lock, how many bus transactions happen? 1 write by the releaser + i read-miss requests + i responses + 1 write by acquirer + 0 (i-1 failed SCs) + i-1 read-miss requests + i-1 responses.

Load Linked and Store Conditional LL-SC is an implementation of atomic read- modify-write LL: Record the loaded address in a table Table updates a flag if any other process has modified the contents of the value pointed to by the address Perform any number of instructions SC: store succeeds only if the flag in the table is clear no other process attempted a store since the local LL (success only if the operation was “effectively” atomic) LL-SC does not generate bus traffic if the SC fails More efficient than test&test&set

Locks – Summary Locks – Issues Test and Set Cached Locks Test and Test and Set Load Linked, Store Conditional

References D J. Sorin, M D. Hill, D A. Wood. A Primer on Memory Consistency and Cache Coherence. Synthesis lectures on computer architecture, Morgan and Claypool. 2011. Michael L. Scott. Shared Memory Synchronization. SLoCA, M&C. 2013. Rajeev Balasubramonian, CS6810, University of Utah. Matthew T Jacob, “High Performance Computing”, IISc/NPTEL. Hennessy and Patterson. CA. 5ed.