Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multiprocessors – Locks

Similar presentations


Presentation on theme: "Multiprocessors – Locks"— Presentation transcript:

1 Multiprocessors – Locks
Oracle SPARC M core, 64MB L3 cache (8 x 8 MB), 1.6TB/s. 256 KB of 4-way SA L2 ICache, 0.5 TB/s per cluster. 2 cores share 256 KB, 8-way SA L2 DCache, 0.5 TB/s.

2 Outline Locks – Issues Test and Set Cached Locks Test and Test and Set
Load Linked, Store Conditional

3 Implementing Locks Must synchronize processes so that they access shared variable one at a time in critical section; called Mutual Exclusion Mutex Lock: a synchronization primitive AcquireLock(L) Done before critical section of code Returns when safe for process to enter critical section ReleaseLock(L) Done after critical section Allows another process to acquire lock

4 Implementing Locks int L=0; AcquireLock(L): while (L==1) ; L = 1;
/* BUSY WAITING */ ReleaseLock(L): L = 0;

5 Problem in Implementing Locks
AcquireLock(L): while (L==1) ; L = 1; wait: LW R1, Addr(L) BNEZ wait ADDI R1, R1, 1 SW R1, Addr(L)

6 Problem in Implementing Locks
Process 1 Process 2 wait: LW R1, Addr(L) BNEZ wait ADDI R1, R1, 1 SW R1, Addr(L) LW R1, Addr(L) Context Switch LW R1, Addr(L) BNEZ wait Initally L=0. P1 and P2 are in contention to acquire the lock. ADDI R1, R1, 1 # Critical Section # Context Switch BNEZ wait ADDI R1, R1, 1 Both P1 and P2 are executing in the Critical Section !!! # Critical Section #

7 Atomic Read-Modify-Write (RMW) instruction
Atomic Exchange Hardware support for lock implementation Atomic exchange: Swap contents between register and memory. Test&Set Takes one memory operand and a register operand Test&Set Lock tmp = Lock Lock = 1 return tmp Test&Set occurs atomically (indivisibly). Atomic Read-Modify-Write (RMW) instruction

8 Lock Implementation lock: Test&Set R1, L BNZ R1, lock Critical Section
SW R0, L 1 R1 L MM The atomic read-modify-write hardware primitive facilitates synchronization implementations (locks, barriers, etc.)

9 Test&Set Implementation
Test&Set R1, L R1 t = L; # Store a copy of Lock L = 1; # Set Lock R1 ← t; # Write original value of the Lock in R1 L 2 stores (t, L) and 1 load (R1) (atomic – practically 1 instruction) MM The atomic read-modify-write hardware primitive facilitates synchronization implementations (locks, barriers, etc.)

10 Test&Set Example T0 executes T&S T0 T1 T2 P0 P1 P2 R: 1 Interconnect
Main Memory L: 0

11 Test&Set Example T0 executes T&S tmp ← L; L ← 1; R ← tmp; T0 T1 T2
T0 observes R; Realizes lock was free before its T&S P0 P1 P2 R: 0 T0 has acquired Lock Interconnect Main Memory L: 1

12 Test&Set Example T0 executes T&S T0 has acquired Lock T0 T1 T2 P0 P1
tmp ← L; L ← 1; R ← tmp; Interconnect T1 observes R; Realizes lock was occupied before its T&S Main Memory L: 1 T1 retries T&S till Lock is free All threads retry T&S till Lock is free

13 Test&Set Timeline T0 in CS P0 P1 P2 MM P0 P1 MM T0 executes T&S
R: R: Interconnect MM L: 1 P0 P1 MM T0 executes T&S T0's T&S execution R: 1 L: 0 T1 executes T&S R: 1 L: 1 R: 0 T0 in CS T1's T&S execution L: 1 Time L: 1 R: 1 T1 executes T&S T1's T&S execution L: 1 L: 1 R: 1 ...

14 Lock Performance Issues
Spin Lock – Process may enter into an infinite loop of read-modify till it succeeds Atomicity ensures that process is not switched out – other processes do not progress If lock is in memory – heavy traffic Solution: Move lock variables from memory to caches

15 Improvements on T&S Cached locks Test and Test and Set LL-SC

16 Caching Locks Locks can be cached
Atomic exchange happens between RF and local copy in cache Coherence ensures that a lock update is seen by other processors. P1 P2 R Atomic Exchange C1 C2 L Interconnect Main Memory

17 Caching Locks Main Memory P1 P2 ... Pn S S S Interconnect
Tn P1 P2 ... Pn S S S Interconnect Main Memory Atomic Exchange – Coherence Events Read: Read the latest value of the lock variable (coherence events if lock in I state). Modify: Modify the lock variable (coherence events). Write: Write lock value to register.

18 Caching Locks – Example
T0 T1 Tn P1 P2 ... Pn T&S S S S Interconnect Main Memory Read: Lock already in shared state. No coherence traffic.

19 Caching Locks – Example
T0 T1 Tn P1 P2 ... Pn T&S S S S Interconnect Inv L Main Memory Read: Lock already in shared state. No coherence traffic. Modify: Write invalidate message to all caches having a copy of the lock.

20 Caching Locks – Example
T0 T1 Tn P1 P2 ... Pn T&S M I I Interconnect Main Memory Read: Lock already in shared state. No coherence traffic. Modify: Write invalidate message to all caches having a copy of the lock.

21 Caching Locks – Example
T0 T1 Tn P1 P2 ... Pn T&S M I I Interconnect Main Memory Read: Lock already in shared state. No coherence traffic. Modify: Write invalidate message to all caches having a copy of the lock. Write: Write lock value to register.

22 Caching Locks – Example
T&S T0 T1 Tn P1 P2 ... Pn M I I Interconnect Main Memory

23 Caching Locks – Example
T&S T0 T1 Tn P1 P2 ... Pn M I I Interconnect Read Miss L Main Memory Read: Lock in Invalid state. Read Miss.

24 Caching Locks – Example
T&S T0 T1 Tn P1 P2 ... Pn S S I Interconnect Main Memory Read: Lock in Invalid state. Read Miss. Cache block updated.

25 Caching Locks – Example
T&S T0 T1 Tn P1 P2 ... Pn S S I Interconnect Inv L Main Memory Read: Lock in Invalid state. Read Miss. Cache block updated. Modify: Write invalidate message.

26 Caching Locks – Example
T&S T0 T1 Tn P1 P2 ... Pn I M I Interconnect Main Memory Read: Lock in Invalid state. Read Miss. Cache block updated. Modify: Write invalidate message.

27 Caching Locks – Example
T&S T0 T1 Tn P1 P2 ... Pn I M I Interconnect Main Memory T0 might be in CS all this while!!! Read: Lock in Invalid state. Read Miss. Cache block updated. Modify: Write invalidate message. Write: Write lock value to register.

28 Caching Locks – Example
ReleaseLock T0 T1 Tn P1 P2 ... Pn I M I Interconnect Main Memory T0 executes ReleaseLock

29 Caching Locks – Example
ReleaseLock T0 T1 Tn P1 P2 ... Pn I M I Interconnect Inv L Main Memory T0 executes ReleaseLock

30 Caching Locks – Example
ReleaseLock T0 T1 Tn P1 P2 ... Pn M I I Interconnect Main Memory T0 executes ReleaseLock

31 Caching Locks – Example
T&S T&S T0 T1 Tn P1 P2 ... Pn M I I Interconnect Main Memory

32 Caching Locks – Example
T&S T&S T0 T1 Tn P1 P2 ... Pn M I I Interconnect Read Miss L Main Memory Read: Locks in Invalid state. 2 read misses (Bus vs. Interconnection networks)

33 Caching Locks – Example
T&S T&S T0 T1 Tn P1 P2 ... Pn S S S Interconnect Main Memory Read: Locks in Invalid state. 2 read misses (Bus vs. Interconnection networks) P2 and Pn contain L=0.

34 Caching Locks – Example
T&S T&S T0 T1 Tn P1 P2 ... Pn S S S Interconnect Inv L Main Memory Read: Locks in Invalid state. 2 read misses (Bus vs. Interconnection networks) Modify: Write invalidate messages by P2, Pn. (Contention in Buses, Delay in Interconnection networks)

35 Caching Locks – Example
T&S T&S T0 T1 Tn P1 P2 ... Pn S S S Inv L Interconnect Contention in Buses – P2 and Pn intend to send Write Invalidates P2 wins Bus arbitration Pn loses Bus arbitration. Try Later. Sends Inv L Pn invalidates its lock copy. Updates L. L=1. Pn wins Bus arbitration. Sends Inv L P2 updates its cache block in MM and other caches. M → I. Pn gets an updated cache block. Pn modifies the Lock variable. I → M.

36 Caching Locks – Messages
T1 T2 Tn Caching Locks – Messages P1 P2 Pn Interconnect MM Lock release – 1 Write Invalidate. i threads execute T&S Read Misses – i Responses – 1. All threads have copy in S state. 1 Write Invalidate by the arbitration winner S → I in i-1 copies. S → M in 1 cache. 1 T&S completes (Lock acquired). i-1 T&S yet to complete – i-1 invalidates. All copies: S → I in i-1 copies. S → M in 1 cache. Each cache sends the same lock value. Every thread sees that the lock has been acquired.

37 Caching Locks If two processes share a lock variable, T&S generate huge amounts of coherence traffic Each cache sends the same lock value. After 1st T&S every thread fails to acquire lock. After L=1, and all copies are in S state, each thread can test its local copy and decide not to execute T&S. Pi Pj T&S T&S C1 C2 I M I M L L P1 holds the lock. Pi and Pj are in an infinite T&S loop. Main Memory

38 Coherence Traffic for a Lock
lockloop: test R1, Lock bnz R1, lockloop t&s R1, Lock # Critical Section # st Lock, #0 Test and Test and Set T0 T1 T2 P0 P1 P2 S 1 S 1 S 1 Contending threads in lockloop as long as lock has been acquired. Interconnect Main Memory 1

39 Coherence Traffic for a Lock
lockloop: test R1, Lock bnz R1, lockloop t&s R1, Lock # Critical Section # st Lock, #0 Test and Test and Set T0 T1 T2 P0 P1 P2 M S 1 S I 1 I S 1 Write Invalidate Lock T0 releases Lock Main Memory 1

40 Coherence Traffic for a Lock
lockloop: test R1, Lock bnz R1, lockloop t&s R1, Lock # Critical Section # st Lock, #0 Test and Test and Set T0 T1 T2 P0 P1 P2 M S 1 S I I 1 T1 tests Lock Read Miss Lock T1 exits inner loop Main Memory 1

41 Coherence Traffic for a Lock
lockloop: test R1, Lock bnz R1, lockloop t&s R1, Lock # Critical Section # st Lock, #0 Test and Test and Set T0 T1 T2 P0 P1 P2 S 1 S I 1 T1 tests-and-sets Lock Write Miss Lock Main Memory

42 Coherence Traffic for a Lock
lockloop: test R1, Lock bnz R1, lockloop t&s R1, Lock # Critical Section # st Lock, #0 Test and Test and Set T0 T1 T2 P0 P1 P2 I 1 M I 1 T1 tests-and-sets Lock Write Miss Lock Main Memory

43 Coherence Traffic for a Lock
lockloop: test R1, Lock bnz R1, lockloop t&s R1, Lock # Critical Section # st Lock, #0 Test and Test and Set T0 T1 T2 P0 P1 P2 1 I 1 M I 1 T1 tests-and-sets Lock Atomic Read-Modify-Write Main Memory

44 Test and Test and Set – Messages
i threads execute T&T&S in the inner loop. Assume Lock = 1. None of the threads exit the inner loop when Lock is set. No coherence traffic. vs. T&S – Every T&S loop generates coherence traffic.

45 Test and Test and Set – Messages
i threads do Test and T&S. Each in their inner loop (L=1). Lock release (L=0). i copies: S → I. Next iteration of inner loop: i read misses. Responses – 1. All threads have copy in S state (L=0). All i threads: Inner loop Test succeeds. i threads execute T&S – Sequence of events identical to the T&S example.

46 T&T&S – Room for Improvement
As soon as lock is released each waiting thread sends Write Invalidate – huge coherence traffic How to reduce this coherence traffic? After the inner loop exits, snoop on the bus to track writes to the lock variable Write: AcquireLock by setting lock variable to 1 If any process completes a write, do not send write invalidate request on the bus

47 Load Linked and Store Conditional
T&S is an atomic Read-Modify-Write instruction Coherence traffic cannot be avoided Use LL-SC LL records the lock variable address in a table Snoop on the bus; Record the first Write Invalidate request on the lock variable in the table. Before the thread updates its cache copy, SC checks the table. If flag is set, SC fails (no coherence traffic). Fail is indicated by a special value in the register. If flag is not set, SC succeeds. Lock acquired.

48 LL-SC Example LD R2, X .... SW R2, X LL R2, X .... SC R2, X
Atomic execution required! LD R2, X .... SW R2, X LL R2, X .... SC R2, X X: if (R2 == ) ... Some other thread has modified X R2 is filled with a special value indicating failure of SC

49 Spin lock with lower coherence traffic.
Spin Lock using LL-SC Spin lock with lower coherence traffic. lockit: LL R2, 0(R1) BNEZ R2, lockit DADDUI R2, R0, #1 SC R2, 0(R1) BEQZ R2, lockit ; no coherence traffic ; not available, keep spinning ; put value 1 in R2 ; store-conditional succeeds if no one ; updated the lock since the last LL ; confirm that SC succeeded, else keep trying If there are i processes waiting for the lock, how many bus transactions happen? 1 write by the releaser + i read-miss requests + i responses + 1 write by acquirer + 0 (i-1 failed SCs) + i-1 read-miss requests + i-1 responses.

50 Load Linked and Store Conditional
LL-SC is an implementation of atomic read- modify-write LL: Record the loaded address in a table Table updates a flag if any other process has modified the contents of the value pointed to by the address Perform any number of instructions SC: store succeeds only if the flag in the table is clear no other process attempted a store since the local LL (success only if the operation was “effectively” atomic) LL-SC does not generate bus traffic if the SC fails More efficient than test&test&set

51 Locks – Summary Locks – Issues Test and Set Cached Locks
Test and Test and Set Load Linked, Store Conditional

52 References D J. Sorin, M D. Hill, D A. Wood. A Primer on Memory Consistency and Cache Coherence. Synthesis lectures on computer architecture, Morgan and Claypool Michael L. Scott. Shared Memory Synchronization. SLoCA, M&C Rajeev Balasubramonian, CS6810, University of Utah. Matthew T Jacob, “High Performance Computing”, IISc/NPTEL. Hennessy and Patterson. CA. 5ed.


Download ppt "Multiprocessors – Locks"

Similar presentations


Ads by Google