Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 252 Graduate Computer Architecture Lecture 11: Multiprocessors-II Krste Asanovic Electrical Engineering and Computer Sciences University of California,

Similar presentations


Presentation on theme: "CS 252 Graduate Computer Architecture Lecture 11: Multiprocessors-II Krste Asanovic Electrical Engineering and Computer Sciences University of California,"— Presentation transcript:

1 CS 252 Graduate Computer Architecture Lecture 11: Multiprocessors-II Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste http://inst.eecs.berkeley.edu/~cs252

2 10/16/2007 2 Recap: Sequential Consistency A Memory Model “ A system is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in the order specified by the program” Leslie Lamport Sequential Consistency = arbitrary order-preserving interleaving of memory references of sequential programs M PPPPPP

3 10/16/2007 3 Recap: Sequential Consistency Sequential consistency imposes more memory ordering constraints than those imposed by uniprocessor program dependencies ( ) What are these in our example ? T1:T2: Store (X), 1 (X = 1) Load R 1, (Y) Store (Y), 11 (Y = 11) Store (Y’), R 1 (Y’= Y) Load R 2, (X) Store (X’), R 2 (X’= X) additional SC requirements

4 10/16/2007 4 Recap: Mutual Exclusion and Locks Want to guarantee only one process is active in a critical section Blocking atomic read-modify-write instructions e.g., Test&Set, Fetch&Add, Swap vs Non-blocking atomic read-modify-write instructions e.g., Compare&Swap, Load-reserve/Store-conditional vs Protocols based on ordinary Loads and Stores

5 10/16/2007 5 Issues in Implementing Sequential Consistency Implementation of SC is complicated by two issues Out-of-order execution capability Load(a); Load(b)yes Load(a); Store(b)yes if a  b Store(a); Load(b)yes if a  b Store(a); Store(b)yes if a  b Caches Caches can prevent the effect of a store from being seen by other processors M PPPPPP SC complications motivates architects to consider weak or relaxed memory models

6 10/16/2007 6 Memory Fences Instructions to sequentialize memory accesses Processors with relaxed or weak memory models (i.e., permit Loads and Stores to different addresses to be reordered) need to provide memory fence instructions to force the serialization of memory accesses Examples of processors with relaxed memory models: Sparc V8 (TSO,PSO): Membar Sparc V9 (RMO): Membar #LoadLoad, Membar #LoadStore Membar #StoreLoad, Membar #StoreStore PowerPC (WO): Sync, EIEIO Memory fences are expensive operations, however, one pays the cost of serialization only when it is required

7 10/16/2007 7 Using Memory Fences Producer posting Item x: Load R tail, (tail) Store (R tail ), x Membar SS R tail =R tail +1 Store (tail), R tail Consumer: Load R head, (head) spin:Load R tail, (tail) if R head ==R tail goto spin Membar LL Load R, (R head ) R head =R head +1 Store (head), R head process(R) Producer Consumer tailhead R tail R head R ensures that tail ptr is not updated before x has been stored ensures that R is not loaded before x has been stored

8 10/16/2007 8 Data-Race Free Programs a.k.a. Properly Synchronized Programs Process 1... Acquire(mutex); Release(mutex); Process 2... Acquire(mutex); Release(mutex); Synchronization variables (e.g. mutex) are disjoint from data variables Accesses to writable shared data variables are protected in critical regions no data races except for locks (Formal definition is elusive) In general, it cannot be proven if a program is data-race free.

9 10/16/2007 9 Fences in Data-Race Free Programs Process 1... Acquire(mutex); membar; membar; Release(mutex); Process 2... Acquire(mutex); membar; membar; Release(mutex); Relaxed memory model allows reordering of instructions by the compiler or the processor as long as the reordering is not done across a fence The processor also should not speculate or prefetch across fences

10 10/16/2007 10 Mutual Exclusion Using Load/Store A protocol based on two shared variables c1 and c2. Initially, both c1 and c2 are 0 (not busy) What is wrong? Process 1... c1=1; L: if c2=1 then go to L c1=0; Process 2... c2=1; L: if c1=1 then go to L c2=0;

11 10/16/2007 11 Mutual Exclusion: second attempt To avoid deadlock, let a process give up the reservation (i.e. Process 1 sets c1 to 0) while waiting. Process 1... L: c1=1; if c2=1 then { c1=0; go to L} c1=0 Process 2... L: c2=1; if c1=1 then { c2=0; go to L} c2=0 What can go wrong now?

12 10/16/2007 12 A Protocol for Mutual Exclusion T. Dekker, 1966 Process 1... c1=1; turn = 1; L: if c2=1 & turn=1 then go to L c1=0; A protocol based on 3 shared variables c1, c2 and turn. Initially, both c1 and c2 are 0 (not busy) turn = i ensures that only process i can wait variables c1 and c2 ensure mutual exclusion Solution for n processes was given by Dijkstra and is quite tricky! Process 2... c2=1; turn = 2; L: if c1=1 & turn=2 then go to L c2=0;

13 10/16/2007 13 Analysis of Dekker’s Algorithm... Process 1 c1=1; turn = 1; L: if c2=1 & turn=1 then go to L c1=0;... Process 2 c2=1; turn = 2; L: if c1=1 & turn=2 then go to L c2=0; Scenario 1... Process 1 c1=1; turn = 1; L: if c2=1 & turn=1 then go to L c1=0;... Process 2 c2=1; turn = 2; L: if c1=1 & turn=2 then go to L c2=0; Scenario 2

14 10/16/2007 14 N-process Mutual Exclusion Lamport’s Bakery Algorithm Process i choosing[i] = 1; num[i] = max(num[0], …, num[N-1]) + 1; choosing[i] = 0; for(j = 0; j < N; j++) { while( choosing[j] ); while( num[j] && ( ( num[j] < num[i] ) || ( num[j] == num[i] && j < i ) ) ); } num[i] = 0; Initially num[j] = 0, for all j Entry Code Exit Code

15 10/16/2007 15 CS252 Administrivia Project meetings next week (10/23-25), same schedule as before (M 1-3PM, Tu/Th 9:40-11AM) –Schedule on website –All in 645 Soda, 20mins/group Hope to see: –Project web site –At least one initial result (some delta from “hello world”) –Grasp of related work Midterm review

16 10/16/2007 16

17 10/16/2007 17

18 10/16/2007 18

19 10/16/2007 19

20 10/16/2007 20

21 10/16/2007 21 EECS Graduate Grading Guidelines A+, A, A- Quality expected from PhD student B+, B Quality expected from MS student, not PhD <= B- < Quality expected from MS student Class average somewhere in range 3.2 - 3.6 http://www.eecs.berkeley.edu/Policies/grad.grading.shtml

22 10/16/2007 22 Memory Consistency in SMPs Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has a stale value Do these stale values matter? What is the view of shared memory for programming? cache-1 A100 CPU-Memory bus CPU-1 CPU-2 cache-2 A100 memory A100

23 10/16/2007 23 Write-back Caches & SC T1 is executed prog T2 LD Y, R1 ST Y’, R1 LD X, R2 ST X’,R2 prog T1 ST X, 1 ST Y,11 cache-2 cache-1memory X = 0 Y =10 X’= Y’= X= 1 Y=11 Y = Y’= X = X’= cache-1 writes back Y X = 0 Y =11 X’= Y’= X= 1 Y=11 Y = Y’= X = X’= X = 1 Y =11 X’= Y’= X= 1 Y=11 Y’= 11 X = 0 X’= 0 cache-1 writes back X X = 0 Y =11 X’= Y’= X= 1 Y=11 Y’= 11 X = 0 X’= 0 T2 executed X = 1 Y =11 X’= 0 Y’=11 X= 1 Y=11 Y’=11 X = 0 X’= 0 cache-2 writes back X’ & Y’ inconsistent

24 10/16/2007 24 Write-through Caches & SC cache-2 Y = Y’= X = 0 X’= memory X = 0 Y =10 X’= Y’= cache-1 X= 0 Y=10 prog T2 LD Y, R1 ST Y’, R1 LD X, R2 ST X’,R2 prog T1 ST X, 1 ST Y,11 Write-through caches don’t preserve sequential consistency either T1 executed Y = Y’= X = 0 X’= X = 1 Y =11 X’= Y’= X= 1 Y=11 T2 executed Y = 11 Y’= 11 X = 0 X’= 0 X = 1 Y =11 X’= 0 Y’=11 X= 1 Y=11

25 10/16/2007 25 Maintaining Sequential Consistency SC is sufficient for correct producer-consumer and mutual exclusion code (e.g., Dekker) Multiple copies of a location in various caches can cause SC to break down. Hardware support is required such that only one processor at a time has write permission for a location no processor can load a stale copy of the location after a write  cache coherence protocols

26 10/16/2007 26 Cache Coherence Protocols for SC write request: the address is invalidated (updated) in all other caches before (after) the write is performed read request: if a dirty copy is found in some cache, a write- back is performed before the memory is read We will focus on Invalidation protocols as opposed to Update protocols

27 10/16/2007 27 Warmup: Parallel I/O (DMA stands for Direct Memory Access) Either Cache or DMA can be the Bus Master and effect transfers DISK DMA Physical Memory Proc. R/W Data (D) Cache Address (A) A D R/W Page transfers occur while the Processor is running Memory Bus

28 10/16/2007 28 Problems with Parallel I/O Memory Disk: Physical memory may be stale if Cache copy is dirty Disk Memory: Cache may hold state data and not see memory writes DISK DMA Physical Memory Proc. Cache Memory Bus Cached portions of page DMA transfers

29 10/16/2007 29 Snoopy Cache Goodman 1983 Idea: Have cache watch (or snoop upon) DMA transfers, and then “do the right thing” Snoopy cache tags are dual-ported Proc. Cache Snoopy read port attached to Memory Bus Data (lines) Tags and State A D R/W Used to drive Memory Bus when Cache is Bus Master A R/W

30 10/16/2007 30 Snoopy Cache Actions for DMA Observed Bus Cycle Cache State Cache Action Address not cached DMA Read Cached, unmodified Memory Disk Cached, modified Address not cached DMA Write Cached, unmodified Disk Memory Cached, modified

31 10/16/2007 31 Shared Memory Multiprocessor Use snoopy mechanism to keep all processors’ view of memory coherent M1M1 M2M2 M3M3 Snoopy Cache DMA Physical Memory Bus Snoopy Cache Snoopy Cache DISKS

32 10/16/2007 32 Cache State Transition Diagram The MSI protocol M SI M: Modified S: Shared I: Invalid Each cache line has a tag Address tag state bits Write miss Other processor intent to write Read miss P 1 intent to write Other processor intent to write Read by any processor P 1 reads or writes Cache state in processor P 1 Other processor reads P 1 writes back

33 10/16/2007 33 Two Processor Example (Reading and writing the same cache line) M SI Write miss Read miss P 1 intent to write P 2 intent to write P 2 reads, P 1 writes back P 1 reads or writes P 2 intent to write P1P1 M SI Write miss Read miss P 2 intent to write P 1 intent to write P 1 reads, P 2 writes back P 2 reads or writes P 1 intent to write P2P2 P 1 reads P 1 writes P 2 reads P 2 writes P 1 writes P 2 writes P 1 reads P 1 writes

34 10/16/2007 34 Observation If a line is in the M state then no other cache can have a copy of the line! – Memory stays coherent, multiple differing copies cannot exist M SI Write miss Other processor intent to write Read miss P 1 intent to write Other processor intent to write Read by any processor P 1 reads or writes Other processor reads P 1 writes back

35 10/16/2007 35 MESI: An Enhanced MSI protocol increased performance for private data ME SI M: Modified Exclusive E: Exclusive, unmodified S: Shared I: Invalid Each cache line has a tag Address tag state bits Write miss Other processor intent to write Read miss, shared Other processor intent to write P 1 write Read by any processor Other processor reads P 1 writes back P 1 read P 1 write or read Cache state in processor P 1 P 1 intent to write Read miss, not shared

36 10/16/2007 36 Snooper Optimized Snoop with Level-2 Caches Processors often have two-level caches small L1, large L2 (usually both on chip now) Inclusion property: entries in L1 must be in L2 invalidation in L2  invalidation in L1 Snooping on L2 does not affect CPU-L1 bandwidth What problem could occur? CPU L1 $ L2 $ CPU L1 $ L2 $ CPU L1 $ L2 $ CPU L1 $ L2 $

37 10/16/2007 37 Intervention When a read-miss for A occurs in cache-2, a read request for A is placed on the bus Cache-1 needs to supply & change its state to shared The memory may respond to the request also! Does memory know it has stale data? Cache-1 needs to intervene through memory controller to supply correct data to cache-2 cache-1 A200 CPU-Memory bus CPU-1 CPU-2 cache-2 memory (stale data) A100

38 10/16/2007 38 False Sharing state blk addr data0data1... dataN A cache block contains more than one word Cache-coherence is done at the block-level and not word-level Suppose M 1 writes word i and M 2 writes word k and both words have the same block address. What can happen?

39 10/16/2007 39 Synchronization and Caches: Performance Issues Cache-coherence protocols will cause mutex to ping-pong between P1’s and P2’s caches. Ping-ponging can be reduced by first reading the mutex location (non-atomically) and executing a swap only if it is found to be zero. cache Processor 1 R  1 L: swap (mutex), R; if then goto L; M[mutex]  0; Processor 2 R  1 L: swap (mutex), R; if then goto L; M[mutex]  0; Processor 3 R  1 L: swap (mutex), R; if then goto L; M[mutex]  0; CPU-Memory Bus mutex=1 cache

40 10/16/2007 40 Performance Related to Bus Occupancy In general, a read-modify-write instruction requires two memory (bus) operations without intervening memory operations by other processors In a multiprocessor setting, bus needs to be locked for the entire duration of the atomic read and write operation expensive for simple buses very expensive for split-transaction buses modern ISAs use load-reserve store-conditional

41 10/16/2007 41 Load-reserve & Store-conditional If the snooper sees a store transaction to the address in the reserve register, the reserve bit is set to 0 Several processors may reserve ‘a’ simultaneously These instructions are like ordinary loads and stores with respect to the bus traffic Can implement reservation by using cache hit/miss, no additional hardware required (problems?) Special register(s) to hold reservation flag and address, and the outcome of store-conditional Load-reserve R, (a):  ; R M[a]; Store-conditional (a), R: if == then cancel other procs’ reservation on a; M[a]  ; status succeed; else status fail;

42 10/16/2007 42 Performance: Load-reserve & Store-conditional The total number of memory (bus) transactions is not necessarily reduced, but splitting an atomic instruction into load-reserve & store- conditional: increases bus utilization (and reduces processor stall time), especially in split- transaction buses reduces cache ping-pong effect because processors trying to acquire a semaphore do not have to perform a store each time

43 10/16/2007 43 Blocking caches One request at a time + CC  SC Non-blocking caches Multiple requests (different addresses) concurrently + CC  Relaxed memory models CC ensures that all processors observe the same order of loads and stores to an address Out-of-Order Loads/Stores & CC Cache Memory pushout (Wb-rep) load/store buffers CPU (S-req, E-req) (S-rep, E-rep) Wb-req, Inv-req, Inv-rep snooper (I/S/E) CPU/Memory Interface


Download ppt "CS 252 Graduate Computer Architecture Lecture 11: Multiprocessors-II Krste Asanovic Electrical Engineering and Computer Sciences University of California,"

Similar presentations


Ads by Google