Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 7. Multiprocessor and Memory Coherence Prof. Taeweon Suh Computer Science Education Korea University COM515 Advanced Computer Architecture.

Similar presentations


Presentation on theme: "Lecture 7. Multiprocessor and Memory Coherence Prof. Taeweon Suh Computer Science Education Korea University COM515 Advanced Computer Architecture."— Presentation transcript:

1 Lecture 7. Multiprocessor and Memory Coherence Prof. Taeweon Suh Computer Science Education Korea University COM515 Advanced Computer Architecture

2 Korea Univ Memory Hierarchy in a Multiprocessor 2 PPP $ Bus-based shared memory $$ Memory PPP $ Fully-connected shared memory (Dancehall) $$ Memory Interconnection Network P $ Memory Interconnection Network P $ Memory Distributed shared memory

3 Korea Univ Why Cache Coherency? Closest cache level is private Multiple copies of cache line can be present across different processor nodes Local updates (writes) leads to incoherent state  Problem exhibits in both write-through and writeback caches 3 Slide from Prof. H.H. Lee in Georgia Tech

4 Korea Univ Writeback Cache w/o Coherence 4 P Cache Memory P X= 100 Cache P X= 100 X= 505 read? X= 100 read?write Slide from Prof. H.H. Lee in Georgia Tech

5 Korea Univ Writethrough Cache w/o Coherence 5 P Cache Memory P X= 100 Cache P X= 100 X= 505 Read? write Slide from Prof. H.H. Lee in Georgia Tech

6 Korea Univ Definition of Coherence A multiprocessor memory system is coherent if the results of any execution of a program can be reconstructed by a hypothetical serial order Implicit definition of coherence  Write propagation Writes are visible to other processes  Write serialization All writes to the same location are seen in the same order by all processes For example, if read operations by P 1 to a location see the value produced by write w 1 (say, from P 2 ) before the value produced by write w 2 (say, from P 3 ), then reads by another process P 4 (or P 2 or P 3 ) also should not be able to see w 2 before w 1 6 Slide from Prof. H.H. Lee in Georgia Tech

7 Korea Univ 7 Sounds Easy? P0P1P2P3 A=1B=2 T1 A=0B=0 T2 A=1 B=2 T3 A=1 B=2 A=1 B=2 T3 A=1 B=2 A=1 B=2 A=1 See A’s update before B’sSee B’s update before A’s

8 Korea Univ Cache Coherence Protocols According to Caching Policies Write-through cache  Update-based protocol  Invalidation-based protocol Writeback cache  Update-based protocol  Invalidation-based protocol 8

9 Korea Univ Bus Snooping based on Write-Through Cache All the writes will be shown as a transaction on the shared bus to memory Two protocols  Update-based Protocol  Invalidation-based Protocol 9 Slide from Prof. H.H. Lee in Georgia Tech

10 Korea Univ Bus Snooping Update-based Protocol on Write-Through cache 10 P Cache Memory P X= 100 Cache P X= 505 Bus transaction Bus snoop X= 505 X= 100 write Slide from Prof. H.H. Lee in Georgia Tech

11 Korea Univ Bus Snooping Invalidation-based Protocol on Write-Through cache 11 P Cache Memory P X= 100 Cache P X= 505 Bus transaction Bus snoop X= 505 Load X X= 505 write X= 100 Slide from Prof. H.H. Lee in Georgia Tech

12 Korea Univ A Simple Snoopy Coherence Protocol for a WT, No Write-Allocate Cache 12 Invalid Valid PrRd / BusRd PrRd / --- PrWr / BusWr BusWr / --- PrWr / BusWr Processor-initiated Transaction Bus-snooper-initiated Transaction Observed / Transaction Slide from Prof. H.H. Lee in Georgia Tech

13 Korea Univ How about Writeback Cache? WB cache to reduce bandwidth requirement The majority of local writes are hidden behind the processor nodes How to snoop? Write Ordering 13 Slide from Prof. H.H. Lee in Georgia Tech

14 Korea Univ Cache Coherence Protocols for WB Caches A cache has an exclusive copy of a line if  It is the only cache having a valid copy  Memory may or may not have it Modified (dirty) cache line  The cache having the line is the owner of the line, because it must supply the block 14 Slide from Prof. H.H. Lee in Georgia Tech

15 Korea Univ Update-based Protocol on WB Cache 15 P Cache Memory P Cache P Bus transaction X= 100 Store X X= 505 update X= 505 Update data for all processor nodes who share the same data Because a processor node keeps updating the memory location, a lot of traffic will be incurred Slide from Prof. H.H. Lee in Georgia Tech

16 Korea Univ Update-based Protocol on WB Cache 16 P Cache Memory P Cache P Bus transaction X= 505 Load X Hit ! Store X X= 333 update X= 333 Update data for all processor nodes who share the same data Because a processor node keeps updating the memory location, a lot of traffic will be incurred Slide from Prof. H.H. Lee in Georgia Tech

17 Korea Univ Invalidation-based Protocol on WB Cache Invalidate the data copies for the sharing processor nodes Reduced traffic when a processor node keeps updating the same memory location 17 P Cache P P Bus transaction X= 100 Store X invalidate X= 505 Memory Slide from Prof. H.H. Lee in Georgia Tech

18 Korea Univ Invalidation-based Protocol on WB Cache 18 P Cache P P Bus transaction X= 505 Load X Bus snoop Miss ! Snoop hit X= 505 Memory Invalidate the data copies for the sharing processor nodes Reduced traffic when a processor node keeps updating the same memory location Slide from Prof. H.H. Lee in Georgia Tech

19 Korea Univ Invalidation-based Protocol on WB Cache 19 P Cache P P Bus transaction X= 505 Store X Bus snoop X= 505X= 333 Store X X= 987 Store X X= 444 Invalidate the data copies for the sharing processor nodes Reduced traffic when a processor node keeps updating the same memory location Memory Slide from Prof. H.H. Lee in Georgia Tech

20 Korea Univ MSI Writeback Invalidation Protocol Modified  Dirty  Only this cache has a valid copy Shared  Memory is consistent  One or more caches have a valid copy Invalid Writeback protocol: A cache line can be written multiple times before the memory is updated 20 Slide from Prof. H.H. Lee in Georgia Tech

21 Korea Univ MSI Writeback Invalidation Protocol Two types of request from the processor  PrRd  PrWr Three types of bus transactions post by cache controller  BusRd PrRd misses the cache Memory or another cache supplies the line  BusRd eXclusive (Read-to-own) PrWr is issued to a line which is not in the Modified state  BusWB Writeback due to replacement Processor does not directly involve in initiating this operation 21 Slide from Prof. H.H. Lee in Georgia Tech

22 Korea Univ MSI Writeback Invalidation Protocol (Processor Request) 22 Modified Invalid Shared PrRd / BusRd PrRd / --- PrWr / BusRdX PrWr / --- PrRd / --- PrWr / BusRdX Processor-initiated Slide from Prof. H.H. Lee in Georgia Tech

23 Korea Univ MSI Writeback Invalidation Protocol (Bus Transaction) 23 Flush data on the bus Both memory and requestor will grab the copy The requestor get data by  Cache-to-cache transfer; or  Memory Modified Invalid Shared Bus-snooper-initiated BusRd / --- BusRd / Flush BusRdX / FlushBusRdX / --- Slide from Prof. H.H. Lee in Georgia Tech

24 Korea Univ MSI Writeback Invalidation Protocol (Bus transaction) Another possible Implementation 24 Modified Invalid Shared Bus-snooper-initiated BusRd / --- BusRd / Flush BusRdX / FlushBusRdX / --- Anticipate no more reads from this processor A performance concern Save “invalidation” trip if the requesting cache writes the shared line later BusRd / Flush Slide from Prof. H.H. Lee in Georgia Tech

25 Korea Univ MSI Writeback Invalidation Protocol 25 Modified Invalid Shared Bus-snooper-initiated BusRd / --- PrRd / BusRd PrRd / --- PrWr / BusRdX PrWr / --- PrRd / --- PrWr / BusRdX Processor-initiated BusRd / Flush BusRdX / FlushBusRdX / --- Slide from Prof. H.H. Lee in Georgia Tech

26 Korea Univ MSI Example 26 P1 Cache P2P3 Bus Cache MEMORY BusRd Processor Action State in P1 State in P2State in P3Bus TransactionData Supplier S --- BusRdMemory P1 reads X X=10 S Slide from Prof. H.H. Lee in Georgia Tech

27 Korea Univ MSI Example 27 P1 Cache P2P3 Bus Cache MEMORY X=10S Processor Action State in P1 State in P2State in P3Bus TransactionData Supplier S --- BusRdMemory P1 reads X P3 reads X BusRd X=10S S ---SBusRdMemory X=10 Slide from Prof. H.H. Lee in Georgia Tech

28 Korea Univ MSI Example 28 P1 Cache P2P3 Bus Cache MEMORY X=10S Processor Action State in P1 State in P2State in P3Bus TransactionData Supplier S --- BusRdMemory P1 reads X P3 reads X X=10 S S ---SBusRdMemory P3 writes X BusRdX ---I M I MBusRdX X=10 X=-25 Slide from Prof. H.H. Lee in Georgia Tech

29 Korea Univ MSI Example 29 P1 Cache P2P3 Bus Cache MEMORY Processor Action State in P1 State in P2State in P3Bus TransactionData Supplier S --- BusRdMemory P1 reads X P3 reads X X=-25M S ---SBusRdMemory P3 writes X ---I I MBusRdX P1 reads X BusRd X=-25 S S S ---SBusRdP3 Cache X=10X=-25 Slide from Prof. H.H. Lee in Georgia Tech

30 Korea Univ MSI Example 30 P1 Cache P2P3 Bus Cache MEMORY Processor Action State in P1 State in P2State in P3Bus TransactionData Supplier S --- BusRdMemory P1 reads X P3 reads X X=-25M S ---SBusRdMemory P3 writes X I ---MBusRdX P1 reads X X=-25S S S ---SBusRdP3 Cache X=10X=-25 P2 reads X BusRd X=-25S S SSBusRdMemory Slide from Prof. H.H. Lee in Georgia Tech

31 Korea Univ MESI Writeback Invalidation Protocol To reduce two types of unnecessary bus transactions  BusRdX that snoops and converts the block from S to M when only you are the sole owner of the block  BusRd that gets the line in S state when there is no sharers (that lead to the overhead above) Introduce the Exclusive state  One can write to the copy without generating BusRdX Illinois Protocol: Proposed by Pamarcos and Patel in 1984 Employed in Intel, PowerPC, MIPS 31 Slide from Prof. H.H. Lee in Georgia Tech

32 Korea Univ MESI Writeback Invalidation Protocol Processor Request (Illinois Protocol) 32 Invalid ExclusiveModified Shared PrRd / BusRd (not-S) PrWr / --- Processor-initiated PrRd / --- PrRd, PrWr / --- PrRd / --- S: Shared Signal PrWr / BusRdX PrRd / BusRd (S) PrWr / BusRdX Slide from Prof. H.H. Lee in Georgia Tech

33 Korea Univ MESI Writeback Invalidation Protocol Bus Transactions (Illinois Protocol) 33 Invalid ExclusiveModified Shared Bus-snooper-initiated BusRd / Flush BusRdX / Flush BusRd / Flush* Flush*: Flush for data supplier; no action for other sharers BusRdX / Flush* BusRd / Flush Or ---) BusRdX / --- Whenever possible, Illinois protocol performs $-to-$ transfer rather than having memory to supply the data Use a Selection algorithm if there are multiple suppliers (Alternative: add an O state or force update memory) Most of the MESI implementations simply write to memory Slide from Prof. H.H. Lee in Georgia Tech

34 Korea Univ MESI Writeback Invalidation Protocol (Illinois Protocol) 34 Invalid ExclusiveModified Shared Bus-snooper-initiated BusRd / Flush BusRdX / Flush BusRd / Flush* BusRdX / Flush* BusRdX / --- PrRd / BusRd (not-S) PrWr / --- Processor-initiated PrRd / --- PrRd, PrWr / --- PrRd / --- PrWr / BusRdX S: Shared Signal PrWr / BusRdX BusRd / Flush (or ---) Flush*: Flush for data supplier; no action for other sharers Slide from Prof. H.H. Lee in Georgia Tech

35 Korea Univ MOESI Protocol 35 Add one additional state ─ Owner state Similar to Shared state The O state processor will be responsible for supplying data (copy in memory may be stale) Employed by  Sun UltraSparc  AMD Opteron In dual-core Opteron, cache-to-cache transfer is done through a system request interface (SRI) running at full CPU speed CPU0 L2 CPU1 L2 System Request Interface Crossbar Hyper- Transport Mem Controller Slide from Prof. H.H. Lee in Georgia Tech

36 Korea Univ 36 Implication on Multi-Level Caches How to guarantee coherence in a multi-level cache hierarchy  Snoop all cache levels?  Intel’s 8870 chipset has a “snoop filter” for quad-core Maintaining inclusion property  Ensure data in the outer level must be present in the inner level  Only snoop the outermost level (e.g. L2)  L2 needs to know L1 has write hits Use Write-Through cache Use Write-back but maintain another “modified-but-stale” bit in L2 Slide from Prof. H.H. Lee in Georgia Tech

37 Korea Univ 37 Inclusion Property Not so easy …  Replacement Different bus observes different access activities e.g. L2 may replace a line frequently accessed in L1  L1 and L2 are 2-way set associative  Blocks m1 and m2 go to the same set in L1 and L2  A new block m3 mapped to the same entry replaces m1 in L1 and m2 in L2 due to the LRU scheme  Split L1 caches and Unified L2 Imagine all caches are direct-mapped. m1 (instruction block) and m2 (data block) mapped to the same entry in L2  Different cache line sizes What happens if L1’s block size is smaller than L2’s? Modified Slide from Prof. H.H. Lee in Georgia Tech

38 Korea Univ 38 Inclusion Property Use specific cache configurations to maintain the inclusion property automatically  E.g., DM L1 + bigger DM or set-associative L2 with the same cache line size (#sets in L2 ≥ #sets in L1) Explicitly propagate L2 action to L1  L2 replacement will flush the corresponding L1 line  Observed BusRdX bus transaction will invalidate the corresponding L1 line  To avoid excess traffic, L2 maintains an Inclusion bit for filtering (to indicate in L1 or not) Modified Slide from Prof. H.H. Lee in Georgia Tech

39 Korea Univ 39 Directory-based Coherence Protocol Snooping-based protocol  N transactions for an N-node MP  All caches need to watch every memory request from each processor  Not a scalable solution for maintaining coherence in large shared memory systems Directory protocol  Directory-based control of who has what;  HW overheads to keep the directory (~ # lines * # processors) P $ P $ P $ P $ Memory Interconnection Network Directory Modified bit Presence bits, one for each node Slide from Prof. H.H. Lee in Georgia Tech

40 Korea Univ 40 Directory-based Coherence Protocol P $ P $ P $ P $ Memory Interconnection Network P $ C(k) C(k+1) C(k+j) 1 presence bit for each processor, each cache block in memory 1 modified bit for each cache block in memory Slide from Prof. H.H. Lee in Georgia Tech

41 Korea Univ 41 Directory-based Coherence Protocol (Limited Dir) Encoded Present bits (log 2 N), each cache line can reside in 2 processors in this example 1 modified bit for each cache block in memory P0 $ P13 $ P14 $ P15 $ Memory Interconnection Network P1 $ Presence encoding is NULL or not Slide from Prof. H.H. Lee in Georgia Tech

42 Korea Univ 42 Distributed Directory Coherence Protocol Centralized directory is less scalable (contention) Distributed shared memory (DSM) for a large MP system Interconnection network is no longer a shared bus Maintain cache coherence (CC-NUMA) Each address has a “home” P $ Memory Interconnection Network P $ Memory P $ P $ P $ P $ Directory Slide from Prof. H.H. Lee in Georgia Tech

43 Korea Univ 43 Stanford DASH Stanford DASH: 4 CPUs in each cluster, total 16 clusters (1992)  Invalidation-based cache coherence  Directory keeps one of the 3 status of a cache block at its home node Uncached Shared (unmodified state) Dirty P $ Memory Directory Interconnection Network Snoop bus Modified Slide from Prof. H.H. Lee in Georgia Tech P $ P $ Memory Directory Snoop bus P $

44 Korea Univ 44 DASH Memory Hierarchy (1992) Processor Level Local Cluster Level Home Cluster Level (address is at home) If dirty, needs to get it from remote node which owns it Remote Cluster Level Interconnection Network Modified Slide from Prof. H.H. Lee in Georgia Tech P $ Memory Directory Snoop bus P $ P $ Memory Directory Snoop bus P $ Processor Level Local Cluster Level

45 Korea Univ Stanford DASH 45 MIPS R MHz 64KB L1 I$ 64KB L1 D$ 256KB L2 MESI

46 Korea Univ 46 $ Directory Coherence Protocol: Read Miss Interconnection Network 001 P $ Memory P Miss Z (read) Go to Home Node Memory P Z Z 1 Data Z is shared (clean) Home of Z $ Z Modified from Prof. H.H. Lee’ slide in Georgia Tech

47 Korea Univ Memory 47 Directory Coherence Protocol: Read Miss Interconnection Network 1010 P Memory P Miss Z (read) P Data Z is Dirty Go to Home Node Respond with Owner Info Data Request Z 01 Data Z is Clean, Shared by 2 nodes $$$ Z Z Memory Home of Z Modified from Prof. H.H. Lee’ slide in Georgia Tech

48 Korea Univ Memory 48 Directory Coherence Protocol: Write Miss Interconnection Network 001 PP Miss Z (write) P 1 Z Respond w/ sharers Invalidate ACK 0011 Write Z can proceed in P0 $ $ $ ZZ Z Slide from Prof. H.H. Lee in Georgia Tech Memory Invalidate Go to Home Node

49 Korea Univ 49 Memory Consistency Issue What do you expect for the following codes? P1P2 A=1; Flag = 1; while (Flag==0) {}; print A; P1P2 A=1; B=1; print B; print A; Initial values A=0 B=0 Is it possible P2 prints A=0? Is it possible P2 prints B=1, A=0? Slide from Prof. H.H. Lee in Georgia Tech

50 Korea Univ 50 Memory Consistency Model Programmers anticipate certain memory ordering and program behavior Become very complex When  Running shared-memory programs  A processor supports out-of-order execution A memory consistency model specifies the legal ordering of memory events when several processors access the shared memory locations Slide from Prof. H.H. Lee in Georgia Tech

51 Korea Univ 51 Sequential Consistency (SC) [Leslie Lamport] An MP is Sequentially Consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program. Two properties  Program ordering  Write atomicity (All writes to any location should appear to all processors in the same order) Intuitive to programmers PPP Memory Slide from Prof. H.H. Lee in Georgia Tech

52 Korea Univ 52 SC Example T=1 U=2 Y=1 Z=2 P1P2P3P0 A=1A=2 T=AY=A U=AZ=A Sequentially Consistent T=1 U=2 Y=2 Z=1 Violating Sequential Consistency! (but possible in processor consistency model) P1P2P3P0 A=1A=2 T=AY=A U=AZ=A Slide from Prof. H.H. Lee in Georgia Tech

53 Korea Univ 53 Maintain Program Ordering (SC) Dekker’s algorithm Only one processor is allowed to enter the CS P1P2 Flag1 = Flag2 = 0 Flag1 = 1 if (Flag2 == 0) enter Critical Section Flag2 = 1 if (Flag1 == 0) enter Critical Section Caveat: implementation fine with uni-processor, but violate the ordering of the above P1P0 Flag1=1 Write Buffer Flag2=1 Write Buffer Flag1: 0 Flag2: 0 Flag2=0Flag1=0 INCORRECT!! BOTH ARE IN CRITICAL SECTION! Slide from Prof. H.H. Lee in Georgia Tech

54 Korea Univ 54 Atomic and Instantaneous Update (SC) Update (of A) must take place atomically to all processors A read cannot return the value of another processor’s write until the write is made visible by “all” processors P1P2 A = B = 0 A = 1 if (A==1) B =1 P3 if (B==1) R1=A Slide from Prof. H.H. Lee in Georgia Tech

55 Korea Univ 55 Atomic and Instantaneous Update (SC) Update (of A) must take place atomically to all processors A read cannot return the value of another processor’s write until the write is made visible by “all” processors P1P2 A = B = 0 A = 1 if (A==1) B =1 P3 if (B==1) R1=A P1P2P4P3 A=1 B=1 P0 A=1 B=1 A=1 Caveat when an update is not atomic to all … R1=0? Slide from Prof. H.H. Lee in Georgia Tech

56 Korea Univ 56 Relaxed Memory Models Relax program order requirement (its applies only to operation pairs accessing different locations)  Load bypass store  Load bypass load  Store bypass store  Store bypass load Relax write atomicity requirement?  Read others’ write early Allow a read to return the value of another processor’s write before all cached copies of the accessed location receive the invalidation or update messages generated by the write  Read own write early Allow a read to return the value of its own previous write before the write is made visible to other processors Modified Slide from Prof. H.H. Lee in Georgia Tech

57 Korea Univ Relaxed Memory Models 57

58 Korea Univ 58 Relaxed Consistency Processor Consistency  Used in P6  Write visibility could be in different orders of different processors (not guarantee write atomicity)  Allow loads to bypass independent stores in each individual processor  To achieve SC, explicit synchronization operations need to be substituted or inserted Read-modify-write instructions Memory fence instructions Slide from Prof. H.H. Lee in Georgia Tech

59 Korea Univ 59 Processor Consistency Modified Slide from Prof. H.H. Lee in Georgia Tech 59 Load bypassing Stores 59 F1=1F2=1 A=1A=2 R1=AR3=A R2=F2R4=F1 R1=1; R3=2; R2=0; R4=0 is a possible outcome

60 Korea Univ 60 Processor Consistency Allow load bypassing store to a different address Unlike SC, cannot guarantee mutual exclusion in the critical section P1P2 Flag1 = Flag2 = 0 Flag1 = 1 if (Flag2 == 0) enter Critical Section Flag2 = 1 if (Flag1 == 0) enter Critical Section Slide from Prof. H.H. Lee in Georgia Tech

61 Korea Univ 61 Processor Consistency B=1;R1=0 is a possible outcome since PC allows A=1 to be visible in P2 prior to P3 P1P2 A = B = 0 A = 1 if (A==1) B =1 P3 if (B==1) R1=A Slide from Prof. H.H. Lee in Georgia Tech

62 Korea Univ Backup Slides 62

63 Korea Univ Dekker’s Algorithm Dekker’s algorithm guarantees mutual exclusion, freedom from deadlock, and freedom from starvation 63 Source: Wikipedia


Download ppt "Lecture 7. Multiprocessor and Memory Coherence Prof. Taeweon Suh Computer Science Education Korea University COM515 Advanced Computer Architecture."

Similar presentations


Ads by Google