Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 2. Snoop-based Cache Coherence Protocols Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture &

Similar presentations


Presentation on theme: "Lecture 2. Snoop-based Cache Coherence Protocols Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture &"— Presentation transcript:

1 Lecture 2. Snoop-based Cache Coherence Protocols Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming

2 Korea Univ Flynn’s Taxonomy A classification of computers, proposed by Michael J. Flynn in 1966  Characterize computer designs in terms of the number of distinct instructions issued at a time and the number of data elements they operate on 2 Single Instruction Multiple Instruction Single DataSISDMISD Multiple DataSIMDMIMD Source: Widipedia

3 Korea Univ Flynn’s Taxonomy (Cont.) SISD  Single Instruction Single Data  Uniprocessor  Example: Your desktop (notebook) computer before the spread of dual or more core CPUs SIMD  Single Instruction Multiple Data  Each processor works on its own data stream  But all processors execute the same instruction in lockstep  Example: MMX and GPU 3 Picture sources: Wikipedia

4 Korea Univ SIMD Example MMX (Multimedia Extension)  64-bit registers == 2 32-bit integers, 4 16-bits integers, or 8 8-bit integers processed concurrently SSE (Streaming SIMD Extensions)  256-bit registers == 4 DP floating-point operations 4

5 Korea Univ Flynn’s Taxonomy (Cont.) MISD  Multiple Instruction Single Data  Each processor executes different instructions on the same data  Not used much MIMD  Multiple Instruction Multiple Data  Each processor executes its own instruction for its own data  Virtually, all the multiprocessor systems are based on MIMD 5 Pic ture sources: Wikipedia

6 Korea Univ Multiprocessor Systems Shared memory systems  Bus-based shared memory  Distributed shared memory Current server systems (for example, Xeon-based servers) Cluster-based systems  Supercomputers and datacenters 6

7 Korea Univ Clusters 7 Supercomputer dubbed 7N (Cluster computer), 95th fastest in the world on the TOP500 in 2007 https://www.jlab.org/news/releases/jefferson-lab-boasts-virginias-fastest-computer

8 Korea Univ Shared Memory Multiprocessor Models 8 PPP $ Bus-based shared memory $$ Memory PPP $ Fully-connected shared memory (Dancehall) $$ Memory Interconnection Network P $ Memory Interconnection Network P $ Memory Distributed shared memory Our Focus today

9 Korea Univ Some Terminologies Shared memory systems can be classified into  UMA (Uniform Memory Access) architecture  NUMA (Non-Uniform Memory Access) architecture SMP (Symmetric Multiprocessor) is an UMA example  Don’t be confused with SMT (Simultaneous Multithreading) 9

10 Korea Univ SMP (UMA) Systems 10 Antique (?) P-III based SMP Sandy Bridge based motherboard Motherboard-Replacement-Program-2.jpg/ Memory P-III $$

11 Korea Univ DSM (NUMA) Machine Examples Nehalem-based systems with QPI 11 Nehalem-based Xeon 5500 QPI: QuickPath Interconnect

12 Korea Univ More Recent NUMA System 12

13 Korea Univ Amdahl’s Law (Law of Diminishing Returns) Amdahl’s law is named after computer architect Gene Amdahl It is used to find the maximum expected improvement to an overall system The speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program 13 Maximum speedup = (1 – P) + P / N 1 P: Parallelizable portion of a program N: # processors Source: Widipedia

14 Korea Univ WB & WT Caches 14 CPU core Cache Memory X= 100 Writeback X= 300 CPU core Cache Memory X= 100 Writethrough X= 300

15 Korea Univ Definition of Coherence Coherence is a property of a shared-memory architecture giving the illusion to the software that there is a single copy of every memory location, even if multiple copies exist A multiprocessor memory system is coherent if the results of any execution of a program can be reconstructed by a hypothetical serial order 15 Modified Slide from Prof. H.H. Lee in Georgia Tech Memory P-III $$

16 Korea Univ Definition of Coherence A multiprocessor memory system is coherent if the results of any execution of a program can be reconstructed by a hypothetical serial order Implicit definition of coherence  Write propagation Writes are visible to other processes  Write serialization All writes to the same location are seen in the same order by all processes 16 Slide from Prof. H.H. Lee in Georgia Tech

17 Korea Univ Why Cache Coherency? Closest cache level is private Multiple copies of cache line can be present across different processor nodes Local updates (writes) leads to incoherent state  Problem exhibits in both write-through and writeback caches 17 Slide from Prof. H.H. Lee in Georgia Tech Core i7 L2 Cache (256KB) CPU Core Reg File L1 I$ (32KB) L1 D$ (32KB) L3 Cache (8MB) - Shared L2 Cache (256KB) CPU Core Reg File L1 I$ (32KB) L1 D$ (32KB)..

18 Korea Univ Writeback Cache w/o Coherence 18 P Cache Memory P X= 100 Cache P X= 100 X= 505 read? X= 100 read?write Slide from Prof. H.H. Lee in Georgia Tech

19 Korea Univ Writethrough Cache w/o Coherence 19 P Cache Memory P X= 100 Cache P X= 100 X= 505 Read? write Slide from Prof. H.H. Lee in Georgia Tech

20 Korea Univ Cache Coherence Protocols According to Caching Policies Write-through cache  Update-based protocol  Invalidation-based protocol Writeback cache  Update-based protocol  Invalidation-based protocol 20

21 Korea Univ Bus Snooping based on Write-Through Cache All the writes will be shown as a transaction on the shared bus to memory Two protocols  Update-based Protocol  Invalidation-based Protocol 21 Slide from Prof. H.H. Lee in Georgia Tech

22 Korea Univ Bus Snooping Update-based Protocol on Write-Through cache 22 P Cache Memory P X= 100 Cache P X= 505 Bus transaction Bus snoop X= 505 X= 100 write Slide from Prof. H.H. Lee in Georgia Tech

23 Korea Univ Bus Snooping Invalidation-based Protocol on Write-Through cache 23 P Cache Memory P X= 100 Cache P X= 505 Bus transaction Bus snoop X= 505 Load X X= 505 write X= 100 Slide from Prof. H.H. Lee in Georgia Tech

24 Korea Univ A Simple Snoopy Coherence Protocol for a WT, No Write-Allocate Cache 24 Invalid Valid PrRd / BusRd PrRd / --- PrWr / BusWr BusWr / --- PrWr / BusWr Processor-initiated Transaction Bus-snooper-initiated Transaction Observed / Transaction Slide from Prof. H.H. Lee in Georgia Tech

25 Korea Univ How about Writeback Cache? WB cache to reduce bandwidth requirement The majority of local writes are hidden behind the processor nodes How to snoop? Write Ordering 25 Slide from Prof. H.H. Lee in Georgia Tech

26 Korea Univ Cache Coherence Protocols for WB Caches A cache has an exclusive copy of a line if  It is the only cache having a valid copy  Memory may or may not have it Modified (dirty) cache line  The cache having the line is the owner of the line, because it must supply the block 26 Slide from Prof. H.H. Lee in Georgia Tech

27 Korea Univ Update-based Protocol on WB Cache 27 P Cache Memory P Cache P Bus transaction X= 100 Store X X= 505 update X= 505 Update data for all processor nodes who share the same data Because a processor node keeps updating the memory location, a lot of traffic will be incurred Slide from Prof. H.H. Lee in Georgia Tech

28 Korea Univ Update-based Protocol on WB Cache 28 P Cache Memory P Cache P Bus transaction X= 505 Load X Hit ! Store X X= 333 update X= 333 Update data for all processor nodes who share the same data Because a processor node keeps updating the memory location, a lot of traffic will be incurred Slide from Prof. H.H. Lee in Georgia Tech

29 Korea Univ Invalidation-based Protocol on WB Cache Invalidate the data copies for the sharing processor nodes Reduced traffic when a processor node keeps updating the same memory location 29 P Cache P P Bus transaction X= 100 Store X invalidate X= 505 Memory Slide from Prof. H.H. Lee in Georgia Tech

30 Korea Univ Invalidation-based Protocol on WB Cache 30 P Cache P P Bus transaction X= 505 Load X Bus snoop Miss ! Snoop hit X= 505 Memory Invalidate the data copies for the sharing processor nodes Reduced traffic when a processor node keeps updating the same memory location Slide from Prof. H.H. Lee in Georgia Tech

31 Korea Univ Invalidation-based Protocol on WB Cache 31 P Cache P P Bus transaction X= 505 Store X Bus snoop X= 505X= 333 Store X X= 987 Store X X= 444 Invalidate the data copies for the sharing processor nodes Reduced traffic when a processor node keeps updating the same memory location Memory Slide from Prof. H.H. Lee in Georgia Tech

32 Korea Univ MSI Writeback Invalidation Protocol Modified  Dirty  Only this cache has a valid copy Shared  Memory is consistent  One or more caches have a valid copy Invalid Writeback protocol: A cache line can be written multiple times before the memory is updated 32 Slide from Prof. H.H. Lee in Georgia Tech

33 Korea Univ MSI Writeback Invalidation Protocol Two types of request from the processor  PrRd  PrWr Three types of bus transactions posted by cache controller  BusRd PrRd misses the cache Memory or another cache supplies the line  BusRdX (Read-to-own) PrWr is issued to a line which is not in the Modified state  BusWB Writeback due to replacement Processor does not directly involve in initiating this operation 33 Slide from Prof. H.H. Lee in Georgia Tech

34 Korea Univ MSI Writeback Invalidation Protocol (Processor Request) 34 Modified Invalid Shared PrRd / BusRd PrRd / --- PrWr / BusRdX PrWr / --- PrRd / --- PrWr / BusRdX Processor-initiated Slide from Prof. H.H. Lee in Georgia Tech

35 Korea Univ MSI Writeback Invalidation Protocol (Bus Transaction) 35 Flush data on the bus Both memory and requestor will grab the copy The requestor get data from either  Cache-to-cache transfer; or  Memory Modified Invalid Shared Bus-snooper-initiated BusRd / --- BusRd / Flush BusRdX / FlushBusRdX / --- Slide from Prof. H.H. Lee in Georgia Tech

36 Korea Univ MSI Writeback Invalidation Protocol (Bus transaction) Another possible Implementation 36 Modified Invalid Shared Bus-snooper-initiated BusRd / --- BusRd / Flush BusRdX / FlushBusRdX / --- Anticipate no more reads from this processor A performance concern Save “invalidation” trip if the requesting cache writes the shared line later BusRd / Flush Slide from Prof. H.H. Lee in Georgia Tech

37 Korea Univ MSI Writeback Invalidation Protocol 37 Modified Invalid Shared Bus-snooper-initiated BusRd / --- PrRd / BusRd PrRd / --- PrWr / BusRdX PrWr / --- PrRd / --- PrWr / BusRdX Processor-initiated BusRd / Flush BusRdX / FlushBusRdX / --- Slide from Prof. H.H. Lee in Georgia Tech

38 Korea Univ MSI Example 38 P1 Cache P2P3 Bus Cache MEMORY BusRd Processor Action State in P1 State in P2State in P3Bus TransactionData Supplier S --- BusRdMemory P1 reads X X=10 S Slide from Prof. H.H. Lee in Georgia Tech

39 Korea Univ MSI Example 39 P1 Cache P2P3 Bus Cache MEMORY X=10S Processor Action State in P1 State in P2State in P3Bus TransactionData Supplier S --- BusRdMemory P1 reads X P3 reads X BusRd X=10S S ---SBusRdMemory X=10 Slide from Prof. H.H. Lee in Georgia Tech

40 Korea Univ MSI Example 40 P1 Cache P2P3 Bus Cache MEMORY X=10S Processor Action State in P1 State in P2State in P3Bus TransactionData Supplier S --- BusRdMemory P1 reads X P3 reads X X=10 S S ---SBusRdMemory P3 writes X BusRdX ---I M I MBusRdX X=10 X=-25 Slide from Prof. H.H. Lee in Georgia Tech

41 Korea Univ MSI Example 41 P1 Cache P2P3 Bus Cache MEMORY Processor Action State in P1 State in P2State in P3Bus TransactionData Supplier S --- BusRdMemory P1 reads X P3 reads X X=-25M S ---SBusRdMemory P3 writes X ---I I MBusRdX P1 reads X BusRd X=-25 S S S ---SBusRdP3 Cache X=10X=-25 Slide from Prof. H.H. Lee in Georgia Tech

42 Korea Univ MSI Example 42 P1 Cache P2P3 Bus Cache MEMORY Processor Action State in P1 State in P2State in P3Bus TransactionData Supplier S --- BusRdMemory P1 reads X P3 reads X X=-25M S ---SBusRdMemory P3 writes X I ---MBusRdX P1 reads X X=-25S S S ---SBusRdP3 Cache X=10X=-25 P2 reads X BusRd X=-25S S SSBusRdMemory Slide from Prof. H.H. Lee in Georgia Tech

43 Korea Univ MESI Writeback Invalidation Protocol To reduce two types of unnecessary bus transactions  BusRdX that snoops and converts the block from S to M when only you are the sole owner of the block  BusRd that gets the line in S state when there is no sharers (that lead to the overhead above) Introduce the Exclusive state  One can write to the copy without generating BusRdX Illinois Protocol: Proposed by Pamarcos and Patel in 1984 Employed in Intel, PowerPC, MIPS 43 Slide from Prof. H.H. Lee in Georgia Tech

44 Korea Univ MESI Writeback Invalidation (Processor Request) 44 Invalid ExclusiveModified Shared PrRd / BusRd (not-S) PrWr / --- Processor-initiated PrRd / --- PrRd, PrWr / --- PrRd / --- S: Shared Signal PrWr / BusRdX PrRd / BusRd (S) PrWr / BusRdX Slide from Prof. H.H. Lee in Georgia Tech

45 Korea Univ MESI Writeback Invalidation Protocol (Bus Transactions) 45 Invalid ExclusiveModified Shared Bus-snooper-initiated BusRd / Flush BusRdX / Flush BusRd / Flush* Flush*: Flush for data supplier; no action for other sharers BusRdX / Flush* BusRd / Flush Or ---) BusRdX / --- Whenever possible, Illinois protocol performs $-to-$ transfer rather than having memory to supply the data Use a Selection algorithm if there are multiple suppliers (Alternative: add an O state or force update memory) Modified Slide from Prof. H.H. Lee in Georgia Tech

46 Korea Univ MESI Writeback Invalidation Protocol (Illinois Protocol) 46 Invalid ExclusiveModified Shared Bus-snooper-initiated BusRd / Flush BusRdX / Flush BusRd / Flush* BusRdX / Flush* BusRdX / --- PrRd / BusRd (not-S) PrWr / --- Processor-initiated PrRd / --- PrRd, PrWr / --- PrRd / --- PrWr / BusRdX S: Shared Signal PrWr / BusRdX BusRd / Flush (or ---) Flush*: Flush for data supplier; no action for other sharers Slide from Prof. H.H. Lee in Georgia Tech PrRd / BusRd (S)

47 Korea Univ MOESI Protocol 47 Introduce a notion of ownership ─ Owned state Similar to Shared state The O state processor will be responsible for supplying data (copy in memory may be stale) Employed by  Sun UltraSparc  AMD Opteron In dual-core Opteron, cache-to-cache transfer is done through a system request interface (SRI) running at full CPU speed CPU0 L2 CPU1 L2 System Request Interface Crossbar Hyper- Transport Mem Controller Modified Slide from Prof. H.H. Lee in Georgia Tech

48 Korea Univ MOESI Writeback Invalidation Protocol (Processor Request) 48 Invalid Exclusive Modified Shared PrRd / BusRd (not-S) PrWr / --- Processor-initiated PrRd / --- PrRd, PrWr / --- PrRd / --- S: Shared Signal PrWr / BusRdX PrRd / BusRd (S) PrWr / BusRdX Owned PrRd / --- PrWr / BusRdX

49 Korea Univ MOESI Writeback Invalidation Protocol (Bus Transactions) 49 Invalid Exclusive Modified Shared Bus-snooper-initiated BusRd / Flush BusRdX / Flush Flush*: Flush for data supplier; no action for other sharers BusRd / Flush (Or ---) BusRdX / --- Owned BusRd / Flush BusRdX / Flush BusRd / Flush BusRd / Flush* BusRd / --- BusRdX / --- BusRdX / Flush* BusRd / Flush

50 Korea Univ MOESI Writeback Invalidation Protocol (Bus Transactions) 50 Invalid Exclusive Modified Shared Bus-snooper-initiated BusRdX / Flush BusRdX / --- Owned BusRd / Flush BusRdX / Flush BusRd / Flush BusRd / --- BusRdX / --- BusRd / Flush

51 Korea Univ Transient States in MSI Design issues: Coherence transaction is not atomic  I → E, S (?) depending on Shared signal in MESI  The next state cannot be determined until the request is launched on the bus and the snoop result is available 51 BusRdX reads a memory block and invalidates other copies BusUpgr invalidates potential remote cache copies

52 Korea Univ Atomic & Non-atomic Buses A split-transaction bus increases the available bus bandwidth by breaking up a transaction into subtransactions 52 time Atomic bus Non-atomic bus (pipelined bus or split transaction bus) addr1 read data addr2 read data addr1 read data addr2 read data addr3 read data addr4 read data

53 Korea Univ Issues with Pipelined Buses 53 Non-atomic bus (pipelined bus or split transaction bus) addr1 read data addr1 read data addr1 read data addr1 write data SGI Challenge (mid-1990s) has a system-wide table in each node to book-keep all outstanding requests  A request is launched if no entry in the table matches the address of the request Silicon Graphics, Inc. was an American manufacturer of high-performance computing solutions, including computer hardware and software. -wiki

54 Korea Univ SGI Challenge 54

55 Korea Univ Inclusion & Exclusion Properties 55 L2 CPU Core Reg File L1 I$L1 D Main Memory Inclusion property A block in L1 should be L2 as well L2 block eviction causes the invalidation of the L1 block L1 write causes L2 update Effective cache size is equal to L2 size Desirable for cache coherence L2 CPU Core Reg File L1 I$L1 D Main Memory Exclusion property A block is located either L1 or L2 When a L1 block is replaced, it is possibly located in L2 Better utilization of hardware resources

56 Korea Univ Cache Hierarchies 56 “Achieving Non-Inclusive Cache Performance with Inclusive Caches”, MICRO, 2010 Effective cache sizes Inclusive: LLC Non-Inclusive: LLC ~ (LLC + L1s) Exclusive: LLC + L1s

57 Korea Univ Coherency in Multi-level Cache Hierarchy 57 L2 Cache CPU Core Reg File L1 I$L1 D$ Main Memory L2 Cache CPU Core Reg File L1 I$L1 D$ L2 is exclusive All incoming bus requests contend with CPU core for L1

58 Korea Univ Coherency in Multi-level Cache Hierarchy 58 L2 Cache CPU Core Reg File L1 I$L1 D$ Main Memory L2 Cache CPU Core Reg File L1 I$L1 D$ L2 is inclusive L2 is used as a snoop filter L2 line eviction forces the L1 line eviction If L1 is the writeback cache, the blocks in L1 and L2 are not consistent  Writethrough policy in L1 is desirable  Otherwise, L1 should be snooped

59 Korea Univ Nehalem Case Study 59 L2 Cache (8-way, 256KB) CPU Core Reg File L1 I$ (32KB) L1 D$ (32KB) L3 Cache (8MB) - Shared L2 Cache (256KB) CPU Core Reg File L1 I$ (32KB) L1 D$ (32KB).. Main Memory writeback Inclusive Non-Inclusive 4-cycle In the L1 data cache and in the L2/L3 unified caches, the MESI (modified, exclusive, shared, invalid) cache protocol maintains consistency with caches of other processors. The L1 data cache and the L2/L3 unified caches have two MESI status flags per cache line.

60 Korea Univ Nehalem Uncore 60

61 Korea Univ Sandy Bridge 61 Transactions from each core travel along the ring LLC slice (2MB each) are connected to the ring

62 Korea Univ TLB and Virtual Memory 62 CPU core TLB in MMU Virtual (linear) address Hard disk Virtual memory Space Physical address Processor Windows XP 0 MS Word … 0x4F 1 9 0x xF … 1 3 Hello world Main Memory MMU: Memory Management Unit

63 Korea Univ TLB with a Cache TLB is a cache for page table Cache is a cache (?) for instruction and data  Modern processors typically use physical address to access caches 63 Main Memory virtual address physical address CPU core Page table MMU TLB CPU Cache physical address Instructions or data

64 Korea Univ Core i7 Case Study 64 L2 Cache (8-way, 256KB) CPU Core Reg File L1 I$ (32KB) L1 D$ (8- way, 32KB) L3 Cache (8MB) - Shared L2 Cache (256KB) CPU Core Reg File L1 I$ (32KB) L1 D$ (32KB).. Main Memory L1: VIPT L2: PIPT L3: PIPT Inclusive Non-Inclusive ITLBDTLBITLBDTLB

65 Korea Univ TLB Shootdown TLB inconsistency arises when a PTE in TLB is modified  The PTE copy in other TLBs and main memory is stale  2 cases Virtual-to-physical mapping change by OS Page access right change by OS TLB shootdown procedure (similar to Page fault handling)  A processor invokes Virtual memory manager and it generates IPI (Inter- processor Interrupt)  Each processor invokes a software handler to remove the stale PTE and invalidate all the block copies in private caches 65 Cache CPU Core Reg File Main Memory TLB Cache CPU Core Reg File TLB

66 Korea Univ False Sharing 66 Cache CPU Core 0 Reg File Main Memory Cache CPU Core 1 Reg File Time #1 CPU0 write #2 CPU1 write #3 CPU0 write #3 CPU1 read Data is loaded into cache on a block granularity (for example, 64B) CPUs share a block, but each CPU never uses the data modified by the other CPUs

67 Korea Univ Backup Slides 67

68 Korea Univ Intel Core 2 Duo 68 Homogeneous cores Bus-based on chip interconnect Shared on-die Cache Memory Traditional I/O Classic OOO: Reservation Stations, Issue ports, Schedulers…etc Large, shared set associative, prefetch, etc. Source: Intel Corp.

69 Korea Univ Core 2 Duo Microarchitecture 69

70 Korea Univ Why Sharing on-die L2? 70

71 Korea Univ Intel Quad-Core Processor (Kentsfield, Clovertown) 71

72 Korea Univ AMD Barcelona ’ s Cache Architecture 72 Source: AMD


Download ppt "Lecture 2. Snoop-based Cache Coherence Protocols Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture &"

Similar presentations


Ads by Google