Lecture 2. Snoop-based Cache Coherence Protocols COM503 Parallel Computer Architecture & Programming Lecture 2. Snoop-based Cache Coherence Protocols Prof. Taeweon Suh Computer Science Education Korea University
Flynn’s Taxonomy A classification of computers, proposed by Michael J. Flynn in 1966 Characterize computer designs in terms of the number of distinct instructions issued at a time and the number of data elements they operate on Single Instruction Multiple Instruction Single Data SISD MISD Multiple Data SIMD MIMD Source: Widipedia
Flynn’s Taxonomy (Cont.) SISD Single Instruction Single Data Uniprocessor Example: Your desktop (notebook) computer before the spread of dual or more core CPUs SIMD Single Instruction Multiple Data Each processor works on its own data stream But all processors execute the same instruction in lockstep Example: MMX and GPU Picture sources: Wikipedia
SIMD Example MMX (Multimedia Extension) 64-bit registers == 2 32-bit integers, 4 16-bits integers, or 8 8-bit integers processed concurrently SSE (Streaming SIMD Extensions) 256-bit registers == 4 DP floating-point operations
Flynn’s Taxonomy (Cont.) MISD Multiple Instruction Single Data Each processor executes different instructions on the same data Not used much MIMD Multiple Instruction Multiple Data Each processor executes its own instruction for its own data Virtually, all the multiprocessor systems are based on MIMD Pic ture sources: Wikipedia
Multiprocessor Systems Shared memory systems Bus-based shared memory Distributed shared memory Current server systems (for example, Xeon-based servers) Cluster-based systems Supercomputers and datacenters
Clusters Supercomputer dubbed 7N (Cluster computer), 95th fastest in the world on the TOP500 in 2007 http://www.tik.ee.ethz.ch/~ddosvax/cluster/ https://www.jlab.org/news/releases/jefferson-lab-boasts-virginias-fastest-computer
Shared Memory Multiprocessor Models $ Bus-based shared memory Memory Our Focus today P $ Memory Fully-connected shared memory (Dancehall) Interconnection Network P $ Memory Interconnection Network Distributed shared memory
Some Terminologies Shared memory systems can be classified into UMA (Uniform Memory Access) architecture NUMA (Non-Uniform Memory Access) architecture SMP (Symmetric Multiprocessor) is an UMA example Don’t be confused with SMT (Simultaneous Multithreading)
Sandy Bridge based motherboard Antique (?) P-III based SMP SMP (UMA) Systems Sandy Bridge based motherboard Antique (?) P-III based SMP Memory P-III $ http://www.evga.com/forums/tm.aspx?m=1897631&mpage=1 http://news.softpedia.com/newsImage/Gigabyte-Also-Details-Its-Sandy-Bridge-Motherboard-Replacement-Program-2.jpg/
DSM (NUMA) Machine Examples Nehalem-based systems with QPI Nehalem-based Xeon 5500 QPI: QuickPath Interconnect http://www.qdpma.com/systemarchitecture/SystemArchitecture_QPI.html
More Recent NUMA System http://www.anandtech.com/show/6533/gigabyte-ga7pesh1-review-a-dual-processor-motherboard-through-a-scientists-eyes http://ark.intel.com/products/64596/Intel-Xeon-Processor-E5-2690-20M-Cache-2_90-GHz-8_00-GTs-Intel-QPI http://www.intel.in/content/www/in/en/intelligent-systems/crystal-forest-server/xeon-e5-2600-e5-2400-89xx-ibd.html
Amdahl’s Law (Law of Diminishing Returns) Amdahl’s law is named after computer architect Gene Amdahl It is used to find the maximum expected improvement to an overall system The speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program Maximum speedup = (1 – P) + P / N 1 P: Parallelizable portion of a program N: # processors Source: Widipedia
WB & WT Caches Memory Memory Writeback Writethrough X= 100 X= 100 CPU core CPU core Cache Cache X= 300 X= 100 X= 300 X= 100 Memory Memory X= 100 X= 100 X= 300
Definition of Coherence Coherence is a property of a shared-memory architecture giving the illusion to the software that there is a single copy of every memory location, even if multiple copies exist A multiprocessor memory system is coherent if the results of any execution of a program can be reconstructed by a hypothetical serial order Memory P-III $ Modified Slide from Prof. H.H. Lee in Georgia Tech
Definition of Coherence A multiprocessor memory system is coherent if the results of any execution of a program can be reconstructed by a hypothetical serial order Implicit definition of coherence Write propagation Writes are visible to other processes Write serialization All writes to the same location are seen in the same order by all processes Slide from Prof. H.H. Lee in Georgia Tech
Why Cache Coherency? Closest cache level is private Multiple copies of cache line can be present across different processor nodes Local updates (writes) leads to incoherent state Problem exhibits in both write-through and writeback caches Core i7 L2 Cache (256KB) CPU Core Reg File L1 I$ (32KB) L1 D$ (32KB) L3 Cache (8MB) - Shared .. Slide from Prof. H.H. Lee in Georgia Tech
Writeback Cache w/o Coherence P P P read? read? write Cache Cache Cache X= 100 X= 100 X= 100 X= 505 Memory X= 100 Slide from Prof. H.H. Lee in Georgia Tech
Writethrough Cache w/o Coherence P P P Read? write Cache Cache Cache X= 100 X= 505 X= 100 X= 505 X= 505 Memory X= 100 Slide from Prof. H.H. Lee in Georgia Tech
Cache Coherence Protocols According to Caching Policies Write-through cache Update-based protocol Invalidation-based protocol Writeback cache
Bus Snooping based on Write-Through Cache All the writes will be shown as a transaction on the shared bus to memory Two protocols Update-based Protocol Invalidation-based Protocol Slide from Prof. H.H. Lee in Georgia Tech
Bus Snooping Memory Update-based Protocol on Write-Through cache P P P X= 505 X= 100 X= 505 X= 100 Memory Bus transaction X= 100 X= 505 Bus snoop Slide from Prof. H.H. Lee in Georgia Tech
Bus Snooping Memory Invalidation-based Protocol on Write-Through cache Load X write Cache Cache Cache X= 100 X= 505 X= 505 X= 100 Memory X= 100 X= 505 Bus transaction Bus snoop Slide from Prof. H.H. Lee in Georgia Tech
A Simple Snoopy Coherence Protocol for a WT, No Write-Allocate Cache PrWr / BusWr PrRd / --- Valid PrRd / BusRd BusWr / --- Invalid Observed / Transaction PrWr / BusWr Processor-initiated Transaction Bus-snooper-initiated Transaction Slide from Prof. H.H. Lee in Georgia Tech
How about Writeback Cache? WB cache to reduce bandwidth requirement The majority of local writes are hidden behind the processor nodes How to snoop? Write Ordering Slide from Prof. H.H. Lee in Georgia Tech
Cache Coherence Protocols for WB Caches A cache has an exclusive copy of a line if It is the only cache having a valid copy Memory may or may not have it Modified (dirty) cache line The cache having the line is the owner of the line, because it must supply the block Slide from Prof. H.H. Lee in Georgia Tech
Update-based Protocol on WB Cache Store X Cache Cache Cache X= 505 X= 100 X= 505 X= 100 X= 505 X= 100 update Memory Bus transaction Update data for all processor nodes who share the same data Because a processor node keeps updating the memory location, a lot of traffic will be incurred Slide from Prof. H.H. Lee in Georgia Tech
Update-based Protocol on WB Cache Store X Load X Cache Cache Cache X= 505 X= 333 X= 333 X= 505 X= 333 X= 505 Hit ! update update Memory Bus transaction Update data for all processor nodes who share the same data Because a processor node keeps updating the memory location, a lot of traffic will be incurred Slide from Prof. H.H. Lee in Georgia Tech
Invalidation-based Protocol on WB Cache Store X Cache Cache Cache X= 100 X= 100 X= 505 X= 100 invalidate Memory Bus transaction Invalidate the data copies for the sharing processor nodes Reduced traffic when a processor node keeps updating the same memory location Slide from Prof. H.H. Lee in Georgia Tech
Invalidation-based Protocol on WB Cache Load X Cache Cache Cache X= 505 X= 505 Miss ! Snoop hit Memory Bus transaction Bus snoop Invalidate the data copies for the sharing processor nodes Reduced traffic when a processor node keeps updating the same memory location Slide from Prof. H.H. Lee in Georgia Tech
Invalidation-based Protocol on WB Cache Store X P P P Store X Store X Cache Cache Cache X= 333 X= 444 X= 987 X= 505 X= 505 Memory Bus transaction Bus snoop Invalidate the data copies for the sharing processor nodes Reduced traffic when a processor node keeps updating the same memory location Slide from Prof. H.H. Lee in Georgia Tech
MSI Writeback Invalidation Protocol Modified Dirty Only this cache has a valid copy Shared Memory is consistent One or more caches have a valid copy Invalid Writeback protocol: A cache line can be written multiple times before the memory is updated Slide from Prof. H.H. Lee in Georgia Tech
MSI Writeback Invalidation Protocol Two types of request from the processor PrRd PrWr Three types of bus transactions posted by cache controller BusRd PrRd misses the cache Memory or another cache supplies the line BusRdX (Read-to-own) PrWr is issued to a line which is not in the Modified state BusWB Writeback due to replacement Processor does not directly involve in initiating this operation Slide from Prof. H.H. Lee in Georgia Tech
MSI Writeback Invalidation Protocol (Processor Request) PrWr / BusRdX PrRd / --- PrWr / --- Modified Shared PrRd / --- PrRd / BusRd PrWr / BusRdX Invalid Processor-initiated Slide from Prof. H.H. Lee in Georgia Tech
MSI Writeback Invalidation Protocol (Bus Transaction) BusRd / Flush BusRd / --- Modified Shared BusRdX / Flush BusRdX / --- Flush data on the bus Both memory and requestor will grab the copy The requestor get data from either Cache-to-cache transfer; or Memory Invalid Bus-snooper-initiated Slide from Prof. H.H. Lee in Georgia Tech
Bus-snooper-initiated MSI Writeback Invalidation Protocol (Bus transaction) Another possible Implementation BusRd / Flush BusRd / --- Modified Shared BusRdX / Flush BusRdX / --- BusRd / Flush Invalid Anticipate no more reads from this processor A performance concern Save “invalidation” trip if the requesting cache writes the shared line later Bus-snooper-initiated Slide from Prof. H.H. Lee in Georgia Tech
MSI Writeback Invalidation Protocol PrWr / BusRdX PrWr / --- PrRd / --- BusRd / Flush BusRd / --- Modified Shared PrRd / --- BusRdX / Flush BusRdX / --- PrWr / BusRdX Invalid PrRd / BusRd Processor-initiated Bus-snooper-initiated Slide from Prof. H.H. Lee in Georgia Tech
MSI Example P1 P2 P3 Bus BusRd MEMORY Processor Action State in P1 Cache Cache Cache X=10 S Bus BusRd MEMORY X=10 Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X S --- BusRd Memory Slide from Prof. H.H. Lee in Georgia Tech
MSI Example P1 P2 P3 BusRd Bus MEMORY Processor Action State in P1 Cache Cache Cache X=10 S X=10 S BusRd Bus MEMORY X=10 Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X S --- BusRd Memory S --- BusRd Memory P3 reads X Slide from Prof. H.H. Lee in Georgia Tech
MSI Example P1 P2 P3 BusRdX Bus MEMORY Processor Action State in P1 Cache Cache Cache --- I X=10 S X=10 X=-25 M S BusRdX Bus MEMORY X=10 Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X S --- BusRd Memory S --- BusRd Memory P3 reads X P3 writes X I --- M BusRdX Slide from Prof. H.H. Lee in Georgia Tech
MSI Example P1 P2 P3 BusRd Bus MEMORY Processor Action State in P1 Cache Cache Cache --- X=-25 S I X=-25 S M BusRd Bus MEMORY X=-25 X=10 Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X S --- BusRd Memory S --- BusRd Memory P3 reads X P3 writes X I --- M BusRdX P1 reads X S --- BusRd P3 Cache Slide from Prof. H.H. Lee in Georgia Tech
MSI Example P1 P2 P3 BusRd Bus MEMORY Processor Action State in P1 Cache Cache Cache X=-25 S X=-25 S X=-25 S M BusRd Bus MEMORY X=-25 X=10 Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X S --- BusRd Memory S --- BusRd Memory P3 reads X P3 writes X I --- M BusRdX P1 reads X S --- BusRd P3 Cache P2 reads X S BusRd Memory Slide from Prof. H.H. Lee in Georgia Tech
MESI Writeback Invalidation Protocol To reduce two types of unnecessary bus transactions BusRdX that snoops and converts the block from S to M when only you are the sole owner of the block BusRd that gets the line in S state when there is no sharers (that lead to the overhead above) Introduce the Exclusive state One can write to the copy without generating BusRdX Illinois Protocol: Proposed by Pamarcos and Patel in 1984 Employed in Intel, PowerPC, MIPS Slide from Prof. H.H. Lee in Georgia Tech
MESI Writeback Invalidation (Processor Request) PrWr / --- PrRd, PrWr / --- PrRd / --- Exclusive Modified PrRd / BusRd (not-S) PrWr / BusRdX PrWr / BusRdX Invalid Shared PrRd / --- PrRd / BusRd (S) S: Shared Signal Processor-initiated Slide from Prof. H.H. Lee in Georgia Tech
MESI Writeback Invalidation Protocol (Bus Transactions) Whenever possible, Illinois protocol performs $-to-$ transfer rather than having memory to supply the data Use a Selection algorithm if there are multiple suppliers (Alternative: add an O state or force update memory) Exclusive Modified BusRd / Flush Or ---) BusRdX / Flush BusRdX / --- BusRd / Flush Invalid Shared BusRd / Flush* BusRdX / Flush* Bus-snooper-initiated Flush*: Flush for data supplier; no action for other sharers Modified Slide from Prof. H.H. Lee in Georgia Tech
MESI Writeback Invalidation Protocol (Illinois Protocol) PrWr / --- PrRd, PrWr / --- PrRd / --- Exclusive Modified BusRd / Flush (or ---) BusRdX / Flush PrWr / BusRdX BusRdX / --- BusRd / Flush PrWr / BusRdX PrRd / BusRd (not-S) Invalid Shared BusRd / Flush* S: Shared Signal Processor-initiated BusRdX / Flush* Bus-snooper-initiated PrRd / --- Flush*: Flush for data supplier; no action for other sharers PrRd / BusRd (S) Slide from Prof. H.H. Lee in Georgia Tech
System Request Interface MOESI Protocol Introduce a notion of ownership ─ Owned state Similar to Shared state The O state processor will be responsible for supplying data (copy in memory may be stale) Employed by Sun UltraSparc AMD Opteron In dual-core Opteron, cache-to-cache transfer is done through a system request interface (SRI) running at full CPU speed CPU0 L2 CPU1 System Request Interface Crossbar Hyper- Transport Mem Controller Modified Slide from Prof. H.H. Lee in Georgia Tech
MOESI Writeback Invalidation Protocol (Processor Request) PrRd / --- PrRd, PrWr / --- PrWr / --- Exclusive Modified PrRd / BusRd (not-S) PrWr / BusRdX PrWr / BusRdX PrWr / BusRdX PrRd / --- Owned Invalid Shared PrRd / --- PrRd / BusRd (S) S: Shared Signal Processor-initiated
MOESI Writeback Invalidation Protocol (Bus Transactions) Exclusive Modified BusRd / Flush (Or ---) BusRd / Flush BusRd / Flush BusRdX / Flush BusRdX / --- BusRd / Flush Owned BusRdX / Flush BusRd / Flush Invalid Shared BusRd / Flush* BusRdX / Flush* BusRd / --- BusRdX / --- Bus-snooper-initiated Flush*: Flush for data supplier; no action for other sharers
MOESI Writeback Invalidation Protocol (Bus Transactions) Exclusive Modified BusRd / Flush BusRd / Flush BusRdX / Flush BusRdX / --- Owned BusRdX / Flush BusRd / Flush Invalid Shared BusRd / --- BusRdX / --- Bus-snooper-initiated
Transient States in MSI Design issues: Coherence transaction is not atomic I → E, S (?) depending on Shared signal in MESI The next state cannot be determined until the request is launched on the bus and the snoop result is available BusRdX reads a memory block and invalidates other copies BusUpgr invalidates potential remote cache copies
Atomic & Non-atomic Buses A split-transaction bus increases the available bus bandwidth by breaking up a transaction into subtransactions addr1 addr2 read read data data Atomic bus time addr1 addr2 addr3 addr4 read read read read data data data data Non-atomic bus (pipelined bus or split transaction bus)
Issues with Pipelined Buses addr1 addr1 addr1 addr1 read read read write data data data data Non-atomic bus (pipelined bus or split transaction bus) SGI Challenge (mid-1990s) has a system-wide table in each node to book-keep all outstanding requests A request is launched if no entry in the table matches the address of the request Silicon Graphics, Inc. was an American manufacturer of high-performance computing solutions, including computer hardware and software. -wiki
SGI Challenge http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi/0620/bks/SGI_Developer/books/REACT_PG/sgi_html/ch02.html http://www.computinghistory.org.uk/det/11263/SGI-Challenge-10000/ http://en.wikipedia.org/wiki/SGI_Challenge
Inclusion & Exclusion Properties Inclusion property Exclusion property CPU Core CPU Core Reg File Reg File L1 I$ L1 D L1 I$ L1 D L2 L2 Main Memory Main Memory A block in L1 should be L2 as well L2 block eviction causes the invalidation of the L1 block L1 write causes L2 update Effective cache size is equal to L2 size Desirable for cache coherence A block is located either L1 or L2 When a L1 block is replaced, it is possibly located in L2 Better utilization of hardware resources
Cache Hierarchies Effective cache sizes Inclusive: LLC Non-Inclusive: LLC ~ (LLC + L1s) Exclusive: LLC + L1s We assume CMP when computing the effective cache size; There are several L1s. LLC lines can be victimized due to the bring-in of new lines from other L1 misses “Achieving Non-Inclusive Cache Performance with Inclusive Caches”, MICRO, 2010
Coherency in Multi-level Cache Hierarchy L2 is exclusive All incoming bus requests contend with CPU core for L1 CPU Core CPU Core Reg File Reg File L1 I$ L1 D$ L1 I$ L1 D$ L2 Cache L2 Cache Main Memory
Coherency in Multi-level Cache Hierarchy L2 is inclusive L2 is used as a snoop filter L2 line eviction forces the L1 line eviction If L1 is the writeback cache, the blocks in L1 and L2 are not consistent Writethrough policy in L1 is desirable Otherwise, L1 should be snooped CPU Core CPU Core Reg File Reg File L1 I$ L1 D$ L1 I$ L1 D$ L2 Cache L2 Cache Main Memory
Nehalem Case Study .. 4-cycle writeback writeback Non-Inclusive CPU Core CPU Core .. Reg File Reg File 4-cycle writeback L1 I$ (32KB) L1 D$ (32KB) L1 I$ (32KB) L1 D$ (32KB) L2 Cache (8-way, 256KB) L2 Cache (256KB) writeback Non-Inclusive L3 Cache (8MB) - Shared writeback Inclusive Main Memory In the L1 data cache and in the L2/L3 unified caches, the MESI (modified, exclusive, shared, invalid) cache protocol maintains consistency with caches of other processors. The L1 data cache and the L2/L3 unified caches have two MESI status flags per cache line. http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf
Nehalem Uncore
Sandy Bridge Transactions from each core travel along the ring LLC slice (2MB each) are connected to the ring http://www.behardware.com/articles/815-3/intel-core-i7-and-core-i5-lga-1155-sandy-bridge.html
Virtual (linear) address TLB and Virtual Memory Hard disk Processor Hello world Virtual memory Space 3 3 2 2 1 1 Main Memory Virtual (linear) address CPU core Physical address TLB in MMU MS Word 3 1 0xF … 0x39 1 9 1 Windows XP 3 0x4F 2 … 1 3 2 1 MMU: Memory Management Unit
TLB with a Cache TLB is a cache for page table Cache is a cache (?) for instruction and data Modern processors typically use physical address to access caches CPU Main Memory Instructions or data CPU core virtual address physical address MMU TLB Cache physical address Instructions or data Page table
Core i7 Case Study .. L1: VIPT L2: PIPT Non-Inclusive L3: PIPT CPU Core CPU Core .. Reg File Reg File L1: VIPT ITLB DTLB L1 I$ (32KB) L1 D$ (8-way, 32KB) L1 I$ (32KB) L1 D$ (32KB) ITLB DTLB L2 Cache (8-way, 256KB) L2 Cache (256KB) L2: PIPT Non-Inclusive L3 Cache (8MB) - Shared L3: PIPT Inclusive Main Memory http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf
TLB Shootdown TLB inconsistency arises when a PTE in TLB is modified The PTE copy in other TLBs and main memory is stale 2 cases Virtual-to-physical mapping change by OS Page access right change by OS TLB shootdown procedure (similar to Page fault handling) A processor invokes Virtual memory manager and it generates IPI (Inter-processor Interrupt) Each processor invokes a software handler to remove the stale PTE and invalidate all the block copies in private caches CPU Core CPU Core Reg File Reg File TLB TLB Cache Cache Main Memory
False Sharing #1 CPU0 write #2 CPU1 write #3 CPU0 write #3 CPU1 read Time CPU Core 0 CPU Core 1 Data is loaded into cache on a block granularity (for example, 64B) CPUs share a block, but each CPU never uses the data modified by the other CPUs Reg File Reg File Cache Cache Main Memory
Backup Slides
Intel Core 2 Duo Homogeneous cores Bus-based on chip interconnect Shared on-die Cache Memory Traditional I/O Classic OOO: Reservation Stations, Issue ports, Schedulers…etc Source: Intel Corp. Large, shared set associative, prefetch, etc.
Core 2 Duo Microarchitecture
Why Sharing on-die L2?
Intel Quad-Core Processor (Kentsfield, Clovertown)
AMD Barcelona’s Cache Architecture Source: AMD