Cache Coherence CS433 Spring 2001 Laxmikant Kale.

Slides:



Advertisements
Similar presentations
Chapter 5 Part I: Shared Memory Multiprocessors
Advertisements

Extra Cache Coherence Examples In the following examples there are a couple questions. You can answer these for practice by ing Colin at
Lecture 7. Multiprocessor and Memory Coherence
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Cache Optimization Summary
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Snoopy Caches II Steve Ko Computer Sciences and Engineering University at Buffalo.
CS252 Graduate Computer Architecture Lecture 25 Memory Consistency Models and Snoopy Bus Protocols Prof John D. Kubiatowicz
Computer Architecture II 1 Computer architecture II Lecture 8.
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
EECC756 - Shaaban #1 lec # 10 Spring Shared Memory Multiprocessors Symmetric Memory Multiprocessors (SMPs): commonly 2-4 processors/node.
CS 258 Parallel Computer Architecture Lecture 13 Shared Memory Multiprocessors March 10, 2008 Prof John D. Kubiatowicz
EECC756 - Shaaban #1 lec # 11 Spring Shared Memory Multiprocessors Symmetric Multiprocessors (SMPs): –Symmetric access to all of main memory.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
1 Lecture 1: Introduction Course organization:  4 lectures on cache coherence and consistency  2 lectures on transactional memory  2 lectures on interconnection.
1 Lecture 3: Snooping Protocols Topics: snooping-based cache coherence implementations.
Computer architecture II
Cache Coherence: Part 1 Todd C. Mowry CS 740 November 4, 1999 Topics The Cache Coherence Problem Snoopy Protocols.
Bus-Based Multiprocessor
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Nov 14, 2005 Topic: Cache Coherence.
1 Lecture 2: Intro and Snooping Protocols Topics: multi-core cache organizations, programming models, cache coherence (snooping-based)
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Logical Protocol to Physical Design
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
CS 258 Parallel Computer Architecture Lecture 12 Shared Memory Multiprocessors II March 1, 2002 Prof John D. Kubiatowicz
Snooping Cache and Shared-Memory Multiprocessors
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Multiprocessor Cache Coherency
Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.
©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.
Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.
CS492B Analysis of Concurrent Programs Coherence Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.
Presented By:- Prerna Puri M.Tech(C.S.E.) Cache Coherence Protocols MSI & MESI.
Spring EE 437 Lillevik 437s06-l21 University of Portland School of Engineering Advanced Computer Architecture Lecture 21 MSP shared cached MSI protocol.
Lecture 13: Multiprocessors Kai Bu
Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.
Cache Coherence CSE 661 – Parallel and Vector Architectures
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
0 Shared Address Space Processors Chapter 5 from Culler & Singh February, 2007.
Cache Coherence Protocols A. Jantsch / Z. Lu / I. Sander.
1 Memory and Cache Coherence. 2 Shared Memory Multiprocessors Symmetric Multiprocessors (SMPs) Symmetric access to all of main memory from any processor.
Lecture 9 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin1 Lecture 9 Outline  MESI protocol  Dragon update-based protocol.
Multiprocessors— Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
Cache Coherence for Small-Scale Machines Todd C
1 Lecture 3: Coherence Protocols Topics: consistency models, coherence protocol examples.
ECE 4100/6100 Advanced Computer Architecture Lecture 13 Multiprocessor and Memory Coherence Prof. Hsien-Hsin Sean Lee School of Electrical and Computer.
1 Lecture 7: PCM Wrap-Up, Cache coherence Topics: handling PCM errors and writes, cache coherence intro.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
CSC/ECE 506: Architecture of Parallel Computers Bus-Based Coherent Multiprocessors 1 Lecture 12 (Chapter 8) Lecture 12 (Chapter 8)
Lecture 13: Multiprocessors Kai Bu
1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.
Outline Introduction (Sec. 5.1)
COSC6385 Advanced Computer Architecture
COMP 740: Computer Architecture and Implementation
Cache Coherence in Shared Memory Multiprocessors
Cache Coherence: Part 1 Todd C. Mowry CS 740 October 25, 2000
Cache Coherence for Shared Memory Multiprocessors
Multiprocessor Cache Coherency
Lecture 9 Outline MESI protocol Dragon update-based protocol
Example Cache Coherence Problem
Protocol Design Space of Snooping Cache Coherent Multiprocessors
Lecture 2: Snooping-Based Coherence
Chip-Multiprocessor.
Bus-Based Coherent Multiprocessors
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
Lecture 24: Virtual Memory, Multiprocessors
Lecture 24: Multiprocessors
Lecture 8 Outline Memory consistency
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
CS 258 Parallel Computer Architecture Lecture 16 Snoopy Protocols I
Prof John D. Kubiatowicz
Presentation transcript:

Cache Coherence CS433 Spring 2001 Laxmikant Kale

2 Designing a shared memory machine The architecture must support sequential consistency: –Programs must behave as if multiple sequential executions are interleaved (w.r.t. memory accesses). –In presence of out-of-order execution by individual processors –This is not hard to do, if you have a serializing component such as a bus (or the memory itself). All accesses go through the same bus. But that is not all: –Processors have caches Cache coherence –Machines are not bus-based Large scalable machines with complex interconnection networks –Make it harder to satisfy seq. Consistency Is sequential consistency really necessary

3 Topic outline Review : –Cache coherence problem –Bus based “snooping” protocols for guaranteeing cache coherence and seq. Consistency Directory based protocols for large machines –Origin 2000,.. Relaxed consistency models

4 Cache coherence problem Each processor maintains a cache: –Some locations are stored in two places: cache and memory –Not a problem on uni-processors: cache controllers know where to look –Multiple processors: If a cache line is in two processor’s caches at the same time: Write from one won’t be see by the other If a 3rd processor wants to read, should it get it from memory? –Or cache of another processor

5 Formal definition of coherence Results of a program: values returned by its read operations A memory system is coherent if the results of any execution of a program are such that each location, it is possible to construct a hypothetical serial order of all operations to the location that is consistent with the results of the execution and in which: 1. operations issued by any particular process occur in the order issued by that process, and 2. the value returned by a read is the value written by the last write to that location in the serial order Two necessary features: –Write propagation: value written must become visible to others –Write serialization: writes to location seen in same order by all if I see w1 after w2, you should not see w2 before w1 no need for analogous read serialization since reads not visible to others (From Culler, Singh Textbook/slides)

6 Snooping protocols Solution for bus-based multiprocessors Have all cache controllers monitor the bus –So, each one knows (or can find out) where every cache line is.. Different protocols exist: –Maintain a state for each cache line –Take an action based on state and access by my processor, or another PE0PE1 PE p-1 Mem0Mem1 Mem p-1 cache

7 Write-through vs write-back caches When a processor writes to a location that is in its cache –Should it also change the memory? Yes: write-through cache No: write-back cache

8 Simple protocol for write-through There is one bit (valid or invalid) for each cache block If there are multiple readers –they can all have private copies If you see anyone else doing a “write” (BusWr) –invalidate your copy What hardware support do you need? From Culler- Singh-Gupta Textbook

9 Write-back caches Write-thru caches are not used much: –Disadvantages compared with write-back caches –Performance: every write goes to memory, bus accesses use memory bandwidth, limiting scalability Often unnecessary to write to memory Processor waits for writes to complete before issuing next instruction –To satisfy Sequential consistency –But memory is slow to respond –(Other solutions? Some reordering may be ok.. »But memory ops cannot be pipelined

10 SC in Write-through Example Provides SC, not just coherence Extend arguments used for coherence –Writes and read misses to all locations serialized by bus into bus order –If read obtains value of write W, W guaranteed to have completed since it caused a bus transaction –When write W is performed w.r.t. any processor, all previous writes in bus order have completed

11 Design Space for Snooping Protocols No need to change processor, main memory, cache … –Extend cache controller and exploit bus (provides serialization) Focus on protocols for write-back caches Dirty state now also indicates exclusive ownership –Exclusive: only cache with a valid copy (main memory may be too) –Owner: responsible for supplying block upon a request for it Design space – Invalidation versus Update-based protocols – Set of states

12 Invalidation-based Protocols Exclusive means can modify without notifying anyone else –i.e. without bus transaction –Must first get block in exclusive state before writing into it –Even if already in valid state, need transaction, so called a write miss Store to non-dirty data generates a read-exclusive bus transaction –Tells others about impending write, obtains exclusive ownership makes the write visible, i.e. write is performed may be actually observed (by a read miss) only later write hit made visible (performed) when block updated in writer’s cache –Only one RdX can succeed at a time for a block: serialized by bus Read and Read-exclusive bus transactions drive coherence actions –Writeback transactions also, but not caused by memory operation and quite incidental to coherence protocol note: replaced block that is not in modified state can be dropped

13 Update-based Protocols A write operation updates values in other caches –New, update bus transaction Advantages –Other processors don’t miss on next access: reduced latency In invalidation protocols, they would miss and cause more transactions –Single bus transaction to update several caches can save bandwidth Also, only the word written is transferred, not whole block Disadvantages –Multiple writes by same processor cause multiple update transactions In invalidation, first write gets exclusive ownership, others local Detailed tradeoffs more complex

14 Invalidate versus Update Basic question of program behavior –Is a block written by one processor read by others before it is rewritten? Invalidation: –Yes => readers will take a miss –No => multiple writes without additional traffic and clears out copies that won’t be used again Update: –Yes => readers will not miss if they had a copy previously single bus transaction to update all copies –No => multiple useless updates, even to dead copies Need to look at program behavior and hardware complexity Invalidation protocols much more popular (more later) –Some systems provide both, or even hybrid

15 Basic MSI Writeback Inval Protocol States –Invalid (I) –Shared (S): one or more –Dirty or Modified (M): one only Processor Events: –PrRd (read) –PrWr (write) Bus Transactions –BusRd: asks for copy with no intent to modify –BusRdX: asks for copy with intent to modify –BusWB: updates memory Actions –Update state, perform bus transaction, flush value onto bus

16 State Transition Diagram –Write to shared block: Already have latest data; can use upgrade (BusUpgr) instead of BusRdX –Replacement changes state of two blocks: outgoing and incoming

17 Satisfying Coherence Write propagation is clear Write serialization? –All writes that appear on the bus (BusRdX) ordered by the bus Write performed in writer’s cache before it handles other transactions, so ordered in same way even w.r.t. writer –Reads that appear on the bus ordered wrt these –Write that don’t appear on the bus: sequence of such writes between two bus xactions for the block must come from same processor, say P in serialization, the sequence appears between these two bus xactions reads by P will seem them in this order w.r.t. other bus transactions reads by other processors separated from sequence by a bus xaction, which places them in the serialized order w.r.t the writes so reads by all processors see writes in same order

18 Satisfying Sequential Consistency 1. Appeal to definition: –Bus imposes total order on bus xactions for all locations –Between xactions, procs perform reads/writes locally in program order –So any execution defines a natural partial order M j subsequent to M i if (I) follows in program order on same processor, (ii) M j generates bus xaction that follows the memory operation for M i –In segment between two bus transactions, any interleaving of ops from different processors leads to consistent total order –In such a segment, writes observed by processor P serialized as follows Writes from other processors by the previous bus xaction P issued Writes from P by program order 2. Show sufficient conditions are satisfied –Write completion: can detect when write appears on bus –Write atomicity: if a read returns the value of a write, that write has already become visible to all others already (can reason different cases)

19 Lower-level Protocol Choices BusRd observed in M state: what transitition to make? Depends on expectations of access patterns –S: assumption that I’ll read again soon, rather than other will write good for mostly read data what about “migratory” data –I read and write, then you read and write, then X reads and writes... –better to go to I state, so I don’t have to be invalidated on your write – Synapse transitioned to I state – Sequent Symmetry and MIT Alewife use adaptive protocols Choices can affect performance of memory system (later)

20 MESI (4-state) Invalidation Protocol Problem with MSI protocol –Reading and modifying data is 2 bus xactions, even if noone sharing e.g. even in sequential program BusRd (I->S) followed by BusRdX or BusUpgr (S->M) Add exclusive state: write locally without xaction, but not modified –Main memory is up to date, so cache not necessarily owner –States invalid exclusive or exclusive-clean (only this cache has copy, but not modified) shared (two or more caches may have copies) modified (dirty) –I -> E on PrRd if noone else has copy needs “shared” signal on bus: wired-or line asserted in response to BusRd

21 MESI State Transition Diagram –BusRd(S) means shared line asserted on BusRd transaction –Flush’: if cache-to-cache sharing (see next), only one cache flushes data –MOESI protocol: Owned state: exclusive but memory not valid

22 Lower-level Protocol Choices Who supplies data on miss when not in M state: memory or cache Original, lllinois MESI: cache, since assumed faster than memory –Cache-to-cache sharing Not true in modern systems –Intervening in another cache more expensive than getting from memory Cache-to-cache sharing also adds complexity –How does memory know it should supply data (must wait for caches) –Selection algorithm if multiple caches have valid data But valuable for cache-coherent machines with distributed memory –May be cheaper to obtain from nearby cache than distant memory –Especially when constructed out of SMP nodes (Stanford DASH)