1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Nov 14, 2005 Topic: Cache Coherence.

Slides:



Advertisements
Similar presentations
Cache Coherence. Memory Consistency in SMPs Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has.
Advertisements

Implementation and Verification of a Cache Coherence protocol using Spin Steven Farago.
L.N. Bhuyan Adapted from Patterson’s slides
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Cache Optimization Summary
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Snoopy Caches II Steve Ko Computer Sciences and Engineering University at Buffalo.
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Shared-memory.
The University of Adelaide, School of Computer Science
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 12, 2003 Topics: 1. Cache Performance (concl.) 2. Cache Coherence.
Computer Architecture 2011 – coherency & consistency (lec 7) 1 Computer Architecture Memory Coherency & Consistency By Dan Tsafrir, 11/4/2011 Presentation.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
BusMultis.1 Review: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control Datapath  Multiprocessor –
1 Lecture 20: Coherence protocols Topics: snooping and directory-based coherence protocols (Sections )
1 Lecture 1: Introduction Course organization:  4 lectures on cache coherence and consistency  2 lectures on transactional memory  2 lectures on interconnection.
1 Lecture 3: Snooping Protocols Topics: snooping-based cache coherence implementations.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 6, 2002 Topic: 1. Virtual Memory; 2. Cache Coherence.
Logical Protocol to Physical Design
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
Snooping Cache and Shared-Memory Multiprocessors
Vir. Mem II CSE 471 Aut 011 Synonyms v.p. x, process A v.p. y, process B v.p # index Map to same physical page Map to synonyms in the cache To avoid synonyms,
Snoopy Coherence Protocols Small-scale multiprocessors.
Cache Organization of Pentium
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Multiprocessor Cache Coherency
Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.
Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.
1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.
Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.
Computer Architecture 2015 – Cache Coherency & Consistency 1 Computer Architecture Memory Coherency & Consistency By Yoav Etsion and Dan Tsafrir Presentation.
Presented By:- Prerna Puri M.Tech(C.S.E.) Cache Coherence Protocols MSI & MESI.
Spring EE 437 Lillevik 437s06-l21 University of Portland School of Engineering Advanced Computer Architecture Lecture 21 MSP shared cached MSI protocol.
Cache Control and Cache Coherence Protocols How to Manage State of Cache How to Keep Processors Reading the Correct Information.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
A Low-Overhead Coherence Solution for Multiprocessors with Private Cache Memories Also known as “Snoopy cache” Paper by: Mark S. Papamarcos and Janak H.
Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.
Evaluating the Performance of Four Snooping Cache Coherency Protocols Susan J. Eggers, Randy H. Katz.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
December 1, 2006©2006 Craig Zilles1 Threads and Cache Coherence in Hardware  Previously, we introduced multi-cores. —Today we’ll look at issues related.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 5, 2005 Session 22.
1 Lecture 3: Coherence Protocols Topics: consistency models, coherence protocol examples.
Cache Coherence CS433 Spring 2001 Laxmikant Kale.
Additional Material CEG 4131 Computer Architecture III
1 Lecture 7: PCM Wrap-Up, Cache coherence Topics: handling PCM errors and writes, cache coherence intro.
Performance of Snooping Protocols Kay Jr-Hui Jeng.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
The University of Adelaide, School of Computer Science
1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.
COMP 740: Computer Architecture and Implementation
Cache Organization of Pentium
CS 152 Computer Architecture and Engineering Lecture 18: Snoopy Caches
תרגול מס' 5: MESI Protocol
Cache Coherence in Shared Memory Multiprocessors
Computer Engineering 2nd Semester
CS 704 Advanced Computer Architecture
Cache Coherence for Shared Memory Multiprocessors
12.4 Memory Organization in Multiprocessor Systems
Multiprocessor Cache Coherency
Example Cache Coherence Problem
Cache Coherence (controllers snoop on bus transactions)
Lecture 2: Snooping-Based Coherence
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
Lecture 25: Multiprocessors
High Performance Computing
Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini
Lecture 25: Multiprocessors
Cache coherence CEG 4131 Computer Architecture III
Lecture 24: Multiprocessors
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
Presentation transcript:

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Nov 14, 2005 Topic: Cache Coherence

2Outline  Cache Coherence Reading: HP3 Section 6.3 & Appendix I

3 Cache Coherence  Common problem with multiple copies of mutable information (in both hardware and software) “If a datum is copied and the copy is to match the original at all times, then all changes to the original must cause the copy to be immediately updated or invalidated.” (Richard L. Sites, co-architect of DEC Alpha) “If a datum is copied and the copy is to match the original at all times, then all changes to the original must cause the copy to be immediately updated or invalidated.” (Richard L. Sites, co-architect of DEC Alpha) 1234AAAC-ABB1234AAAC-ABB Copy becomes stale Copies diverge; hard to recover from 1234AAAB-ABB1234AAAB-ABB Write update 1234AAA--ABB1234AAA--ABB Write invalidate

4 Example of Cache Coherence  I/O in uniprocessor with primary unified cache MM copy and cache copy of memory block not always coherent MM copy and cache copy of memory block not always coherent WT cache WT cache  MM copy stale while write update to MM in transit WB cache WB cache  MM copy stale while cache copy Dirty Inconsistency of no concern if no one reads/writes MM copy Inconsistency of no concern if no one reads/writes MM copy If I/O directed to main memory, need to maintain coherence If I/O directed to main memory, need to maintain coherence

5 Example of Cache Coherence (contd)  Uniprocessor with a split primary cache I-cache contains instruction I-cache contains instruction D-cache contains data D-cache contains data Often contents are disjoint Often contents are disjoint If self-modifying code is allowed, then same cache block may appear in both caches, and consistency must be enforced If self-modifying code is allowed, then same cache block may appear in both caches, and consistency must be enforced MS-DOS allows self-modifying code MS-DOS allows self-modifying code  Strong motivation for unified caches in Intel i386 and i486  Pentium has split primary cache, and supports SMC by enforcing coherence between I and D caches  Coordinating primary and secondary caches in uniprocessor  Shared memory multiprocessors

6 Two “Snoopy” Protocols  We will discuss two protocols A simple three-state protocol A simple three-state protocol  Section 6.3 & Appendix I of HP3 The MESI protocol The MESI protocol  IEEE standard  Used by many machines, including Pentium and PowerPC 601  Snooping: monitor memory bus activity by individual caches monitor memory bus activity by individual caches taking some actions based on this activity taking some actions based on this activity introduces a fourth category of miss to the 3C model: coherence misses introduces a fourth category of miss to the 3C model: coherence misses  First, we need some notation to discuss the protocols

7 Notation: Write-Through Cache

8 Notation: Write-Back Cache

9 Three-State Write-Invalidate Protocol  Minor modification of WB cache  Assumptions Single bus and MM Single bus and MM Two or more CPUs, each with WB cache Two or more CPUs, each with WB cache Every cache block in one of three states: Invalid, Clean, Dirty (called Invalid, Shared, Exclusive in Figure 6.10 of HP3) Every cache block in one of three states: Invalid, Clean, Dirty (called Invalid, Shared, Exclusive in Figure 6.10 of HP3) MM copies of blocks have no state MM copies of blocks have no state At any moment, a single cache owns bus (is bus master) At any moment, a single cache owns bus (is bus master) Bus master does not obey bus command Bus master does not obey bus command All misses (reads or writes) serviced by All misses (reads or writes) serviced by  MM if all cache copies are Clean  the only Dirty cache copy (which is no longer Dirty ), and MM copy is written instead of being read

10 Understanding the Protocol MM C1C2 A A A A A -- A B Bus owner Clean Another Clean copy exists Can read without notifying other caches Bus owner Clean Another Clean copy exists Can read without notifying other caches Bus owner Clean No other cache copies Can read without notifying other caches Bus owner Dirty No other cache copies Can read or write without notifying other caches Only two global states Most up-to-date copy is MM copy, and all cache copies are Clean Most up-to-date copy is a single unique cache copy in state Dirty

11 State Diagram of Cache Block (Part 1)

12 State Diagram of Cache Block (Part 2)

13 Comparison with Single WB Cache  Similarities Read hit invisible on bus Read hit invisible on bus All misses visible on bus All misses visible on bus  Differences In single WB cache, all misses are serviced by MM; in three- state protocol, misses are serviced either by MM or by unique cache block holding only Dirty copy In single WB cache, all misses are serviced by MM; in three- state protocol, misses are serviced either by MM or by unique cache block holding only Dirty copy In single WB cache, write hit is invisible on bus; in three-state protocol, write hit of Clean block: In single WB cache, write hit is invisible on bus; in three-state protocol, write hit of Clean block:  invalidates all other Clean blocks by a Bus Write Miss (necessary action)

14 Correctness of Three-State Protocol  Problem: State transition of FSM is supposed to be atomic, but they are not in this protocol, because of the bus  Example: CPU read miss in Dirty state CPU access to cache detects a miss CPU access to cache detects a miss Request bus Request bus Acquire bus, and change state of cache block Acquire bus, and change state of cache block Evict dirty block to MM Evict dirty block to MM Put Bus Read Miss on bus Put Bus Read Miss on bus Receive requested block from MM or another cache Receive requested block from MM or another cache Release bus, and read from cache block just received Release bus, and read from cache block just received  Bus arbitration may cause gap between steps 2 and 3 Whole sequence of operations no longer atomic Whole sequence of operations no longer atomic App. I.1 argues that protocol will work correctly if steps 3-7 are atomic, i.e., bus is not a split-transaction bus App. I.1 argues that protocol will work correctly if steps 3-7 are atomic, i.e., bus is not a split-transaction bus

15 Adding More Bits to Protocols  Add third bit, called Shared, to Valid and Dirty bits Get five states (M, O, E, S, I) Get five states (M, O, E, S, I) Developed in context of Futurebus+, with intention of explaining all snoopy protocols, all of which use 3, 4, or 5 states Developed in context of Futurebus+, with intention of explaining all snoopy protocols, all of which use 3, 4, or 5 states

16 MESI Protocol  Four-state, write-invalidate  Improved version of three-state protocol Clean state split into Exclusive and Shared states Clean state split into Exclusive and Shared states Dirty state equivalent to Modified state Dirty state equivalent to Modified state  Several slightly different versions of MESI protocol Will describe version implemented by Futurebus+ Will describe version implemented by Futurebus+ PowerPC 601 MESI protocol does not support cache-to-cache transfer of blocks PowerPC 601 MESI protocol does not support cache-to-cache transfer of blocks

17 State Diag. of MESI Cache Block (Part 1)

18 State Diag. of MESI Cache Block (Part 2)

19 Comparison with Three-State Protocol  Similarities Read hit invisible on bus Read hit invisible on bus All misses handled the same way All misses handled the same way  Differences Big improvement in handling write hits Big improvement in handling write hits  Write hit in Exclusive state invisible on bus  Write hit in Shared state involves no block transfer, only a control signal A A -- A A A A B Exclusive state Can be read or written Shared state Can be read only Modified state Can be read and written

20 Comments on Write-Invalidate Protocols  Performance Processor can lose cache block through invalidation by another processor Processor can lose cache block through invalidation by another processor Average memory access time goes up, since writes to shared blocks take more time (other copies have to be invalidated) Average memory access time goes up, since writes to shared blocks take more time (other copies have to be invalidated)  Implementation Bus and CPU want to simultaneously access same cache Bus and CPU want to simultaneously access same cache  Either same block or different blocks, but conflict nonetheless Three possible solutions Three possible solutions  Use a single tag array, and accept structural hazards  Use two separate tag arrays for bus and CPU, which must now be kept coherent at all times  Use a multiported tag array (both Intel Pentium and PowerPC 601 use this solution)