4/13/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 18: Snoopy Caches Dr. George Michelogiannakis EECS, University of California.

Slides:



Advertisements
Similar presentations
Cache Coherence. Memory Consistency in SMPs Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has.
Advertisements

Symmetric Multiprocessors: Synchronization and Sequential Consistency.
1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.
4/16/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Snoopy Caches I Steve Ko Computer Sciences and Engineering University at Buffalo.
4/4/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 17: Synchronization and Sequential Consistency Krste Asanovic Electrical.
CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,
© Krste Asanovic, 2014CS252, Spring 2014, Lecture 12 CS252 Graduate Computer Architecture Spring 2014 Lecture 12: Synchronization and Memory Models Krste.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Snoopy Caches II Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Directory-Based Caches II Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Directory-Based Caches I Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Synchronization and Consistency II Steve Ko Computer Sciences and Engineering University at.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Synchronization and Consistency I Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Multithreading II Steve Ko Computer Sciences and Engineering University at Buffalo.
Computer Architecture 2011 – coherency & consistency (lec 7) 1 Computer Architecture Memory Coherency & Consistency By Dan Tsafrir, 11/4/2011 Presentation.
CS 152 Computer Architecture and Engineering Lecture 19: Synchronization and Sequential Consistency Krste Asanovic Electrical Engineering and Computer.
1 Lecture 21: Synchronization Topics: lock implementations (Sections )
CS 152 Computer Architecture and Engineering Lecture 21: Directory-Based Cache Protocols Scott Beamer (substituting for Krste Asanovic) Electrical Engineering.
1 Lecture 15: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.
CS 152 Computer Architecture and Engineering Lecture 19: Synchronization and Sequential Consistency Krste Asanovic Electrical Engineering and Computer.
CS 252 Graduate Computer Architecture Lecture 11: Multiprocessors-II Krste Asanovic Electrical Engineering and Computer Sciences University of California,
April 13, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 18: Snoopy Caches Krste Asanovic Electrical Engineering and Computer.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
April 8, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 19: Synchronization and Sequential Consistency Krste Asanovic Electrical.
CS 152 Computer Architecture and Engineering Lecture 20: Snoopy Caches Krste Asanovic Electrical Engineering and Computer Sciences University of California,
April 4, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 17: Synchronization and Sequential Consistency Krste Asanovic Electrical.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
April 18, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.
April 15, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 20: Snoopy Caches Krste Asanovic Electrical Engineering and Computer.
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.
Computer Architecture 2015 – Cache Coherency & Consistency 1 Computer Architecture Memory Coherency & Consistency By Yoav Etsion and Dan Tsafrir Presentation.
Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 12: May 3, 2003 Shared Memory.
December 1, 2006©2006 Craig Zilles1 Threads and Cache Coherence in Hardware  Previously, we introduced multi-cores. —Today we’ll look at issues related.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
1 Lecture 3: Coherence Protocols Topics: consistency models, coherence protocol examples.
CS267 Lecture 61 Shared Memory Hardware and Memory Consistency Modified from J. Demmel and K. Yelick
The University of Adelaide, School of Computer Science
EE 382 Processor DesignWinter 98/99Michael Flynn 1 EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I.
Multiprocessors – Locks
Outline Introduction (Sec. 5.1)
Symmetric Multiprocessors: Synchronization and Sequential Consistency
COSC6385 Advanced Computer Architecture
CS 152 Computer Architecture and Engineering Lecture 18: Snoopy Caches
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Dr. George Michelogiannakis EECS, University of California at Berkeley
Krste Asanovic Electrical Engineering and Computer Sciences
The University of Adelaide, School of Computer Science
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Krste Asanovic Electrical Engineering and Computer Sciences
Krste Asanovic Electrical Engineering and Computer Sciences
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Dr. George Michelogiannakis EECS, University of California at Berkeley
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
CS 152 Computer Architecture and Engineering Lecture 20: Snoopy Caches
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 18 Cache Coherence Krste Asanovic Electrical Engineering and.
Lecture 24: Multiprocessors
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture: Coherence and Synchronization
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 22 Synchronization Krste Asanovic Electrical Engineering and.
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
CSE 486/586 Distributed Systems Cache Coherence
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 19 Memory Consistency Models Krste Asanovic Electrical Engineering.
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

4/13/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 18: Snoopy Caches Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory

4/13/2016 CS152, Spring 2016 Administrivia  Lab 4 due now  PS 5 due next week Wednesday (20 th )  Lecture 20 will be on datacenters  Quiz 4 and PS 3 will be returned tomorrow 2

4/13/2016 CS152, Spring 2016 Last time in Lecture 17 Two kinds of synchronization between processors:  Producer-Consumer –Consumer must wait until producer has produced value –Software version of a read-after-write hazard  Mutual Exclusion –Only one processor can be in a critical section at a time –Critical section guards shared data that can be written  Producer-consumer synchronization implementable with just loads and stores, but need to know ISA’s memory model!  Mutual-exclusion can also be implemented with loads and stores, but tricky and slow, so ISAs add atomic read-modify- write instructions to implement locks 3

4/13/2016 CS152, Spring 2016 Sequential Consistency 4 Cost: Prevents aggressive compiler reordering optimizations Constrains hardware utilization (e.g., store buffer)

4/13/2016 CS152, Spring 2016 Sequential Consistency 5 Sequential consistency imposes more memory ordering constraints than those imposed by uniprocessor program dependencies ( ) What are these in our example ? T1:T2: Store (X), 1 (X = 1) Load R 1, (Y) Store (Y), 11 (Y = 11) Store (Y’), R 1 (Y’= Y) Load R 2, (X) Store (X’), R 2 (X’= X) additional SC requirements

4/13/2016 CS152, Spring 2016 Load-reserve & Store-conditional 6 Special register(s) to hold reservation flag and address, and the outcome of store-conditional try: Load-reserve R head, (head) spin:Load R tail, (tail) if R head ==R tail goto spin Load R, (R head ) R head = R head + 1 Store-conditional (head), R head if (status==fail) goto try process(R) Load-reserve R, (m):  ; R  M[m]; Store-conditional (m), R: if == then cancel other procs’ reservation on m; M[m] R; status succeed; else status fail;

4/13/2016 CS152, Spring 2016 Performance of Locks 7 Blocking atomic read-modify-write instructions e.g., Test&Set, Fetch&Add, Swap vs Non-blocking atomic read-modify-write instructions e.g., Compare&Swap, Load-reserve/Store-conditional vs Protocols based on ordinary Loads and Stores Performance depends on several interacting factors: degree of contention, caches, out-of-order execution of Loads and Stores later...

4/13/2016 CS152, Spring 2016 Amdahl’s Law 8 Begins with Simple Software Assumption (Limit Arg.) Fraction F of execution time perfectly parallelizable No Overhead for Scheduling Communication, Synchronization, etc. F is the Parallel Part Fraction 1 – F Completely Serial Time on 1 core = (1 – F) / 1 + F / 1 = 1 Time on N cores = (1 – F) / 1 + F / N

4/13/2016 CS152, Spring 2016 Strong Consistency 9 An execution is strongly consistent (linearizable) if the method calls can be correctly arranged retaining the mutual order of calls that do not overlap in time, regardless of what thread calls them.

4/13/2016 CS152, Spring 2016 Quiescent Consistency 10 An execution is quiescently consistent if the method calls can be correctly arranged retaining the mutual order of calls separated by quiescence, a period of time where no method is being called in any thread.

4/13/2016 CS152, Spring 2016 Relaxed Memory Models Need Fences 11 Producer posting Item x: Load R tail, (tail) Store (R tail ), x Membar SS R tail =R tail +1 Store (tail), R tail Consumer: Load R head, (head) spin:Load R tail, (tail) if R head ==R tail goto spin Membar LL Load R, (R head ) R head =R head +1 Store (head), R head process(R) Producer Consumer tailhead R tail R head R ensures that tail ptr is not updated before x has been stored ensures that R is not loaded before x has been stored

4/13/2016 CS152, Spring 2016 Memory Coherence in SMPs 12 Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has a stale value Do these stale values matter? What is the view of shared memory for programming? cache-1 A100 CPU-Memory bus CPU-1 CPU-2 cache-2 A100 memory A100

4/13/2016 CS152, Spring 2016 Write-back Caches & SC 13 T1 is executed prog T2 LD Y, R1 ST Y’, R1 LD X, R2 ST X’,R2 prog T1 ST X, 1 ST Y,11 cache-2 cache-1memory X = 0 Y =10 X’= Y’= X= 1 Y=11 Y = Y’= X = X’= cache-1 writes back Y X = 0 Y =11 X’= Y’= X= 1 Y=11 Y = Y’= X = X’= X = 1 Y =11 X’= Y’= X= 1 Y=11 Y’= 11 X = 0 X’= 0 cache-1 writes back X X = 0 Y =11 X’= Y’= X= 1 Y=11 Y’= 11 X = 0 X’= 0 T2 executed X = 1 Y =11 X’= 0 Y’=11 X= 1 Y=11 Y’=11 X = 0 X’= 0 cache-2 writes back X’ & Y’ inconsistent

4/13/2016 CS152, Spring 2016 Write-through Caches & SC 14 cache-2 Y = Y’= X = 0 X’= memory X = 0 Y =10 X’= Y’= cache-1 X= 0 Y=10 prog T2 LD Y, R1 ST Y’, R1 LD X, R2 ST X’,R2 prog T1 ST X, 1 ST Y,11 Write-through caches don’t preserve sequential consistency either T1 executed Y = Y’= X = 0 X’= X = 1 Y =11 X’= Y’= X= 1 Y=11 T2 executed Y = 11 Y’= 11 X = 0 X’= 0 X = 1 Y =11 X’= 0 Y’=11 X= 1 Y=11

4/13/2016 CS152, Spring 2016 Maintaining Cache Coherence  Hardware support is required such that – only one processor at a time has write permission for a location – no processor can load a stale copy of the location after a write -> cache coherence protocols 15

4/13/2016 CS152, Spring 2016 Cache Coherence vs. Memory Consistency  A cache coherence protocol ensures that all writes by one processor are eventually visible to other processors, for one memory address –i.e., updates are not lost  No guarantee of when an update should be seen  No guarantee of what order of updates (of different addresses) should be seen  A cache coherence protocol is not enough to ensure sequential consistency –But if sequentially consistent, then caches must be coherent 16

4/13/2016 CS152, Spring 2016 Cache Coherence vs. Memory Consistency  A memory consistency model gives the rules on when a write by one processor can be observed by a read on another, across different addresses –As previously seen with examples  Combination of cache coherence protocol plus processor memory reorder buffer used to implement a given architecture’s memory consistency model 17

4/13/2016 CS152, Spring 2016 Warmup: Parallel I/O 18 (DMA stands for “Direct Memory Access”, means the I/O device can read/write memory autonomous from the CPU) Either Cache or DMA can be the Bus Master and effect transfers DISK DMA Physical Memory Proc. R/W Data (D) Cache Address (A) A D R/W Page transfers occur while the Processor is running Memory Bus

4/13/2016 CS152, Spring 2016 Problems with Parallel I/O 19 Memory Disk: Physical memory may be stale if cache copy is dirty Disk Memory: Cache may hold stale data and not see memory writes DISK DMA Physical Memory Proc. Cache Memory Bus Cached portions of page DMA transfers

4/13/2016 CS152, Spring 2016 Snoopy Cache, Goodman 1983  Idea: Have cache watch (or snoop upon) DMA transfers, and then “do the right thing”  Snoopy cache tags are dual-ported 20 Proc. Cache Snoopy read port attached to Memory Bus Data (lines) Tags and State A D R/W Used to drive Memory Bus when Cache is Bus Master A R/W

4/13/2016 CS152, Spring 2016 Snoopy Cache Actions for DMA 21 Observed Bus Cycle Cache State Cache Action Address not cached DMA Read Cached, unmodified Memory Disk Cached, modified Address not cached DMA Write Cached, unmodified Disk Memory Cached, modified No action Cache intervenes Cache purges its copy ???

4/13/2016 CS152, Spring 2016 Shared Memory Multiprocessor 22 Use snoopy mechanism to keep all processors’ view of memory coherent M1M1 M2M2 M3M3 Snoopy Cache DMA Physical Memory Bus Snoopy Cache Snoopy Cache DISKS

4/13/2016 CS152, Spring 2016 Snoopy Cache Coherence Protocols 23 write miss: the address is invalidated in all other caches before the write is performed read miss: if a dirty copy is found in some cache, a write- back is performed before the memory is read

4/13/2016 CS152, Spring 2016 Cache State Transition Diagram The MSI protocol 24 M SI M: Modified S: Shared I: Invalid Each cache line has state bits Address tag state bits Write miss (P1 gets line from memory) Other processor intents to write (P 1 writes back) Read miss (P1 gets line from memory) P 1 intent to write Other processor intents to write Read by any processor P 1 reads or writes Cache state in processor P 1 Other processor reads (P 1 writes back)

4/13/2016 CS152, Spring 2016 Two Processor Example (Reading and writing the same cache line) 25 M SI Write miss Read miss P 1 intent to write P 2 intent to write P 2 reads, P 1 writes back P 1 reads or writes P 2 intent to write P1P1 M SI Write miss Read miss P 2 intent to write P 1 intent to write P 1 reads, P 2 writes back P 2 reads or writes P 1 intent to write P2P2 P 1 reads P 1 writes P 2 reads P 2 writes P 1 writes P 2 writes P 1 reads P 1 writes

4/13/2016 CS152, Spring 2016 Observation  If a line is in the M state then no other cache can have a copy of the line!  Memory stays coherent, multiple differing copies cannot exist 26 M SI Write miss Other processor intent to write Read miss P 1 intent to write Other processor intent to write Read by any processor P 1 reads or writes Other processor reads P 1 writes back

4/13/2016 CS152, Spring 2016 MESI: An Enhanced MSI protocol increased performance for private data 27 ME SI M: Modified Exclusive E: Exclusive but unmodified S: Shared I: Invalid Each cache line has a tag Address tag state bits Write miss Other processor intent to write Read miss, shared Other processor intent to write P 1 write Read by any processor Other processor reads P 1 writes back P 1 read P 1 write or read Cache state in processor P 1 P 1 intent to write Read miss, not shared Other processor reads Other processor intent to write, P1 writes back

4/13/2016 CS152, Spring 2016 Optimized Snoop with Level-2 Caches 28 Snooper Processors often have two-level caches small L1, large L2 (on chip) Inclusion property: entries in L1 must be in L2 invalidation in L2  invalidation in L1 Snooping on L2 does not affect CPU-L1 bandwidth What problem could occur? CPU L1 $ L2 $ CPU L1 $ L2 $ CPU L1 $ L2 $ CPU L1 $ L2 $

4/13/2016 CS152, Spring 2016 Intervention 29 When a read-miss for A occurs in cache-2, a read request for A is placed on the bus Cache-1 needs to supply & change its state to shared The memory may respond to the request also! Does memory know it has stale data? Cache-1 needs to intervene through memory controller to supply correct data to cache-2 cache-1 A200 CPU-Memory bus CPU-1 CPU-2 cache-2 memory (stale data) A100

4/13/2016 CS152, Spring 2016 False Sharing 30 state line addr data0data1... dataN A cache line contains more than one word Cache-coherence is done at the line-level and not word-level Suppose M 1 writes word i and M 2 writes word k and both words have the same line address. What can happen?

4/13/2016 CS152, Spring 2016 Synchronization and Caches: Performance Issues 31 Cache-coherence protocols will cause mutex to ping-pong between P1’s and P2’s caches. Ping-ponging can be reduced by first reading the mutex location (non-atomically) and executing a swap only if it is found to be zero. cache Processor 1 R  1 L: swap (mutex), R; if then goto L; M[mutex]  0; Processor 2 R  1 L: swap (mutex), R; if then goto L; M[mutex]  0; Processor 3 R  1 L: swap (mutex), R; if then goto L; M[mutex]  0; CPU-Memory Bus mutex=1 cache

4/13/2016 CS152, Spring 2016 Load-reserve & Store-conditional 32 If the snooper sees a store transaction to the address in the reserve register, the reserve bit is set to 0 Several processors may reserve ‘a’ simultaneously These instructions are like ordinary loads and stores with respect to the bus traffic Special register(s) to hold reservation flag and address, and the outcome of store-conditional Load-reserve R, (a):  ; R M[a]; Store-conditional (a), R: if == then cancel other procs’ reservation on a; M[a]  ; status succeed; else status fail;

4/13/2016 CS152, Spring 2016 Out-of-Order Loads/Stores & CC 33 Blocking caches One request at a time + CC  SC Non-blocking caches Multiple requests (different addresses) concurrently + CC  Relaxed memory models CC ensures that all processors observe the same order of loads and stores to an address Cache Memory pushout (Wb-rep) load/store buffers CPU (S-req, E-req) (S-rep, E-rep) Wb-req, Inv-req, Inv-rep snooper (I/S/E) CPU/Memory Interface

4/13/2016 CS152, Spring 2016 Acknowledgements  These slides contain material developed and copyright by: –Arvind (MIT) –Krste Asanovic (MIT/UCB) –Joel Emer (Intel/MIT) –James Hoe (CMU) –John Kubiatowicz (UCB) –David Patterson (UCB)  MIT material derived from course  UCB material derived from course CS252  “New microprocessor claims 10x energy improvement”, from extremetech  “Exploring the diverse world of programming” by Pavel Shved  “Amdahl’s Law in the Multicore Era” b Mark D. Hill and Michael R. Marty 34