CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Directory-Based Caches I Steve Ko Computer Sciences and Engineering University at Buffalo.

Slides:



Advertisements
Similar presentations
Cache Coherence. Memory Consistency in SMPs Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has.
Advertisements

The University of Adelaide, School of Computer Science
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.
4/16/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.
1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Snoopy Caches I Steve Ko Computer Sciences and Engineering University at Buffalo.
© Krste Asanovic, 2014CS252, Spring 2014, Lecture 12 CS252 Graduate Computer Architecture Spring 2014 Lecture 12: Synchronization and Memory Models Krste.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Complex Pipelining II Steve Ko Computer Sciences and Engineering University at Buffalo.
Cache Optimization Summary
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Snoopy Caches II Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Directory-Based Caches II Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache II Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Synchronization and Consistency II Steve Ko Computer Sciences and Engineering University at.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Virtual Memory I Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache IV Steve Ko Computer Sciences and Engineering University at Buffalo.
The University of Adelaide, School of Computer Science
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Putting it all together: Intel Nehalem Steve Ko Computer Sciences and Engineering University.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Synchronization and Consistency I Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Multithreading II Steve Ko Computer Sciences and Engineering University at Buffalo.
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
Computer Architecture 2011 – coherency & consistency (lec 7) 1 Computer Architecture Memory Coherency & Consistency By Dan Tsafrir, 11/4/2011 Presentation.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
EECC756 - Shaaban #1 lec # 13 Spring Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions: –Processor-Cache-Memory.
CS 152 Computer Architecture and Engineering Lecture 21: Directory-Based Cache Protocols Scott Beamer (substituting for Krste Asanovic) Electrical Engineering.
CS252/Patterson Lec /28/01 CS 213 Lecture 9: Multiprocessor: Directory Protocol.
CS 252 Graduate Computer Architecture Lecture 11: Multiprocessors-II Krste Asanovic Electrical Engineering and Computer Sciences University of California,
April 13, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 18: Snoopy Caches Krste Asanovic Electrical Engineering and Computer.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CS 152 Computer Architecture and Engineering Lecture 20: Snoopy Caches Krste Asanovic Electrical Engineering and Computer Sciences University of California,
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
Snooping Cache and Shared-Memory Multiprocessors
April 18, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.
April 15, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 20: Snoopy Caches Krste Asanovic Electrical Engineering and Computer.
February 11, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 8 - Memory Hierarchy-III Krste Asanovic Electrical Engineering.
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Multiprocessor Cache Coherency
Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.
Computer Architecture 2015 – Cache Coherency & Consistency 1 Computer Architecture Memory Coherency & Consistency By Yoav Etsion and Dan Tsafrir Presentation.
CS492B Analysis of Concurrent Programs Coherence Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.
EECS 252 Graduate Computer Architecture Lec 13 – Snooping Cache and Directory Based Multiprocessors David Patterson Electrical Engineering and Computer.
Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.
Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
4/13/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 18: Snoopy Caches Dr. George Michelogiannakis EECS, University of California.
The University of Adelaide, School of Computer Science
Outline Introduction (Sec. 5.1)
CS 152 Computer Architecture and Engineering Lecture 18: Snoopy Caches
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
12.4 Memory Organization in Multiprocessor Systems
Multiprocessor Cache Coherency
Dr. George Michelogiannakis EECS, University of California at Berkeley
The University of Adelaide, School of Computer Science
Krste Asanovic Electrical Engineering and Computer Sciences
The University of Adelaide, School of Computer Science
11 – Snooping Cache and Directory Based Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 24: Virtual Memory, Multiprocessors
CS 152 Computer Architecture and Engineering Lecture 20: Snoopy Caches
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 18 Cache Coherence Krste Asanovic Electrical Engineering and.
Lecture 23: Virtual Memory, Multiprocessors
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 19: Coherence and Synchronization
The University of Adelaide, School of Computer Science
CSE 486/586 Distributed Systems Cache Coherence
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Directory-Based Caches I Steve Ko Computer Sciences and Engineering University at Buffalo

CSE 490/590, Spring Last time… Snoopy Cache Coherence Protocol –MSI & MESI

CSE 490/590, Spring MESI: An Enhanced MSI protocol increased performance for private data ME SI M: Modified Exclusive E: Exclusive but unmodified S: Shared I: Invalid Each cache line has a tag Address tag state bits Write miss Other processor intent to write Read miss, shared Other processor intent to write P 1 write Read by any processor Other processor reads P 1 writes back P 1 read P 1 write or read Cache state in processor P 1 P 1 intent to write Read miss, not shared Other processor reads Other processor intent to write, P1 writes back

CSE 490/590, Spring Snooper Optimized Snoop with Level-2 Caches Processors often have two-level caches small L1, large L2 (usually both on chip now) Inclusion property: entries in L1 must be in L2 invalidation in L2  invalidation in L1 Snooping on L2 does not affect CPU-L1 bandwidth What problem could occur? CPU L1 $ L2 $ CPU L1 $ L2 $ CPU L1 $ L2 $ CPU L1 $ L2 $

CSE 490/590, Spring Intervention When a read-miss for A occurs in cache-2, a read request for A is placed on the bus Cache-1 needs to supply & change its state to shared The memory may respond to the request also! Does memory know it has stale data? Cache-1 needs to intervene through memory controller to supply correct data to cache-2 cache-1 A200 CPU-Memory bus CPU-1 CPU-2 cache-2 memory (stale data) A100

CSE 490/590, Spring False Sharing state blk addr data0data1... dataN A cache block contains more than one word Cache-coherence is done at the block-level and not word-level Suppose M 1 writes word i and M 2 writes word k and both words have the same block address. What can happen?

CSE 490/590, Spring Synchronization and Caches: Performance Issues Cache-coherence protocols will cause mutex to ping-pong between P1’s and P2’s caches. Ping-ponging can be reduced by first reading the mutex location (non-atomically) and executing a swap only if it is found to be zero. cache Processor 1 R  1 L: swap (mutex), R; if then goto L; M[mutex]  0; Processor 2 R  1 L: swap (mutex), R; if then goto L; M[mutex]  0; Processor 3 R  1 L: swap (mutex), R; if then goto L; M[mutex]  0; CPU-Memory Bus mutex=1 cache

CSE 490/590, Spring Blocking caches One request at a time + CC  SC Non-blocking caches Multiple requests (different addresses) concurrently + CC  Relaxed memory models CC ensures that all processors observe the same order of loads and stores to an address Out-of-Order Loads/Stores & CC Cache Memory pushout (Wb-rep) load/store buffers CPU (S-req, M-req) (S-rep, M-rep) Wb-req, Inv-req, Inv-rep snooper (I/S/M) CPU/Memory Interface

CSE 490/590, Spring Performance of Symmetric Shared-Memory Multiprocessors Cache performance is combination of: 1.Uniprocessor cache miss traffic 2.Traffic caused by communication –Results in invalidations and subsequent cache misses Adds 4 th C: coherence miss –Joins Compulsory, Capacity, Conflict –(Sometimes called a Communication miss)

CSE 490/590, Spring Coherency Misses 1.True sharing misses arise from the communication of data through the cache coherence mechanism Invalidates due to 1 st write to shared block Reads by another CPU of modified block in different cache Miss would still occur if block size were 1 word 2.False sharing misses when a block is invalidated because some word in the block, other than the one being read, is written into Invalidation does not cause a new value to be communicated, but only causes an extra cache miss Block is shared, but no word in block is actually shared  miss would not occur if block size were 1 word

CSE 490/590, Spring CSE 490/590 Administrivia Keyboards available for pickup at my office Project 2: less than 2 weeks left (Deadline 5/2) –Will have demo sessions No class on 5/2 (finish the project!) Final exam: Thursday 5/5, 11:45pm – 2:45pm Project 2 + Final = 55%

CSE 490/590, Spring Example: True v. False Sharing v. Hit? TimeP1P2 True, False, Hit? Why? 1Write x1 2Read x2 3Write x1 4Write x2 5Read x2 Assume x1 and x2 in same cache block. P1 and P2 both read x1 and x2 before. True miss; invalidate x1 in P2 False miss; x1 irrelevant to P2 True miss; invalidate x2 in P1

CSE 490/590, Spring MP Performance 4 Processor Commercial Workload: OLTP, Decision Support (Database), Search Engine True sharing and false sharing unchanged going from 1 MB to 8 MB (L3 cache) Uniprocessor cache misses improve with cache size increase (Instruction, Capacity/Conflict, Compulsory)

CSE 490/590, Spring MP Performance 2MB Cache Commercial Workload: OLTP, Decision Support (Database), Search Engine True sharing, false sharing increase going from 1 to 8 CPUs

CSE 490/590, Spring A Cache Coherent System Must: Provide set of states, state transition diagram, and actions Manage coherence protocol –(0) Determine when to invoke coherence protocol –(a) Find info about state of address in other caches to determine action »whether need to communicate with other cached copies –(b) Locate the other copies –(c) Communicate with those copies (invalidate/update) (0) is done the same way on all systems –state of the line is maintained in the cache –protocol is invoked if an “access fault” occurs on the line Different approaches distinguished by (a) to (c)

CSE 490/590, Spring Bus-based Coherence All of (a), (b), (c) done through broadcast on bus –faulting processor sends out a “search” –others respond to the search probe and take necessary action Could do it in scalable network too –broadcast to all processors, and let them respond Conceptually simple, but broadcast doesn’t scale with number of processors, P –on bus, bus bandwidth doesn’t scale –on scalable network, every fault leads to at least P network transactions Scalable coherence: –can have same cache states and state transition diagram –different mechanisms to manage protocol

CSE 490/590, Spring Scalable Approach: Directories Every memory block has associated directory information –keeps track of copies of cached blocks and their states –on a miss, find directory entry, look it up, and communicate only with the nodes that have copies if necessary –in scalable networks, communication with directory and copies is through network transactions Many alternatives for organizing directory information

CSE 490/590, Spring Acknowledgements These slides heavily contain material developed and copyright by –Krste Asanovic (MIT/UCB) –David Patterson (UCB) And also by: –Arvind (MIT) –Joel Emer (Intel/MIT) –James Hoe (CMU) –John Kubiatowicz (UCB) MIT material derived from course UCB material derived from course CS252