March 24 2005University of Utah CS 7698 Token Coherence: Decoupling Performance and Correctness Article by: Martin, Hill & Wood Presented by: Michael Tabet.

Slides:



Advertisements
Similar presentations
L.N. Bhuyan Adapted from Patterson’s slides
Advertisements

1 Uniform memory access (UMA) Each processor has uniform access time to memory - also known as symmetric multiprocessors (SMPs) (example: SUN ES1000) Non-uniform.
Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
Cache Coherence Mechanisms (Research project) CSCI-5593
1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)
4/16/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.
1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.
Cache Optimization Summary
Manager-Client Pairing: A Framework for Implementing Coherence Hierarchies Jesse G. Beu Michael C. Rosier Thomas M. Conte Tinker Research Georgia Institute.
CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.
The Stanford Directory Architecture for Shared Memory (DASH)* Presented by: Michael Bauer ECE 259/CPS 221 Spring Semester 2008 Dr. Lebeck * Based on “The.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
(C) 2003 Milo Martin Token Coherence: Decoupling Performance and Correctness Milo Martin, Mark Hill, and David Wood Wisconsin Multifacet Project
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Directory-Based Caches II Steve Ko Computer Sciences and Engineering University at Buffalo.
Token Tenure: PATCHing Token Counting Using Directory-Based Cache Coherence Arun Raghavan, Colin Blundell, Milo Martin University of Pennsylvania {arraghav,
CS 7810 Lecture 19 Coherence Decoupling: Making Use of Incoherence J.Huh, J. Chang, D. Burger, G. Sohi Proceedings of ASPLOS-XI October 2004.
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
1 Lecture 22: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
(C) 2002 Milo MartinHPCA, Feb Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.
(C) 2003 Milo Martin Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper,
1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.
NUMA coherence CSE 471 Aut 011 Cache Coherence in NUMA Machines Snooping is not possible on media other than bus/ring Broadcast / multicast is not that.
1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
Multiprocessor Cache Coherency
Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.
A Novel Directory-Based Non-Busy, Non- Blocking Cache Coherence Huang Yomgqin, Yuan Aidong, Li Jun, Hu Xiangdong 2009 International forum on computer Science-Technology.
(C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo.
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.
L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Distributed Shared Memory (part 1). Distributed Shared Memory (DSM) mem0 proc0 mem1 proc1 mem2 proc2 memN procN network... shared memory.
1 Lecture 24: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
1 Lecture 3: Coherence Protocols Topics: consistency models, coherence protocol examples.
Token Coherence: Decoupling Performance and Correctness Milo M. D. Martin Mark D. Hill David A. Wood University of Wisconsin-Madison ISCA-30 (2003)
Timestamp snooping: an approach for extending SMPs Milo M. K. Martin et al. Summary by Yitao Duan 3/22/2002.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
The University of Adelaide, School of Computer Science
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
Architecture and Design of AlphaServer GS320
A New Coherence Method Using A Multicast Address Network
Lecture 18: Coherence and Synchronization
Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.
Multiprocessor Cache Coherency
CMSC 611: Advanced Computer Architecture
Krste Asanovic Electrical Engineering and Computer Sciences
Example Cache Coherence Problem
Cache Coherence Protocols:
CMSC 611: Advanced Computer Architecture
Lecture 8: Directory-Based Cache Coherence
Improving Multiple-CMP Systems with Token Coherence
Lecture 7: Directory-Based Cache Coherence
E. Bilir, R. Dickson, Y. Hu, M. Plakal, D. Sorin,
11 – Snooping Cache and Directory Based Multiprocessors
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
Lecture 25: Multiprocessors
Lecture 25: Multiprocessors
Token Coherence: Decoupling Performance and Correctness
Lecture 24: Multiprocessors
Prof. Onur Mutlu ETH Zürich Fall November 2017
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
CSE 486/586 Distributed Systems Cache Coherence
Lecture 10: Directory-Based Examples II
Presentation transcript:

March University of Utah CS 7698 Token Coherence: Decoupling Performance and Correctness Article by: Martin, Hill & Wood Presented by: Michael Tabet CS 7698

March CS 7698 A Tale of Two Methods  Snooping based Uses totally ordered broadcasts to preserve correctness Uses lots of bandwidth Big (large busses) = BAD!  Directory based Uses indirection to preserve bandwidth Indirection adds latency Needs a directory controller

March CS 7698 Potential work arounds Snooping  Snooping is fast, but requires a bus. Big fast busses are complex ->  Use a virtual bus to virtual broadcast! Directory  Networks require lots of logic (especially big ones) ->  Use glueless networks!

March CS 7698 Token Coherence Provides for both indirection, and speed up through unordered broadcasts Two components:  Correctness substrate  Performance protocol

March CS 7698 Correctness Speed is Good, Correctness is Better! Need to guarantee ordered reads/writes! Thus, use a correctness “substrate”

March CS 7698 Correctness Invariants 1.At all times, each block has T tokens 2.A processor can only write a block if it holds all T tokens 3.A processor can read a block only if it holds at least one token 4.If a coherence message contains one or more tokens, it must contain data

March CS 7698 Invariant 1 Implications Allows for precise control of blocks of data.

March CS 7698 Invariant 2 Implications Enables write control mechanism to allow in order writes

March CS 7698 Invariant 3 Implications Restricts reads

March CS 7698 Invariant 4 Implications Provides a method to ensure cache coherence

March CS 7698 Starvation Invariants allow of ordered reads/writes, but how do we prevent starvation? Persistent requests: 1.A processor times out on transient requests 2.Raises a persistent request (only one per block) 3.All nodes must forward blocks to the node But repeated & persistent requests only make up 1-3% of the messages

March CS 7698 Persistent Request State Diagram

March CS 7698 Performance protocol But if you always follow the rules, it can get slow and tedious! Tokens allow for unordered responses to requests. This opens the door for all sorts of optimizations

March CS 7698 TokenB A New Contender Akin to MSI snooping protocol: Requests broadcast Data exists either in  Modified (All tokens)  Shared (Some tokens)  Invalid (No tokens) But: Performance protocol allows for better performance!

March CS 7698 TokenB: Optimized Token Counting MSI was a bit of a lie, can optimize token counting by altering invariants 1,3,4: 1.At all times, each block has T tokens, one of which is the owner token 3.A processor can read a block only if it holds at least one token for that block and has valid data 4.If a coherence message contains the owner token, it must contain data

March CS 7698 TokenB Continued The Good Stuff Performance in: Tokens allow replies to be sent unordered, and indirectly (no broadcast) This means: 15-28% faster than snooping 17-54% faster than directory 21-25% less bandwidth than snooping

March CS 7698 An Example P1 reads then P2 writes then P1 reads Presume a 4 node systems, where P1 has an invalid copy, P2 has a shared copy, and P3 is the “home/owner” node

March CS 7698 Example The Snooping Way P1 P2 P3 P All messages broadcast!

March CS 7698 Example The Directory Way P1 P2 P3 P4 Directory Directory process messages !

March CS 7698 Example The Token Way P1 P2 P3 P4 1(broadcast) 2 3(broadcast) (broadcast) 6

March CS 7698 Real world results Examined on a tree structure (virtual broadcast), and on a 2d torus Migratory optimization: a read request after a write is forwarded all tokens Benchmarked on OLTP, SPECjbb, Apache

March CS 7698 Results Token vs Snooping: TOKEN Wins!

March CS 7698 Results Directory vs Token: Token mostly wins!

March CS 7698 Conclusion TokenB offers a good performance for small-middle sized parallel systems Broadcasts limits scalability past 16 nodes But other performance implementations could be scaled larger!