A New Coherence Method Using A Multicast Address Network

Slides:

Advertisements

Similar presentations

1 Uniform memory access (UMA) Each processor has uniform access time to memory - also known as symmetric multiprocessors (SMPs) (example: SUN ES1000) Non-uniform.

Advertisements

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)

To Include or Not to Include? Natalie Enright Dana Vantrease.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

Department of Computer Sciences Revisiting the Complexity of Hardware Cache Coherence and Some Implications Rakesh Komuravelli Sarita Adve, Ching-Tsun.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.

Cache Optimization Summary

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.

A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

CS 7810 Lecture 19 Coherence Decoupling: Making Use of Incoherence J.Huh, J. Chang, D. Burger, G. Sohi Proceedings of ASPLOS-XI October 2004.

CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.

Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors Karin StraussAMD Advanced Architecture and Technology.

1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.

(C) 2002 Milo MartinHPCA, Feb Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.

1 Lecture 3: Snooping Protocols Topics: snooping-based cache coherence implementations.

(C) 2003 Milo Martin Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper,

1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.

1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.

CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.

1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.

1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.

Multiprocessor Cache Coherency

Quantifying and Comparing the Impact of Wrong-Path Memory References in Multiple-CMP Systems Ayse Yilmazer, University of Rhode Island Resit Sendag, University.

In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI.

Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.

A Low-Overhead Coherence Solution for Multiprocessors with Private Cache Memories Also known as “Snoopy cache” Paper by: Mark S. Papamarcos and Janak H.

Using Prediction to Accelerate Coherence Protocols Authors : Shubendu S. Mukherjee and Mark D. Hill Proceedings. The 25th Annual International Symposium.

Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

Evaluating the Performance of Four Snooping Cache Coherency Protocols Susan J. Eggers, Randy H. Katz.

Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.

Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P.

Token Coherence: Decoupling Performance and Correctness Milo M. D. Martin Mark D. Hill David A. Wood University of Wisconsin-Madison ISCA-30 (2003)

“An Evaluation of Directory Schemes for Cache Coherence” Presented by Scott Weber.

March University of Utah CS 7698 Token Coherence: Decoupling Performance and Correctness Article by: Martin, Hill & Wood Presented by: Michael Tabet.

The University of Adelaide, School of Computer Science

1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.

Lecture 8: Snooping and Directory Protocols

Framework For Exploring Interconnect Level Cache Coherency

Architecture and Design of AlphaServer GS320

A Study on Snoop-Based Cache Coherence Protocols

Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.

Multiprocessor Cache Coherency

Cache Memory Presentation I

Jason F. Cantin, Mikko H. Lipasti, and James E. Smith

CMSC 611: Advanced Computer Architecture

Example Cache Coherence Problem

CS5102 High Performance Computer Systems Distributed Shared Memory

Lecture 2: Snooping-Based Coherence

Using Packet Information for Efficient Communication in NoCs

Lecture 17: Transactional Memories I

Lecture 8: Directory-Based Cache Coherence

Improving Multiple-CMP Systems with Token Coherence

Lecture 7: Directory-Based Cache Coherence

E. Bilir, R. Dickson, Y. Hu, M. Plakal, D. Sorin,

11 – Snooping Cache and Directory Based Multiprocessors

* From AMD 1996 Publication #18522 Revision E

Lecture 25: Multiprocessors

Lecture 9: Directory-Based Examples

High Performance Computing

Lecture 8: Directory-Based Examples

Lecture 25: Multiprocessors

Lecture 24: Multiprocessors

Coherent caches Adapted from a lecture by Ian Watson, University of Machester.

Lecture 10: Directory-Based Examples II

Presentation transcript:

A New Coherence Method Using A Multicast Address Network Multicast Snooping A New Coherence Method Using A Multicast Address Network

Outline Multicast Snooping View from Above Multicast Snooping Details Experimental Implementation and Methodology Questions Posed by Multicast Snooping

View from Above Multicast Snooping: A Hybrid of Snooping and Broadcast Performance benefits of snooping Scalability of broadcast Graceful degradation to broadcast as system grows Multicast Groups ala Networks Processor “guesses” peer(s) that need to see message, then multicasts to these targets Memory has directory to evaluate guesses, acts upon those that are wrong

View from Above, Continued Advantages of Multicast Snooping When Multicast “Guess” is Right, Speed of Snooping Protocol Achieved with Greater Scaling Over Traditional Snooping Systems “Guessing” right isn’t that hard Net result is support of larger systems with snooping (better snooping scalability) When Multicast “Guess” is Wrong, Directory-Based Mechanisms Maintain Correctness Degradation as system scales to directory-like behavior

Outline Multicast Snooping View from Above Multicast Snooping Details Experimental Implementation and Methodology Questions Posed by Multicast Snooping

Multicast Snooping Coherence Logically Separate Multicast Address and Data Busses Authors Model Physical Separation for Simplicity MOSI Protocol Why MOSI? Why not MSI or MESI or MOESI? MOSI appears to have been chosen to enable a getx with an incomplete mask to transition the requestor’s block to O rather than to fully invalidate the transaction. This allows the first mask’s processors to act upon the transaction at its time of issuance. An upgrade transaction with the proper mask can then be issued, which avoids having the first mask’s processors needing to process two messages, the first of which is invalidated by the directory at memory. Clearly, having an E state is unfavorable for reasons previously discussed in class (E generally needs a single shared line that every processor must monitor).

Multicast Snooping Protocol Broadcast-Like, Three Major Differences Coherence Transactions Carry Mask Mask: specification of which processors should receive the transaction; always includes source processor and memory that owns the requested block Memory Carries Simplified Directory Entry Verifies mask is correct, reacts appropriately If incorrect, sends correct mask back with semi-ack or nack Processor Actions Carry Additional Complexity Needed to support semi-acks/nacks on getx

Multicast Snooping Mask Prediction Each Processor Maintains Local Mini-Directory Tracks Locality of Block Access, Last Invalidator, Arranged by Block Tag Builds Mask Using “StickySpatial(k)” Predictor Ors mask for block with masks for k-nearest neighbors in table Nearest neighbors may not be related blocks

Multicast Address Networks For Now, Consider As A Cloud Notable Utilization of Fat-Tree Network (Recall CM-5) Important Properties For Supporting Multicast Networks Include: Total ordering, need not be simultaneous to all destinations Capable of multiple deliveries per cycle Low latency – avoid bottlenecks, exploit locality

Outline Multicast Snooping View from Above Multicast Snooping Details Experimental Implementation and Methodology Questions Posed by Multicast Snooping

Performance Evaluation Not the Focus of the Paper Preliminary Evidence Suggests that Further Detailed Evaluation of Multicast Snooping is Warranted Simulated 32-Processor CC-NUMA Used Wisconsin WWT II Simulator MSI only – pessimistic for Multicast Snooping Benchmarks Mainly Derived from SPLASH-2

Evaluating The Pieces Generated Traces Fed Through Mask Predictor Prediction Accuracy Range: 73-95% Extra Nodes Predicted Still Leave Multicast Group Size Much Smaller than System Size Would these results scale with smaller system sizes? What’s affecting the number of extra nodes? Network Results Show Multiple Messages Per Cycle Possible (~50% of Optimal) The implications in the paper suggest that multicast group size is more of a factor of the mean number of sharers encountered by a coherence transaction. This suggests that the multicast traffic ratio (average number of multicast group members to perfect number of multicast group members) would be nearly fixed for a range of related system sizes under similar workloads. Where this would start to change is when the number of nodes goes below the perfect number, that is, there is more parallelism present than there are processors to exploit it. The authors suggest that there is additional opportunity to improve the mask predictor to reduce the multicast traffic ratio closer to the ideal (1.0).

Outline Multicast Snooping View from Above Multicast Snooping Details Experimental Implementation and Methodology Questions Posed by Multicast Snooping

Questions To Consider Is Multicast Snooping A Good Idea? If So, Would It Be Better In Some Scenarios Than In Others? If Not, Why Not? Is Multicast Snooping Optimal? Are the Noted Drawbacks Regarding the Evaluation Back-Breakers? In general, it appears that multicast snooping is a good idea in classes of systems where performance and scalability are important. The retention of snooping performance for larger-scale systems is attractive. It is unclear how much additional hardware would be added to a snoop-based system and whether this would become a significant cost factor. For very large systems running workloads that share data across nearly all processors, this would degrade to a directory-style system, which is reasonable. As for optimality, I believe this is an “it depends” issue, with dependencies upon workload, system size, and customer prioritization of performance, cost, and availability. The noted drawbacks need to be addressed in future work, particularly the approximations and lack of full timing simulation. This is not, in my opinion, a valid reason to dismiss multicast snooping altogether.