CS 258 Spring 20081 An Adaptive Cache Coherence Protocol Optimized for Migratory Sharing Per Stenström, Mats Brorsson, and Lars Sandberg Presented by Allen.

Slides:



Advertisements
Similar presentations
1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)
Advertisements

Performance of Cache Memory
Cache Optimization Summary
CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.
The Stanford Directory Architecture for Shared Memory (DASH)* Presented by: Michael Bauer ECE 259/CPS 221 Spring Semester 2008 Dr. Lebeck * Based on “The.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Per Stenstrom, Truman Joe and Anoop Gupta Presented by Colleen Lewis.
The University of Adelaide, School of Computer Science
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
1 Lecture 4: Directory-Based Coherence Details of memory-based (SGI Origin) and cache-based (Sequent NUMA-Q) directory protocols.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
1 Lecture 1: Introduction Course organization:  4 lectures on cache coherence and consistency  2 lectures on transactional memory  2 lectures on interconnection.
1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.
NUMA coherence CSE 471 Aut 011 Cache Coherence in NUMA Machines Snooping is not possible on media other than bus/ring Broadcast / multicast is not that.
1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Architecture and Design of AlphaServer GS320 Kourosh Gharachorloo, Madhu Sharma, Simon Steely, and Stephen Van Doren ASPLOS’2000 Presented By: Alok Garg.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
ECE669 L17: Memory Systems April 1, 2004 ECE 669 Parallel Computer Architecture Lecture 17 Memory Systems.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
Multiprocessor Cache Coherency
Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.
The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor Computer System Laboratory Stanford University Daniel Lenoski, James Laudon, Kourosh.
1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.
Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.
Lecture 13: Multiprocessors Kai Bu
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Cache Coherence Protocols A. Jantsch / Z. Lu / I. Sander.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.
Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z. By Nooruddin Shaik.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
1 Lecture 3: Coherence Protocols Topics: consistency models, coherence protocol examples.
1 Lecture: Coherence Topics: snooping-based coherence, directory-based coherence protocols (Sections )
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 7, 2005 Session 23.
Performance of Snooping Protocols Kay Jr-Hui Jeng.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
The University of Adelaide, School of Computer Science
22/12/2005 Distributed Shared-Memory Architectures by Seda Demirağ Distrubuted Shared-Memory Architectures by Seda Demirağ.
An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.
Architecture and Design of AlphaServer GS320
CSC 4250 Computer Architectures
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
CS 704 Advanced Computer Architecture
Lecture 18: Coherence and Synchronization
Multiprocessor Cache Coherency
CMSC 611: Advanced Computer Architecture
Example Cache Coherence Problem
Directory-based Protocol
CS5102 High Performance Computer Systems Distributed Shared Memory
Death Match ’92: NUMA v. COMA
Lecture 8: Directory-Based Cache Coherence
Lecture 7: Directory-Based Cache Coherence
11 – Snooping Cache and Directory Based Multiprocessors
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
Lecture 25: Multiprocessors
Lecture 9: Directory-Based Examples
Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini
Lecture 8: Directory-Based Examples
Lecture 25: Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Cache coherence CEG 4131 Computer Architecture III
Lecture 24: Multiprocessors
Lecture: Coherence Topics: wrap-up of snooping-based coherence,
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Lecture 10: Directory-Based Examples II
Presentation transcript:

CS 258 Spring An Adaptive Cache Coherence Protocol Optimized for Migratory Sharing Per Stenström, Mats Brorsson, and Lars Sandberg Presented by Allen Lee April 2, 2008

Observations Data structures accessed within critical sections are read and modified by exactly one processor at any given time.  Also called migratory objects [Gupta and Weber]. Each processor that accesses these data structures typically causes a cache miss followed by an invalidation request.  These transactions can be combined.

Migratory Sharing R i denotes a read access by processor i W i denotes a write access by processor i Consider the access pattern, where i != j, (R i ) + (W i )(R i | W i )*(R j ) + (W j )(R j | W j )* … A block that adheres to the above pattern is called migratory, otherwise it is ordinary.

Detecting Migratory Sharing A block is nominated as migratory if 1. A read-exclusive (invalidation) request comes a processor different from the one that most recently wrote to it, and 2. There are exactly two copies of the block. (R i ) + (W i )(R i | W i )*(R j ) + (W j )(R j | W j )* … Condition 2 Condition 1

Review: Stanford DASH Directory-based write-invalidate protocol Home keeps track of global state of blocks  Uncached  Shared-Remote  Dirty-Remote A block in a local node is either  Invalid  Shared  Dirty

Review: Stanford DASH Read-miss Read-exclusive (L = local, H = home, R = remote) Rr = Read miss Rp = Response from remote to local Sw = Response from remote to home Rxq = Read-exclusive request Rxp = Read-exclusive response Inv = Invalidate Iack = Invalidation acknowledgment

Actions for Migratory Blocks (L = local, H = home, R = remote) Rr = Read miss DT = Notification of ownership transfer Mr = Migratory read-miss request MIAck = Home has updated directory Mack = Remote gives up ownership and sends copy to local (migratory ack)

Detecting Migratory Blocks

Migratory to Ordinary Consider the following access pattern Rr i, Rr j, Rr i, Rr j, … To avoid ping-pong, introduce Migrating local cache state  Initial read causes local state to be Migrating instead of Dirty  Subsequent write changes state to Dirty  Receiving migrating read request (Mr) in Migrating generates NoMig signal and the block is thereafter considered ordinary

Architectural Model 16 nodes interconnected by two 4x4 wormhole-routed meshes 64-Kbyte cache with 16-byte line size 100 MHz processor clock

Results ApplicationRead-excl. Req. Reduction MP3D87% Cholesky69% Water96% LU5%

Results – Traffic Reduction 704 bits are sent under W-I to serve read- miss and subsequent read-exclusive request 328 bits for the adaptive protocol ApplicationTraffic Reduction MP3D32% Cholesky22% Water31% LU1%

Results – Effect on Consistency Model Relaxed consistency models require high bandwidth for global requests.  Traffic reduction results in better performance

Results – Impact of Cache Size Smaller caches have higher replacement miss rates. Under AD, smaller cache sizes cause the write penalty reduction to be smaller.

Conclusion The adaptive protocol optimizes coherence actions due to migratory sharing Technique requires a simple extension of directory-based write-invalidate protocol Improves performance by reducing write stall time and overall network traffic