Evaluating the Performance of Four Snooping Cache Coherency Protocols Susan J. Eggers, Randy H. Katz.

Slides:



Advertisements
Similar presentations
Extra Cache Coherence Examples In the following examples there are a couple questions. You can answer these for practice by ing Colin at
Advertisements

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Shared-memory.
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
1 Lecture 1: Introduction Course organization:  4 lectures on cache coherence and consistency  2 lectures on transactional memory  2 lectures on interconnection.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Nov 14, 2005 Topic: Cache Coherence.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
Snooping Cache and Shared-Memory Multiprocessors
Cache Organization of Pentium
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Multiprocessor Cache Coherency
Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.
1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.
Tufts University Department of Electrical and Computer Engineering
Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.
CS492B Analysis of Concurrent Programs Coherence Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.
Presented By:- Prerna Puri M.Tech(C.S.E.) Cache Coherence Protocols MSI & MESI.
Spring EE 437 Lillevik 437s06-l21 University of Portland School of Engineering Advanced Computer Architecture Lecture 21 MSP shared cached MSI protocol.
Cache Control and Cache Coherence Protocols How to Manage State of Cache How to Keep Processors Reading the Correct Information.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Lecture 13: Multiprocessors Kai Bu
L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.
Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Cache Coherence Protocols A. Jantsch / Z. Lu / I. Sander.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 5, 2005 Session 22.
ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines.
Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache And Pefetch Buffers Norman P. Jouppi Presenter:Shrinivas Narayani.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
1 Lecture 3: Coherence Protocols Topics: consistency models, coherence protocol examples.
Understanding Parallel Computers Parallel Processing EE 613.
Additional Material CEG 4131 Computer Architecture III
“An Evaluation of Directory Schemes for Cache Coherence” Presented by Scott Weber.
1 Lecture 7: PCM Wrap-Up, Cache coherence Topics: handling PCM errors and writes, cache coherence intro.
Performance of Snooping Protocols Kay Jr-Hui Jeng.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
The University of Adelaide, School of Computer Science
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
Lecture 13: Multiprocessors Kai Bu
COSC6385 Advanced Computer Architecture
Cache Organization of Pentium
תרגול מס' 5: MESI Protocol
Cache Coherence in Shared Memory Multiprocessors
Computer Engineering 2nd Semester
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Lecture 18: Coherence and Synchronization
A Study on Snoop-Based Cache Coherence Protocols
Cache Coherence for Shared Memory Multiprocessors
12.4 Memory Organization in Multiprocessor Systems
Example Cache Coherence Problem
Cache Coherence (controllers snoop on bus transactions)
Lecture 2: Snooping-Based Coherence
Chip-Multiprocessor.
Multiprocessors - Flynn’s taxonomy (1966)
Lecture 4: Update Protocol
Bus-Based Coherent Multiprocessors
Lecture 25: Multiprocessors
High Performance Computing
Lecture 25: Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 24: Multiprocessors
Prof John D. Kubiatowicz
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Evaluating the Performance of Four Snooping Cache Coherency Protocols Susan J. Eggers, Randy H. Katz

Example Cache Coherence Problem

Solutions - Protocols Snooping protocols – suitable for bus-based architectures – requires broadcast Directory-based protocols - sharing information stored separately (in directories) - non-bus based architectures

Snooping Protocols Suitable for bus-based architectures Types – * write – invalidate - processor invalidates all other cached copies of shared data - it can then update its own with no further bus operations * write – broadcast - processor broadcasts updates to shared data to other caches - therefore, all copies are the same

Case Studies Architecture - shared-memory architecture - 5 – 12 processors connected on a single bus - one-cycle per instruction execution - direct-mapped cache, one-cycle reads, two-cycle write Applications - traces gathered from 4 parallel CAD programs, developed for single-bus, shared memory multiprocessors. - granularity of parallelism is a process - single-program-multiple-data

Write-Invalidate Protocols Writing processor invalidates all other shared (cached) copies of data. Any subsequent writes by the same processor do not require bus utilization Caches of other processors “snoop” on the bus Example – Berkeley Ownership (Invalid, Valid, Shared Dirty, Dirty) Sources of overhead Invalidation signal, Invalidation misses

Write-Invalidate Protocols (Contd.)… Cache coherency overhead minimized – Sequential sharing (multiple consecutive writes to a block by a single processor) – Fine-grain sharing (little inter-processor contention for shared data) Trouble Spot – High contention for shared data results in “pingponging”. – Large block size Simulation Results – Proportion of invalidation misses to total misses increases with larger block sizes

Read-Broadcast: Enhancement to Write- Invalidate Designed to reduce invalidation misses Update an invalidated block with data, whenever there is a read bus operation for the block’s address Required: – Buffer to hold the data – Control to implement read-interference Improvements: – One invalidation miss per invalidation signal

Performance Analysis of Read-Broadcast Benefits – Reduces the number of invalidation misses – Ratio of invalidation misses to total misses increases with block size, but the proportion is lower than with Berkeley Ownership. Side-Effects – Increase in processor lockout from the cache CPU and snoop contention over the shared cache resource Snoop-related cache activity more than with Berkeley Ownership For 3 of the traces, the increase in processor lockout wiped out the benefit to total execution cycles gained by the decrease in invalidation misses. – Increase in the average number of cycles per bus transfer Additional cycle required for the snoops to acknowledge completion of operation Need to update the processor’s state on read-broadcasts and simple state invalidations

Write-Invalidate/Read-Broadcast Comparison If the processor lockout and number of execution cycles is large in Read-Broadcast, it may lead to a net gain in total execution cycles Read-Broadcast is beneficial in the “one producer, several consumers” situation An optimized cache controller will also improve the performance of Read-Broadcast

Write-Broadcast Protocols Writing processor broadcasts updates to shared addresses Special bus line used to indicate that blocks are shared Examples - Firefly protocol (Valid Exclusive, Shared, Dirty - updates memory simultaneously with each write to shared data) Sources of overhead – sequential sharing: each processor accesses the data many times before another processor begins – bus broadcasts to shared data

Write-Broadcast Protocols (Contd.)... Cache Coherency Overhead Minimized – avoids “pingponging” of shared data (occurring in write-invalidate) Trouble Spot – Large cache size: lifetime of cache blocks increases, write-broadcasts continue for data that is no longer actively shared Simulation Results – Traces confirm the analysis – Proportion of Write-Broadcast cycles within total cycles increases with increasing cache size

Competitive Snooping: Enhancement to Write- Broadcast Switches to write-invalidate when the breakeven point in bus-related coherency overhead is reached Breakeven point: – Sum of write broadcast cycles issued for the address equals the number of cycles needed for rereading the data had it been invalidated. Improvements: – limits coherency overhead to twice that of optimal Two algorithms – Standard-Snoopy-Caching – Snoopy-Reading

Standard-Snoopy-Caching A counter (initial value = cost in cycles of a data transfer), is assigned to each cache block in every cache. On a write broadcast, a cache that contains the address of the broadcast is (arbitrarily) chosen, and its counter is decremented. When a counter value reaches zero, the cache block is invalidated. When all counters for an address (other than that of the writer), are zero, write- broadcasts for it cease. Reaccess by a processor to an address resets its cache counter to the initial value. The algorithm’s lower bound proof demonstrates that the total costs of invalidating are in balance with the total costs of rereading.

Snoopy-Reading The adversary is allowed to read-broadcast on rereads. All other caches with invalidated copies take the data, and reset their counters. When a cache’s counter reaches zero, it invalidates the block containing the address; and write broadcasts are discontinued, when all caches but that of the writer have been invalidated. Other changes – – On a write-broadcast, all caches containing the address decrement their counters – Decrementing is done on consecutive write broadcasts by a particular processor

Snoopy-Reading Vs Standard-Snoopy- Caching Advantages of Snoopy-Reading – Well suited for a workload with few rereads – Does not require hardware to “arbitrarily” choose a cache Snoopy-Reading invalidates more quickly than Standard-Snoopy-Caching

Performance Analysis of Competitive Snooping Simulation results – Decreases number of write broadcasts – Benefit is greater when there is sequential sharing

Write-Broadcast/Competitive Snooping Comparison Competitive snooping is beneficial in case of sequential sharing. – Decreases bus utilization and total execution time As inter-processor contention increases, competitive snooping results in an increase in bus utilization and total execution time

Conclusion Write-Invalidate/Read-Broadcast Read-broadcast is not suitable for sequential sharing It may prove beneficial in the single-producer, multiple- consumer situation Write-Broadcast/Competitive Snooping Competitive Snooping is advantageous if there is sequential sharing

References S.J. Eggers, R.H. Katz, “Evaluating the Performance of Four Snooping Cache Coherency Protocols”

MSI State Transition Diagram Modified Shared Invalid Similar protocol used in the Silicon Graphics 4D Series multiprocessor machines

MESI State Transition Diagram Modified Exclusive Shared Invalid Variants used in Intel Pentium, PowerPC 601, MIPS R4400

MOESI Protocol Owned state (Shared Modified): Exclusive, but memory not valid Used in Athlon MP

Write-Once Protocol RD I - Invalid V - Valid R - Reserved D - Dirty PrWr/- PrWr/BusWrOnce PrRd/- PrWr/- V I BusRdX/BusWB PrRd/BusWB BusRd/- BusRdX/- PrRd/- BusRd/- BusWrOnce/- BusRdX/- PrRd/BusRd PrWr/BusRdX PrRd/BusRd PrWr/BusRdX