“An Evaluation of Directory Schemes for Cache Coherence” Presented by Scott Weber.

Slides:

Advertisements

Similar presentations

Cache Coherence. Memory Consistency in SMPs Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has.

Advertisements

L.N. Bhuyan Adapted from Patterson’s slides

CSCI 8150 Advanced Computer Architecture

Copyright Josep Torrellas 2003,20081 Cache Coherence Instructor: Josep Torrellas CS533 Term: Spring 2008.

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.

Cache Optimization Summary

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Shared-memory.

CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.

BusMultis.1 Review: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control Datapath  Multiprocessor –

CS252/Patterson Lec /28/01 CS 213 Lecture 9: Multiprocessor: Directory Protocol.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Nov 14, 2005 Topic: Cache Coherence.

NUMA coherence CSE 471 Aut 011 Cache Coherence in NUMA Machines Snooping is not possible on media other than bus/ring Broadcast / multicast is not that.

EECC756 - Shaaban #1 lec # 12 Spring Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions: –Processor-Cache-Memory.

1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.

Snooping Cache and Shared-Memory Multiprocessors

Snoopy Coherence Protocols Small-scale multiprocessors.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.

Multiprocessor Cache Coherency

Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.

1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.

Tufts University Department of Electrical and Computer Engineering

Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.

EECS 252 Graduate Computer Architecture Lec 13 – Snooping Cache and Directory Based Multiprocessors David Patterson Electrical Engineering and Computer.

Cache Control and Cache Coherence Protocols How to Manage State of Cache How to Keep Processors Reading the Correct Information.

A Low-Overhead Coherence Solution for Multiprocessors with Private Cache Memories Also known as “Snoopy cache” Paper by: Mark S. Papamarcos and Janak H.

Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.

Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.

Multiprocessor cache coherence. Caching: terms and definitions cache line, line size, cache size degree of associativity –direct-mapped, set and fully.

Evaluating the Performance of Four Snooping Cache Coherency Protocols Susan J. Eggers, Randy H. Katz.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 5, 2005 Session 22.

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 9: May 3, 2001 Distributed Shared Memory.

Performance of Snooping Protocols Kay Jr-Hui Jeng.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University

The University of Adelaide, School of Computer Science

Siva and Osman March 7, 2000 Cache Coherence Schemes for Multiprocessors Sivakumar M Osman Unsal.

1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.

תרגול מס' 5: MESI Protocol

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

CS 704 Advanced Computer Architecture

Multiprocessor Cache Coherency

Jason F. Cantin, Mikko H. Lipasti, and James E. Smith

CMSC 611: Advanced Computer Architecture

Example Cache Coherence Problem

The University of Adelaide, School of Computer Science

Lecture 2: Snooping-Based Coherence

Cache Coherence Protocols 15th April, 2006

James Archibald and Jean-Loup Baer CS258 (Prof. John Kubiatowicz)

Multiprocessors - Flynn’s taxonomy (1966)

Lecture 8: Directory-Based Cache Coherence

Lecture 7: Directory-Based Cache Coherence

11 – Snooping Cache and Directory Based Multiprocessors

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

High Performance Computing

Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini

The University of Adelaide, School of Computer Science

Cache coherence CEG 4131 Computer Architecture III

Coherent caches Adapted from a lecture by Ian Watson, University of Machester.

Lecture 17 Multiprocessors and Thread-Level Parallelism

CPE 631 Lecture 20: Multiprocessors

The University of Adelaide, School of Computer Science

Presentation transcript:

“An Evaluation of Directory Schemes for Cache Coherence” Presented by Scott Weber

Outline Motivation and goals for directory schemes Directory schemes Schemes evaluated –Directory-based –Snoopy Insights from the evaluation Directory scheme alternatives Conclusions and Retrospective

Motivation and Goals Snooping does not scale past ~20 processors –Protocol depends on low-latency broadcasts Snooping interferes with the processor-cache connection Avoid broadcast nature of snooping Directory-based protocols should be competitive with snoopy protocols Access to a directory cannot be a bottleneck

Directory Schemes Tang scheme (Dir n NB) –Multiple clean blocks, one dirty block –Copy of tags, dirty bits for each cache in directory Read miss Write miss check directory, check directory, if dirty then write dirty back, if dirty then flush dirty back, supply the data, invalidate clean copies, update directory. perform the write, update directory. Write hit (dirty) Write hit (clean) if dirty bit set then write. if dirty bit not set, notify directory, invalidate clean copies, update directory, update dirty bit.

Directory Schemes Modifications to Tang’s scheme –Censier and Feautrier (Dir n NB) Vector of valid bits for each cache and dirty bit Use the address of the data to access directory –Yen and Fu (Dir n NB) refines C & F Single bit in each cache to indicate only copy When set, do not have to access directory Requires more bandwidth to update single bits

Directory Schemes Archibald and Baer (Dir 0 B) –Four states: block not cached block clean in exactly one cache block clean in an unknown number of caches block dirty in exactly one cache –Requires broadcasts to do invalidations and write backs –Organization is still centralized –Easy to add more caches to the systems

Schemes Evaluated Classification –Dir(cache pointers)[Broadcast|No Broadcast] Dir 1 NB – Tang (with n = 1) and variants Dir 0 B – Archibald and Baer Alternatives attempt based on results –Dir i NB, Dir n NB, Dir 1 B, Dir i B Write-Through-With-Invalidate (WTI) Dragon Update Protocol

Evaluation Methodology Trace-driven simulation Interested in memory traffic for CC (use  cache) Machine independent metric Communication cost/memory reference Assume bus for comparison Measure event frequencies for various types of memory accesses (differ for each protocol) Weight event frequencies in terms of bus cycles –Non-pipelined shared bus model –Pipelined split address/data bus model

Evaluation of the Protocols Dir 1 NB has a high read miss rate (5.18%) –Caused by read sharing among processes –Limitation of data only being in one cache –Dir 0 B has a low read miss rate (0.62%) Dir 0 B and WTI have same rates since they have same state changes on data in cache Dragon is dominated by write hits (updates) 36% of misses are coherency-related misses

Evaluation of the Protocols >85% writes to previously clean blocks cause invalidations in 0 or 1 caches –Motivates Dir i NB, Dir n NB, Dir 1 B, Dir i B Directory bandwidth similar to memory –Can distribute directory and memory to scale Estimating average memory access makes protocol bus cycles more equal Spin-locks on shared variables hurt Dir 1 NB

Directory Scheme Alternatives Schemes introduced to decrease broadcasts –Dir n NB Performs sequential invalidations –Dir 1 B performs a single invalidation (common case) if broadcast bit is clear, otherwise broadcast –Dir i NB and Dir i B use limited number of ptrs –Limited broadcasts invalidate to cache subsets 01XX01 encoding indicate subsets Schemes like these scale since new directory bits do not necessarily have to be added when scaling

Conclusions Bandwidth to directory is similar to bandwidth to memory –Distribute the directory and memory –Allows to scale with the number of processors Eliminates the inefficiency of broadcasts –Most blocks shared by 0 or 1 caches –Only need a few pointers in each directory entry Snoopy and broadcast protocols are competitive –Need to keep the number of spin-locks to a minimum

Retrospective Paper led to the development of DASH (Dir n NB) prototype Concern at paper time was if snoopy and directory-based protocols were competitive Real issues –Scalability of coherence scheme –Implementation complexity –Overhead of coherence protocol –Performance with many processors –Implementing distributed directory coherence

Event Frequencies

Bus Cycle Breakdown