Cache Coherence “Can we do a better job of supporting cache coherence?” Ross Daly Chan Kim.

Slides:



Advertisements
Similar presentations
Chapter 5 Part I: Shared Memory Multiprocessors
Advertisements

L.N. Bhuyan Adapted from Patterson’s slides
Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.
Cache coherence for CMPs Miodrag Bolic. Private cache Each cache bank is private to a particular core Cache coherence is maintained at the L2 cache level.
The University of Adelaide, School of Computer Science
Cache Coherence Mechanisms (Research project) CSCI-5593
MESI cache coherence protocol
Transactional Memory (TM) Evan Jolley EE 6633 December 7, 2012.
The University of Adelaide, School of Computer Science
A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.
CS 7810 Lecture 19 Coherence Decoupling: Making Use of Incoherence J.Huh, J. Chang, D. Burger, G. Sohi Proceedings of ASPLOS-XI October 2004.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
1 Lecture 1: Introduction Course organization:  4 lectures on cache coherence and consistency  2 lectures on transactional memory  2 lectures on interconnection.
1 Lecture 8: Large Cache Design I Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,
1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Multiprocessor Cache Coherency
InputsMetricsCode MAIN MEMORY core Interconnection network Private data (LI) cache Cache controller core Cache controller Private data (LI) cache MULTICORE.
IntroductionSnoopingDirectoryConclusion IntroductionSnoopingDirectoryConclusion Memory 1A 2B 3C 4D 5E Cache 1 1A 2B 3C Cache 2 3C 4D 5E Cache 4 1A 2B.
InputsMetricsCodeResults MAIN MEMORY core Interconnection network Private data (LI) cache Cache controller core Cache controller Private data (LI)
Spring EE 437 Lillevik 437s06-l21 University of Portland School of Engineering Advanced Computer Architecture Lecture 21 MSP shared cached MSI protocol.
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.
1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.
Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 5, 2005 Session 22.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.
Martin Kruliš by Martin Kruliš (v1.1)1.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 9: May 3, 2001 Distributed Shared Memory.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
The University of Adelaide, School of Computer Science
1 Lecture 7: PCM Wrap-Up, Cache coherence Topics: handling PCM errors and writes, cache coherence intro.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 7, 2005 Session 23.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
The University of Adelaide, School of Computer Science
An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.
Multiprocessors – Locks
Maurice Herlihy and J. Eliot B. Moss,  ISCA '93
Software Coherence Management on Non-Coherent-Cache Multicores
תרגול מס' 5: MESI Protocol
Architecture and Design of AlphaServer GS320
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Task Scheduling for Multicore CPUs and NUMA Systems
12.4 Memory Organization in Multiprocessor Systems
Jason F. Cantin, Mikko H. Lipasti, and James E. Smith
Lecture 13: Large Cache Design I
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
Software Cache Coherent Control by Parallelizing Compiler
Directory-based Protocol
The University of Adelaide, School of Computer Science
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
High Performance Computing
CS 3410, Spring 2014 Computer Science Cornell University
Lecture 25: Multiprocessors
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
Lecture 17 Multiprocessors and Thread-Level Parallelism
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Cache Coherence “Can we do a better job of supporting cache coherence?” Ross Daly Chan Kim

Definition of CC “For any given memory location, at any given moment in time, there is either a single core that may write it (and that may also read it) or some number of cores that may read it.” “Data-Value Invariant: the value of a memory location at the start of an epoch is the same as the value of the memory location at the end of its last read-write epoch” - D. J. Sorin, M. D. Hill, and D. A. Wood. A Primer on Memory Consistency and Cache Coherence, volume 6 of Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers, May 2011.

Goals Improve performance for cache coherency on multi-core/many-core systems. Scaling the number of cores to increase performance A Scaling the number of cores with out increasing cache coherence complexity.

Xpoint Cache Motivation:

Xpoint: Architecture(2D) Typical bus based ArchitectureXpoint Architecture

Xpoint: Architecture(3D)

Xpoint: Results 29x speedup for 32 core system 45x speedup for 64 core system 2.1 improvement over 64 core conventional bus

Increasing the Effectiveness of Directory Caches by Deactivating Coherence for Private Memory Blocks: Motivation Keeping track of all the blocks in directory entails huge storage requirements. Directory cache requires less storage, but it will suffer from directory cache misses. Most of the accessed blocks (about 75% on avg.) are private.

Increasing the Effectiveness of Directory Caches by Deactivating Coherence for Private Memory Blocks: Private vs. Shared blocks Coarse-grain strategy (page granularity) OS detects when a private page must become shared. Every new page load is private When another processor access private blocks, it becomes shared.

Increasing the Effectiveness of Directory Caches by Deactivating Coherence for Private Memory Blocks

Increasing the Effectiveness of Directory Caches by Deactivating Coherence for Private Memory Blocks: Coherence Recovery Mechanism Flushing-based Recovery Mechanism - Flushing all the blocks within a page may increase the miss rate. Updating-based Recovery Mechanism

Increasing the Effectiveness of Directory Caches by Deactivating Coherence for Private Memory Blocks: Results Directory caches can avoid the tracking of about 57% Shorten the runtime of parallel application by 15% while keeping directory cache size or to maintain system performance while using directory caches 8 times smaller.

Complexity-Effective Multicore Coherence Similarity - Motivation - Private and Shared blocks Difference - Simplifying the protocol - directory-less

Complexity-Effective Multicore Coherence: Simplifying the protocol Dynamic write policy - Write-back vs. Write-through VIPS Cache coherency protocol - Valid/Invalid – Private/Shared

Complexity-Effective Multicore Coherence: Directory-less Self-invalidation - Readers are allowed to make unregistered copies of a memory location, as long as they promise to invalidate these at the next synchronization point. - Doe this follow cache coherency? Selective Flushing Write-through at a word granularity with per-word dirty bit

Complexity-Effective Multicore Coherence: Simplifying the protocol: Synchronization Synchronization relies on data race Atomic instructions spin locally in it’s L1 until the condition is changed by another core. In this paper, a core does not send invalidation signal to other cores when executes write inst. Solution?

Complexity-Effective Multicore Coherence: Simplifying the protocol: Results Outperformed MESI directory protocol by 4.8% Reduced network energy consumption by 14.2% Simulated for 15 parallel benchmarks, on 16 cores