An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.

Slides:

Advertisements

Similar presentations

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Advertisements

Cache coherence for CMPs Miodrag Bolic. Private cache Each cache bank is private to a particular core Cache coherence is maintained at the L2 cache level.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.

Cache Coherence “Can we do a better job of supporting cache coherence?” Ross Daly Chan Kim.

The University of Adelaide, School of Computer Science

1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.

Cache Optimization Summary

The Stanford Directory Architecture for Shared Memory (DASH)* Presented by: Michael Bauer ECE 259/CPS 221 Spring Semester 2008 Dr. Lebeck * Based on “The.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.

CS 258 Spring An Adaptive Cache Coherence Protocol Optimized for Migratory Sharing Per Stenström, Mats Brorsson, and Lars Sandberg Presented by Allen.

1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.

1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.

Multiprocessor Cache Coherency

Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.

Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

Using Prediction to Accelerate Coherence Protocols Authors : Shubendu S. Mukherjee and Mark D. Hill Proceedings. The 25th Annual International Symposium.

Using Prediction to Accelerate Coherence Protocols Shubu Mukherjee, Ph.D. Principal Hardware Engineer VSSAD Labs, Alpha Development Group Compaq Computer.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos

Sequential Hardware Prefetching in Shared-Memory Multiprocessors Fredrik Dahlgren, Member, IEEE Computer Society, Michel Dubois, Senior Member, IEEE, and.

Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.

Coherence Decoupling: Making Use of Incoherence J. Huh, J. Chang, D. Burger, G. Sohi ASPLOS 2004.

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan.

By Islam Atta Supervised by Dr. Ihab Talkhan

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

Perceptron-based Coherence Predictors Naveen R. Iyer Publication: Perceptron-based Coherence Predictors. D. Ghosh, J.B. Carter, and H. Duame. In the Proceedings.

컴퓨터교육과 이상욱 Published in: COMPUTER ARCHITECTURE LETTERS (VOL. 10, NO. 1) Issue Date: JANUARY-JUNE 2011 Publisher: IEEE Authors: Omer Khan (Massachusetts.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 9: May 3, 2001 Distributed Shared Memory.

Project Summary Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay Dept of CISE University of Florida.

An Evaluation of Memory Consistency Models for Shared- Memory Systems with ILP processors Vijay S. Pai, Parthsarthy Ranganathan, Sarita Adve and Tracy.

The University of Adelaide, School of Computer Science

1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.

Lecture 8: Snooping and Directory Protocols

Lecture: Large Caches, Virtual Memory

Architecture and Design of AlphaServer GS320

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Lecture 18: Coherence and Synchronization

Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.

12.4 Memory Organization in Multiprocessor Systems

Multiprocessor Cache Coherency

The University of Adelaide, School of Computer Science

CMSC 611: Advanced Computer Architecture

Directory-based Protocol

The University of Adelaide, School of Computer Science

Using Prediction to Accelerate Coherence Protocols

Lecture: Cache Innovations, Virtual Memory

Lecture 8: Directory-Based Cache Coherence

Lecture 7: Directory-Based Cache Coherence

11 – Snooping Cache and Directory Based Multiprocessors

CS 213 Lecture 11: Multiprocessor 3: Directory Organization

High Performance Computing

Lecture 8: Directory-Based Examples

Lecture 25: Multiprocessors

Lecture: Cache Hierarchies

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

Presentation transcript:

An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos

2 Introduction Shared Memory multiprocessors – Enterprise servers, Top500 supercomputers Shared Memory paradigm – Producer – Consumer relationships – Suffer from Remote misses Solutions – high performance interconnects, – sophisticated latency hiding techniques – effective caching/coherence mechanisms

3 Motivation Update vs. invalidate protocols – too much coherent traffic Adaptive protocols to optimize for migratory sharing – Identify dynamic sharing during execution – Adapt the coherence protocol

4 The Problem – 3 hops latency

5 Basic Idea (1/2) Directory delegation – Identify shared blocks – Producer node becomes the Home-node – Consumers send requests directly to producer Decrease latency – Each read/write access completes after 2 hops

6 Basic Idea (2/2) The producer node can identify sharers – Sharer nodes stored in the directory – Speculate that new data will be requested – Forward new data to sharers Similar to… – prefetching – last write prediction

7 Architecture

8 RAC == Remote Access Cache In the past – Eliminate Remote misses caused by small & low associative caches – Not a problem today In this work – A location to push data at a remote node – Location to store delegated blocks – Victim cache (for remote misses) as before

9 Sharing Pattern Detection Track access history only for frequently used blocks – Directory entries reside in the directory cache Keep saturating counter per directory entry – last_writer id  4 bits – reader_count  2 bits – write_repeat  2 bits

10 Producer/Consumer Tables Maintain state for blocks that don’t reside at home node – Producer table: current node serves as a producer for some cache blocks – Consumer table: current node is interested in some blocks found in a corresponding producer node

11 Delegate - Undelegate

12 One step further – Speculative updates Eliminate remote misses – Maintain sharers list after invalidation – Forward new data to sharers – Downgrade local state to SHARED Need to choose carefully what data to forward – Don’t want to change cpu core – Delayed Intervention

13 Delayed Intervention

14 Evaluation

15 Benchmarks

16 Results

17 Results

18 Results

19 Conclusions Adapting mechanisms to improve producer- consumer relationships Eliminate remote misses Directory delegation & speculative updates Minor hardware cost – 32 entry delegate cache & 32KB RAC Exec time ↓13%, remote misses ↓29%, network traffic ↓17% – 1K-entry delegate cache & 1MB RAC Exec time ↓21%, remote misses ↓40%, network traffic ↓15%