To Include or Not to Include? Natalie Enright Dana Vantrease.

Slides:



Advertisements
Similar presentations
Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.
Advertisements

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.
Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :
1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.
High Performing Cache Hierarchies for Server Workloads
Virtual Exclusion: An Architectural Approach to Reducing Leakage Energy in Multiprocessor Systems Mrinmoy Ghosh Hsien-Hsin S. Lee School of Electrical.
Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.
1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.
Cache Optimization Summary
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
1 Lecture 1: Introduction Course organization:  4 lectures on cache coherence and consistency  2 lectures on transactional memory  2 lectures on interconnection.
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.
Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,
Presented By:- Prerna Puri M.Tech(C.S.E.) Cache Coherence Protocols MSI & MESI.
Quantifying and Comparing the Impact of Wrong-Path Memory References in Multiple-CMP Systems Ayse Yilmazer, University of Rhode Island Resit Sendag, University.
Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings of the 27 th Annual ISCA, June.
Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.
Lecture 13: Multiprocessors Kai Bu
Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.
Coherence Decoupling: Making Use of Incoherence J. Huh, J. Chang, D. Burger, G. Sohi ASPLOS 2004.
CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.
CMP Design Choices Finding Parameters that Impact CMP Performance Sam Koblenski and Peter McClone.
Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
“An Evaluation of Directory Schemes for Cache Coherence” Presented by Scott Weber.
The University of Adelaide, School of Computer Science
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
Lecture 13: Multiprocessors Kai Bu
Presented by: Nick Kirchem Feb 13, 2004
ASR: Adaptive Selective Replication for CMP Caches
Architecture and Design of AlphaServer GS320
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Lecture 18: Coherence and Synchronization
12.4 Memory Organization in Multiprocessor Systems
Multiprocessor Cache Coherency
Jason F. Cantin, Mikko H. Lipasti, and James E. Smith
Assignment 4 – (a) Consider a symmetric MP with two processors and a cache invalidate write-back cache. Each block corresponds to two words in memory.
CMSC 611: Advanced Computer Architecture
Example Cache Coherence Problem
Lecture 2: Snooping-Based Coherence
Kai Bu 13 Multiprocessors So today, we’ll finish the last part of our lecture sessions, multiprocessors.
CMSC 611: Advanced Computer Architecture
Interconnect with Cache Coherency Manager
Lecture 8: Directory-Based Cache Coherence
Improving Multiple-CMP Systems with Token Coherence
Lecture 7: Directory-Based Cache Coherence
11 – Snooping Cache and Directory Based Multiprocessors
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
DDM – A Cache-Only Memory Architecture
Lecture 25: Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Exploring Core Designs for Chip Multiprocessors
Lecture 17 Multiprocessors and Thread-Level Parallelism
CMP Design Choices Finding Parameters that Impact CMP Performance
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
CSE 486/586 Distributed Systems Cache Coherence
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

To Include or Not to Include? Natalie Enright Dana Vantrease

Motivation CMP technology affects coherence protocols differently than previously studied MP systems New shared on-chip resources (e.g. L2) Low latency between on-chip caches Need for scalability in design Industry Examples IBM Power 4 – Inclusion Piranha – Exclusion Our goal: Determine at which point, each inclusion protocol (strict inclusion, non-inclusion and exclusion) is the best choice for CMP performance.

SMP vs CMP Opportunities L1 L2 L1 L2 L1 L2 L1 L2 L1 L2 L1 L2 VS

Multilevel Inclusion Protocol given to us with the simulator L1 has Modified, Shared and Invalid States L2 has Modified, Owned, Shared, and Invalid States When an L2 line is replaced, any copies present on the chip must be invalidated (the sharers are given in the directory entry) In a single processor chip, there are only 2 caches (Instruction and Data) connected to a single L2 cache Chip multiprocessors introduce an additional 2 level 1 caches per processor which could make this forced inclusion harmful.

Non-Inclusion Protocol courtesy of Mike L1 now has owned and exclusion states Complexity of the on chip directory has increased significantly States added to indicate local level 1 sharers or a local level 1 owner. L1 directory state also needs to be visible for external requests from other chips Increase effective on-chip cache storage

Directory Exclusion No replication of Data between a single L1 and the L2 L2 Acts as Large Victim Cache Utilizes cache space, lowering required off-chip bandwidth L2 is centralized coherency point (tag lookup) L1 States: M, E, I, SC, SM L2 States: M, E, I No ownership – simply request 1 st Sharer in Tag Lookup for Data Request

L1 L2 L1 Tags Directory Exclusion L1 L2 L1 Tags L1 L2 L1 Tags

Tag Lookup Cache Aids in off-chip coherency and directing on- chip requests Associativity = L1 associativity * # L1s # Sets = #Sets in a single L1 # Data Entries = # L1s Data Entry = The L1 corresponding to the Data Entry has the data or not (1/0). Scalability?

Methodology Vary the L1 cache size to find the design point at which an inclusive protocol hurts performance. As the number of cores increases, so does the aggregate L1 cache size

Simulation Configuration Configuration 4 processors per chip and 1 chip 2 MB of L2 cache Small but wanted to see the effect of changing the ratio of L1 size to L2 size. 16 processors per chip as future work Only simulated one chip to isolate the effects of intra-chip coherence from inter-chip coherence Future work: see how extending the life of a block on chip through non-inclusion or exclusion affects other chips.

Results Inclusion vs. Non-Inclusion

Results (cont.) Inclusion vs. Pseudo-Exclusion

Conclusion/Future Work An inclusive protocol is less complex Esp. considering inter-chip communication Non-Inclusion performs consistently better than inclusion Additional complexity only warranted after the total L1 cache size is greater than 25% of the L2 cache size. Longer runs and more benchmarks would provide more conclusive evidence

Future Work Ongoing: Get working exclusion protocol in Ruby tester and Simics. Current Status: Currently runs 500 memory transactions in the Ruby tester. Run comparable tests to those run for Non- inclusion Analyze benefits of exclusion over inclusion. Expand to 16 cores and study scalability issues.