CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

Slides:

Advertisements

Similar presentations

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Advertisements

Cache coherence for CMPs Miodrag Bolic. Private cache Each cache bank is private to a particular core Cache coherence is maintained at the L2 cache level.

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

To Include or Not to Include? Natalie Enright Dana Vantrease.

1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi.

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.

Improving Cache Performance by Exploiting Read-Write Disparity

1 Lecture 16: Large Cache Design Papers: An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al., ASPLOS’02 Distance Associativity.

The University of Adelaide, School of Computer Science

ECE/CS 552: Cache Performance Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes based on notes by Mark Hill Updated by.

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

1 Lecture 12: Large Cache Design Papers (papers from last class and…): Co-Operative Caching for Chip Multiprocessors, Chang and Sohi, ISCA’06 Victim Replication,

HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

1 Lecture 8: Large Cache Design I Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.

The Auction: Optimizing Banks Usage in Non-Uniform Cache Architectures Javier Lira ψ Carlos Molina ψ,ф Antonio González ψ,λ λ Intel Barcelona Research.

1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.

CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.

1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.

Multiprocessor Cache Coherency

Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.

Non-Uniform Cache Architectures for Wire Delay Dominated Caches Abhishek Desai Bhavesh Mehta Devang Sachdev Gilles Muller.

ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 5 Non-Uniform Cache.

Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.

Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,

1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.

 Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip  Sun’s Niagara2 8 cores, 64 Threads  Key design issues Architecture.

Cooperative Caching for Chip Multiprocessors Jichuan Chang †, Enric Herrero ‡, Ramon Canal ‡ and Gurindar S. Sohi * HP Labs † Universitat Politècnica de.

Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.

Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.

Cache Control and Cache Coherence Protocols How to Manage State of Cache How to Keep Processors Reading the Correct Information.

CSE 378 Cache Performance1 Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache /

Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.

Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko March 5 th, 2008 Based on “Managing Distributed, Shared L2 Caches through.

Multiprocessor cache coherence. Caching: terms and definitions cache line, line size, cache size degree of associativity –direct-mapped, set and fully.

Analysis of NUCA Policies for CMPs Using Parsec Benchmark Suite Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research Center Intel.

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.

ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.

By Islam Atta Supervised by Dr. Ihab Talkhan

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

Optimizing Replication, Communication, and Capacity Allocation in CMPs Z. Chishti, M. D. Powell, and T. N. Vijaykumar Presented by: Siddhesh Mhambrey Published.

Project Summary Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay Dept of CISE University of Florida.

 Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip  Sun’s Niagara2 (T2) 8 cores, 64 Threads  Key design issues Architecture.

“An Evaluation of Directory Schemes for Cache Coherence” Presented by Scott Weber.

LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors Javier Lira ψ Carlos Molina ψ,ф Antonio González ψ,λ λ Intel Barcelona Research.

Performance of Snooping Protocols Kay Jr-Hui Jeng.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

The University of Adelaide, School of Computer Science

An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.

Presented by: Nick Kirchem Feb 13, 2004

ASR: Adaptive Selective Replication for CMP Caches

Managing Wire Delay in CMP Caches

A Study on Snoop-Based Cache Coherence Protocols

Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.

Example Cache Coherence Problem

Cache Coherence Protocols:

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

CS 3410, Spring 2014 Computer Science Cornell University

Lucía G. Menezo Valentín Puente Jose Ángel Gregorio

Coherent caches Adapted from a lecture by Ian Watson, University of Machester.

Presentation transcript:

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z. Chishti, M. Powell, and T. Vijaykumar ASR: Adaptive Selective Replication for CMP Caches, B. Beckman, M. Marty, and D. Wood

Outline Motivation Related Work (1) – Non-uniform Caches CMP-NuRAPID Related Work (2) – Replication Schemes ASR

Motivation Two options for L2 caches in CMPs  Shared: high latency because of wire delay  Private: more misses because of replications Need hybrid L2 caches Take in mind  On-chip communication is fast  On-chip capacity is limited

NUCA Non-Uniform Cache Architecture Place frequently-accessed data closest to the core to allow fast access Couple tag and data placement Can only place one or two ways in each set close to the processor

NuRAPID Non-uniform access with Replacement And Placement usIng Distance associativity Decouple the set-associative way number from data placement Divide the cache data array into d-groups Use forward and reverse pointers  Forward: from tag to data  Reverse: from data to tag  One to one?

CMP-NuRAPID - Overview Hybrid private tag Shared data organization Controlled Replication – CR In-Situ Communication – ISC Capacity Stealing – CS

CMP-NuRAPID – Structure Need carefully chosen d-group preference

CMP-NuRAPID – Data and Tag Array Tag arrays snoop on bus to maintain coherence The data array is accessed through a crossbar

CMP-NuRAPID – Controlled Replication For read-only sharing First use no copy, save capacity Second copy, reduce future access latency In total, avoid off-chip misses

CMP-NuRAPID – Time Issues Start to read before the invalidation and end after the invalidation  Mark the tag for the block being read from a farther d-group busy Start to read after the invalidation begins and end before the invalidation completes  Put an entry in the queue that holds the order of the bus transaction before sending a read request to a farther d-group

CMP-NuRAPID – In-situ Communication For read-write sharing Communication state Write-through for all C blocks in L1 cache

CMP-NuRAPID – Capacity Stealing Demote less-frequently-used data to unused frames in the d-groups closer to the cores with less capacity demands Placement and Promotion  Place all private blocks in the d-group closest to the initiating core  Promote the block directly to the closest d-group for the core

CMP-NuRAPID – Capacity Stealing Demotion and Replacement  Demote the block to the next-fastest d-group  Replace in the order of invalid, private, and shared Doesn’t this kind of demotion pollute another core’s fastest d-group?

CMP-NuRAPID - Methodology Simics 4-core CMP 8 MB, 8-way CMP-NuRAPID with 4 single- ported d-groups Both multithreaded and multiprogrammed workloads

CMP-NuRAPID – Multithreaded

CMP-NuRAPID – Multiprogrammed

Replication Schemes Cooperative Caching  Private L2 caches  Restrict replication under certain criteria Victim Replication  Share L2 cache  Allow replication under certain criteria Both have static replication policies How about dynamic?

ASR - Overview Adaptive Selective Replication Dynamic cache block replication Replicate blocks when the benefits exceed the costs  Benefits: lower L2 hit latency  Costs: More L2 misses

ASR – Sharing Types Shingle Requestor  Blocks are accessed by a single processor Shared Read-Only  Blocks are read, but not written, by multiple processors Shared Read-Write  Blocks are accessed by multiple processors, with at least one write Focus on replicating shared read-only blocks  High locality  Little Capacity  Large portion of requests

ASR - SPR Selective Probabilistic Replication Assume private L2 caches and selectively limits replication on L1 evictions Use probabilistic filtering to make local replication decisions

ASR – Balancing Replication

ASR – Replication Control Replication levels  C: Current  H: Higher  L: Lower Cycles  H: Hit cycles-per-instruction  M: Miss cycles-per-instruction

ASR – Replication Control

Wait until there are enough events to ensure a fair cost/benefit comparison Wait until four consecutive evaluation intervals predict the same change before change the replication level

ASR – Designs Supported by SPR SPR-VR  Add 1-bit per L2 cache block to identify replicas  Disallow replications when the local cache set is filled with owner blocks with identified sharers SPR-NR  Store a 1-bit counter per remote processor for each L2 block  Remove the shared bus overhead (How?) SPR-CC  Model the centralized tag structure using an idealized distributed tag structure

ASR - Methodology Two CMP configurations – Current and Future 8 processors Writeback, write-allocate cache Both commercial and scientific workloads Use throughput as metrics

ASR – Memory Cycles

ASR - Speedup

Conclusion Hybrid is better Dynamic is better Need tradeoff How does it scale…