Optimizing Replication, Communication, and Capacity Allocation in CMPs Z. Chishti, M. D. Powell, and T. N. Vijaykumar Presented by: Siddhesh Mhambrey Published.

Slides:

Advertisements

Similar presentations

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Advertisements

Javier Lira (UPC, Spain)Carlos Molina (URV, Spain) David Brooks (Harvard, USA)Antonio González (Intel-UPC,

To Include or Not to Include? Natalie Enright Dana Vantrease.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy.

Nikos Hardavellas, Northwestern University

High Performing Cache Hierarchies for Server Workloads

The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

Circuit-Switched Coherence Natalie Enright Jerger*, Li-Shiuan Peh +, Mikko Lipasti* *University of Wisconsin - Madison + Princeton University 2 nd IEEE.

Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi.

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.

1 Lecture 16: Large Cache Design Papers: An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al., ASPLOS’02 Distance Associativity.

1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.

SMT Parallel Applications –For one program, parallel executing threads Multiprogrammed Applications –For multiple programs, independent threads.

1 Lecture 12: Large Cache Design Papers (papers from last class and…): Co-Operative Caching for Chip Multiprocessors, Chang and Sohi, ISCA’06 Victim Replication,

HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel.

Scaling the Bandwidth Wall: Challenges in and Avenues for CMP Scalability 36th International Symposium on Computer Architecture Brian Rogers †‡, Anil Krishna.

Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian,

Chip-Multiprocessor Caches: Placement and Management

1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

Skewed Compressed Cache

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

Non-Uniform Cache Architectures for Wire Delay Dominated Caches Abhishek Desai Bhavesh Mehta Devang Sachdev Gilles Muller.

ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 5 Non-Uniform Cache.

Design and Management of 3D CMP’s using Network-in-Memory Feihui Li et.al. Penn State University (ISCA – 2006)

Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.

Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,

Cache Improvements James Brock, Joseph Schmigel May 12, 2006 – Computer Architecture.

Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.

StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.

Cooperative Caching for Chip Multiprocessors Jichuan Chang †, Enric Herrero ‡, Ramon Canal ‡ and Gurindar S. Sohi * HP Labs † Universitat Politècnica de.

Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos

Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko March 5 th, 2008 Based on “Managing Distributed, Shared L2 Caches through.

Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.

1 Lecture 13: Cache, TLB, VM Today: large caches, virtual memory, TLB (Sections 2.4, B.4, B.5)

Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.

02/21/2003 CART 1 On-chip MRAM as a High-Bandwidth, Low-Latency Replacement for DRAM Physical Memories Rajagopalan Desikan, Charles R. Lefurgy, Stephen.

BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches

Analysis of NUCA Policies for CMPs Using Parsec Benchmark Suite Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research Center Intel.

CS 153 Design of Operating Systems Spring 2015 Final Review 2.

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.

PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie, Gabriel H. Loh Georgia Institute of Technology Presented by: Yingying.

Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan.

By Islam Atta Supervised by Dr. Ihab Talkhan

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P.

 Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip  Sun’s Niagara2 (T2) 8 cores, 64 Threads  Key design issues Architecture.

1 Lecture: Virtual Memory Topics: virtual memory, TLB/cache access (Sections 2.2)

Conditional Memory Ordering Christoph von Praun, Harold W.Cain, Jong-Deok Choi, Kyung Dong Ryu Presented by: Renwei Yu Published in Proceedings of the.

LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors Javier Lira ψ Carlos Molina ψ,ф Antonio González ψ,λ λ Intel Barcelona Research.

The University of Adelaide, School of Computer Science

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

ASR: Adaptive Selective Replication for CMP Caches

Jason F. Cantin, Mikko H. Lipasti, and James E. Smith

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

CANDY: Enabling Coherent DRAM Caches for Multi-node Systems

Lucía G. Menezo Valentín Puente Jose Ángel Gregorio

Presentation transcript:

Optimizing Replication, Communication, and Capacity Allocation in CMPs Z. Chishti, M. D. Powell, and T. N. Vijaykumar Presented by: Siddhesh Mhambrey Published in Proceedings of the 32nd International Symposium on Computer Architecture, pages , June 2005.

Motivation  Emerging trend for CMPs  New Challenges in Cache design policies Increased capacity pressure on the on-chip memory- Need for large on chip capacity for multiple cores Increased cache latencies in large caches- Wire delays Need for a cache design that tackles these challenges

Cache Organization  Goal: Utilize Capacity Effectively- Reduce capacity misses Mitigate Increased Latencies- Keep wire delays small  Shared High Capacity but increased latency  Private Low Latency but limited capacity Neither private nor shared caches provide both goals

Latency-Capacity Tradeoff  SMPs and DSMs have same goals in terms of cache design  Capacity CMPs have limited on-chip memories SMPs have large off-chip memories  Latency of accesses SMPs have slow off-chip access CMPs have fast on-chip access CMPs change Latency-Capacity Tradeoff in two ways

Novel Mechanisms  Controlled Replication Avoid copies for some read-only shared data  In-Situ Communication Use fast on-chip communication to avoid coherence miss of read-write-shared data  Capacity Stealing Allow a core to steal another core’s unused capacity  Hybrid cache Private Tag Array and Shared Data Array CMP-NuRAPID(Non-Uniform access with Replacement and Placement using Distance associativity)  Performance CMP-NuRAPID improves performance by 13% over a shared cache and 8% over a private cache for three commercial multithreaded workloads Three novel mechanisms to exploit the changes in Latency-Capacity tradeoff

CMP-NuRAPID  Non-Uniform Access and Distance Associativity Caches divided into d-groups D-group preference 4-core CMP with CMP-NuRAPID

CMP-NuRAPID Organization CMP NuRAPID Tag and Data Arrays Data Array Tag Arrays

CMP-NuRAPID Organization  Private Tag Array  Shared Data Array  Leverages forward and reverse pointers Single copy of block shared by multiple tags Data for one core in different d- groups Extra Level of Indirection for novel mechanisms

Mechanisms  Controlled Replication  In-Situ Communication  Capacity Stealing

Controlled Replication  On a read miss- Updates tag pointer to point to the already- on-chip block  On a subsequent read-Data copy is made in the reader’s closest d-group to avoid slow accesses in future

Mechanisms  Controlled Replication  In-Situ Communication  Capacity Stealing

In-Situ Communication  Enforce single copy of read-write shared block in L2 and keep the block in communication (C) state Replace M to S transition by M to C transition Fast communication with capacity savings

Mechanisms  Controlled Replication  In-Situ Communication  Capacity Stealing

Capacity Stealing  Demotion: Demote less frequently used data to un-used frames in d-groups closer to core with less capacity demands.  Promotion: if tag hit occurs on a block in farther d-group promote it Data for one core in different d-groups Use of unused capacity in a neighboring core

Methodology  Full-system simulation of 4-core CMP using Simics  CMP NuRAPID: 8 MB, 8-way  4 d-groups,1-port for each tag array and data d-group  Compare to Private 2 MB, 8-way, 1-port per core CMP-SNUCA: Shared with non-uniform-access, no replication

Results Multi-Threaded WorkloadsMulti-programmed Workloads

Summary

Conclusions  CMPs change the Latency Capacity tradeoff  Controlled Replication, In-Situ Communication and Capacity Stealing are novel mechanisms to exploi the change in the Latency-Capacity tradeoff  CMP-NuRAPID is a hybrid cache that uses incorporates the novel mechanisms  For commercial multi-threaded workloads– 13% better than shared, 8% better than private  For multi-programmed workloads– 28% better than shared, 8% better than private

Thank you Questions?