Moinuddin K.Qureshi, Univ of Texas at Austin MICRO’2006 2007, 12, 05 PAK, EUNJI.

Slides:



Advertisements
Similar presentations
1 Utility-Based Partitioning of Shared Caches Moinuddin K. Qureshi Yale N. Patt International Symposium on Microarchitecture (MICRO) 2006.
Advertisements

1 A Case for MLP-Aware Cache Replacement International Symposium on Computer Architecture (ISCA) 2006 Moinuddin K. Qureshi Daniel N. Lynch, Onur Mutlu,
CRUISE: Cache Replacement and Utility-Aware Scheduling
Application-to-Core Mapping Policies to Reduce Memory System Interference Reetuparna Das * Rachata Ausavarungnirun $ Onur Mutlu $ Akhilesh Kumar § Mani.
Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.
ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.
1 Lecture 9: Large Cache Design II Topics: Cache partitioning and replacement policies.
1 The 3P and 4P cache replacement policies Pierre Michaud INRIA Cache Replacement Championship June 20, 2010.
|Introduction |Background |TAP (TLP-Aware Cache Management Policy) Core sampling Cache block lifetime normalization TAP-UCP and TAP-RRIP |Evaluation Methodology.
Improving Cache Performance by Exploiting Read-Write Disparity
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks Vivek Seshadri Samihan Yedkar ∙ Hongyi Xin ∙ Onur Mutlu Phillip.
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.
Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.
A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,
Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
© 2007 IBM Corporation HPCA – 2010 Improving Read Performance of PCM via Write Cancellation and Write Pausing Moinuddin Qureshi Michele Franceschini and.
Cooperative Caching for Chip Multiprocessors Jichuan Chang †, Enric Herrero ‡, Ramon Canal ‡ and Gurindar S. Sohi * HP Labs † Universitat Politècnica de.
Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.
Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.
Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.
ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 6 Fair Caching Mechanisms.
(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)
P AY -A S -Y OU -G O S TORAGE -E FFICIENT H ARD E RROR C ORRECTION Moinuddin K. Qureshi ECE, Georgia Tech Research done while at: IBM T. J. Watson Research.
An Analysis of Efficient Multi-Core Global Power Management Policies Authors: Canturk Isci†, Alper Buyuktosunoglu†, Chen-Yong Cher†, Pradip Bose† and Margaret.
Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
1 Utility-Based Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches Written by Moinuddin K. Qureshi and Yale N.
Sampling Dead Block Prediction for Last-Level Caches
MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,
HPCA Laboratory for Computer Architecture1/11/2010 Dimitris Kaseridis 1, Jeff Stuecheli 1,2, Jian Chen 1 & Lizy K. John 1 1 University of Texas.
Time Parallel Simulations I Problem-Specific Approach to Create Massively Parallel Simulations.
MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.
Parallel and Distributed Simulation Time Parallel Simulation.
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.
CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.
PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie, Gabriel H. Loh Georgia Institute of Technology Presented by: Yingying.
Embedded System Lab. 정범종 PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie et al. ACM, 2009.
The Evicted-Address Filter
Speaker : Kyu Hyun, Choi. Problem: Interference in shared caches – Lack of isolation → no QoS – Poor cache utilization → degraded performance.
Spring 2011 Parallel Computer Architecture Lecture 25: Shared Resource Management Prof. Onur Mutlu Carnegie Mellon University.
18-740/640 Computer Architecture Lecture 14: Memory Resource Management I Prof. Onur Mutlu Carnegie Mellon University Fall 2015, 10/26/2015.
15-740/ Computer Architecture Lecture 22: Caching in Multi-Core Systems Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/7/2011.
Cache Replacement Policy Based on Expected Hit Count
Samira Khan University of Virginia April 21, 2016
Prof. Onur Mutlu ETH Zurich Fall September 2017
Prof. Onur Mutlu Carnegie Mellon University
Adaptive Cache Partitioning on a Composite Core
ASR: Adaptive Selective Replication for CMP Caches
Less is More: Leveraging Belady’s Algorithm with Demand-based Learning
18742 Parallel Computer Architecture Caching in Multi-core Systems
Moinuddin K. Qureshi ECE, Georgia Tech Gabriel H. Loh, AMD
Prefetch-Aware Cache Management for High Performance Caching
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Energy-Efficient Address Translation
Reducing Memory Reference Energy with Opportunistic Virtual Caching
A Case for MLP-Aware Cache Replacement
CARP: Compression-Aware Replacement Policies
Massachusetts Institute of Technology
Lecture 14: Large Cache Design II
Presentation transcript:

Moinuddin K.Qureshi, Univ of Texas at Austin MICRO’ , 12, 05 PAK, EUNJI

 Introduction and Motivation  Utility-Based Cache Partitioning  Evaluation  Scalable Partitioning Algorithm  Related Work and Summary

 CMP and shared caches are common  Applications compete for the shared cache  Partitioning policies critical for high performance  Traditional policies:  Equal (half-and-half)  Performance isolation. No adaptation  LRU  Demand based. Demand ≠ benefit (e.g. streaming)

Low Utility High Utility Saturating Utility Utility U a b = Misses with a ways – Misses with b ways

Improve performance by giving more cache to the application that benefits more from cache

 Three components:  Utility Monitors (UMON) per core  Partitioning Algorithm (PA)  Replacement support to enforce partitions I$ D$ Core1 I$ D$ Core2 Shared L2 cache Main Memory UMON1 UMON2 PA

 For each core, simulate LRU policy using ATD  Hit counters in ATD to count hits per recency position  LRU is a stack algorithm: hit counts  utility E.g. hits(2 ways) = H0+H1 MTD Set B Set E Set G Set A Set C Set D Set F Set H ATD Set B Set E Set G Set A Set C Set D Set F Set H (MRU)H0 H1 H2…H15(LRU)

 Extra tags incur hardware and power overhead  DSS reduces overhead [Qureshi,ISCA’06]  32 sets sufficient (analytical bounds)  Storage < 2kB/UMON MTD ATD Set B Set E Set G Set A Set C Set D Set F Set H (MRU)H0 H1 H2…H15(LRU) Set B Set E Set G Set A Set C Set D Set F Set H Set B Set E Set G Set A Set C Set D Set F Set H Set B Set E Set G UMON (DSS)

Us = Sampled mean (Num ways allocated by DSS) Ug = Global mean (Num ways allocated by Global) P = P(Us within 1 way of Ug) By Cheb. inequality: P ≥ 1 – variance/n n = number of sampled sets In general, variance ≤ 3

 Evaluate all possible partitions and select the best  With a ways to core1 and (16-a) ways to core2: Hits core1 = (H 0 + H 1 + … + H a-1 ) ---- from UMON1 Hits core2 = (H 0 + H 1 + … + H 16-a-1 ) ---- from UMON2  Select a that maximizes (Hits core1 + Hits core2 )  Partitioning done once every 5 million cycles

 Way partitioning support: [Suh+ HPCA’02, Iyer ICS’04]  Each line has core-id bits  On a miss, count ways_occupied in set by miss-causing application ways_occupied < ways_given Yes No Victim is the LRU line from other app Victim is the LRU line from miss-causing app

 Configuration  Two cores: 8-wide, 128-entry window  Private L1s  L2: Shared, unified, 1MB, 16-way  LRU-based Memory: 400 cycles, 32 banks  Benchmarks  Two-threaded workloads divided into 5 categories  Used 20 workloads (four from each type) Weighted speedup for the baseline

 Weighted Speedup (default metric)  perf = IPC1/SingleIPC1 + IPC2/SingleIPC2  correlates with reduction in execution time  Throughput  perf = IPC1 + IPC2  can be unfair to low-IPC application  Hmean-fairness  perf = hmean(IPC1/SingleIPC1, IPC2/SingleIPC2)  balances fairness and performance

UCP improves average weighted speedup by 11%

UCP improves average throughput by 17%

UCP improves average hmean-fairness by 11%

Dynamic Set Sampling (DSS) reduces overhead, not benefits 8 sets 16 sets 32 sets All sets

 Time complexity of partitioning low for two cores (number of possible partitions ≈ number of ways)  Possible partitions increase exponentially with cores  For a 32-way cache, possible partitions:  4 cores  6545  8 cores  15.4 million  Problem NP hard  need scalable partitioning algorithm

 GA allocates 1 block to the app that has the max utility for one block. Repeat till all blocks allocated  Optimal partitioning when utility curves are convex  Pathological behavior for non-convex curves

In each iteration, the utility for 1 block: U(A) = 10 misses U(B) = 0 misses Problem: GA considers benefit only from the immediate block. Hence it fails to exploit huge gains from ahead Blocks assigned Misses All blocks assigned to A, even if B has same miss reduction with fewer blocks

 Marginal Utility (MU) = Utility per cache resource  MU a b = U a b /(b-a)  GA considers MU for 1 block. LA considers MU for all possible allocations  Select the app that has the max value for MU. Allocate it as many blocks required to get max MU  Repeat till all blocks assigned

Time complexity ≈ ways 2 /2 (512 ops for 32-ways) Iteration 1: MU(A) = 10/1 block MU(B) = 80/3 blocks B gets 3 blocks Result: A gets 5 blocks and B gets 3 blocks (Optimal) Next five iterations: MU(A) = 10/1 block MU(B) = 0 A gets 1 block Blocks assigned Misses

Four cores sharing a 2MB 32-way L2 Mix2 (swm-glg-mesa-prl) Mix3 (mcf-applu-art-vrtx) Mix4 (mcf-art-eqk-wupw) Mix1 (gap-applu-apsi-gzp) LA performs similar to EvalAll, with low time-complexity LRU UCP(Greedy) UCP(Lookahead) UCP(EvalAll)

 CMP and shared caches are common  Partition shared caches based on utility, not demand  UMON estimates utility at runtime with low overhead  UCP improves performance:  Weighted speedup by 11%  Throughput by 17%  Hmean-fairness by 11%  Lookahead algorithm is scalable to many cores sharing a highly associative cache