Bypass and Insertion Algorithms for Exclusive Last-level Caches

Slides:

Advertisements

Similar presentations

1 Utility-Based Partitioning of Shared Caches Moinuddin K. Qureshi Yale N. Patt International Symposium on Microarchitecture (MICRO) 2006.

Advertisements

Pricing for Utility-driven Resource Management and Allocation in Clusters Chee Shin Yeo and Rajkumar Buyya Grid Computing and Distributed Systems (GRIDS)

Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.

Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.

Gennady Pekhimenko Advisers: Todd C. Mowry & Onur Mutlu

Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.

Yuejian Xie, Gabriel H. Loh. Core0 IL1 DL1 Core1 IL1 DL1 Last Level Cache (LLC) Core1s Data 2 Core0s Data.

Warm-Up Methodology for HW/SW Co-Designed Processors A. Brankovic, K. Stavrou, E. Gibert, A. Gonzalez.

Feedback Directed Prefetching Santhosh Srinath Onur Mutlu Hyesoon Kim Yale N. Patt §¥ ¥ §

SE-292 High Performance Computing

1 A Case for MLP-Aware Cache Replacement International Symposium on Computer Architecture (ISCA) 2006 Moinuddin K. Qureshi Daniel N. Lynch, Onur Mutlu,

Javier Lira (UPC, Spain)Carlos Molina (URV, Spain) David Brooks (Harvard, USA)Antonio González (Intel-UPC,

The Performance Impact of Kernel Prefetching on Buffer Cache Replacement Algorithms (ACM SIGMETRIC 05 ) ACM International Conference on Measurement & Modeling.

ULC: An Unified Placement and Replacement Protocol in Multi-level Storage Systems Song Jiang and Xiaodong Zhang College of William and Mary.

Aamer Jaleel, Kevin B. Theobald, Simon C. Steely Jr. , Joel Emer

A Survey of Web Cache Replacement Strategies Stefan Podlipnig, Laszlo Boszormenyl University Klagenfurt ACM Computing Surveys, December 2003 Presenter:

CRUISE: Cache Replacement and Utility-Aware Scheduling

Learning Cache Models by Measurements Jan Reineke joint work with Andreas Abel Uppsala University December 20, 2012.

Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a cache for secondary (disk) storage – Managed jointly.

1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache.

1 Sizing the Streaming Media Cluster Solution for a Given Workload Lucy Cherkasova and Wenting Tang HPLabs.

Increasing the Energy Efficiency of TLS Systems Using Intermediate Checkpointing Salman Khan 1, Nikolas Ioannou 2, Polychronis Xekalakis 3 and Marcelo.

KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.

Addition 1’s to 20.

SE-292 High Performance Computing

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan

Application-to-Core Mapping Policies to Reduce Memory System Interference Reetuparna Das * Rachata Ausavarungnirun $ Onur Mutlu $ Akhilesh Kumar § Mani.

Performance, Area and Bandwidth Implications on Large-Scale CMP Cache Design Li Zhao, Ravi Iyer, Srihari Makineni, Jaideep Moses, Ramesh Illikkal, Don.

1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.

High Performing Cache Hierarchies for Server Workloads

Hierarchy-aware Replacement and Bypass Algorithms for Last-level Caches Mainak Chaudhuri Indian Institute of Technology, Kanpur & Jayesh Gaur 1, Nithiyanandan.

FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.

Virtual Exclusion: An Architectural Approach to Reducing Leakage Energy in Multiprocessor Systems Mrinmoy Ghosh Hsien-Hsin S. Lee School of Electrical.

Achieving Non-Inclusive Cache Performance with Inclusive Caches

Insertion Policy Selection Using Decision Tree Analysis Samira Khan, Daniel A. Jiménez University of Texas at San Antonio.

Prefetch-Aware Cache Management for High Performance Caching

Improving Cache Performance by Exploiting Read-Write Disparity

Improving Proxy Cache Performance: Analysis of Three Replacement Policies Dilley, J.; Arlitt, M. A journal paper of IEEE Internet Computing, Volume: 3.

Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks Vivek Seshadri Samihan Yedkar ∙ Hongyi Xin ∙ Onur Mutlu Phillip.

Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.

Characterizing Multi-threaded Applications for Designing Sharing-aware Last-level Cache Replacement Policies Ragavendra Natarajan 1, Mainak Chaudhuri 2.

Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,

1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.

Bypass and Insertion Algorithms for Exclusive Last-level Caches Jayesh Gaur 1, Mainak Chaudhuri 2, Sreenivas Subramoney 1 1 Intel Architecture Group, Intel.

Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

Sampling Dead Block Prediction for Last-Level Caches

MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,

International Symposium on Computer Architecture ( ISCA – 2010 )

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie, Gabriel H. Loh Georgia Institute of Technology Presented by: Yingying.

The Evicted-Address Filter

Scavenger: A New Last Level Cache Architecture with Global Block Priority Arkaprava Basu, IIT Kanpur Nevin Kirman, Cornell Mainak Chaudhuri, IIT Kanpur.

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

Video Caching in Radio Access network: Impact on Delay and Capacity

Cache Replacement Championship

Cache Replacement Policy Based on Expected Hit Count

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

Improving Cache Performance using Victim Tag Stores

The 2nd Cache Replacement Championship (CRC-2)

Less is More: Leveraging Belady’s Algorithm with Demand-based Learning

Javier Díaz1, Pablo Ibáñez1, Teresa Monreal2,

18742 Parallel Computer Architecture Caching in Multi-core Systems

Prefetch-Aware Cache Management for High Performance Caching

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

CARP: Compression Aware Replacement Policies

Using Dead Blocks as a Virtual Victim Cache

CARP: Compression-Aware Replacement Policies

Lecture 9: Caching and Demand-Paged Virtual Memory

Presentation transcript:

Bypass and Insertion Algorithms for Exclusive Last-level Caches Jayesh Gaur1, Mainak Chaudhuri2, Sreenivas Subramoney1 1Intel Architecture Group, Intel Corporation, Bangalore, India 2Department of Computer Science and Engineering, Indian Institute of Technology Kanpur, India Presented by Samira Khan Intel Labs, Intel Corporation and University of Texas at San Antonio International Symposium on Computer Architecture (ISCA), June 6th, 2011

Inclusive Vs Exclusive Inclusive Cache Hierarchy Last level cache (LLC) is the super set of all caches A block in L1 is also present in L2 and LLC Exclusive Cache Hierarchy A Cache block is present only in one level A block in L1 is never present in L2 and LLC L1 L1 L1 L2 L2 L2 LLC LLC L1 Inclusive Hierarchy Exclusive Hierarchy

Inclusive Vs Exclusive Inclusive Last-level Caches (LLC) are popular choice Inclusion wastes Cache capacity Exclusive caches have higher capacity and better performance Some of the materials are taken from the original presentation

Exclusive Last Level Cache Exclusive LLC (L3) serves as a victim cache for the L2 cache Data is filled into the L2 On L2 eviction, data is filled into LLC On LLC hit, Cache line is invalidated from LLC and moved to L2 DRAM Load L2 Miss Load LLC Miss Load Core + L1 LLC L2 Fill 2 MB 32 KB 512 KB Evict LLC Hit Invalidate from LLC This talk is about replacement and bypass policies for exclusive caches

Replacement Policy in Exclusive LLC fill hit hit hit last hit eviction MRU Popular replacement policy LRU Replaces Least Recently Used block Needs recency information to choose the victim Victim LRU Cache set Exclusive caches have no recency information

Replacement Policy in Exclusive LLC How to choose victim in exclusive LLC? Can we bypass lines in LLC? Choose replacement victim with the help of some information from higher level caches Do not place lines in the exclusive LLC that are never re-referenced before eviction

Outline Motivation Problem Description Characterizing Dead and Live lines Basic Algorithm Results Conclusion

Characterizing Dead and Live Lines Dead allocation to LLC Cache line filled into LLC, but evicted before being recalled by L2 Live allocation to LLC Cache line filled into LLC and sees a hit in LLC Trip Count (TC) : # times cache line makes trips between LLC and L2 cache, before eviction TC= 1 LLC DRAM TC = 0 L2 Eviction From LLC TC captures the reuse distance between two clustered uses of a cache line

Oracle Analysis : Trip Count Only 1 bit TC is required for most applications: either TC = 0 or TC >= 1 Can we use the liveness information from TC to design insertion/bypass policies ?

Use Count in L2 Use count (UC) is the number of times a cache line is hit in L2 Cache due to demand requests For cache lines brought by demand requests, UC >=1 We need only 2 bits for learning UC TC= 1, UC = Y LLC DRAM TC = 0 UC = X L2 Eviction From LLC Y hits X hits Refer to paper that shows <TC,UC> pair can best approximate Belady victim selection

TCxUC-based Algorithms Send <TC,UC> information for every L2 eviction Bin all L2 evictions into 8 <TC,UC> bins Learn the dead and live distributions in these bins Identify bins that have more dead blocks than live Bypass blocks that belong to a bin that has more dead blocks More details in paper

Experimental Methodology SPEC 2006 and SERVER categories 97 single-threaded (ST) traces 35 4-way multi-programmed (MP) workloads Cycle-accurate execution-driven simulation based on x86 ISA and core i7 model Three level cache hierarchy 32KB L1 Caches 2 MB LLC for ST and 8 MB LLC for MP(16-way) 512 KB 8-way L2 cache per core

Policy Evaluation for ST Workloads Overall, Bypass + TC_UC_AGE is the best policy

Multi-programmed (MP) Workloads Throughput = ∑ IPCi Policy /∑ IPCi base Fairness = min (IPCi Policy/ IPCi base) Geomean throughput gain for our best proposal is 2.5%

Why this paper is important? Conclusion For capacity and performance, exclusive LLC is more meaningful LRU and related inclusive cache replacement schemes do not work for exclusive LLC We presented several insertion/bypass schemes for exclusive caches Based on trip count and use count For ST workloads, we gain 4.3% higher average IPC For MP workloads, we gain 2.5% average throughput Why this paper is important?

Thank you Questions ?

BACKUP

TC-based Insertion Age TC -AGE policy (Analogous to SRRIP, ISCA 2010) L2 $ Fill 1 bit per $ line LLC Fill 2 bits per $ line LLC Eviction TC = 0 TC = 1 LLC Hit ? N Y Age 1 3 TC = 1 ? Maintain relative age order Choose least age as victim DIP + TC-AGE policy (Analogous to DRRIP, ISCA 2010) If TC = 1, fill LLC with age = 3 If TC = 0, duel between age = 0 and age = 1 TC enables us to mimic the inclusive replacement policies on exclusive caches However, TC is insufficient to enable bypass. All cache lines start at TC = 0 This slide is kindly provided by the authors