Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Slides:

Advertisements

Similar presentations

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

Nikos Hardavellas, Northwestern University

High Performing Cache Hierarchies for Server Workloads

2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),

FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.

The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.

1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors Matt DeVuyst Rakesh Kumar Dean Tullsen.

Thoughts on Shared Caches Jeff Odom University of Maryland.

Improving Cache Performance by Exploiting Read-Write Disparity

1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.

1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.

HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

1 Virtual Private Caches ISCA’07 Kyle J. Nesbit, James Laudon, James E. Smith Presenter: Yan Li.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,

StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.

Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.

Embedded System Lab. 최 길 모최 길 모 Kilmo Choi A Software Memory Partition Approach for Eliminating Bank-level Interference in Multicore.

Moinuddin K.Qureshi, Univ of Texas at Austin MICRO’ , 12, 05 PAK, EUNJI.

Applying Control Theory to the Caches of Multiprocessors Department of EECS University of Tennessee, Knoxville Kai Ma.

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.

1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi

1 Utility-Based Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches Written by Moinuddin K. Qureshi and Yale N.

Sampling Dead Block Prediction for Last-Level Caches

HPCA Laboratory for Computer Architecture1/11/2010 Dimitris Kaseridis 1, Jeff Stuecheli 1,2, Jian Chen 1 & Lizy K. John 1 1 University of Texas.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.

PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie, Gabriel H. Loh Georgia Institute of Technology Presented by: Yingying.

By Islam Atta Supervised by Dr. Ihab Talkhan

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

Sunpyo Hong, Hyesoon Kim

Cache Replacement Championship

Core Architecture Optimization for Heterogeneous CMPs R. Kumar, D. M. Tullsen, and N.P. Jouppi İlker YILDIRIM

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

vCAT: Dynamic Cache Management using CAT Virtualization

Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh

Adaptive Cache Partitioning on a Composite Core

ASR: Adaptive Selective Replication for CMP Caches

Xiaodong Wang, Shuang Chen, Jeff Setter,

Java 9: The Quest for Very Large Heaps

ISPASS th April Santa Rosa, California

Less is More: Leveraging Belady’s Algorithm with Demand-based Learning

Task Scheduling for Multicore CPUs and NUMA Systems

18742 Parallel Computer Architecture Caching in Multi-core Systems

Lecture 13: Large Cache Design I

Energy-Efficient Address Translation

Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar, Hongzhou Zhao†, Arrvindh Shriraman Eric Matthews∗, Sandhya.

Using Dead Blocks as a Virtual Victim Cache

Haishan Zhu, Mattan Erez

CARP: Compression-Aware Replacement Policies

Faustino J. Gomez, Doug Burger, and Risto Miikkulainen

(A Research Proposal for Optimizing DBMS on CMP)

José A. Joao* Onur Mutlu‡ Yale N. Patt*

A Case for Interconnect-Aware Architectures

Phase based adaptive Branch predictor: Seeing the forest for the trees

Presentation transcript:

Bank-aware Dynamic Cache Partitioning for Multicore Architectures System Power Measurement, Modeling and Management Bank-aware Dynamic Cache Partitioning for Multicore Architectures Dimitris Kaseridis1 Jeff Stuecheli1,2 & Lizy K. John1 1University of Texas – Austin 2IBM – Austin Laboratory for Computer Architecture 9/23/2009

Outline Motivation/background Cache partitioning/profiling Proposed system Results Conclusion/future work Laboratory for Computer Architecture 9/23/2009

Motivation Shared resources in CMPs Last Level Cache Memory bandwidth Opportunity and Pitfalls Constructive Mixing low and high cache requirements in shared pool Destructive Thrashing workloads (spec cpu 2000 art + mcf) Cache partitioning required Primary opportunity requires heterogeneous workload mixes Typical in consolidation + virtualization Laboratory for Computer Architecture 9/23/2009

Monolithic vs NUCA vs Industry architectures Monolithic: One large shared uniform latency cache bank on a CMP Does not exploit physical locality for private data Slow for all CMP-NUCA: Typical proposal has a very large number of autonomous cache banks Very flexible (256 banks) Non optimal configuration Inefficient bank size (bank overhead) Real implementations Fewer banks in industry NUCA with discrete cache levels Key is wire assumptions made in original NUCA analysis Core Cache Core Cache Core Core cache cache cache cache Cache Cache cache cache cache cache Core Core Core Core IBM POWER7 Intel Nehalem EX Laboratory for Computer Architecture 9/23/2009

Baseline System 8 cores 16 MB total capacity 16 x 1 MB banks 8 way associative Local Banks Tight latency to close core Center Banks Shared capacity Laboratory for Computer Architecture 9/23/2009

Cache Partitioning/Profiling Laboratory for Computer Architecture 9/23/2009

Cache Sharing/Partitioning Last level cache of CMP Once isolated resources now shared Drove need for isolation Design space Non-configurable Shared vs private caches Static partitioning/policy Long term policy choice Dynamic Real time profiling directed partitions Trial and error (experiment to find ideal configuration) Predictive profilers Non-invasive state space exploration (our system) Laboratory for Computer Architecture 9/23/2009

Bank-aware cache partitions System components Non-invasive profiling using MSA (Mattson Stack Algorithm) Cache allocation using marginal utility Bank-aware LLC partitions Laboratory for Computer Architecture 9/23/2009

MSA Based Cache Profiling Mattson stack algorithm Originally proposed to concurrently simulate many cache sizes Structure is a true LRU cache Stack distance from MRU of each reference is recorded Misses can be calculated for fraction of ways Laboratory for Computer Architecture 9/23/2009

Hardware MSA implementation ways Hardware MSA implementation Naïve algorithm is prohibitive Fully associative Complete cache directory of maximum cache size for every core on the CMP (total size) Reduction Set Sampling Partial Tags Maximal Capacity Configuration in paper 12 bit tag 1/32 set sampling 9/16 bank per core 0.4% overhead of cache on chip sets Laboratory for Computer Architecture 9/23/2009

Marginal Utility Miss rate relative to capacity is non-linear, and heavily workload dependant Dramatic miss rate reduction as data structures become cache contained In practice, Iteratively assign cache to cores that produce the most hits per capacity Laboratory for Computer Architecture 9/23/2009

Bank-aware LLC partitions a  ideal MSA model b  banked true LRU Cascade banks Power inefficient c  realistic banking Allocation policy Hash allocation Random allocation Bank granularity Uniform requirement Laboratory for Computer Architecture 9/23/2009

Bank-aware allocation heuristics General idea As capacity grows, courser assignment is good enough Only share portions of Local cache banks between neighbors Central banks are assigned to a specific core Any core to receive central banks is also assigned full local capacity Laboratory for Computer Architecture 9/23/2009

Cache allocation flowchart Assign full cache banks first (steps 1-3) All cores that have multiple banks are complete Partition remaining local banks (steps 4-7) Fine tune assignment Sharing pairs Laboratory for Computer Architecture 9/23/2009

Evaluation Laboratory for Computer Architecture 9/23/2009

Methodology Workloads 8 cores running mix of 26 SPEC CPU 2000 workloads What benchmark mix? Typical is to classify with limited experiments We wanted to cover a larger state space Monte Carlo Compare bank aware miss rate to ideal assignment Show algorithm works for many cases Detailed simulation Cycle accurate Full system Simics+GEMS+CourseBanks+CachePartitions Laboratory for Computer Architecture 9/23/2009

Monte Carlo How close is Bank-aware assignment to ideal monolithic? Graphic shows miss rate reduction 1000 random SpecCPU 2000 benchmark mixes 97% correlation in miss rates Laboratory for Computer Architecture 9/23/2009

Workload sets for detailed simulation Laboratory for Computer Architecture 9/23/2009

Cycle accurate simulation Overall Miss ratio 70% reduction over shared 25% over equal Throughput 43% increase over shared 11% increase over equal Laboratory for Computer Architecture 9/23/2009

Conclusion/future work Significant miss rate reduction/throughput improvement possible Partitions are very important Marginal utility can work with realistic banked CMP caches Heterogeneous Benchmarks needed Can’t evaluate all combinations Hand chosen combinations are hard to compare across proposals Laboratory for Computer Architecture 9/23/2009

Thank You, Questions? Laboratory for Computer Architecture University of Texas Austin & IBM Austin 9/23/2009