Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture Seongbeom Kim, Dhruba Chandra, and Yan Solihin Dept. of Electrical and Computer.

Slides:

Advertisements

Similar presentations

Performance Evaluation of Cache Replacement Policies for the SPEC CPU2000 Benchmark Suite Hussein Al-Zoubi.

Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras

An Evaluation of OpenMP on Current and Emerging Multithreaded/Multicore Processors Matthew Curtis-Maury, Xiaoning Ding, Christos D. Antonopoulos, and Dimitrios.

2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Our approach! 6.9% Perfect L2 cache (hit rate 100% ) 1MB L2 cache Cholesky 47% speedup BASE: All cores are used to execute the application-threads. PB-GS(PB-LS)

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors Matt DeVuyst Rakesh Kumar Dean Tullsen.

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.

1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.

- Sam Ganzfried - Ryan Sukauye - Aniket Ponkshe. Outline Effects of asymmetry and how to handle them Design Space Exploration for Core Architecture Accelerating.

1 Virtual Private Caches ISCA’07 Kyle J. Nesbit, James Laudon, James E. Smith Presenter: Yan Li.

1 Characterizing the Sort Operation on Multithreaded Architectures Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad* The Advanced Computer Architecture.

EECC722 - Shaaban #1 Lec # 4 Fall Operating System Impact on SMT Architecture The work published in “An Analysis of Operating System Behavior.

Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture Dhruba Chandra Fei Guo Seongbeom Kim Yan Solihin Electrical and Computer.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

ECE 510 Brendan Crowley Paper Review October 31, 2006.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

Conference title 1 A Research-Oriented Advanced Multicore Architecture Course Julio Sahuquillo, Salvador Petit, Vicent Selfa, and María E. Gómez May 25,

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.

Defining Anomalous Behavior for Phase Change Memory

A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.

Chapter 5 – CPU Scheduling (Pgs 183 – 218). CPU Scheduling  Goal: To get as much done as possible  How: By never letting the CPU sit "idle" and not.

StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.

Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research.

Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.

Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 6 Fair Caching Mechanisms.

Moinuddin K.Qureshi, Univ of Texas at Austin MICRO’ , 12, 05 PAK, EUNJI.

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

1 Process Scheduling in Multiprocessor and Multithreaded Systems Matt Davis CS5354/7/2003.

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

Micro-sliced Virtual Processors to Hide the Effect of Discontinuous CPU Availability for Consolidated Systems Jeongseob Ahn, Chang Hyun Park, and Jaehyuk.

QUINN GAUMER ECE 259/CPS 221 Improving Performance Isolation on Chip Multiprocessors via on Operating System Scheduler.

PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie, Gabriel H. Loh Georgia Institute of Technology Presented by: Yingying.

Scheduling Issues on a Heterogeneous Single ISA Multicore IRISA, France Robert Guziolowski, André Seznec. Contact: 1. M. Becchi and P.

Exploiting Multithreaded Architectures to Improve Data Management Operations Layali Rashid The Advanced Computer Architecture U of C (ACAG) Department.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors Authors: Matthew DeVuyst, Rakesh Kumar, and Dean M. Tullsen.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

Migration Cost Aware Task Scheduling Milestone Shraddha Joshi, Brian Osbun 10/24/2013.

1 Cache-Oblivious Query Processing Bingsheng He, Qiong Luo {saven, Department of Computer Science & Engineering Hong Kong University of.

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

Spring 2011 Parallel Computer Architecture Lecture 25: Shared Resource Management Prof. Onur Mutlu Carnegie Mellon University.

Priority Based Fair Scheduling: A Memory Scheduler Design for Chip-Multiprocessor Systems Tsinghua University Tsinghua National Laboratory for Information.

Samira Khan University of Virginia April 26, 2016

Samira Khan University of Virginia April 21, 2016

Improving Cache Performance using Victim Tag Stores

Prof. Onur Mutlu Carnegie Mellon University

Ioannis E. Venetis Department of Computer Engineering and Informatics

Adaptive Cache Partitioning on a Composite Core

Xiaodong Wang, Shuang Chen, Jeff Setter,

Introduction | Model | Solution | Evaluation

Prof. Onur Mutlu Carnegie Mellon University

Simultaneous Multithreading

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Energy-Efficient Address Translation

Application Slowdown Model

Computer Architecture Lecture 4 17th May, 2006

Tapestry: Reducing Interference on Manycore Processors for IaaS Clouds

Massachusetts Institute of Technology

Manjunath Shevgoor, Rajeev Balasubramonian, University of Utah

Department of Electrical Engineering Joint work with Jiong Luo

Presented by Florian Ettinger

Presentation transcript:

Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture Seongbeom Kim, Dhruba Chandra, and Yan Solihin Dept. of Electrical and Computer Engineering North Carolina State University {skim16, dchandr,

PACT 20042Seongbeom Kim, NCSU L2 $ Cache Sharing in CMP L1 $ …… Processor Core 1Processor Core 2 L1 $

PACT 20043Seongbeom Kim, NCSU L2 $ Cache Sharing in CMP L1 $ …… Processor Core 1 L1 $ Processor Core 2 ←t1

PACT 20044Seongbeom Kim, NCSU L1 $ Processor Core 1 L1 $ Processor Core 2 L2 $ Cache Sharing in CMP …… t2→

PACT 20045Seongbeom Kim, NCSU L1 $ L2 $ Cache Sharing in CMP …… Processor Core 1Processor Core 2 ←t1 L1 $ t2→ t2’s throughput is significantly reduced due to unfair cache sharing.

PACT 20046Seongbeom Kim, NCSU Shared L2 cache space contention

PACT 20047Seongbeom Kim, NCSU Shared L2 cache space contention

PACT 20048Seongbeom Kim, NCSU Uniprocessor scheduling 2-core CMP scheduling Problems of unfair cache sharing –Sub-optimal throughput –Thread starvation –Priority inversion –Thread-mix dependent throughput Fairness: uniform slowdown for co-scheduled threads Impact of unfair cache sharing t1 t4 t1 t3 t2 t1 t2 t1 t3 t1 t2 t1 t3 t4 t1 P1: P2: time slice

PACT 20049Seongbeom Kim, NCSU Contributions Cache fairness metrics –Easy to measure –Approximate uniform slowdown well Fair caching algorithms –Static/dynamic cache partitioning Optimizing fairness –Simple hardware modifications Simulation results –Fairness: 4x improvement –Throughput 15% improvement Comparable to cache miss minimization approach

PACT Seongbeom Kim, NCSU Related Work Cache miss minimization in CMP: –G. Suh, S. Devadas, L. Rudolph, HPCA 2002 Balancing throughput and fairness in SMT: –K. Luo, J. Gummaraju, M. Franklin, ISPASS 2001 –A. Snavely and D. Tullsen, ASPLOS, 2000 –…

PACT Seongbeom Kim, NCSU Outline Fairness Metrics Static Fair Caching Algorithms (See Paper) Dynamic Fair Caching Algorithms Evaluation Environment Evaluation Conclusions

PACT Seongbeom Kim, NCSU Fairness Metrics Uniform slowdown Execution time of t i when it runs alone.

PACT Seongbeom Kim, NCSU Fairness Metrics Uniform slowdown Execution time of t i when it shares cache with others.

PACT Seongbeom Kim, NCSU Fairness Metrics Uniform slowdown We want to minimize: –Ideally:

PACT Seongbeom Kim, NCSU Fairness Metrics Uniform slowdown We want to minimize: –Ideally:

PACT Seongbeom Kim, NCSU Fairness Metrics Uniform slowdown We want to minimize: –Ideally:

PACT Seongbeom Kim, NCSU Outline Fairness Metrics Static Fair Caching Algorithms (See Paper) Dynamic Fair Caching Algorithms Evaluation Environment Evaluation Conclusions

PACT Seongbeom Kim, NCSU Partitionable Cache Hardware LRU P1: 448B P2 Miss P2: 576B Current Partition P1: 384B P2: 640B Target Partition Modified LRU cache replacement policy –G. Suh, et. al., HPCA 2002

PACT Seongbeom Kim, NCSU Partitionable Cache Hardware LRU * P1: 448B P2 Miss P2: 576B Current Partition P1: 384B P2: 640B Target Partition Modified LRU cache replacement policy –G. Suh, et. al., HPCA 2002 LRU * P1: 384B P2: 640B Current Partition P1: 384B P2: 640B Target Partition

PACT Seongbeom Kim, NCSU Dynamic Fair Caching Algorithm P1: P2: Ex) Optimizing M3 metric P1: P2: Target Partition MissRate alone P1: P2: MissRate shared Repartitioning interval

PACT Seongbeom Kim, NCSU Dynamic Fair Caching Algorithm 1 st Interval P1:20% P2: 5% MissRate alone Repartitioning interval P1: P2: MissRate shared P1:20% P2:15% MissRate shared P1:256KB P2:256KB Target Partition

PACT Seongbeom Kim, NCSU Dynamic Fair Caching Algorithm Repartition! Evaluate M3 P1: 20% / 20% P2: 15% / 5% P1:20% P2: 5% MissRate alone Repartitioning interval P1:20% P2:15% MissRate shared P1:256KB P2:256KB Target Partition P1:192KB P2:320KB Target Partition Partition granularity: 64KB

PACT Seongbeom Kim, NCSU Dynamic Fair Caching Algorithm 2 nd Interval P1:20% P2: 5% MissRate alone Repartitioning interval P1:20% P2:15% MissRate shared P1:20% P2:15% MissRate shared P1:20% P2:10% MissRate shared P1:192KB P2:320KB Target Partition

PACT Seongbeom Kim, NCSU Dynamic Fair Caching Algorithm Repartition! Evaluate M3 P1: 20% / 20% P2: 10% / 5% P1:20% P2: 5% MissRate alone Repartitioning interval P1:20% P2:15% MissRate shared P1:20% P2:10% MissRate shared P1:192KB P2:320KB Target Partition P1:128KB P2:384KB Target Partition

PACT Seongbeom Kim, NCSU Dynamic Fair Caching Algorithm 3 rd Interval P1:20% P2: 5% MissRate alone Repartitioning interval P1:20% P2:10% MissRate shared P1:128KB P2:384KB Target Partition P1:20% P2:10% MissRate shared P1:25% P2: 9% MissRate shared

PACT Seongbeom Kim, NCSU Dynamic Fair Caching Algorithm Repartition! Do Rollback if: P2: Δ<T rollback Δ=MR old -MR new P1:20% P2: 5% MissRate alone Repartitioning interval P1:20% P2:10% MissRate shared P1:25% P2: 9% MissRate shared P1:128KB P2:384KB Target Partition P1:192KB P2:320KB Target Partition

PACT Seongbeom Kim, NCSU Fair Caching Overhead Partitionable cache hardware Profiling –Static profiling for M1, M3 –Dynamic profiling for M1, M3, M4 Storage –Per-thread registers Miss rate/count for “alone” case Miss rate/count for “shared’ case Repartitioning algorithm –< 100 cycles overhead in 2-core CMP –invoked at every repartitioning interval

PACT Seongbeom Kim, NCSU Outline Fairness Metrics Static Fair Caching Algorithms (See Paper) Dynamic Fair Caching Algorithms Evaluation Environment Evaluation Conclusions

PACT Seongbeom Kim, NCSU Evaluation Environment UIUC ’ s SESC Simulator –Cycle accurate CMP Cores 2 cores, each 4-issue dynamic. 3.2GHz Memory L1 I/D (private): WB, 32KB, 4way, 64B line, RT: 3cycles L2 Unified (shared): WB, 512KB, 8way, 64B line, RT: 14 cycles L2 replacement: LRU or Pseudo-LRU RT memory latency: 407 cycles

PACT Seongbeom Kim, NCSU Evaluation Environment ParameterValues Repartitioning granularity64KB Repartitioning interval10K, 20K, 40K, 80K L2 accesses T rollback 0%, 5%, 10%, 15%, 20%, 25%, 30% 18 benchmark pairs Algorithm Parameters Static algorithms: FairM1 Dynamic algorithms: FairM1Dyn, FairM3Dyn, FairM4Dyn

PACT Seongbeom Kim, NCSU Outline Fairness Metrics Static Fair Caching Algorithms (See Paper) Dynamic Fair Caching Algorithms Evaluation Environment Evaluation –Correlation results –Static fair caching results –Dynamic fair caching results –Impact of rollback threshold –Impact of time interval Conclusions

PACT Seongbeom Kim, NCSU Correlation Results

PACT Seongbeom Kim, NCSU Correlation Results M1 & M3 show best correlation with M0.

PACT Seongbeom Kim, NCSU Static Fair Caching Results

PACT Seongbeom Kim, NCSU Static Fair Caching Results FairM1 has comparable throughput as MinMiss with better fairness

PACT Seongbeom Kim, NCSU Static Fair Caching Results Opt assures that better fairness is achieved without throughput loss.

PACT Seongbeom Kim, NCSU Dynamic Fair Caching Results

PACT Seongbeom Kim, NCSU Dynamic Fair Caching Results FairM1Dyn, FairM3Dyn show best fairness and throughput.

PACT Seongbeom Kim, NCSU Dynamic Fair Caching Results Improvement in fairness results in throughput gain.

PACT Seongbeom Kim, NCSU Dynamic Fair Caching Results Fair caching sometimes degrades throughput (2 out of 18).

PACT Seongbeom Kim, NCSU Impact of Rollback Threshold in FairM1Dyn

PACT Seongbeom Kim, NCSU Impact of Rollback Threshold in FairM1Dyn ’20% T rollback ’ shows best fairness and throughput.

PACT Seongbeom Kim, NCSU Impact of Repartitioning Interval in FairM1Dyn

PACT Seongbeom Kim, NCSU Impact of Repartitioning Interval in FairM1Dyn ‘10K L2 accesses’ shows best fairness and throughput.

PACT Seongbeom Kim, NCSU Outline Fairness Metrics Static Fair Caching Algorithms (See Paper) Dynamic Fair Caching Algorithms Evaluation Environment Evaluation Conclusions

PACT Seongbeom Kim, NCSU Conclusions Problems of unfair cache sharing –Sub-optimal throughput –Thread starvation –Priority inversion –Thread-mix dependent throughput Contributions –Cache fairness metrics –Static/dynamic fair caching algorithms Benefits of fair caching –Fairness: 4x improvement –Throughput 15% improvement Comparable to cache miss minimization approach –Fair caching simplifies scheduler design –Simple hardware support

PACT Seongbeom Kim, NCSU Partitioning Histogram Mostly oscillating between two partitioning choices.

PACT Seongbeom Kim, NCSU Partitioning Histogram T rollback of 35% can still find better partition.

PACT Seongbeom Kim, NCSU Impact of Partition Granularity in FairM1Dyn 64KB shows best fairness and throughput.

PACT Seongbeom Kim, NCSU Impact of Initial Partition in FairM1Dyn Tolerable differences from various initial partition.

PACT Seongbeom Kim, NCSU Impact of Initial Partition in FairM1Dyn Initially equal partition alleviates local optimum problem.

PACT Seongbeom Kim, NCSU SpeedUp over Batch Scheduling FairM1Dyn, FairM3Dyn show best speedup.