MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

Slides:

Advertisements

Similar presentations

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

Advertisements

Prefetching Techniques for STT-RAM based Last-level Cache in CMP Systems Mengjie Mao, Guangyu Sun, Yong Li, Kai Bu, Alex K. Jones, Yiran Chen Department.

AN ANALYTICAL MODEL TO STUDY OPTIMAL AREA BREAKDOWN BETWEEN CORES AND CACHES IN A CHIP MULTIPROCESSOR Taecheol Oh, Hyunjin Lee, Kiyeon Lee and Sangyeun.

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

Nikos Hardavellas, Northwestern University

International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isolation in Multi-core Platforms Apr 9, 2012 Heechul Yun +, Gang Yao +, Rodolfo.

1 MemScale: Active Low-Power Modes for Main Memory Qingyuan Deng, David Meisner*, Luiz Ramos, Thomas F. Wenisch*, and Ricardo Bianchini Rutgers University.

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.

Improving Cache Performance by Exploiting Read-Write Disparity

1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.

MICRO-47, December 15, 2014 FIRM: Fair and HIgh-PerfoRmance Memory Control for Persistent Memory Systems Jishen Zhao Onur Mutlu Yuan Xie.

Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian,

Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.

Defining Anomalous Behavior for Phase Change Memory

A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.

Evaluating Impact of Storage on Smartphone Energy Efficiency David T. Nguyen.

Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.

StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.

© 2007 IBM Corporation HPCA – 2010 Improving Read Performance of PCM via Write Cancellation and Write Pausing Moinuddin Qureshi Michele Franceschini and.

Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research.

Timing Channel Protection for a Shared Memory Controller Yao Wang, Andrew Ferraiuolo, G. Edward Suh Feb 17 th 2014.

Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.

Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.

Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

Moinuddin K.Qureshi, Univ of Texas at Austin MICRO’ , 12, 05 PAK, EUNJI.

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

Applying Control Theory to the Caches of Multiprocessors Department of EECS University of Tennessee, Knoxville Kai Ma.

Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.

1 Utility-Based Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches Written by Moinuddin K. Qureshi and Yale N.

MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,

Analyzing Performance Vulnerability due to Resource Denial-Of-Service Attack on Chip Multiprocessors Dong Hyuk WooGeorgia Tech Hsien-Hsin “Sean” LeeGeorgia.

Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.

PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie, Gabriel H. Loh Georgia Institute of Technology Presented by: Yingying.

Embedded System Lab. 정범종 PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie et al. ACM, 2009.

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

Computer Organization CS224 Fall 2012 Lessons 41 & 42.

Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:

Sunpyo Hong, Hyesoon Kim

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.

Speaker : Kyu Hyun, Choi. Problem: Interference in shared caches – Lack of isolation → no QoS – Poor cache utilization → degraded performance.

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

CSE 340 Computer Architecture Summer 2016 Understanding Performance.

Cache Issues Computer Organization II 1 Main Memory Supporting Caches Use DRAMs for main memory – Fixed width (e.g., 1 word) – Connected by fixed-width.

15-740/ Computer Architecture Lecture 5: Project Example Justin Meza Yoongu Kim Fall 2011, 9/21/2011.

Mellow Writes: Extending Lifetime in Resistive Memories through Selective Slow Write Backs Lunkai Zhang, Diana Franklin, Frederic T. Chong 1 Brian Neely,

UH-MEM: Utility-Based Hybrid Memory Management

Seth Pugsley, Jeffrey Jestes,

Memshare: a Dynamic Multi-tenant Key-value Cache

Adaptive Cache Partitioning on a Composite Core

ASR: Adaptive Selective Replication for CMP Caches

Resource Aware Scheduler – Initial Results

Less is More: Leveraging Belady’s Algorithm with Demand-based Learning

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

ECE 445 – Computer Organization

Application Slowdown Model

Fine-Grain CAM-Tag Cache Resizing Using Miss Tags

Some challenges in heterogeneous multi-core systems

CARP: Compression-Aware Replacement Policies

Horizontally Partitioned Hybrid Main Memory with PCM

Lei Zhao, Youtao Zhang, Jun Yang

Presentation transcript:

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with PCM

Introduction DRAM memory is not energy efficient  Data centers are energy hungry  DRAM memory consumes 20-40% of the energy Apply PCM as main memory  Energy efficient but  Slower read, much slower write and shorter lifetime Hybrid memory: add a DRAM cache  Improve performance (  LLC miss rate)  Extend lifetime (  LLC writeback rate) How to manage the shared resources? C0C0 L1 L2 C1 L1 L2 C2 L1 L2 C3 L1 L2 DRAM PCM DRAM LLC

Shared Resource Management CMP systems  Shared resources: - last-level cache - memory bandwidth  Unmanaged resources  interference    poor performance Partition resources:  interference,  performance Cache Partitioning Bandwidth Partitioning DRAM main memory Hybrid main memory UCP [Qureshi et. al., MICRO 39] RBP [Liu et. al., HPCA’10] WCP [Zhou et. al., HiPEAC’12] This work Utility-based Cache Partitioning (UCP) Tracks utility (LLC hit/miss) and minimizes overall LLC misses Utility-based Cache Partitioning (UCP) Tracks utility (LLC hit/miss) and minimizes overall LLC misses Read-only Bandwidth Partitioning (RBP) Partitions the bus bandwidth based on LLC miss information Read-only Bandwidth Partitioning (RBP) Partitions the bus bandwidth based on LLC miss information Writeback-aware Cache Partitioning (WCP) Tracks and minimizes LLC miss & writebacks Writeback-aware Cache Partitioning (WCP) Tracks and minimizes LLC miss & writebacks Questions: 1. Is read-only (LLC miss) information enough? 2. Is bus bandwidth still the bottleneck? Questions: 1. Is read-only (LLC miss) information enough? 2. Is bus bandwidth still the bottleneck? C0 L1 L2 C1 L1 L2 C2 L1 L2 C3 L1 L2 Memory LLC

Bandwidth Partitioning Analytic model guides the run time partitioning  Use queuing theory to model delay  Monitor performance to estimate the parameters of the model  Find the partition that maximizes the system’s performance  Enforce the partition at run time DRAM vs. Hybrid main memory  PCM writes are extremely slow and power hungry Issues specific to hybrid main memory  Bottleneck: bus bandwidth or device bandwidth  Can we ignore the bandwidth consumed by LLC writebacks

Device Bandwidth Utilization DRAMPCM DRAM PCM DRAM Memory 1. Low device bandwidth utilization 2. Memory reads (LLC misses) dominate DRAM Memory 1. Low device bandwidth utilization 2. Memory reads (LLC misses) dominate Hybrid Memory 1. High device bandwidth utilization 2. Memory writes (LLC writebacks) often dominate Hybrid Memory 1. High device bandwidth utilization 2. Memory writes (LLC writebacks) often dominate

RBP on Hybrid Main Memory RBP vs. SHARE 1. RBP outperforms SHARE for workloads dominated by PCM read (LLC miss) 2. RBP lost against SHARE for workloads dominated by PCM write (LLC writeback) A new bandwidth partitioning scheme is necessary for hybrid memory RBP vs. SHARE 1. RBP outperforms SHARE for workloads dominated by PCM read (LLC miss) 2. RBP lost against SHARE for workloads dominated by PCM write (LLC writeback) A new bandwidth partitioning scheme is necessary for hybrid memory 10% 90% Percentage of Device Bandwidth Consumed by PCM Writes (LLC Writebacks)

Writeback-Aware Bandwidth Partitioning Focus on collective bandwidth of PCM devices Considers LLC writeback information Token bucket algorithm  Device service units = tokens  Allocate tokens among app. every epoch (5 million cycles) Analytic model  Maximize weighted speedup  Model the contention on bandwidth as queuing delay  Difficulty: write is blocking only when write queue is full

Analytic Model for bandwidth partitioning For a single core  Additive CPI formula: CPI = CPI LLC∞ + LLC miss freq. * LLC miss penalty  memory ≈ queue, memory service time ≈ queuing delay For a CMP CPI with a infinite LLC CPI due to LLC misses request arrival rate … request service rate LLC miss rate λ m Memory bandwidth α Time to serve requests CPI due to LLC misses … … … … LLC miss rate λ m,1 LLC miss rate λ m,N Memory bandwidth α 1 Memory bandwidth α 2 Memory bandwidth α N Memory Maximize Weighted Speedup

Taking into account the LLC writebacks  CPI = CPI LLC∞ + LLC miss freq. * LLC miss penalty + LLC writeback freq. * LLC writeback penalty LLC miss rate λ m,1 Analytic Model for WBP CPI due to LLC writebacks … … … LLC writeback rate λ w,1 Read memory bandwidth α 1 Write memory bandwidth β 1 Memory * P RQ WQ … … LLC miss rate λ m,2 LLC writeback rate λ w,2 Read memory bandwidth α 2 Write memory bandwidth β 2 … … LLC miss rate λ m,N LLC writeback rate λ w,N Read memory bandwidth α N Write memory bandwidth β N Memory p CPI due to LLC misses and writebacks How to determine P? Prob. that writebacks are on the critical path Maximize Weighted Speedup

Dynamic Weight Adjustment Choose P based on the expected number of executed instructions (EEI) λ m,1 λ w,1 λ m,2 λ w,2 λ m,N λ w,N α 1,1 β 1,1 α 1,2 β 1,2 α 1,N β 1,N WBP p1p1 EEI BU 1 BU 2 BU N Actual EEI EEI 1 P p2p2 … α 2,1 β 2,1 α 2,2 β 2,2 α 2,N β 2,N α m,1 β m,1 α m,2 β m,2 α m,N β m,N pmpm EEI 2 EEI m Bandwidth Utilization ratio (BU): utilized bandwidth : allocated bandwidth Bandwidth Utilization ratio (BU): utilized bandwidth : allocated bandwidth

Architecture Overview BUMon tracks info during an epoch DWA and WBP compute bandwidth partition for the next epoch Bandwidth Regulator enforces the configuration

Enforcing Bandwidth Partitioning

Simulation Setup Configurations  8-core CMP, 168-entry instruction window  Private 4-way 64KB L1, Private 8-way 2MB L2  Partitioned 32MB LLC, 12.5 ns latency  64GB PCM, 4 channels of 2 ranks each, 50ns read latency, 1000ns write latency Benchmarks  SPEC CPU2006  Classified into 3 types (W, R, RW) based on whether PCM reads/writes dominate bandwidth consumption  Creates 15 workloads (Light, High) Sensitivity study on write latency, #channels and #cores

Effective Read Latency 1. Different workloads favor different policy (partitioning weight) 2. WBP+DWA can match the best static policy (partitioning weight) 3. WBP+DWA reduces the effective read latency by 31.9% over RBP 1. Different workloads favor different policy (partitioning weight) 2. WBP+DWA can match the best static policy (partitioning weight) 3. WBP+DWA reduces the effective read latency by 31.9% over RBP

Throughput 1. The best weight varies for different workloads (  writebacks   weight) 2. WBP+DWA achieves comparable performance to the best static weight 3. WBP+DWA improves the throughput by 24.2% over RBP 1. The best weight varies for different workloads (  writebacks   weight) 2. WBP+DWA achieves comparable performance to the best static weight 3. WBP+DWA improves the throughput by 24.2% over RBP

Fairness (Harmonic IPC) WBP+DWA improves fairness by an average of 16.7% over RBP

Conclusions PCM device bandwidth is the bottleneck in hybrid memory Writeback information is important (LLC writebacks consume a substantial portion of memory bandwidth) WBP can better partition the PCM bandwidth WBP outperforms RBP by an average of 24.9% in terms of weighted speedup

Thank you Questions ?