A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Slides:

Advertisements

Similar presentations

CRUISE: Cache Replacement and Utility-Aware Scheduling

Advertisements

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture Seongbeom Kim, Dhruba Chandra, and Yan Solihin Dept. of Electrical and Computer.

High Performing Cache Hierarchies for Server Workloads

2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Our approach! 6.9% Perfect L2 cache (hit rate 100% ) 1MB L2 cache Cholesky 47% speedup BASE: All cores are used to execute the application-threads. PB-GS(PB-LS)

The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Thoughts on Shared Caches Jeff Odom University of Maryland.

Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi.

Techniques for Multicore Thermal Management Field Cady, Bin Fu and Kai Ren.

1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.

1 Characterizing the Sort Operation on Multithreaded Architectures Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad* The Advanced Computer Architecture.

Scaling the Bandwidth Wall: Challenges in and Avenues for CMP Scalability 36th International Symposium on Computer Architecture Brian Rogers †‡, Anil Krishna.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

Multiprocessor Cache Coherency

Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

Cloud Computing Energy efficient cloud computing Keke Chen.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.

Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.

Embedded System Lab. 최 길 모최 길 모 Kilmo Choi A Software Memory Partition Approach for Eliminating Bank-level Interference in Multicore.

Moinuddin K.Qureshi, Univ of Texas at Austin MICRO’ , 12, 05 PAK, EUNJI.

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

Presenter: Min-Yu Lo 2015/10/19 Asit K. Mishra, N. Vijaykrishnan, Chita R. Das Computer Architecture (ISCA), th Annual International Symposium on.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

1 Process Scheduling in Multiprocessor and Multithreaded Systems Matt Davis CS5354/7/2003.

Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.

1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi

1 Utility-Based Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches Written by Moinuddin K. Qureshi and Yale N.

MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,

Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Dimitris Kaseridis +, Jeffrey Stuecheli *+, and.

Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.

BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches

HPCA Laboratory for Computer Architecture1/11/2010 Dimitris Kaseridis 1, Jeff Stuecheli 1,2, Jian Chen 1 & Lizy K. John 1 1 University of Texas.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.

Availability in CMPs By Eric Hill Pranay Koka. Motivation RAS is an important feature for commercial servers –Server downtime is equivalent to lost money.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

By Islam Atta Supervised by Dr. Ihab Talkhan

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

Project Summary Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay Dept of CISE University of Florida.

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

The University of Adelaide, School of Computer Science

Reducing Memory Interference in Multicore Systems

Xiaodong Wang, Shuang Chen, Jeff Setter,

Resource Aware Scheduler – Initial Results

The University of Adelaide, School of Computer Science

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Energy-Efficient Address Translation

Massachusetts Institute of Technology

(A Research Proposal for Optimizing DBMS on CMP)

CANDY: Enabling Coherent DRAM Caches for Multi-node Systems

Lecture: Cache Hierarchies

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Presentation transcript:

A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli, Jian Chen and Lizy K. John Department of Electrical and Computer Engineering, The University of Texas at Austin, TX, USA, IBM Corp., Austin, TX, USA Reviewed by: Stanley Ikpe

Overview  Terminology  Paper Breakdown  Paper Summary  Objective  Implementation  Results  General Comments  Discussion Topics

Terminology  Chip Multiprocessor (CMP) : multiple processor cores on a single chip  Throughput : measure of work done; successful messages delivered  Bandwidth (memory) : rate at which data can be read/stored  Quality of Service (QoS) : ability to provide priority to applications  Fairness : ability to allocate resources  Resources : utilities used for work (cache capacity and memory bandwidth)  Last Level Cache (LLC) : largest (slow) cache memory (on or off chip)

Paper Breakdown  Motivation: CMP integration provides opportunity for improved throughput. Adversely, sharing resources can be hazardous to performance.  Causes: Parallel Applications; each thread (core) puts different demands/requests on common (shared) resources.  Effects: Inconsistencies in performance, resource contention (unfairness).

Paper Breakdown  So how do we fix this??  Resource Management: control the allocation and use of available resources.  What are some of these resources?  Cache capacity  Available memory bandwidth

Paper Breakdown  How do we go about resource management??  Predictive work monitoring: intuitively infer what resources will be used. Non-invasive (hardware) method of profiling resources (cache capacity and memory bandwidth)  System-wide resource allocation and job scheduling by identifying over-utilized CMPs (BW) and reallocate work.

Baseline Architecture

Set-Associative Design [3] Cache -Overview.pdf

Objectives  Create an algorithm to effectively project memory bandwidth and cache capacity requirements (per core).  Implement for system-wide optimization of resource allocation and job scheduling.  Improve potential throughput for CMP systems.

Implementation  Resource Profiling : prediction scheme to detect cache misses and bandwidth requirements  Mattson’s stack distance algorithm (MSA): method for reducing the simulation time of trace-driven caches. (Mattson et al. [2])  MSA-based profiler for LLC misses : K-way set associative cache implies K+1 counters. Cache access at position i increments counter i. If cache miss increment counter K+1.

MSA-based profiler for LLC misses

Implementation  MSA-based Profiler for Memory Bandwidth : the memory bandwidth required to read (due to cache fills) and write (due to cache dirty write- backs to main memory) Hits to dirty cache lines indicate write-back operations if cache capacity allocation < stack distance. Dirty Stack Distance used to track largest stack distance at which a dirty line accessed Dirty counter projects write-back rate and Dirty bit marks the greatest stack distance of dirty line

Write-back pseudocode

Write-back Profiling Example

SPEC CPU 2006

Implementation  Resource Allocation : compute Marginal-Utility for a given workload across a range of possible cache allocations to compare all possible allocations of unused capacity (n new elements, c already used elements) Intra-chip partitioning algorithm : Marginal-Utility figure of merit measuring amount of utility provided (reduced cache misses) for a given amount of resource (cache capacity). Algorithm considers ideal cache capacity and distributes specific cache-ways per core.

Algorithm

Implementation  Inter-chip partitioning algorithm : find an efficient (below threshold or bandwidth limit) workload schedule on all available CMPs in system. A global implementation is used to mitigate misdistribution of workload. Marginal-Utility algorithm along side bandwidth over-commit detection allow additional workload migration Cache capacity: estimate optimal resource assignment (marginal-utility) and intra-chip partitioning assignment. Algorithm performs workload swapping so each core is below bandwidth limit. Memory Bandwidth: Memory bandwidth over-commit algorithm finds workloads with high/low requirements and does shifting to undercommitted CMPs

Algorithm

Example

Resource Management Scheme

Results  LLC misses: 25.7% average reduction from static-even partitions (with 1.4% storage overhead associated)  BW-aware algorithm shows improvement up until 8 CMP implementation (beyond shows diminishing returns)  Miss rates consistent across different cache sizes with slight improvement due to increased possible cache ways and hence potential workload swapping candidates

Results  Memory Bandwidth : reduction of the average worst-case chip memory bandwidth in the system (per epoch).  Figure of merit used is long memory latencies associated with overcommitted memory bandwidth requirements by specific CMPs  UCP+ algorithm (Marginal- Utility/Intra-chip) shows average of 19% improvement over static-even. (Also increases with number of CMPs due to random workload selection of average worst- case bandwidth.

Results  Simulated Throughput : used to measure the effectiveness of implementation  Case 1: Use of only UCP+  Case 2: Addition of Inter-chip (workload swapping) BW-aware algorithm  Case 1 shows 8.6% IPC and 15.3 MPKI improvements on Chip 4 and 7. (swapping high memory bandwidth benchmarks for lesser demanding ones)  Case 2 shows 8.5% IPC and 11% MPKI improvements due to workload migration of overcommitted chip 7.

Comments  No detailed hardware implementation of “non-invasive” profilers  “Large” CMP systems not demonstrated due to complexity  Good implementation of resource management  Design limited (additional cores)  Cache designs (other than set-associative)

References  [1]D. Kaseridis, J. Stuecheli, J. Chen and L. K. John, “A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems”.  [2]R. L. Mattson, “Evaluation techniques for storage hierarchies”. IBM Systems Journal, 9(2):78-117,  [3] Cache - Overview.pdf

Discussion Topics  How can an inter-board partitioning algorithm be implemented? Is it necessary?  What causes diminished returns beyond 8 CMP chips? Can it be circumvented?