Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

Slides:



Advertisements
Similar presentations
Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.
Advertisements

Improving DRAM Performance by Parallelizing Refreshes with Accesses
A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture Seongbeom Kim, Dhruba Chandra, and Yan Solihin Dept. of Electrical and Computer.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Aérgia: Exploiting Packet Latency Slack in On-Chip Networks
Some Opportunities and Obstacles in Cross-Layer and Cross-Component (Power) Management Onur Mutlu NSF CPOM Workshop, 2/10/2012.
Understanding a Problem in Multicore and How to Solve It
Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.
Data Mapping for Higher Performance and Energy Efficiency in Multi-Level Phase Change Memory HanBin Yoon*, Naveen Muralimanohar ǂ, Justin Meza*, Onur Mutlu*,
Improving Cache Performance by Exploiting Read-Write Disparity
1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.
1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.
MICRO-47, December 15, 2014 FIRM: Fair and HIgh-PerfoRmance Memory Control for Persistent Memory Systems Jishen Zhao Onur Mutlu Yuan Xie.
1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.
Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian,
1 Virtual Private Caches ISCA’07 Kyle J. Nesbit, James Laudon, James E. Smith Presenter: Yan Li.
Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.
By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and
Scalable Many-Core Memory Systems Lecture 4, Topic 3: Memory Interference and QoS-Aware Memory Systems Prof. Onur Mutlu
Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013.
1 Lecture 4: Memory: HMC, Scheduling Topics: BOOM, memory blades, HMC, scheduling policies.
1 Application Aware Prioritization Mechanisms for On-Chip Networks Reetuparna Das Onur Mutlu † Thomas Moscibroda ‡ Chita Das § Reetuparna Das § Onur Mutlu.
Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research.
HIgh Performance Computing & Systems LAB Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems Rachata Ausavarungnirun,
Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.
Embedded System Lab. 최 길 모최 길 모 Kilmo Choi A Software Memory Partition Approach for Eliminating Bank-level Interference in Multicore.
Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.
A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.
Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.
Scalable Many-Core Memory Systems Topic 3: Memory Interference and QoS-Aware Memory Systems Prof. Onur Mutlu
Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.
By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim
MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.
Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research.
A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach Asit K. MishraChita R. DasOnur Mutlu.
Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.
1 Lecture 5: Scheduling and Reliability Topics: scheduling policies, handling DRAM errors.
1 Lecture 3: Memory Buffers and Scheduling Topics: buffers (FB-DIMM, RDIMM, LRDIMM, BoB, BOOM), memory blades, scheduling policies.
Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.
Sunpyo Hong, Hyesoon Kim
Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.
HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.
15-740/ Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University.
1 Lecture 4: Memory Scheduling, Refresh Topics: scheduling policies, refresh basics.
Priority Based Fair Scheduling: A Memory Scheduler Design for Chip-Multiprocessor Systems Tsinghua University Tsinghua National Laboratory for Information.
UH-MEM: Utility-Based Hybrid Memory Management
Reducing Memory Interference in Multicore Systems
A Staged Memory Resource Management Method for CMP Systems
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Zhichun Zhu Zhao Zhang ECE Department ECE Department
Rachata Ausavarungnirun, Kevin Chang
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Application Slowdown Model
Some challenges in heterogeneous multi-core systems
Thesis Defense Lavanya Subramanian
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Haishan Zhu, Mattan Erez
Achieving High Performance and Fairness at Low Cost
Application Slowdown Model
Prof. Onur Mutlu ETH Zürich Fall November 2017
15-740/ Computer Architecture Lecture 19: Main Memory
Prof. Onur Mutlu Carnegie Mellon University 11/14/2012
Presented by Florian Ettinger
Presentation transcript:

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1

Shared Resource Interference Core Main Memory Shared Cache 2

High and Unpredictable Application Slowdowns 2. An application’s performance depends on which application it is running with 1. High application slowdowns due to shared resource interference 3

Outline 4 Goals: 1.High performance 2.Predictable performance Blacklisting memory scheduler Predictability with memory interference Goals: 1.High performance 2.Predictable performance Goals: 1.High performance 2.Predictable performance

Outline 5 Blacklisting memory scheduler Predictability with memory interference Goals: 1.High performance 2.Predictable performance

Background: Main Memory FR-FCFS Memory Scheduler [Zuravleff and Robinson, US Patent ‘97; Rixner et al., ISCA ‘00] – Row-buffer hit first – Older request first Unaware of inter-application interference Row Buffer Bank 0Bank 1Bank 2Bank 3 Row Buffer Rows Columns Channel Memory Controller Bank 0Bank 1Bank 2Bank 3 Row Buffer 6 Row-buffer hitRow-buffer miss

Tackling Inter-Application Interference: Memory Request Scheduling Monitor application memory access characteristics Rank applications based on memory access characteristics Prioritize requests at the memory controller, based on ranking 7

thread Threads in the system thread Non-intensive cluster Intensive cluster thread Memory-non-intensive Memory-intensive Prioritized higher priority higher priority Throughput Fairness An Example: Thread Cluster Memory Scheduling Figure: Kim et al., MICRO

Problems with Previous Application-aware Memory Schedulers Hardware Complexity – Ranking incurs high hardware cost Unfair slowdowns of some applications – Ranking causes unfairness 9

High Hardware Complexity Ranking incurs high hardware cost – Rank computation incurs logic/storage cost – Rank enforcement requires comparison logic 10 8x 1.8x Synthesized with a 32nm standard cell library Avoid ranking to achieve low hardware cost

Ranking Causes Unfair Application Slowdowns Lower-rank applications experience significant slowdowns – Low memory service causes slowdown – Periodic rank shuffling not sufficient astar soplex 11 Grouping offers lower unfairness than ranking

Problems with Previous Application-Aware Memory Schedulers Hardware Complexity – Ranking incurs high hardware cost Unfair slowdowns of some applications – Ranking causes unfairness Our Goal: Design a memory scheduler with Low Complexity, High Performance, and Fairness 12

Towards a New Scheduler Design Monitor applications that have a number of consecutive requests served Blacklist such applications 1. Simple Grouping Mechanism 2. Enforcing Priorities Based On Grouping Prioritize requests of non-blacklisted applications Periodically clear blacklists 13

Methodology Configuration of our simulated system – 24 cores – 4 channels, 8 banks/channel – DDR DRAM – 512 KB private cache/core Workloads – SPEC CPU2006, TPCC, Matlab – 80 multi programmed workloads 14

Metrics System Performance: Fairness: 15

Previous Memory Schedulers FR-FCFS [Zuravleff and Robinson, US Patent 1997, Rixner et al., ISCA 2000] – Prioritizes row-buffer hits and older requests – Application-unaware PARBS [Mutlu and Moscibroda, ISCA 2008] – Batches oldest requests from each application; prioritizes batch – Employs ranking within a batch ATLAS [Kim et al., HPCA 2010] – Prioritizes applications with low memory-intensity TCM [Kim et al., MICRO 2010] – Always prioritizes low memory-intensity applications – Shuffles request priorities of high memory-intensity applications 16

Performance Results 5% higher system performance and 25% lower maximum slowdown than TCM 17 Approaches fairness of PARBS and FRFCFS-Cap achieving better performance than TCM

Complexity Results Blacklisting achieves 70% lower latency than TCM 18 Blacklisting achieves 43% lower area than TCM

Outline 19 Blacklisting memory scheduler Predictability with memory interference Goals: 1.High performance 2.Predictable performance

Need for Predictable Performance There is a need for predictable performance – When multiple applications share resources – Especially if some applications require performance guarantees Example 1: In server systems – Different users’ jobs consolidated onto the same server – Need to provide bounded slowdowns to critical jobs Example 2: In mobile systems – Interactive applications run with non-interactive applications – Need to guarantee performance for interactive applications As a first step: Predictable performance in the presence of memory interference 20

Outline 21 Blacklisting memory scheduler Predictability with memory interference Goals: 1.High performance 2.Predictable performance

Predictability in the Presence of Memory Interference 1. Estimate Slowdown – Key Observations – MISE Operation: Putting it All Together – Evaluating the Model 2. Control Slowdown – Providing Soft Slowdown Guarantees – Minimizing Maximum Slowdown 22

Predictability in the Presence of Memory Interference 1. Estimate Slowdown – Key Observations – MISE Operation: Putting it All Together – Evaluating the Model 2. Control Slowdown – Providing Soft Slowdown Guarantees – Minimizing Maximum Slowdown 23

Slowdown: Definition 24

Key Observation 1 For a memory bound application, Performance  Memory request service rate Easy Difficult Intel Core i7, 4 cores Mem. Bandwidth: 8.5 GB/s 25

Key Observation 2 Request Service Rate Alone (RSR Alone ) of an application can be estimated by giving the application highest priority in accessing memory Highest priority  Little interference (almost as if the application were run alone) 26

Key Observation 2 Request Buffer State Main Memory 1. Run alone Time units Service order Main Memory 12 Request Buffer State Main Memory 2. Run with another application Service order Main Memory 123 Request Buffer State Main Memory 3. Run with another application: highest priority Service order Main Memory 123 Time units 3 27

Memory Interference-induced Slowdown Estimation (MISE) model for memory bound applications 28

Key Observation 3 Memory-bound application No interference Compute Phase Memory Phase With interference Memory phase slowdown dominates overall slowdown time Req 29

Key Observation 3 Non-memory-bound application time No interference Compute Phase Memory Phase With interference Only memory fraction ( ) slows down with interference Memory Interference-induced Slowdown Estimation (MISE) model for non-memory bound applications 30

Predictability in the Presence of Memory Interference 1. Estimate Slowdown – Key Observations – MISE Operation: Putting it All Together – Evaluating the Model 2. Control Slowdown – Providing Soft Slowdown Guarantees – Minimizing Maximum Slowdown 31

MISE Operation: Putting it All Together time Interval Estimate slowdown Interval Estimate slowdown Measure RSR Shared, Estimate RSR Alone Measure RSR Shared, Estimate RSR Alone 32

Predictability in the Presence of Memory Interference 1. Estimate Slowdown – Key Observations – MISE Operation: Putting it All Together – Evaluating the Model 2. Control Slowdown – Providing Soft Slowdown Guarantees – Minimizing Maximum Slowdown 33

Previous Work on Slowdown Estimation Previous work on slowdown estimation – STFM (Stall Time Fair Memory) Scheduling [Mutlu et al., MICRO ‘07] – FST (Fairness via Source Throttling) [Ebrahimi et al., ASPLOS ‘10] Basic Idea: Difficult Easy Count number of cycles application receives interference 34

Two Major Advantages of MISE Over STFM Advantage 1: – STFM estimates alone performance while an application is receiving interference  Difficult – MISE estimates alone performance while giving an application the highest priority  Easier Advantage 2: – STFM does not take into account compute phase for non-memory-bound applications – MISE accounts for compute phase  Better accuracy 35

Methodology Configuration of our simulated system – 4 cores – 1 channel, 8 banks/channel – DDR DRAM – 512 KB private cache/core Workloads – SPEC CPU2006 – 300 multi programmed workloads 36

Quantitative Comparison SPEC CPU 2006 application leslie3d 37

Comparison to STFM cactusADMGemsFDTD soplex wrfcalculixpovray Average error of MISE: 8.2% Average error of STFM: 29.4% (across 300 workloads) 38

Predictability in the Presence of Memory Interference 1. Estimate Slowdown – Key Observations – MISE Operation: Putting it All Together – Evaluating the Model 2. Control Slowdown – Providing Soft Slowdown Guarantees – Minimizing Maximum Slowdown 39

MISE-QoS: Providing “Soft” Slowdown Guarantees Goal 1. Ensure QoS-critical applications meet a prescribed slowdown bound 2. Maximize system performance for other applications Basic Idea – Allocate just enough bandwidth to QoS-critical application – Assign remaining bandwidth to other applications 40

Outline 41 Blacklisting memory scheduler Predictability with memory interference Goals: 1.High performance 2.Predictable performance

A Recap Problem: Shared resource interference causes high and unpredictable application slowdowns Approach: – Simple mechanisms to mitigate interference – Slowdown estimation models – Slowdown control mechanisms Future Work: – Extending to shared caches 42

Shared Cache Interference 43 Core Main Memory Shared Cache

Impact of Cache Capacity Contention 44 Cache capacity interference causes high application slowdowns Shared Main MemoryShared Main Memory and Caches

Backup Slides 45

Outline 46 Blacklisting memory scheduler Predictability with memory interference Goals: 1.High performance 2.Predictable performance Coordinated cache/memory management for performance Cache slowdown estimation Coordinated cache/memory management for predictability

Outline 47 Blacklisting memory scheduler Predictability with memory interference Goals: 1.High performance 2.Predictable performance Coordinated cache/memory management for performance Cache slowdown estimation Coordinated cache/memory management for predictability

Request Service vs. Memory Access 48 Core Main Memory Shared Cache Memory Access Rate Request Service Rate Request service and access rates tightly coupled

Estimating Cache and Memory Slowdowns Through Cache Access Rates 49 Core Main Memory Shared Cache Access Rate

Cache Access Rate vs. Slowdown 50 bzip2xalancbmk

Challenge How to estimate alone cache access rate? 51 Core Main Memory Shared Cache Access Rate Auxiliary Tags Priority

Outline 52 Blacklisting memory scheduler Predictability with memory interference Goals: 1.High performance 2.Predictable performance Coordinated cache/memory management for performance Cache slowdown estimation Coordinated cache/memory management for predictability

Leveraging Slowdown Estimates for Performance Optimization How do we leverage slowdown estimates to achieve high performance by allocating – Memory bandwidth? – Cache capacity? Leverage other metrics along with slowdowns – Memory intensity – Cache miss rates 53

Coordinated Resource Allocation Schemes 54 Core Main Memory Shared Cache Cache capacity-aware bandwidth allocation Bandwidth-aware cache capacity allocation

Outline 55 Blacklisting memory scheduler Predictability with memory interference Goals: 1.High performance 2.Predictable performance Coordinated cache/memory management for performance Cache slowdown estimation Coordinated cache/memory management for predictability

Coordinated Resource Management Schemes for Predictable Performance Goal: Cache capacity and memory bandwidth allocation for an application to meet a bound Challenges: Large search space of potential cache capacity and memory bandwidth allocations Multiple possible combinations of cache/memory allocations for each application 56

Outline 57 Blacklisting memory scheduler Predictability with memory interference Goals: 1.High performance 2.Predictable performance Coordinated cache/memory management for performance Cache slowdown estimation Coordinated cache/memory management for predictability

Timeline 58

Summary Problem: Shared resource interference causes high and unpredictable application slowdowns Goals: High and predictable performance Our Approach: – Simple mechanisms to mitigate interference – Slowdown estimation models – Coordinated cache/memory management 59