Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

Slides:

Advertisements

Similar presentations

Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.

Advertisements

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

Thomas Moscibroda Distributed Systems Research, Redmond Onur Mutlu

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Aérgia: Exploiting Packet Latency Slack in On-Chip Networks

1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.

Some Opportunities and Obstacles in Cross-Layer and Cross-Component (Power) Management Onur Mutlu NSF CPOM Workshop, 2/10/2012.

Understanding a Problem in Multicore and How to Solve It

OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.

Chapter 5 CPU Scheduling. CPU Scheduling Topics: Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling.

1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.

1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.

MICRO-47, December 15, 2014 FIRM: Fair and HIgh-PerfoRmance Memory Control for Persistent Memory Systems Jishen Zhao Onur Mutlu Yuan Xie.

1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.

1 Virtual Private Caches ISCA’07 Kyle J. Nesbit, James Laudon, James E. Smith Presenter: Yan Li.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

Veynu Narasiman The University of Texas at Austin Michael Shebanow

Scalable Many-Core Memory Systems Lecture 4, Topic 3: Memory Interference and QoS-Aware Memory Systems Prof. Onur Mutlu

1 Lecture 4: Memory: HMC, Scheduling Topics: BOOM, memory blades, HMC, scheduling policies.

A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

1 Application Aware Prioritization Mechanisms for On-Chip Networks Reetuparna Das Onur Mutlu † Thomas Moscibroda ‡ Chita Das § Reetuparna Das § Onur Mutlu.

Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research.

HIgh Performance Computing & Systems LAB Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems Rachata Ausavarungnirun,

Timing Channel Protection for a Shared Memory Controller Yao Wang, Andrew Ferraiuolo, G. Edward Suh Feb 17 th 2014.

Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

1 Fair Queuing Memory Systems Kyle Nesbit, Nidhi Aggarwal, Jim Laudon *, and Jim Smith University of Wisconsin – Madison Department of Electrical and Computer.

Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

Scalable Many-Core Memory Systems Topic 3: Memory Interference and QoS-Aware Memory Systems Prof. Onur Mutlu

Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Dimitris Kaseridis +, Jeffrey Stuecheli *+, and.

Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research.

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

Computer Architecture: Memory Interference and QoS (Part I) Prof. Onur Mutlu Carnegie Mellon University.

1 Lecture 3: Memory Buffers and Scheduling Topics: buffers (FB-DIMM, RDIMM, LRDIMM, BoB, BOOM), memory blades, scheduling policies.

Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.

Sunpyo Hong, Hyesoon Kim

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.

RTAS 2014 Bounding Memory Interference Delay in COTS-based Multi-Core Systems Hyoseung Kim Dionisio de Niz Bj ӧ rn Andersson Mark Klein Onur Mutlu Raj.

15-740/ Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University.

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

15-740/ Computer Architecture Lecture 25: Main Memory

Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.

CPU scheduling.  Single Process  one process at a time  Maximum CPU utilization obtained with multiprogramming  CPU idle :waiting time is wasted 2.

18-740/640 Computer Architecture Lecture 15: Memory Resource Management II Prof. Onur Mutlu Carnegie Mellon University Fall 2015, 11/2/2015.

1 Lecture 4: Memory Scheduling, Refresh Topics: scheduling policies, refresh basics.

Priority Based Fair Scheduling: A Memory Scheduler Design for Chip-Multiprocessor Systems Tsinghua University Tsinghua National Laboratory for Information.

Computer Architecture Lecture 22: Memory Controllers Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 3/25/2015.

Reducing Memory Interference in Multicore Systems

Zhichun Zhu Zhao Zhang ECE Department ECE Department

Prof. Onur Mutlu Carnegie Mellon University 11/9/2012

ISPASS th April Santa Rosa, California

Prof. Onur Mutlu Carnegie Mellon University

Milad Hashemi, Onur Mutlu, and Yale N. Patt

Computer Architecture Lecture 24: Memory Scheduling

Application Slowdown Model

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

CPU Scheduling G.Anuradha

Achieving High Performance and Fairness at Low Cost

Prof. Onur Mutlu ETH Zürich Fall November 2017

Prof. Onur Mutlu ETH Zürich Fall September 2018

Presented by Florian Ettinger

Presentation transcript:

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research

2 Outline Background and Goal Motivation  Destruction of Intra-thread DRAM Bank Parallelism Parallelism-Aware Batch Scheduling  Batching  Within-batch Scheduling System Software Support Evaluation Summary

The DRAM System 3 Columns Rows Row Buffer DRAM CONTROLLER DRAM Bus BANK 0 BANK 1 BANK 2BANK 3 FR-FCFS policy 1)Row-hit first 2)Oldest first

4 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank 7... Shared DRAM Memory System Multi-Core Chip threads’ requests interfere

Inter-thread Interference in the DRAM System Threads delay each other by causing resource contention:  Bank, bus, row-buffer conflicts [MICRO 2007] Threads can also destroy each other’s DRAM bank parallelism  Otherwise parallel requests can become serialized Existing DRAM schedulers are unaware of this interference They simply aim to maximize DRAM throughput  Thread-unaware and thread-unfair  No intent to service each thread’s requests in parallel  FR-FCFS policy: 1) row-hit first, 2) oldest first Unfairly prioritizes threads with high row-buffer locality 5

Consequences of Inter-Thread Interference in DRAM 6 Unfair slowdown of different threads [MICRO 2007] System performance loss [MICRO 2007] Vulnerability to denial of service [USENIX Security 2007] Inability to enforce system-level thread priorities [MICRO 2007] Cores make very slow progress Memory performance hogLow priority High priority Normalized Memory Stall-Time DRAM is the only shared resource

Our Goal Control inter-thread interference in DRAM Design a shared DRAM scheduler that  provides high system performance preserves each thread’s DRAM bank parallelism  provides fairness to threads sharing the DRAM system equalizes memory-slowdowns of equal-priority threads  is controllable and configurable enables different service levels for threads with different priorities 7

8 Outline Background and Goal Motivation  Destruction of Intra-thread DRAM Bank Parallelism Parallelism-Aware Batch Scheduling  Batching  Within-batch Scheduling System Software Support Evaluation Summary

The Problem Processors try to tolerate the latency of DRAM requests by generating multiple outstanding requests  Memory-Level Parallelism (MLP)  Out-of-order execution, non-blocking caches, runahead execution Effective only if the DRAM controller actually services the multiple requests in parallel in DRAM banks Multiple threads share the DRAM controller DRAM controllers are not aware of a thread’s MLP  Can service each thread’s outstanding requests serially, not in parallel 9

Bank Parallelism of a Thread 10 Thread A: Bank 0, Row 1 Thread A: Bank 1, Row 1 Bank access latencies of the two requests overlapped Thread stalls for ~ONE bank access latency Thread A : Bank 0 Bank 1 Compute 2 DRAM Requests Bank 0 Stall Compute Bank 1 Single Thread:

Compute 2 DRAM Requests Bank Parallelism Interference in DRAM 11 Bank 0 Bank 1 Thread A: Bank 0, Row 1 Thread B: Bank 1, Row 99 Thread B: Bank 0, Row 99 Thread A: Bank 1, Row 1 A : Compute 2 DRAM Requests Bank 0 Stall Bank 1 Baseline Scheduler: B: Compute Bank 0 Stall Bank 1 Stall Bank access latencies of each thread serialized Each thread stalls for ~TWO bank access latencies

2 DRAM Requests Parallelism-Aware Scheduler 12 Bank 0 Bank 1 Thread A: Bank 0, Row 1 Thread B: Bank 1, Row 99 Thread B: Bank 0, Row 99 Thread A: Bank 1, Row 1 A : 2 DRAM Requests Parallelism-aware Scheduler: B: Compute Bank 0 Stall Compute Bank 1 Stall 2 DRAM Requests A : Compute 2 DRAM Requests Bank 0 Stall Compute Bank 1 B: Compute Bank 0 Stall Compute Bank 1 Stall Baseline Scheduler: Compute Bank 0 Stall Compute Bank 1 Saved Cycles Average stall-time: ~1.5 bank access latencies

13 Outline Background and Goal Motivation  Destruction of Intra-thread DRAM Bank Parallelism Parallelism-Aware Batch Scheduling (PAR-BS)  Request Batching  Within-batch Scheduling System Software Support Evaluation Summary

Parallelism-Aware Batch Scheduling (PAR-BS) Principle 1: Parallelism-awareness  Schedule requests from a thread (to different banks) back to back  Preserves each thread’s bank parallelism  But, this can cause starvation… Principle 2: Request Batching  Group a fixed number of oldest requests from each thread into a “batch”  Service the batch before all other requests  Form a new batch when the current one is done  Eliminates starvation, provides fairness  Allows parallelism-awareness within a batch 14 Bank 0Bank 1 T1 T0 T2 T3 T2 Batch T0 T1

PAR-BS Components Request batching Within-batch scheduling  Parallelism aware 15

Request Batching Each memory request has a bit (marked) associated with it Batch formation:  Mark up to Marking-Cap oldest requests per bank for each thread  Marked requests constitute the batch  Form a new batch when no marked requests are left Marked requests are prioritized over unmarked ones  No reordering of requests across batches: no starvation, high fairness How to prioritize requests within a batch? 16

Within-Batch Scheduling Can use any existing DRAM scheduling policy  FR-FCFS (row-hit first, then oldest-first) exploits row-buffer locality But, we also want to preserve intra-thread bank parallelism  Service each thread’s requests back to back Scheduler computes a ranking of threads when the batch is formed  Higher-ranked threads are prioritized over lower-ranked ones  Improves the likelihood that requests from a thread are serviced in parallel by different banks Different threads prioritized in the same order across ALL banks 17 HOW?

How to Rank Threads within a Batch Ranking scheme affects system throughput and fairness Maximize system throughput  Minimize average stall-time of threads within the batch Minimize unfairness (Equalize the slowdown of threads)  Service threads with inherently low stall-time early in the batch  Insight: delaying memory non-intensive threads results in high slowdown Shortest stall-time first (shortest job first) ranking  Provides optimal system throughput [Smith, 1956]*  Controller estimates each thread’s stall-time within the batch  Ranks threads with shorter stall-time higher 18 * W.E. Smith, “Various optimizers for single stage production,” Naval Research Logistics Quarterly, 1956.

Maximum number of marked requests to any bank (max-bank-load)  Rank thread with lower max-bank-load higher (~ low stall-time) Total number of marked requests (total-load)  Breaks ties: rank thread with lower total-load higher Shortest Stall-Time First Ranking 19 T2T3T1 T0 Bank 0Bank 1Bank 2Bank 3 T3 T1T3 T2 T1T2 T1T0T2T0 T3T2T3 max-bank-loadtotal-load T013 T124 T226 T359 Ranking: T0 > T1 > T2 > T3

7 5 3 Example Within-Batch Scheduling Order 20 T2T3T1 T0 Bank 0Bank 1Bank 2Bank 3 T3 T1T3 T2 T1T2 T1T0T2T0 T3T2T3 Baseline Scheduling Order (Arrival order) PAR-BS Scheduling Order T2 T3 T1 T0 Bank 0Bank 1Bank 2Bank 3 T3 T1 T3 T2 T1 T2 T1 T0 T2 T0 T3 T2 T3 T0T1T2T AVG: 5 bank access latenciesAVG: 3.5 bank access latencies Stall times T0T1T2T Stall times Time Ranking: T0 > T1 > T2 > T Time

Putting It Together: PAR-BS Scheduling Policy PAR-BS Scheduling Policy (1) Marked requests first (2) Row-hit requests first (3) Higher-rank thread first (shortest stall-time first) (4) Oldest first Three properties:  Exploits row-buffer locality and intra-thread bank parallelism  Work-conserving Services unmarked requests to banks without marked requests  Marking-Cap is important Too small cap: destroys row-buffer locality Too large cap: penalizes memory non-intensive threads Many more trade-offs analyzed in the paper 21 Batching Parallelism-aware within-batch scheduling

Hardware Cost <1.5KB storage cost for  8-core system with 128-entry memory request buffer No complex operations (e.g., divisions) Not on the critical path  Scheduler makes a decision only every DRAM cycle 22

23 Outline Background and Goal Motivation  Destruction of Intra-thread DRAM Bank Parallelism Parallelism-Aware Batch Scheduling  Batching  Within-batch Scheduling System Software Support Evaluation Summary

System Software Support OS conveys each thread’s priority level to the controller  Levels 1, 2, 3, … (highest to lowest priority) Controller enforces priorities in two ways  Mark requests from a thread with priority X only every Xth batch  Within a batch, higher-priority threads’ requests are scheduled first Purely opportunistic service  Special very low priority level L  Requests from such threads never marked Quantitative analysis in paper 24

25 Outline Background and Goal Motivation  Destruction of Intra-thread DRAM Bank Parallelism Parallelism-Aware Batch Scheduling  Batching  Within-batch Scheduling System Software Support Evaluation Summary

26 Evaluation Methodology 4-, 8-, 16-core systems  x86 processor model based on Intel Pentium M  4 GHz processor, 128-entry instruction window  512 Kbyte per core private L2 caches, 32 L2 miss buffers Detailed DRAM model based on Micron DDR2-800  128-entry memory request buffer  8 banks, 2Kbyte row buffer  40ns (160 cycles) row-hit round-trip latency  80ns (320 cycles) row-conflict round-trip latency Benchmarks  Multiprogrammed SPEC CPU2006 and Windows Desktop applications  100, 16, 12 program combinations for 4-, 8-, 16-core experiments

27 Comparison with Other DRAM Controllers Baseline FR-FCFS [Zuravleff and Robinson, US Patent 1997; Rixner et al., ISCA 2000]  Prioritizes row-hit requests, older requests  Unfairly penalizes threads with low row-buffer locality, memory non-intensive threads FCFS [Intel Pentium 4 chipsets]  Oldest-first; low DRAM throughput  Unfairly penalizes memory non-intensive threads Network Fair Queueing (NFQ) [Nesbit et al., MICRO 2006]  Equally partitions DRAM bandwidth among threads  Does not consider inherent (baseline) DRAM performance of each thread  Unfairly penalizes threads with high bandwidth utilization [MICRO 2007]  Unfairly prioritizes threads with bursty access patterns [MICRO 2007] Stall-Time Fair Memory Scheduler (STFM) [Mutlu & Moscibroda, MICRO 2007]  Estimates and balances thread slowdowns relative to when run alone  Unfairly treats threads with inaccurate slowdown estimates  Requires multiple (approximate) arithmetic operations

28 Unfairness on 4-, 8-, 16-core Systems Unfairness = MAX Memory Slowdown / MIN Memory Slowdown [MICRO 2007] 1.11X1.08X 1.11X

29 System Performance (Hmean-speedup) 8.3%6.1%5.1%

30 Outline Background and Goal Motivation  Destruction of Intra-thread DRAM Bank Parallelism Parallelism-Aware Batch Scheduling  Batching  Within-batch Scheduling System Software Support Evaluation Summary

Inter-thread interference can destroy each thread’s DRAM bank parallelism  Serializes a thread’s requests  reduces system throughput  Makes techniques that exploit memory-level parallelism less effective  Existing DRAM controllers unaware of intra-thread bank parallelism A new approach to fair and high-performance DRAM scheduling  Batching: Eliminates starvation, allows fair sharing of the DRAM system  Parallelism-aware thread ranking: Preserves each thread’s bank parallelism  Flexible and configurable: Supports system-level thread priorities  QoS policies PAR-BS provides better fairness and system performance than previous DRAM schedulers 31

Thank you. Questions?

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research

Multiple Memory Controllers (I) Local ranking: Each controller uses PAR-BS independently  Computes its own ranking based on its local requests Global ranking: Meta controller that computes a global ranking across all controllers based on global information  Only needs to track bookkeeping info about each thread’s requests to the banks in each controller The difference between the ranking computed by each scheme depends on the balance of the distribution of requests to each controller  Balanced  Local and global rankings are similar 34

Multiple Memory Controllers (II) 35 Unfairness Nomalized Hmean-Speedup 7.4%11.5% 1.18X1.33X 16-core system, 4 memory controllers

Example with Row Hits Stall time Thread 14 Thread 24 Thread 35 Thread 47 AVG5 36 Stall time Thread 15.5 Thread 23 Thread 34.5 Thread 44.5 AVG4.375 Stall time Thread 11 Thread 22 Thread 34 Thread 45.5 AVG3.125