Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh, Onur Mutlu Carnegie Mellon University, AMD Research.

Slides:

Advertisements

Similar presentations

Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.

Advertisements

Improving DRAM Performance by Parallelizing Refreshes with Accesses

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu.

Aérgia: Exploiting Packet Latency Slack in On-Chip Networks

1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.

Some Opportunities and Obstacles in Cross-Layer and Cross-Component (Power) Management Onur Mutlu NSF CPOM Workshop, 2/10/2012.

Understanding a Problem in Multicore and How to Solve It

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture

OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.

1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.

MICRO-47, December 15, 2014 FIRM: Fair and HIgh-PerfoRmance Memory Control for Persistent Memory Systems Jishen Zhao Onur Mutlu Yuan Xie.

1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

1 Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

The Dirty-Block Index Vivek Seshadri Abhishek Bhowmick ∙ Onur Mutlu Phillip B. Gibbons ∙ Michael A. Kozuch ∙ Todd C. Mowry.

Scalable Many-Core Memory Systems Lecture 4, Topic 3: Memory Interference and QoS-Aware Memory Systems Prof. Onur Mutlu

Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs José A. Joao * M. Aater Suleman * Onur Mutlu ‡ Yale N. Patt * * HPS Research.

1 Lecture 4: Memory: HMC, Scheduling Topics: BOOM, memory blades, HMC, scheduling policies.

Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.

1 Application Aware Prioritization Mechanisms for On-Chip Networks Reetuparna Das Onur Mutlu † Thomas Moscibroda ‡ Chita Das § Reetuparna Das § Onur Mutlu.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research.

HIgh Performance Computing & Systems LAB Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems Rachata Ausavarungnirun,

1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.

LIBRA: Multi-mode On-Chip Network Arbitration for Locality-Oblivious Task Placement Gwangsun Kim Computer Science Department Korea Advanced Institute of.

Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

Design and Evaluation of Hierarchical Rings with Deflection Routing Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, Kevin Chang, Greg Nazario, Reetuparna.

Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

Scalable Many-Core Memory Systems Topic 3: Memory Interference and QoS-Aware Memory Systems Prof. Onur Mutlu

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach Asit K. MishraChita R. DasOnur Mutlu.

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

The Evicted-Address Filter

1 Lecture 3: Memory Buffers and Scheduling Topics: buffers (FB-DIMM, RDIMM, LRDIMM, BoB, BOOM), memory blades, scheduling policies.

Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.

Sunpyo Hong, Hyesoon Kim

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.

RTAS 2014 Bounding Memory Interference Delay in COTS-based Multi-Core Systems Hyoseung Kim Dionisio de Niz Bj ӧ rn Andersson Mark Klein Onur Mutlu Raj.

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.

1 Lecture 4: Memory Scheduling, Refresh Topics: scheduling policies, refresh basics.

Priority Based Fair Scheduling: A Memory Scheduler Design for Chip-Multiprocessor Systems Tsinghua University Tsinghua National Laboratory for Information.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

UH-MEM: Utility-Based Hybrid Memory Management

Reducing Memory Interference in Multicore Systems

A Staged Memory Resource Management Method for CMP Systems

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Zhichun Zhu Zhao Zhang ECE Department ECE Department

Rachata Ausavarungnirun, Kevin Chang

Application Slowdown Model

Rachata Ausavarungnirun

Row Buffer Locality Aware Caching Policies for Hybrid Memories

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Achieving High Performance and Fairness at Low Cost

CARP: Compression-Aware Replacement Policies

Presented by Florian Ettinger

Presentation transcript:

Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012

Executive Summary Observation: Heterogeneous CPU-GPU systems require memory schedulers with large request buffers Problem: Existing monolithic application-aware memory scheduler designs are hard to scale to large request buffer sizes Solution: Staged Memory Scheduling (SMS) decomposes the memory controller into three simple stages: 1) Batch formation: maintains row buffer locality 2) Batch scheduler: reduces interference between applications 3) DRAM command scheduler: issues requests to DRAM Compared to state-of-the-art memory schedulers:  SMS is significantly simpler and more scalable  SMS provides higher performance and fairness 2

Outline Background Motivation Our Goal Observations Staged Memory Scheduling 1) Batch Formation 2) Batch Scheduler 3) DRAM Command Scheduler Results Conclusion 3

All cores contend for limited off-chip bandwidth  Inter-application interference degrades system performance  The memory scheduler can help mitigate the problem How does the memory scheduler deliver good performance and fairness? Main Memory is a Bottleneck 4 Memory Scheduler Core 1Core 2Core 3Core 4 To DRAM Memory Request Buffer Req Data Req

Currently open row B Prioritize row-buffer-hit requests [Rixner+, ISCA’00]  To maximize memory bandwidth Prioritize latency-sensitive applications [Kim+, HPCA’10]  To maximize system throughput Ensure that no application is starved [Mutlu and Moscibroda, MICRO’07]  To minimize unfairness Three Principles of Memory Scheduling 5 Req 1Row A Req 2Row B Req 3Row C Req 4Row A Req 5Row B ApplicationMemory Intensity (MPKI) Older Newer

Outline Background Motivation: CPU-GPU Systems Our Goal Observations Staged Memory Scheduling 1) Batch Formation 2) Batch Scheduler 3) DRAM Command Scheduler Results Conclusion 6

Memory Scheduling for CPU-GPU Systems Current and future systems integrate a GPU along with multiple cores GPU shares the main memory with the CPU cores GPU is much more (4x-20x) memory-intensive than CPU How should memory scheduling be done when GPU is integrated on-chip? 7

GPU occupies a significant portion of the request buffers  Limits the MC’s visibility of the CPU applications’ differing memory behavior  can lead to a poor scheduling decision Introducing the GPU into the System 8 Memory Scheduler Core 1Core 2Core 3Core 4 To DRAM Req GPU Req

Naïve Solution: Large Monolithic Buffer 9 Memory Scheduler To DRAM Core 1Core 2Core 3Core 4 Req GPU

A large buffer requires more complicated logic to:  Analyze memory requests (e.g., determine row buffer hits)  Analyze application characteristics  Assign and enforce priorities This leads to high complexity, high power, large die area Problems with Large Monolithic Buffer 10 Memory Scheduler Req More Complex Memory Scheduler

Design a new memory scheduler that is:  Scalable to accommodate a large number of requests  Easy to implement  Application-aware  Able to provide high performance and fairness, especially in heterogeneous CPU-GPU systems Our Goal 11

Outline Background Motivation: CPU-GPU Systems Our Goal Observations Staged Memory Scheduling 1) Batch Formation 2) Batch Scheduler 3) DRAM Command Scheduler Results Conclusion 12

Key Functions of a Memory Controller Memory controller must consider three different things concurrently when choosing the next request: 1) Maximize row buffer hits  Maximize memory bandwidth 2) Manage contention between applications  Maximize system throughput and fairness 3) Satisfy DRAM timing constraints Current systems use a centralized memory controller design to accomplish these functions  Complex, especially with large request buffers 13

Key Idea: Decouple Tasks into Stages Idea: Decouple the functional tasks of the memory controller  Partition tasks across several simpler HW structures (stages) 1) Maximize row buffer hits  Stage 1: Batch formation  Within each application, groups requests to the same row into batches 2) Manage contention between applications  Stage 2: Batch scheduler  Schedules batches from different applications 3) Satisfy DRAM timing constraints  Stage 3: DRAM command scheduler  Issues requests from the already-scheduled order to each bank 14

Outline Background Motivation: CPU-GPU Systems Our Goal Observations Staged Memory Scheduling 1) Batch Formation 2) Batch Scheduler 3) DRAM Command Scheduler Results Conclusion 15

SMS: Staged Memory Scheduling 16 Memory Scheduler Core 1Core 2Core 3Core 4 To DRAM GPU Req Batch Scheduler Stage 1 Stage 2 Stage 3 Req Monolithic Scheduler Batch Formation DRAM Command Scheduler Bank 1 Bank 2 Bank 3Bank 4

Stage 1 Stage 2 SMS: Staged Memory Scheduling 17 Core 1Core 2Core 3Core 4 To DRAM GPU Req Batch Scheduler Batch Formation Stage 3 DRAM Command Scheduler Bank 1 Bank 2 Bank 3Bank 4

Stage 1: Batch Formation Goal: Maximize row buffer hits At each core, we want to batch requests that access the same row within a limited time window A batch is ready to be scheduled under two conditions 1) When the next request accesses a different row 2) When the time window for batch formation expires Keep this stage simple by using per-core FIFOs 18

Core 1 Core 2Core 3Core 4 Stage 1: Batch Formation Example 19 Row A Row B Row C Row D Row E Row F Row E Batch Boundary To Stage 2 (Batch Scheduling) Row A Time window expires Next request goes to a different row Stage 1 Batch Formation

SMS: Staged Memory Scheduling 20 Stage 1 Stage 2 Core 1Core 2Core 3Core 4 To DRAM GPU Req Batch Scheduler Batch Formation Stage 3 DRAM Command Scheduler Bank 1 Bank 2 Bank 3Bank 4

Stage 2: Batch Scheduler Goal: Minimize interference between applications Stage 1 forms batches within each application Stage 2 schedules batches from different applications  Schedules the oldest batch from each application Question: Which application’s batch should be scheduled next? Goal: Maximize system performance and fairness  To achieve this goal, the batch scheduler chooses between two different policies 21

Stage 2: Two Batch Scheduling Algorithms Shortest Job First (SJF)  Prioritize the applications with the fewest outstanding memory requests because they make fast forward progress  Pro: Good system performance and fairness  Con: GPU and memory-intensive applications get deprioritized Round-Robin (RR)  Prioritize the applications in a round-robin manner to ensure that memory-intensive applications can make progress  Pro: GPU and memory-intensive applications are treated fairly  Con: GPU and memory-intensive applications significantly slow down others 22

Stage 2: Batch Scheduling Policy The importance of the GPU varies between systems and over time  Scheduling policy needs to adapt to this Solution: Hybrid Policy At every cycle:  With probability p : Shortest Job First  Benefits the CPU  With probability 1-p : Round-Robin  Benefits the GPU System software can configure p based on the importance/weight of the GPU  Higher GPU importance  Lower p value 23

SMS: Staged Memory Scheduling 24 Stage 1 Stage 2 Core 1Core 2Core 3Core 4 To DRAM GPU Req Batch Scheduler Batch Formation Stage 3 DRAM Command Scheduler Bank 1 Bank 2 Bank 3Bank 4

Stage 3: DRAM Command Scheduler High level policy decisions have already been made by:  Stage 1: Maintains row buffer locality  Stage 2: Minimizes inter-application interference Stage 3: No need for further scheduling Only goal: service requests while satisfying DRAM timing constraints Implemented as simple per-bank FIFO queues 25

Current Batch Scheduling Policy SJF Current Batch Scheduling Policy RR Batch Scheduler Bank 1 Bank 2 Bank 3 Bank 4 Putting Everything Together 26 Core 1 Core 2 Core 3 Core 4 Stage 1: Batch Formation Stage 3: DRAM Command Scheduler GPU Stage 2:

Complexity Compared to a row hit first scheduler, SMS consumes*  66% less area  46% less static power Reduction comes from:  Monolithic scheduler  stages of simpler schedulers  Each stage has a simpler scheduler (considers fewer properties at a time to make the scheduling decision)  Each stage has simpler buffers (FIFO instead of out-of-order)  Each stage has a portion of the total buffer size (buffering is distributed across stages) 27 * Based on a Verilog model using 180nm library

Outline Background Motivation: CPU-GPU Systems Our Goal Observations Staged Memory Scheduling 1) Batch Formation 2) Batch Scheduler 3) DRAM Command Scheduler Results Conclusion 28

Methodology Simulation parameters  16 OoO CPU cores, 1 GPU modeling AMD Radeon™ 5870  DDR DRAM 4 channels, 1 rank/channel, 8 banks/channel Workloads  CPU: SPEC CPU 2006  GPU: Recent games and GPU benchmarks  7 workload categories based on the memory-intensity of CPU applications  Low memory-intensity (L)  Medium memory-intensity (M)  High memory-intensity (H) 29

Comparison to Previous Scheduling Algorithms FR-FCFS [Rixner+, ISCA’00]  Prioritizes row buffer hits  Maximizes DRAM throughput  Low multi-core performance  Application unaware ATLAS [Kim+, HPCA’10]  Prioritizes latency-sensitive applications  Good multi-core performance  Low fairness  Deprioritizes memory-intensive applications TCM [Kim+, MICRO’10]  Clusters low and high-intensity applications and treats each separately  Good multi-core performance and fairness  Not robust  Misclassifies latency-sensitive applications 30

Evaluation Metrics CPU performance metric: Weighted speedup GPU performance metric: Frame rate speedup CPU-GPU system performance: CPU-GPU weighted speedup 31

Evaluated System Scenarios CPU-focused system GPU-focused system 32

Evaluated System Scenario: CPU Focused GPU has low weight (weight = 1) Configure SMS such that p, SJF probability, is set to 0.9  Mostly uses SJF batch scheduling  prioritizes latency- sensitive applications (mainly CPU) 33 1

SJF batch scheduling policy allows latency-sensitive applications to get serviced as fast as possible Performance: CPU-Focused System % over ATLAS SMS is much less complex than previous schedulers p=0.9 Workload Categories

Evaluated System Scenario: GPU Focused GPU has high weight (weight = 1000) Configure SMS such that p, SJF probability, is set to 0  Always uses round-robin batch scheduling  prioritizes memory-intensive applications (GPU)

Round-robin batch scheduling policy schedules GPU requests more frequently Performance: GPU-Focused System % over FR-FCFS SMS is much less complex than previous schedulers p=0 Workload Categories

Performance at Different GPU Weights 37 Best Previous Scheduler ATLAS TCM FR-FCFS

At every GPU weight, SMS outperforms the best previous scheduling algorithm for that weight Performance at Different GPU Weights 38 SMS Best Previous Scheduler

Additional Results in the Paper Fairness evaluation  47.6% improvement over the best previous algorithms Individual CPU and GPU performance breakdowns CPU-only scenarios  Competitive performance with previous algorithms Scalability results  SMS’ performance and fairness scales better than previous algorithms as the number of cores and memory channels increases Analysis of SMS design parameters 39

Outline Background Motivation: CPU-GPU Systems Our Goal Observations Staged Memory Scheduling 1) Batch Formation 2) Batch Scheduler 3) DRAM Command Scheduler Results Conclusion 40

Conclusion Observation: Heterogeneous CPU-GPU systems require memory schedulers with large request buffers Problem: Existing monolithic application-aware memory scheduler designs are hard to scale to large request buffer size Solution: Staged Memory Scheduling (SMS) decomposes the memory controller into three simple stages: 1) Batch formation: maintains row buffer locality 2) Batch scheduler: reduces interference between applications 3) DRAM command scheduler: issues requests to DRAM Compared to state-of-the-art memory schedulers:  SMS is significantly simpler and more scalable  SMS provides higher performance and fairness 41

Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012

Backup Slides 43

Row Buffer Locality on Batch Formation OoO batch formation improves the performance of the system by:  ~3% when the batch scheduler uses SJF policy most of the time  ~7% when the batch scheduler uses RR most of the time However, OoO batch formation is more complex  OoO buffering instead of FIFO queues  Need to fine tune the time window of the batch formation based on application characteristics (only 3%-5% performance gain without fine tuning) 44

Row Buffer Locality on Batch Formation 45

Key Differences Between CPU and GPU 46 L2 MPKI Graphic ApplicationsCPU Applications ~4x difference

MLP and RBL Key differences between a CPU application and a GPU application 47

CPU-GPU Performance Tradeoff 48

Dealing with Multi-Threaded Applications Batch formation: Groups requests from each application in a per-thread FIFO Batch scheduler: Detects critical threads and prioritizes them over non-critical threads  Previous works have shown how to detect and schedule critical threads 1) Bottleneck Identification and Scheduling in MT applications [Joao+, ASPLOS’12] 2) Parallel Application Memory Scheduling [Ebrahimi, MICRO’11] DRAM command scheduler: Stays the same 49

Dealing with Prefetch Requests Previous works have proposed several solutions:  Prefetch-Aware Shared-Resource Management for Multi-Core Systems [Ebrahimi+, ISCA’11]  Coordinated Control of Multiple Prefetchers in Multi-Core Systems [Ebrahimi+, MICRO’09]  Prefetch-aware DRAM Controller [Lee+, MICRO’08] Handling Prefetch Requests in SMS:  SMS can handle prefetch requests before they enter the memory controller (e.g., source throttling based on prefetch accuracy)  SMS can handle prefetch requests by prioritizing/deprioritizing prefetch batch at the batch scheduler (based on prefetch accuracy) 50

Fairness Evaluation 51 Unfairness (Lower is better)

Performance at Different Buffer Sizes 52

CPU and GPU Performance Breakdowns 53 CPU WSFrame Rate

CPU-Only Results 54

Scalability to Number of Cores 55

Scalability to Number of Memory Controllers 56

Detailed Simulation Methodology Number of cores16GPU Max throughput1600 ops/cycle Number of GPU1GPU Texture/Z/Color units 80/128/32 CPU reorder buffers128 entriesDRAM Bus64 bits/channel L1 (private) cache size32KB, 4 ways DRAM row buffer size2KB L2 (shared) cache size8MB, 16 ways MC Request buffer size 300 entries ROB Size128 entries 57

Analysis to Different SMS Parameters 58

Global Bypass What if the system is lightly loaded?  Batching will increase the latency of requests Global Bypass  Disable the batch formation when the number of total requests is lower than a threshold 59

Batch Scheduler Bank 1 Bank 2 Bank 3 Bank 4 Putting Everything Together 60 Core 1 Core 2 Core 3 Core 4 Stage 1: Batch Formation Stage 3: DRAM Command Scheduler GPU Stage 2: Current Batch Scheduling Policy SJF Current Batch Scheduling Policy RR