Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.

Slides:



Advertisements
Similar presentations
Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.
Advertisements

Improving DRAM Performance by Parallelizing Refreshes with Accesses
Fairness via Source Throttling: A configurable and high-performance fairness substrate for multi-core memory systems Eiman Ebrahimi * Chang Joo Lee * Onur.
A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *
Wish Branches Combining Conditional Branching and Predication for Adaptive Predicated Execution The University of Texas at Austin *Oregon Microarchitecture.
High Performing Cache Hierarchies for Server Workloads
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Aérgia: Exploiting Packet Latency Slack in On-Chip Networks
1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.
Some Opportunities and Obstacles in Cross-Layer and Cross-Component (Power) Management Onur Mutlu NSF CPOM Workshop, 2/10/2012.
Bottleneck Identification and Scheduling in Multithreaded Applications José A. Joao M. Aater Suleman Onur Mutlu Yale N. Patt.
Understanding a Problem in Multicore and How to Solve It
Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.
OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.
Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.
1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.
MICRO-47, December 15, 2014 FIRM: Fair and HIgh-PerfoRmance Memory Control for Persistent Memory Systems Jishen Zhao Onur Mutlu Yuan Xie.
1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.
1 Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures M. Aater Suleman* Onur Mutlu† Moinuddin K. Qureshi‡ Yale N. Patt* *The.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.
Veynu Narasiman The University of Texas at Austin Michael Shebanow
Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.
Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs José A. Joao * M. Aater Suleman * Onur Mutlu ‡ Yale N. Patt * * HPS Research.
1 Lecture 4: Memory: HMC, Scheduling Topics: BOOM, memory blades, HMC, scheduling policies.
MorphCore: An Energy-Efficient Architecture for High-Performance ILP and High-Throughput TLP Khubaib * M. Aater Suleman *+ Milad Hashemi * Chris Wilkerson.
NVSleep: Using Non-Volatile Memory to Enable Fast Sleep/Wakeup of Idle Cores Xiang Pan and Radu Teodorescu Computer Architecture Research Lab
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
1 Application Aware Prioritization Mechanisms for On-Chip Networks Reetuparna Das Onur Mutlu † Thomas Moscibroda ‡ Chita Das § Reetuparna Das § Onur Mutlu.
Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research.
HIgh Performance Computing & Systems LAB Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems Rachata Ausavarungnirun,
1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.
Understanding Performance, Power and Energy Behavior in Asymmetric Processors Nagesh B Lakshminarayana Hyesoon Kim School of Computer Science Georgia Institute.
Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.
MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.
Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research.
Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.
1 Lecture 3: Memory Buffers and Scheduling Topics: buffers (FB-DIMM, RDIMM, LRDIMM, BoB, BOOM), memory blades, scheduling policies.
Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.
Sunpyo Hong, Hyesoon Kim
Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.
Optimizing DRAM Timing for the Common-Case Donghyuk Lee Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, Onur Mutlu Adaptive-Latency.
Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.
HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.
15-740/ Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University.
Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.
Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.
18-740/640 Computer Architecture Lecture 15: Memory Resource Management II Prof. Onur Mutlu Carnegie Mellon University Fall 2015, 11/2/2015.
1 Lecture 4: Memory Scheduling, Refresh Topics: scheduling policies, refresh basics.
Priority Based Fair Scheduling: A Memory Scheduler Design for Chip-Multiprocessor Systems Tsinghua University Tsinghua National Laboratory for Information.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Reducing Memory Interference in Multicore Systems
Zhichun Zhu Zhao Zhang ECE Department ECE Department
Understanding Latency Variation in Modern DRAM Chips Experimental Characterization, Analysis, and Optimization Kevin Chang Abhijith Kashyap, Hasan Hassan,
Application Slowdown Model
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Milad Hashemi, Onur Mutlu, Yale N. Patt
Achieving High Performance and Fairness at Low Cost
Address-Value Delta (AVD) Prediction
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Prof. Onur Mutlu Carnegie Mellon University 11/14/2012
Presentation transcript:

Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt * * HPS Research Group The University of Texas at Austin ‡ Computer Architecture Laboratory Carnegie Mellon University + Intel Corporation Austin

2 Background Core 0Core 1Core 2Core N Shared Cache Memory Controller DRAM Bank 0 DRAM Bank 1 DRAM Bank 2... DRAM Bank K... Shared Memory Resources Chip Boundary On-chip Off-chip 2

Background Memory requests from different cores interfere in shared memory resources Multi-programmed workloads  System Performance and Fairness A single multi-threaded application? 3 Core 0 Core 1 Core 2 Core N Shared Cache Memory Controller... DRAM Bank K... Shared Memory Resources Chip Boundary 3 DRAM Bank 0 DRAM Bank 1 DRAM Bank 2

Memory System Interference in A Single Multi-Threaded Application Inter-dependent threads from the same application slow each other down Most importantly the critical path of execution can be significantly slowed down Problem and goal are very different from interference between independent applications  Interdependence between threads  Goal: Reduce execution time of a single application  No notion of fairness among the threads of the same application 4

Potential in A Single Multi-Threaded Application If all main-memory related interference is ideally eliminated, execution time is reduced by 45% on average 5 Normalized to system using FR-FCFS memory scheduling

Outline Problem Statement Parallel Application Memory Scheduling Evaluation Conclusion 6

Outline Problem Statement Parallel Application Memory Scheduling Evaluation Conclusion 7

Parallel Application Memory Scheduler Identify the set of threads likely to be on the critical path as limiter threads  Prioritize requests from limiter threads Among limiter threads:  Prioritize requests from latency-sensitive threads (those with lower MPKI) Among non-limiter threads:  Shuffle priorities of non-limiter threads to reduce inter-thread memory interference  Prioritize requests from threads falling behind others in a parallel for-loop 8

Parallel Application Memory Scheduler Identify the set of threads likely to be on the critical path as limiter threads  Prioritize requests from limiter threads Among limiter threads:  Prioritize requests from latency-sensitive threads (those with lower MPKI) Among non-limiter threads:  Shuffle priorities of non-limiter threads to reduce inter-thread memory interference  Prioritize requests from threads falling behind others in a parallel for-loop 9

Runtime System Limiter Identification Contended critical sections are often on the critical path of execution Extend runtime system to identify thread executing the most contended critical section as the limiter thread  Track total amount of time all threads wait on each lock in a given interval  Identify the lock with largest waiting time as the most contended  Thread holding the most contended lock is a limiter and this information is exposed to the memory controller 10

Prioritizing Requests from Limiter Threads 11 Critical Section 1 Barrier Non-Critical Section Waiting for Sync or Lock Thread D Thread C Thread B Thread A Time Barrier Time Barrier Thread D Thread C Thread B Thread A Critical Section 2 Critical Path Saved Cycles Limiter Thread: D B C A Most Contended Critical Section: 1 Limiter Thread Identification

Parallel Application Memory Scheduler Identify the set of threads likely to be on the critical path as limiter threads  Prioritize requests from limiter threads Among limiter threads:  Prioritize requests from latency-sensitive threads (those with lower MPKI) Among non-limiter threads:  Shuffle priorities of non-limiter threads to reduce inter-thread memory interference  Prioritize requests from threads falling behind others in a parallel for-loop 12

Time-based classification of threads as latency- vs. BW-sensitive 13 Critical Section Barrier Non-Critical Section Waiting for Sync Thread D Thread C Thread B Thread A Time Barrier Time Interval 1 Time Interval 2 Thread Cluster Memory Scheduling (TCM) [Kim et. al., MICRO’10]

Terminology A code-segment is defined as:  A program region between two consecutive synchronization operations  Identified with a 2-tuple: Important for classifying threads as latency- vs. bandwidth-sensitive  Time-based vs. code-segment based classification 14

Code-segment based classification of threads as latency- vs. BW-sensitive 15 Thread D Thread C Thread B Thread D Thread C Thread B Thread A Time Code-Segment Changes Barrier Thread A Time Barrier Time Interval 1 Time Interval 2 Code Segment 1 Code Segment 2 Critical Section Barrier Non-Critical Section Waiting for Sync

Parallel Application Memory Scheduler Identify the set of threads likely to be on the critical path as limiter threads  Prioritize requests from limiter threads Among limiter threads:  Prioritize requests from latency-sensitive threads (those with lower MPKI) Among non-limiter threads:  Shuffle priorities of non-limiter threads to reduce inter-thread memory interference  Prioritize requests from threads falling behind others in a parallel for-loop 16

Shuffling Priorities of Non-Limiter Threads Goal:  Reduce inter-thread interference among a set of threads with the same importance in terms of our estimation of the critical path  Prevent any of these threads from becoming new bottlenecks Basic Idea:  Give each thread a chance to be high priority in the memory system and exploit intra-thread bank parallelism and row-buffer locality  Every interval assign a set of random priorities to the threads and shuffle priorities at the end of the interval 17

Shuffling Priorities of Non-Limiter Threads Barrier Thread A Thread B Thread C Thread D Barrier Time Thread A Thread B Thread C Thread D Time Thread A Thread B Thread C Thread D Time Thread A Thread B Thread C Thread D Barrier Time Thread A Thread B Thread C Thread D Time Thread A Thread B Thread C Thread D Time Saved Cycles Saved Lost Cycles 18 Baseline (No shuffling) Policy 1 Threads with similar memory behavior Policy 2 Shuffling Cycles Active Waiting Legend Threads with different memory behavior

Outline Problem Statement Parallel Application Memory Scheduling Evaluation Conclusion 19

Evaluation Methodology x86 cycle accurate simulator Baseline processor configuration  Per-core - 4-wide issue, out-of-order, 64 entry ROB  Shared (16-core system) MSHRs - 4MB, 16-way L2 cache  Main Memory - DDR MHz - Latency of 15ns per command (tRP, tRCD, CL) - 8B wide core to memory bus 20

PAMS Evaluation 13% 21 7% Thread criticality predictors (TCP) [Bhattacherjee+, ISCA’09]

22 Sensitivity to system parameters -10.5%-15.9% -16.7% L2 Cache Size 4 MB8 MB 16 MB Δ FR-FCFS -10.4% -11.6% -16.7% Number of Memory Channels 1 Channel 2 Channels 4 Channels Δ FR-FCFS

Conclusion Inter-thread main memory interference within a multi-threaded application increases execution time Parallel Application Memory Scheduling (PAMS) improves a single multi-threaded application’s performance by  Identifying a set of threads likely to be on the critical path and prioritizing requests from them  Periodically shuffling priorities of non-likely critical threads to reduce inter-thread interference among them PAMS significantly outperforms  Best previous memory scheduler designed for multi-programmed workloads  A memory scheduler that uses a state-of-the-art thread criticality predictor (TCP) 23

Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt * * HPS Research Group The University of Texas at Austin ‡ Computer Architecture Laboratory Carnegie Mellon University + Intel Corporation Austin