Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt * * HPS Research Group The University of Texas at Austin ‡ Computer Architecture Laboratory Carnegie Mellon University + Intel Corporation Austin
2 Background Core 0Core 1Core 2Core N Shared Cache Memory Controller DRAM Bank 0 DRAM Bank 1 DRAM Bank 2... DRAM Bank K... Shared Memory Resources Chip Boundary On-chip Off-chip 2
Background Memory requests from different cores interfere in shared memory resources Multi-programmed workloads System Performance and Fairness A single multi-threaded application? 3 Core 0 Core 1 Core 2 Core N Shared Cache Memory Controller... DRAM Bank K... Shared Memory Resources Chip Boundary 3 DRAM Bank 0 DRAM Bank 1 DRAM Bank 2
Memory System Interference in A Single Multi-Threaded Application Inter-dependent threads from the same application slow each other down Most importantly the critical path of execution can be significantly slowed down Problem and goal are very different from interference between independent applications Interdependence between threads Goal: Reduce execution time of a single application No notion of fairness among the threads of the same application 4
Potential in A Single Multi-Threaded Application If all main-memory related interference is ideally eliminated, execution time is reduced by 45% on average 5 Normalized to system using FR-FCFS memory scheduling
Outline Problem Statement Parallel Application Memory Scheduling Evaluation Conclusion 6
Outline Problem Statement Parallel Application Memory Scheduling Evaluation Conclusion 7
Parallel Application Memory Scheduler Identify the set of threads likely to be on the critical path as limiter threads Prioritize requests from limiter threads Among limiter threads: Prioritize requests from latency-sensitive threads (those with lower MPKI) Among non-limiter threads: Shuffle priorities of non-limiter threads to reduce inter-thread memory interference Prioritize requests from threads falling behind others in a parallel for-loop 8
Parallel Application Memory Scheduler Identify the set of threads likely to be on the critical path as limiter threads Prioritize requests from limiter threads Among limiter threads: Prioritize requests from latency-sensitive threads (those with lower MPKI) Among non-limiter threads: Shuffle priorities of non-limiter threads to reduce inter-thread memory interference Prioritize requests from threads falling behind others in a parallel for-loop 9
Runtime System Limiter Identification Contended critical sections are often on the critical path of execution Extend runtime system to identify thread executing the most contended critical section as the limiter thread Track total amount of time all threads wait on each lock in a given interval Identify the lock with largest waiting time as the most contended Thread holding the most contended lock is a limiter and this information is exposed to the memory controller 10
Prioritizing Requests from Limiter Threads 11 Critical Section 1 Barrier Non-Critical Section Waiting for Sync or Lock Thread D Thread C Thread B Thread A Time Barrier Time Barrier Thread D Thread C Thread B Thread A Critical Section 2 Critical Path Saved Cycles Limiter Thread: D B C A Most Contended Critical Section: 1 Limiter Thread Identification
Parallel Application Memory Scheduler Identify the set of threads likely to be on the critical path as limiter threads Prioritize requests from limiter threads Among limiter threads: Prioritize requests from latency-sensitive threads (those with lower MPKI) Among non-limiter threads: Shuffle priorities of non-limiter threads to reduce inter-thread memory interference Prioritize requests from threads falling behind others in a parallel for-loop 12
Time-based classification of threads as latency- vs. BW-sensitive 13 Critical Section Barrier Non-Critical Section Waiting for Sync Thread D Thread C Thread B Thread A Time Barrier Time Interval 1 Time Interval 2 Thread Cluster Memory Scheduling (TCM) [Kim et. al., MICRO’10]
Terminology A code-segment is defined as: A program region between two consecutive synchronization operations Identified with a 2-tuple: Important for classifying threads as latency- vs. bandwidth-sensitive Time-based vs. code-segment based classification 14
Code-segment based classification of threads as latency- vs. BW-sensitive 15 Thread D Thread C Thread B Thread D Thread C Thread B Thread A Time Code-Segment Changes Barrier Thread A Time Barrier Time Interval 1 Time Interval 2 Code Segment 1 Code Segment 2 Critical Section Barrier Non-Critical Section Waiting for Sync
Parallel Application Memory Scheduler Identify the set of threads likely to be on the critical path as limiter threads Prioritize requests from limiter threads Among limiter threads: Prioritize requests from latency-sensitive threads (those with lower MPKI) Among non-limiter threads: Shuffle priorities of non-limiter threads to reduce inter-thread memory interference Prioritize requests from threads falling behind others in a parallel for-loop 16
Shuffling Priorities of Non-Limiter Threads Goal: Reduce inter-thread interference among a set of threads with the same importance in terms of our estimation of the critical path Prevent any of these threads from becoming new bottlenecks Basic Idea: Give each thread a chance to be high priority in the memory system and exploit intra-thread bank parallelism and row-buffer locality Every interval assign a set of random priorities to the threads and shuffle priorities at the end of the interval 17
Shuffling Priorities of Non-Limiter Threads Barrier Thread A Thread B Thread C Thread D Barrier Time Thread A Thread B Thread C Thread D Time Thread A Thread B Thread C Thread D Time Thread A Thread B Thread C Thread D Barrier Time Thread A Thread B Thread C Thread D Time Thread A Thread B Thread C Thread D Time Saved Cycles Saved Lost Cycles 18 Baseline (No shuffling) Policy 1 Threads with similar memory behavior Policy 2 Shuffling Cycles Active Waiting Legend Threads with different memory behavior
Outline Problem Statement Parallel Application Memory Scheduling Evaluation Conclusion 19
Evaluation Methodology x86 cycle accurate simulator Baseline processor configuration Per-core - 4-wide issue, out-of-order, 64 entry ROB Shared (16-core system) MSHRs - 4MB, 16-way L2 cache Main Memory - DDR MHz - Latency of 15ns per command (tRP, tRCD, CL) - 8B wide core to memory bus 20
PAMS Evaluation 13% 21 7% Thread criticality predictors (TCP) [Bhattacherjee+, ISCA’09]
22 Sensitivity to system parameters -10.5%-15.9% -16.7% L2 Cache Size 4 MB8 MB 16 MB Δ FR-FCFS -10.4% -11.6% -16.7% Number of Memory Channels 1 Channel 2 Channels 4 Channels Δ FR-FCFS
Conclusion Inter-thread main memory interference within a multi-threaded application increases execution time Parallel Application Memory Scheduling (PAMS) improves a single multi-threaded application’s performance by Identifying a set of threads likely to be on the critical path and prioritizing requests from them Periodically shuffling priorities of non-likely critical threads to reduce inter-thread interference among them PAMS significantly outperforms Best previous memory scheduler designed for multi-programmed workloads A memory scheduler that uses a state-of-the-art thread criticality predictor (TCP) 23
Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt * * HPS Research Group The University of Texas at Austin ‡ Computer Architecture Laboratory Carnegie Mellon University + Intel Corporation Austin