Download presentation
Presentation is loading. Please wait.
Published byΣτυλιανός Παπάγος Modified over 6 years ago
1
Managing GPU Concurrency in Heterogeneous Architectures
MICRO 2014
2
Baseline heterogeneous architecture
Throughput-optimized GPU cores and latency- optimized CPU cores on the same chip Cores are connected to the LLC and MCs via an interconnect
3
Motivation – application interference
20% Performance loss <- contention in shared hardware resources GPU: the dominant consumer of shared resources due to high TLP CPU applications are affected much more compared to GPU applications. 85%
4
Motivation – latency tolerance
High GPU thread level parallelism: Memory system congestion Low CPU performance GPU can tolerate latency due to multi-threading Better performance at lower TLP GPU TLP management
5
Effects of GPU TLP on GPU performance
Reduce GPU TLP, GPU performance can be Better:less cache thrashing and congestion in memory Worse: reduced parallelism and latency tolerance Unchanged
6
Effects of GPU TLP on CPU performance
Reduce GPU TLP, GPU performance can be Better: less congestion in memory subsystem Unchanged
7
: Proposal I – CM-CPU CPU-Centric Concurrency Management
Main goal: reduce GPU concurrency to boost CPU performance 2 metrics Memory congestion: # of stalled requests due to an MC being full Network congestion: # of stalled requests due to reply network being full CPU performance : congestion
8
Proposal I – CM-CPU Congestion level: low, medium or high
At least one metric is high: decrease # of warps Both metrics are low: increase # of warps Otherwise: # of warps unchanged Downside: insufficient GPU latency tolerance due to low TLP
9
Proposal II – CM-BAL Balanced Concurrency Management
stallGPU: # of cycles that GPU core fails to issue a warp Latency tolerance of GPU cores Low latency tolerance High memory contention
10
× Proposal II – CM-BAL Part 1: the same as CM-CPU
Part 2: override CM-CPU TLP: stallGPU by more than k Higher k: more difficult to improve the latency tolerance ×
11
GPU performance results
GPU/CPU DYNCTA: +2% CM-CPU: -11% CM-BAL1: +7%
12
CPU performance results
GPU/CPU DYNCTA: +2% CM-CPU: +24% CM-BAL1: +7%
13
System performance Overall System Speedup = (1 − α) × WSCPU + α × SUGPU α is between 0 and 1 Higher α -> higher GPU importance CM-CPU CM-BAL
14
Conclusions Sharing the memory hierarchy leads to CPU and GPU applications to interfere with each other Existing GPU TLP management techniques are not well-suited for heterogeneous architectures Propose two GPU TLP management techniques for heterogeneous architectures CM-CPU reduces GPU TLP to improve CPU performance CM-BAL is similar to CM-CPU, but increases GPU TLP when it detects low latency tolerance in GPU cores TLP can be tuned based on user’s preference for higher CPU or GPU performance
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.