Managing GPU Concurrency in Heterogeneous Architectures

Managing GPU Concurrency in Heterogeneous Architectures
MICRO 2014

Baseline heterogeneous architecture
Throughput-optimized GPU cores and latency- optimized CPU cores on the same chip Cores are connected to the LLC and MCs via an interconnect

Motivation – application interference
20% Performance loss <- contention in shared hardware resources GPU: the dominant consumer of shared resources due to high TLP CPU applications are affected much more compared to GPU applications. 85%

Motivation – latency tolerance
High GPU thread level parallelism: Memory system congestion Low CPU performance GPU can tolerate latency due to multi-threading Better performance at lower TLP GPU TLP management

Effects of GPU TLP on GPU performance
Reduce GPU TLP, GPU performance can be Better：less cache thrashing and congestion in memory Worse: reduced parallelism and latency tolerance Unchanged

Effects of GPU TLP on CPU performance
Reduce GPU TLP, GPU performance can be Better: less congestion in memory subsystem Unchanged

: Proposal I – CM-CPU CPU-Centric Concurrency Management
Main goal: reduce GPU concurrency to boost CPU performance 2 metrics Memory congestion: # of stalled requests due to an MC being full Network congestion: # of stalled requests due to reply network being full CPU performance : congestion

Proposal I – CM-CPU Congestion level: low, medium or high
At least one metric is high: decrease # of warps Both metrics are low: increase # of warps Otherwise: # of warps unchanged Downside: insufficient GPU latency tolerance due to low TLP

Proposal II – CM-BAL Balanced Concurrency Management
stallGPU: # of cycles that GPU core fails to issue a warp Latency tolerance of GPU cores Low latency tolerance High memory contention

×  Proposal II – CM-BAL Part 1: the same as CM-CPU
Part 2: override CM-CPU TLP: stallGPU by more than k Higher k: more difficult to improve the latency tolerance × 

GPU performance results
GPU/CPU DYNCTA: +2% CM-CPU: -11% CM-BAL1: +7%

CPU performance results
GPU/CPU DYNCTA: +2% CM-CPU: +24% CM-BAL1: +7%

System performance Overall System Speedup = (1 − α) × WSCPU + α × SUGPU α is between 0 and 1 Higher α -> higher GPU importance CM-CPU CM-BAL

Conclusions Sharing the memory hierarchy leads to CPU and GPU applications to interfere with each other Existing GPU TLP management techniques are not well-suited for heterogeneous architectures Propose two GPU TLP management techniques for heterogeneous architectures CM-CPU reduces GPU TLP to improve CPU performance CM-BAL is similar to CM-CPU, but increases GPU TLP when it detects low latency tolerance in GPU cores TLP can be tuned based on user’s preference for higher CPU or GPU performance

Managing GPU Concurrency in Heterogeneous Architectures

Similar presentations

Presentation on theme: "Managing GPU Concurrency in Heterogeneous Architectures"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Managing GPU Concurrency in Heterogeneous Architectures

Similar presentations

Presentation on theme: "Managing GPU Concurrency in Heterogeneous Architectures"— Presentation transcript:

Similar presentations

About project

Feedback