Presentation is loading. Please wait.

Presentation is loading. Please wait.

Performance in GPU Architectures: Potentials and Distances

Similar presentations


Presentation on theme: "Performance in GPU Architectures: Potentials and Distances"— Presentation transcript:

1 Performance in GPU Architectures: Potentials and Distances
Ahmad Lashgar ECE University of Tehran Amirali Baniasadi ECE University of Victoria WDDD-9 June 5, 2011

2 This Work Goal: Investigating GPU performance for general-purpose workloads How: Studying the isolated impact of Memory divergence Branch divergence Context-keeping resources Key finding: Memory has the biggest impact. Branch divergence solution needs memory consideration. A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

3 Outline Background Performance Impacting Parameters Machine Models
Performance Potentials Performance Distances Sensitivity Analysis Conclusion A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

4 Interconnection Network
GPU Architecture Interconnection Network MCtrl6 DRAM1 DRAM6 . TPC1 SM1 SM2 SM3 MCtrl1 MCtrl2 DRAM2 MCtrl5 DRAM5 TPC10 Number of concurrent CTAs per SM is limited by the size of 3 shared resources: Thread Pool Register File Shared Memory GPU architecture; Our configuration in this paper. GPU is a scalable array of SMs and Memory Controllers communicating through interconnection network. For an specific workload, number of concurrent CTAs per SM is limited by the size of 3 shared resources Notice real-GPUs do not store PC and CTAID per thread. Storing these data per warp is enough. Thread Pool L1Data L1Cost L1Text PE32 PE1 PE2 PE31 Register File CTAID Program Counter TID . Shared Memory A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

5 Branch Divergence SM is SIMD processor
Group of threads (warp) execute the same instruction on the lanes. Branch instruction potentially diverge warp to two groups: Threads with taken outcome Threads with not-taken outcome A: // Pre-Divergence if(CONDITION) { B: //NT path } else C: //T path D: // reconvergence point A 1 B 1 C 1 D 1 A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

6 Control-flow mechanism
Control-flow solutions address this. Previous solutions: Postdominator Reconvergence (PDOM) Masking and serializing in diverging paths, finally reconverging all paths Dynamic Warp Formulation (DWF) Regrouping the threads in diverging paths into new warps A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

7 PDOM SIMD Utilization over time A W0 1111 W0 RPC PC Mask Vector - D 1 W0 RPC PC Mask Vector D C 1 B - W0 RPC PC Mask Vector - A 1 W0 RPC PC Mask Vector D 1 - W0 RPC PC Mask Vector D 1 B - W0 RPC PC Mask Vector D B 1 - W1 1111 TOS B Color on CFG show activation of a warp and gray shows inactivation (masked warp). C W0 0110 W0 0110 W0 1001 W1 RPC PC Mask Vector - A 1 W1 RPC PC Mask Vector D C 1 B - W1 RPC PC Mask Vector D 1 - W1 RPC PC Mask Vector - D 1 W1 RPC PC Mask Vector D B 1 - W1 RPC PC Mask Vector D 1 B - W1 0001 W1 0001 W1 1110 TOS D W0 1111 W1 1111 Dynamic regrouping of diverged threads at same path increases utilization A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

8 DWF SIMD Utilization over time Warp Pool A W0 1111 W0 1111 Wi PC Mask Vector W0 B 1 W1 C W2 Wi PC Mask Vector W0 B 1 W1 W2 C W3 Wi PC Mask Vector W0 A 1 W1 Wi PC Mask Vector W0 D 1 W1 C W2 Wi PC Mask Vector W0 D 1 W1 Wi PC Mask Vector W0 A 1 W1 Wi PC Mask Vector W0 A 1 W1 D Wi PC Mask Vector W0 D 1 W1 W2 Wi PC Mask Vector W0 D 1 W1 W2 C Wi PC Mask Vector W0 B 1 W1 A W2 C W1 1111 W1 1111 B Notice warp pool needs to keep TIDs instead of “Mask Vector” Colors of the warps shows potential of different thread-placement C W0 0110 W2 1001 W1 1111 W0 0111 W1 0001 Merge Possibility W2 1000 W3 1110 D W0 0111 W0 1111 W1 1111 W2 1000 A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

9 Performance impacting parameters
Memory Divergence Increase of memory pressure with un-coalesced memory accesses Branch Divergence Decrease of SIMD efficiency with inter-warp diverging-branch Workload Parallelism CTA-limiting resources bound memory latency hiding capability Concurrent CTAs share 3 CTA-limiting resources: Shared Memory Register File Thread Pool A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

10 X X - Y Y - Z Z Machine Models Isolates the impact of each parameter:
DC: DWF Control-flow PC: PDOM Control-flow IC: Ideal Control-flow (MIMD) Limited Resources :LR Unlimited Resources :UR IM: Ideal Memory M: Real Memory A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

11 Machine Models continued…
LR-DC-M LR-PC-M LR-IC-M LR-DC-IM LR-PC-IM LR-IC-IM UR-DC-M UR-PC-M UR-IC-M UR-DC-IM UR-PC-IM UR-IC-IM Real-Memory Limited per SM resources Ideal-Memory Real-Memory Unlimited per SM resources Ideal-Memory A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

12 Control-Flow Mechanisms
Methodology GPGPU-sim v2.1.1b 13 benchmarks from RODINIA benchmark suite and CUDA SDK 2.3 Parameter Value NoC Total Number of SMs 30 Number of Memory Ctrls 6 Number of SM Sharing an Interconnect 3 SM Warp Size 32 Threads Number of Thread per SM 1024 Number of Register per SM bit Number of PEs per SM 32 Shared Memory Size 16KB L1 Data Cache 32KB Parameter Value Clocking Core Clock 325 MHz Interconnect Clock 650 MHz DRAM memory Clock 800MHz Control-Flow Mechanisms Base DWF issue heuristic Majority PDOM warp scheduling round-robin CONFIGURATION A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

13 Performance Potentials
The speedup can be reached if the impacting parameter is idealized 3 Potentials (per control-flow mechanism): Memory Potential Speedup due to ideal memory Control Potential Speedup due to free-of-divergence architecture Resource Potential Speedup due to infinite CTA-limiting resources per SM A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

14 Performance Potentials continued…
In this work all performances are normalized to LR-DC-M A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

15 Memory Potentials PDOM 59% DWF 61% Two-sided arrow
A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

16 Resource Potentials DWF 8.6% PDOM 9.4%
A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

17 Control Potentials PDOM -7% DWF 2%
A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

18 Performance Distances
How much an otherwise ideal GPU is distanced from ideal due to the parameter. 3 Distances: Memory Distance Distance form ideal GPU due to real memory Resource Distance Distance from ideal GPU due to limited resources Control Distance Distance from ideal GPU due to branch divergence A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

19 Performance Distances continued…
A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

20 Memory Distance 40% A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

21 Resource Distance 2% A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

22 Control Distances PDOM DWF 15% 8%
A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

23 Sensitivity Analysis Validating the findings under aggressive configurations: Aggressive-Memory 2x L1 caches 2x Number of memory controllers Aggressive-Resource 2x CTA-limiting resources Limited to performance potentials A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

24 Aggressive-memory Memory Potentials PDOM memory potential
28% DWF memory potential 28% A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

25 Aggressive-memory continued…
Control Potentials PDOM control potential -8% DWF control potential -0.4% A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

26 Aggressive-memory continued…
Resource Potentials PDOM resource potential 8% DWF resource potential ~0% A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

27 Aggressive-resource Memory Potentials PDOM memory potential
51% DWF memory potential 52% A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

28 Aggressive-resource continued…
Control Potentials PDOM control potential -8% DWF control potential 2% A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

29 Aggressive-resource continued…
Resource Potentials PDOM resource potential 4% DWF resource potential 3% A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

30 Conclusion A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

31 Conclusion Performance in GPUs Findings:
Potentials: Improvement by idealizing Memory: 59% and 61% for PDOM and DWF Control: -7% and 2% for PDOM and DWF Resource: 9.4% and 8.6 for PDOM and DWF Distances: Distance from ideal system due to a none-ideal factor Memory: 40% Control: 8% and 15% for PDOM and DWF Resource: 2% Findings: Memory has the biggest impact among the 3 factors Improving control-flow mechanism has to consider memory pressure Same trend under aggressive memory and context-keeping resources A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

32 Thank you. Questions? A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

33 Why 32 PEs per SM GPGPU-sim v2.1.1b coalesces memory accesses over SIMD width slices of a warp separately, similar to pre-Fermi GPUs: Example: Warp Size = 32, PEs per SM = 8 4 independent coalescing domains in a warp We used 32 PEs per SM with ¼ clock rate to model coalescing similar to Fermi GPUs: 0-7 8-15 16-23 24-31 0-31 A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.


Download ppt "Performance in GPU Architectures: Potentials and Distances"

Similar presentations


Ads by Google