Download presentation
Presentation is loading. Please wait.
Published bySabrina Dorsey Modified over 9 years ago
1
Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria
2
This Work Accelerators o Control-flow amortized over tens of threads called warp o Warp size impacts branch/memory divergence & memory access coalescing o Small Warp: Low Branch/Memory Divergence (+), Low Memory Coalescing (-) o Large Warp: High Branch Divergence/Memory (-), High Memory Coalescing(+) Key question: Which processor provides higher energy-efficiency? o Small-warp, coalescing-enhanced o Large-warp, control-flow enhanced Key result: Small-warp enhanced processor better than large-warp enhanced processor 2 Towards Green GPUs: Warp Size Impact Analysis
3
Outline Branch/Memory divergence Memory Access Coalescing Warp Size Impact on Divergence and Coalescing Warp Size: Large or Small? o Use machine models to find the answer: o Small-Warp Coalescing-Enhanced Machine (SW+) o Large-Warp Control-flow-Enhanced Machine (LW+) Experimental Results Conclusion 3 Towards Green GPUs: Warp Size Impact Analysis
4
Warping Opportunities o Reduce scheduling overhead o Improve utilization of execution units (SIMD efficiency) o Exploit inter-thread data locality Challenges o Memory divergence o Branch divergence 4 Towards Green GPUs: Warp Size Impact Analysis
5
Memory Divergence Threads of a warp may take hit or miss in L1 access 5 J = A[S]; // L1 cache access L = K * J; Hit Miss Hit Time Stall WarpT0T1T2T3 WarpT0T1T2T3 Towards Green GPUs: Warp Size Impact Analysis
6
Branch Divergence Branch instruction can diverge to two different paths dividing the warp to two groups: 1.Threads with taken outcome 2.Threads with not-taken outcome 6 If(J==K){ C[tid]=A[tid]*B[tid]; }else if(J>K){ C[tid]=0; } Warp T0XXT3 Warp Time XT1T2X T0T1T2T3 T0XXT3 T0T1T2T3 Towards Green GPUs: Warp Size Impact Analysis
7
Memory Access Coalescing Common memory access of neighbor threads are coalesced into one transaction 7 WarpT0T1T2T3 WarpT4T5T6T7 WarpT8T9T10T11 Hit Miss Hit Miss Mem. Req. AMem. Req. B Mem. Req. C Mem. Req. DMem. Req. E ABAB CCCC DEED Towards Green GPUs: Warp Size Impact Analysis
8
Coalescing Width Range of the threads in a warp which are considered for memory access coalescing o NVIDIA G80 -> Over sub-warp o NVIDIA GT200 -> Over half-warp o NVIDIA GF100 -> Over entire warp When the coalescing width is over entire warp, optimal warp size depends on the workload 8 Towards Green GPUs: Warp Size Impact Analysis
9
Warp Size Warp Size is the number of threads in warp Why small warp? (not lower that SIMD width) o Less branch/memory divergence o Less synchronization overhead at every instruction Why large warp? o Greater opportunity for memory access coalescing We study warp size impact on performance 9 Towards Green GPUs: Warp Size Impact Analysis
10
Warp Size and Branch Divergence Lower the warp size, lower the branch divergence 10 If(J>K){ C[tid]=A[tid]*B[tid]; else{ C[tid]=0; } ↓↓↓↓↓↓↓↓ ↓↓↓↓↓↓ ↓↓ ↓↓↓↓↓↓↓↓ 2-thread warp T1T2T3T4T5T6T7T8 No branch divergence 4-thread warp Branch divergence Towards Green GPUs: Warp Size Impact Analysis
11
Warp Size and Branch Divergence (continued) 11 WarpT0T1T2T3 WarpT4T5T6T7 WarpT8T9T10T11 WarpT0T1XX WarpT4T5T6T7 WarpXT9T10T11 WarpXXT2T3 WarpT8XXX WarpT0T1T2T3 WarpT4T5T6T7 WarpT8T9T10T11 Warp Time T0T1T2T3 T4T5T6T7 T8T9T10T11 Warp T0T1XX T4T5T6T7 XT9T10T11 Warp XXT2T3 XXXX T8XXX Warp T0T1T2T3 T4T5T6T7 T8T9T10T11 Small warpsLarge warps Saving some idle cycles Towards Green GPUs: Warp Size Impact Analysis
12
Warp Size and Memory Divergence 12 WarpT0T1T2T3 WarpT4T5T6T7 WarpT8T9T10T11 Time Small warpsLarge warps Hit Miss Hit Warp T0T1T2T3 Hit Miss Hit Warp T0T1T2T3 T8T9T10T11 T4T5T6T7 Stall WarpT0T1T2T3 WarpT4T5T6T7 T4T5T6T7 T8T9T10T11 WarpT8T9T10T11 Improving latency hiding Towards Green GPUs: Warp Size Impact Analysis
13
Warp Size and Memory Access Coalescing 13 WarpT0T1T2T3 WarpT4T5T6T7 WarpT8T9T10T11 Time Small warpsLarge warps Miss Warp T0T1T2T3 Miss T4T5T6T7 T8T9T10T11 Miss Req. A Req. B Req. A Req. B Req. A Req. B Reducing the number of memory accesses using wider coalescing 5 memory requests2 memory requests Towards Green GPUs: Warp Size Impact Analysis
14
Warp Size Impact on Coalescing Larger the warp, higher the coalescing rate 14 Towards Green GPUs: Warp Size Impact Analysis
15
Warp Size Impact on Idle Cycles Larger the warp, higher divergence and higher idle cycles o but may reduce the idle cycles due to coalescing gain 15 Towards Green GPUs: Warp Size Impact Analysis
16
Warp Size Impact on Energy Larger warps reduce energy if the coalescing gain could dominate the exacerbated divergence 16 Towards Green GPUs: Warp Size Impact Analysis
17
Warp Size Impact on Performance Larger warps improve performance if the coalescing gain could dominate the exacerbated divergence 17 Towards Green GPUs: Warp Size Impact Analysis
18
Warp Size Impact on Energy-efficiency Larger warps improve energy-efficiency if the coalescing gain could dominate the exacerbated divergence 18 Towards Green GPUs: Warp Size Impact Analysis
19
Approach 19 Baseline machine Small Warp Enhanced (SW+): -Ideal MSHR to compensate coalescing lost Large Warp Enhanced (LW+): -MIMD lanes to compensate branch divergence Towards Green GPUs: Warp Size Impact Analysis
20
SW+ Warps as wide as SIMD width o Minimize branch/memory divergence o Improve latency hiding Compensating the deficiency -> Ideal MSHR o Compensating small-warp deficiency (memory access coalescing lost) o In order to merge inter-warp memory transaction, Ideal MSHR tags the per-warp outstanding MSHRs 20 Towards Green GPUs: Warp Size Impact Analysis
21
LW+ Warps 8x larger than SIMD width o Improve memory access coalescing Compensating the deficiency -> Lock-step MIMD execution o Compensate large warp deficiency (branch/memory divergence) o Parallel Fetch/Decode unit per lane 21 Towards Green GPUs: Warp Size Impact Analysis
22
Methodology Performance simulation through GPGPU-sim and power simulation through McPat o Six Memory Controllers (76 GB/s) o 16 8-wide SMs (332.8 GFLOPS) o 1024-thread per code o Warp Size: 8, 16, 32, and 64 Workloads o RODINIA o CUDA SDK o GPGPU-sim 22 Towards Green GPUs: Warp Size Impact Analysis
23
Coalescing Rate SW+: 103%, 67%, 40% higher coalescing vs. 16, 32, 64 thd/warps LW+: 47%, 21%, 1% higher coalescing vs. 16, 32, 64 thd/warps 23 Towards Green GPUs: Warp Size Impact Analysis
24
Idle Cycles SW+: 12%, 8%, 10% less Idle Cycles vs. 8, 16, 32 thd/warps LW+: 4%, 1%, 3% less Idle Cycles vs. 8, 16, 32 thd/warps 24 Towards Green GPUs: Warp Size Impact Analysis
25
Energy SW+: Outperforms 8 (26%) thd/warps. LW+: Outperforms SW+ (19%), 8 (51%), 16 (3%) thd/warps. 25 Towards Green GPUs: Warp Size Impact Analysis
26
Performance SW+: Outperforms LW+ (7%), 8 (18%), 16(15%), 32 (25%) thd/warps. LW+: Outperforms 8 (11%), 16 (8%), 32 (17%), 64 (30%) thd/warps. 26 Towards Green GPUs: Warp Size Impact Analysis 3.2
27
Energy-efficiency SW+: Outperforms LW+ (62%), 8 (136%), 16(13%), 32 (4%) thd/warps. LW+: Outperforms 8 (46%), 64 (8%) thd/warps. 27 Towards Green GPUs: Warp Size Impact Analysis
28
Conclusion & Future Works Warp Size Impacts Coalescing Rate, Idle Cycles, Performance, and Energy Investing in Enhancement of small-warp machine returns higher gain than investing in enhancement of large-warp We use machine models to explore the answer Evaluating wider machine models (including LWM-enhanced large-warp machine) 28 Towards Green GPUs: Warp Size Impact Analysis
29
29 Thank you! Question? Towards Green GPUs: Warp Size Impact Analysis
30
Backup-Slides 30 Towards Green GPUs: Warp Size Impact Analysis
31
Warping Thousands of threads are scheduled zero-overhead o All the context of threads are on-core Tens of threads are grouped into warp o Execute same instruction in lock-step 31 Towards Green GPUs: Warp Size Impact Analysis
32
Key Question Which warp size should be decided as the baseline? o Then, investing in augmenting the processor toward removing the associated deficiency Machine models to find the answer Towards Green GPUs: Warp Size Impact Analysis 32
33
GPGPU-sim Config 33 Towards Green GPUs: Warp Size Impact Analysis NoC #SMs / #memory controllers16 / 6 Number of SM Sharing an Network Interface2 SM #thread per SM / SIMD width1024 / 32 Maximum allowed CTA per SM8 Shared Memory/Register File size16KB/64KB Warp Size8 / 16 / 32 / 64 L1 Data/Texture/Constant cache64KB : 16KB : 16KB Clocking Core / Interconnect / DRAM 1300 / 650 / 800 MHz Memory banks per memory ctrl : DRAM Scheduling Policy8 : FCFS
34
Workloads 34 Towards Green GPUs: Warp Size Impact Analysis NameGrid SizeBlock Size#Insn BFS: BFS Graph [3]16x(8,1,1)16x(512,1)1.4M BKP: Back Propagation [3]2x(1,64,1)2x(16,16)2.9M CP: Distance-Cutoff Coulomb Potential [1](8,32,1)(16,8,1)113M GAS: Gaussian Elimination [3]48x(3,3,1)48x(16,16)8.8M HSPT: Hotspot [3](43,43,1)(16,16,1)76.2M LPS: Laplace equation on regular 3D grid [1](4,25)(32,4)81.7M MP: MUMmer-GPU++ [6](1,1,1)(256,1,1)0.3M MU: MUMmer-GPU [1](1,1,1)(100,1,1)0.15M NN: Neural Network [1] (6,28) (50,28) (100,28) (10,28) (13,13) (5,5) 2x(1,1) 68.1M NNC: Nearest Neighbor [3]4x(938,1,1)4x(16,1,1)5.9M NQU: N-Queen [1](256,1,1)(96,1,1)1.2M RAY: Ray-tracing [1](16,32)(16,8)64.9M SC: Scan[18](64,1,1)(256,1,1)3.6M SR1: SRAD [3] (large dataset)3x(8,8,1)3x(16,16)9.1M SR2: SRAD [3] (small dataset)4x(4,4,1)4x(16,16)2.4M
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.