Presentation is loading. Please wait.

Presentation is loading. Please wait.

Outline GPU Computing GPGPU-Sim / Manycore Accelerators (Micro)Architecture Challenges: –Branch Divergence (DWF, TBC) –On-Chip Interconnect 2.

Similar presentations


Presentation on theme: "Outline GPU Computing GPGPU-Sim / Manycore Accelerators (Micro)Architecture Challenges: –Branch Divergence (DWF, TBC) –On-Chip Interconnect 2."— Presentation transcript:

1

2 Outline GPU Computing GPGPU-Sim / Manycore Accelerators (Micro)Architecture Challenges: –Branch Divergence (DWF, TBC) –On-Chip Interconnect 2

3 What do these have in common? 3

4 Source: AMD Hotchips 19 5

5 GPU Computing Technology trends => want “simpler” cores (less power). GPUs represent an extreme in terms of computation per unit area. Current GPUs tend to work well for applications with regular parallelism (e.g., dense matrix multiply). Research Questions: Can we make GPUs better for a wider class of parallel applications? Can we make them even more efficient? 4

6 Split problem between CPU and GPU CPU (sequential code “accelerator”)GPU (most computation here) 6

7 Heterogeneous Computing CPU spawn done GPU CPU Time CPU spawn GPU 9

8 CUDA Thread Hierarchy Kernel = grid of blocks of warps of threads scalar threads 8

9 CUDA Example [Luebke] Standard C Code void saxpy_serial(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } // Invoke serial SAXPY kernel main() { … saxpy_serial(n, 2.0, x, y); }

10 CUDA Example [Luebke] CUDA code __global__ void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; if(i<n) y[i]=a*x[i]+y[i]; } main() { // omitted: allocate and initialize memory // Invoke parallel SAXPY kernel with 256 threads/block int nblocks = (n + 255) / 256; saxpy_parallel >>(n, 2.0, x, y); // omitted: transfer results from GPU to CPU }

11 GPU Microarchitecture Overview (10,000’) Interconnection Network Shader Core Shader Core Shader Core Shader Core Shader Core Shader Core Shader Core Shader Core Memory Controller Memory Controller GDDR Memory Controller Memory Controller GDDR Memory Controller Memory Controller GDDR GPU Off-chip DRAM 13

12 All threads in a kernel grid run same “code”. A given block in kernel grid runs on single “shader core”. A Warp in a block is a set of threads grouped to execute in SIMD lock step Using stack hardware and/or predication can support different branch outcomes per thread in warp. Single Instruction, Multiple Thread (SIMT) Thread Warp 3 Thread Warp 8 Thread Warp 7 Thread Warp Scalar Thread W Scalar Thread X Scalar Thread Y Scalar Thread Z Common PC SIMD Pipeline 15

13 “Shader Core” Microarchitecture 14 Heavily multithreaded: 32 “warps” each representing 32 scalar threads Designed to tolerate long latency operations rather than avoid them.

14 “GPGPU-Sim” (ISPASS 2009) GPGPU simulator developed by my group at UBC Goal: platform for architecture research on manycore accelerators running massively parallel applications. Support CUDA’s “virtual instruction set” (PTX). Provide a timing model with “good enough” accuracy for architecture research. 10

15 GPGPU-Sim Usage Input: Unmodified CUDA or OpenCL application Output: Clock cycles required to execute + statistics that can be used to determine where cycles were lost due to “microarchitecture level” inefficiency.

16 Accuracy vs. hardware (GPGPU-Sim 2.1.1b) Correlation ~0.90 11 (Architecture simulators give up accuracy to enable flexibility-- can explore more of the design space)

17 GPGPU-Sim Visualizer (ISPASS 2010) 17

18 GPGPU-Sim w/ SASS (decuda) + uArch Tuning (under development) Correlation ~0.95 12 ~0.976 correlation on subset of CUDA SDK that currently runs. Currently adding in Support for Fermi uArch Don’t ask when it Will be available

19 Group scalar threads into warps Branch divergence when threads inside warps want to follow different execution paths. First Problem: Control flow Branch Path A Path B Branch Path A Path B 16

20 20 Current GPUs: Stack-Based Reconvergence (Building upon Levinthal & Porter, SIGGRAPH’84) -G1111 TOS B CD E F A G Thread Warp Common PC Thread 2 Thread 3 Thread 4 Thread 1 B/1111 C/1001D/0110 E/1111 A/1111 G/1111 -A1111 TOS ED0110 EC1001 TOS -E1111 ED0110 TOS -E1111 ADGA Time CBE -B1111 TOS -E1111 TOS Reconv. PC Next PCActive Mask Stack ED0110 EE1001 TOS -E1111 Our version: Immediate postdominator reconvergence 17

21 Consider multiple warps 21 Dynamic Warp Formation (MICRO’07 / TACO’09) Branch Path A Path B Opportunity? Branch Path A 18

22 Idea: Form new warp at divergence Enough threads branching to each path to create full new warps Dynamic Warp Formation 19

23 23 Dynamic Warp Formation: Example AABBGGAACCDDEEFF Time AABBGGAA CD EE F A x/1111 y/1111 B x/1110 y/0011 C x/1000 y/0010 D x/0110 y/0001 F x/0001 y/1100 E x/1110 y/0011 G x/1111 y/1111 A new warp created from scalar threads of both Warp x and y executing at Basic Block D D Execution of Warp x at Basic Block A Execution of Warp y at Basic Block A Legend AA Baseline Dynamic Warp Formation

24 Dynamic Warp Formation: Implementation 21 Modified Register File New Logic

25 Thread Block Compaction (HPCA 2011)

26 26 9 6 3 4 D -- 10 -- -- D 1 2 3 4 E 5 6 7 8 E 9 10 11 12 E DWF Pathologies: Starvation Majority Scheduling –Best Performing in Prev. Work –Prioritize largest group of threads with same PC Starvation, Poor Reconvergence –LOWER SIMD Efficiency! Key obstacle: Variable Memory Latency Time 1 2 7 8 C 5 -- 11 12 C 9 6 3 4 D -- 10 -- -- D 1 2 7 8 E 5 -- 11 12 E 9 6 3 4 E -- 10 -- -- E B: if (K > 10) C: K = 10; else D: K = 0; E: B = C[tid.x] + K; 1000s cycles

27 27 DWF Pathologies: Extra Uncoalesced Accesses Coalesced Memory Access = Memory SIMD –1 st Order CUDA Programmer Optimization Not preserved by DWF E: B = C[tid.x] + K; 1 2 3 4 E 5 6 7 8 E 9 10 11 12 E Memory 0x100 0x140 0x180 1 2 7 12 E 9 6 3 8 E 5 10 11 4 E Memory 0x100 0x140 0x180 #Acc = 3 #Acc = 9 No DWF With DWF L1 Cache Absorbs Redundant Memory Traffic L1$ Port Conflict

28 28 DWF Pathologies: Implicit Warp Sync. Some CUDA applications depend on the lockstep execution of “static warps” –E.g. Task Queue in Ray Tracing Thread 0... 31 Thread 32... 63 Thread 64... 95 Warp 0 Warp 1 Warp 2 int wid = tid.x / 32; if (tid.x % 32 == 0) { sharedTaskID[wid] = atomicAdd(g_TaskID, 32); } my_TaskID = sharedTaskID[wid] + tid.x % 32; ProcessTask(my_TaskID); Implicit Warp Sync.

29 29 Static Warp Dynamic Warp Static Warp Observation Compute kernels usually contain divergent and non-divergent (coherent) code segments Coalesced memory access usually in coherent code segments –DWF no benefit there Coherent Divergent Coherent Reset Warps Divergence Recvg Pt. Coales. LD/ST

30 30 Thread Block Compaction Block-wide Reconvergence Stack –Regroup threads within a block Better Reconv. Stack: Likely Convergence –Converge before Immediate Post-Dominator Robust –Avg. 22% speedup on divergent CUDA apps –No penalty on others PCRPCAMask Warp 0 E--1111 DE0011 CE1100 PCRPCAMask Warp 1 E--1111 DE0100 CE1011 PCRPCAMask Warp 2 E--1111 DE1100 CE0011 PCRPCActive Mask Thread Block 0 E-- 1111 1111 1111 DE 0011 0100 1100 CE 1100 1011 0011 CWarp X CWarp Y DWarp U DWarp T EWarp 0 EWarp 1 EWarp 2

31 31 Thread Block Compaction Run a thread block like a warp –Whole block moves between coherent/divergent code –Block-wide stack to track exec. paths reconvg. Barrier at branch/reconverge pt. –All avail. threads arrive at branch –Insensitive to warp scheduling Warp compaction –Regrouping with all avail. threads –If no divergence, gives static warp arrangement Starvation Implicit Warp Sync. Extra Uncoalesced Memory Access

32 32 Thread Block Compaction PCRPCActive Threads B-123456789101112DE-- 34 6 910-- CE12 5 78 1112E-123456789101112 Time 1 2 7 8 C 5 -- 11 12 C 9 6 3 4 D -- 10 -- -- D 5 6 7 8 B 9 10 11 12 B 1 2 3 4 B 5 6 7 8 E 9 10 11 12 E 1 2 3 4 E A: K = A[tid.x]; B: if (K > 10) C: K = 10; else D: K = 0; E: B = C[tid.x] + K; 5 6 7 8 B 9 10 11 12 B 1 2 3 4 B 5 -- 7 8 C -- -- 11 12 C 1 2 -- -- C -- 6 -- -- D 9 10 -- -- D -- -- 3 4 D 5 6 7 8 E 9 10 11 12 E 1 2 7 8 E --

33 33 Thread Block Compaction Barrier every basic block?! (Idle pipeline) Switch to warps from other thread blocks –Multiple thread blocks run on a core –Already done in most CUDA applications Block 0 Block 1 Block 2 BranchWarp Compaction Execution Time

34 34 Microarchitecture Modifications Per-Warp Stack  Block-Wide Stack I-Buffer + TIDs  Warp Buffer –Store the dynamic warps New Unit: Thread Compactor –Translate activemask to compact dynamic warps ALU I-CacheDecode Warp Buffer Score- Board Issue RegFile MEM ALU Fetch Block-Wide Stack Done (WID) Valid[1:N] Branch Target PC Active Mask Pred. Thread Compactor

35 35 Rarely Taken Likely-Convergence Immediate Post-Dominator: Conservative –All paths from divergent branch must merge there Convergence can happen earlier –When any two of the paths merge Extended Recvg. Stack to exploit this –TBC: 30% speedup for Ray Tracing while (i < K) { X = data[i]; A: if ( X = 0 ) B: result[i] = Y; C: else if ( X = 1 ) D: break; E: i++; } F: return result[i]; A BC DE F iPDom of A

36 36 Experimental Results 2 Benchmark Groups: –COHE = Non-Divergent CUDA applications –DIVG = Divergent CUDA applications Serious Slowdown from pathologies No Penalty for COHE 22% Speedup on DIVG Per-Warp Stack

37 Next: How should on-chip interconnect be designed? (MICRO 2010) 36

38 Throughput-Effective Design Two approaches: Reduce Area Increase performance Look at properties of bulk-synchronous parallel (aka “CUDA”) workloads 38

39 Throughput vs inverse of Area 39

40 Many-to-Few-to-Many Traffic Pattern C0C0 request network C1C1 core injection bandwidth CnCn C0C0 C1C1 CnCn reply network MC0MC1 MCm C2C2 C2C2 MC input bandwidth MC output bandwidth 40

41 Exploit Traffic Pattern Somehow? Keep bisection bandwidth same, reduce router area… Half-Router: –Limited connectivity No turns allowed –Might save ~50% of router crossbar area Half-Router Connectivity 41

42 Checkerboard Routing, Example Routing from a half- router to a half-router –even # of columns away –not in the same row Solution: needs two turns – (1) route to an intermediate full-router using YX –(2)then route to the destination using XY 42

43 Multi-port routers at MCs Increase the injection ports of Memory Controller routers –Only increase terminal BW of the few nodes –No change in Bisection BW –Minimal area overhead (~1% in NoC area) –Speedups of up to 25% Reduces the bottleneck at the few nodes 43

44 Results HM speedup 13% across 24 benchmarks Total router area reduction of 14.2% 44

45 Next: GPU Off-chip Memory Bandwidth Problem (MICRO’09) 24

46 46 Background: DRAM DRAM Column Decoder Memory Array Row Decoder Memory Controller Row Buffer Row Decoder Column Decoder Row Buffer Column Decoder Row Buffer Row Access: Activate a row of DRAM bank and load into row buffer (slow) Column Access: Read and write data in row buffer (fast) Precharge: Write row buffer data back into row (slow)

47 47 t RC = row cycle time t RP = row precharge time t RCD = row activate time Background: DRAM Row Access Locality Definition: Number of accesses to a row between row switches “row switch” Row access locality  Achievable DRAM Bandwidth  Performance (GDDR uses multiple banks to hide latency)

48 N W Router E S 48 Interconnect Arbitration Policy: Round-Robin RowA Memory Controller 0 RowB RowCRowX RowY RowA RowB RowC RowY RowX Memory Controller 1 48

49 49 The Trend: DRAM Access Locality in Many-Core Inside the interconnect, interleaving of memory request streams reduces the DRAM access locality seen by the memory controller GoodBad Pre-interconnect access locality Post-interconnect access locality 49

50 Opened Row: A DRAM 50 Today’s Solution: Out-of-Order Scheduling Row B Row A Request Queue Row B Row A Youngest Oldest Switching RowOpened Row: B Queue size needs to increase as number of cores increase Requires fully-associative logic Circuit issues: o Cycle time o Area o Power 50

51 51 Interconnect Arbitration Policy: HG RowA Memory Controller 0 RowB RowCRowX RowY RowA RowB RowC RowY RowX Memory Controller 1 N W Router E S 51

52 52 Results – IPC Normalized to FR-FCFS Crossbar network, 28 shader cores, 8 DRAM controllers, 8-entry DRAM queues: BFIFO: 14% speedup over regular FIFO BFIFO+HG: 18% speedup over BFIFO, within 91% of FRFCFS 0% 20% 40% 60% 80% 100% fwtlibmumneunnrayredspwpHM FIFOBFIFOBFIFO+HGBFIFO+HMHG4FR-FCFS 52

53 48

54 Thank you. Questions? aamodt@ece.ubc.ca http://www.gpgpu-sim.org


Download ppt "Outline GPU Computing GPGPU-Sim / Manycore Accelerators (Micro)Architecture Challenges: –Branch Divergence (DWF, TBC) –On-Chip Interconnect 2."

Similar presentations


Ads by Google