Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy) (largest to smallest)  “Grid”:  All of the threads  Size: (number of threads per block)

Similar presentations


Presentation on theme: "CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy) (largest to smallest)  “Grid”:  All of the threads  Size: (number of threads per block)"— Presentation transcript:

1 CS 179: Lecture 4 Lab Review 2

2 Groups of Threads (Hierarchy) (largest to smallest)  “Grid”:  All of the threads  Size: (number of threads per block) * (number of blocks)  “Block”:  Size: User-specified  Should at least be a multiple of 32 (often, higher is better)  Upper limit given by hardware (512 in Tesla, 1024 in Fermi)  Features:  Shared memory  Synchronization

3 Groups of Threads  “Warp”:  Group of 32 threads  Execute in lockstep (same instructions)  Susceptible to divergence!

4 Divergence “Two roads diverged in a wood… …and I took both”

5 Divergence  What happens:  Executes normally until if-statement  Branches to calculate Branch A (blue threads)  Goes back (!) and branches to calculate Branch B (red threads)

6

7

8 “Divergent tree” … 506, 508, 510 Assume 512 threads in block… … 500, 504, 508 … 488, 496, 504 … 464, 480, 496

9 “Divergent tree” //Let our shared memory block be partial_outputs[]... synchronize threads before starting... set offset to 1 while ( (offset * 2) <= block dimension): if (thread index % (offset * 2) is 0): add partial_outputs[thread index + offset] to partial_outputs[thread index] double the offset synchronize threads Get thread 0 to atomicAdd() partial_outputs[0] to output Assumes block size is power of 2…

10 “Non-divergent tree” Example purposes only! Real blocks are way bigger!

11 “Non-divergent tree” //Let our shared memory block be partial_outputs[]... set offset to highest power of 2 that’s less than the block dimension while (offset >= 1): if (thread index < offset): add partial_outputs[thread index + offset] to partial_outputs[thread index] halve the offset synchronize threads Get thread 0 to atomicAdd() partial_outputs[0] to output Assumes block size is power of 2…

12 “Divergent tree” Where is the divergence?  Two branches:  Accumulate  Do nothing  If the second branch does nothing, then where is the performance loss?

13

14 “Divergent tree” – Analysis  First iteration: (Reduce 512 -> 256):  Warp of threads 0-31: (After calculating polynomial)  Thread 0: Accumulate  Thread 1: Do nothing  Thread 2: Accumulate  Thread 3: Do nothing  …  Warp of threads 32-63:  (same thing!)  …  (up to) Warp of threads 480-511  Number of executing warps: 512 / 32 = 16

15 “Divergent tree” – Analysis  Second iteration: (Reduce 256 -> 128):  Warp of threads 0-31: (After calculating polynomial)  Threads 0: Accumulate  Thread 1-3: Do nothing  Thread 4: Accumulate  Thread 5-7: Do nothing  …  Warp of threads 32-63:  (same thing!)  …  (up to) Warp of threads 480-511  Number of executing warps: 16 (again!)

16 “Divergent tree” – Analysis  (Process continues, until offset is large enough to separate warps)

17 “Non-divergent tree” – Analysis  First iteration: (Reduce 512 -> 256): (Part 1)  Warp of threads 0-31:  Accumulate  Warp of threads 32-63:  Accumulate  …  (up to) Warp of threads 224-255  Then what?

18 “Non-divergent tree” – Analysis  First iteration: (Reduce 512 -> 256): (Part 2)  Warp of threads 256-287:  Do nothing!  …  (up to) Warp of threads 480-511  Number of executing warps: 256 / 32 = 8 (Was 16 previously!)

19 “Non-divergent tree” – Analysis  Second iteration: (Reduce 256 -> 128):  Warp of threads 0-31, …, 96-127:  Accumulate  Warp of threads 128-159, …, 480-511  Do nothing!  Number of executing warps: 128 / 32 = 4 (Was 16 previously!)

20 What happened?  “Implicit divergence”

21 Why did we do this?  Performance improvements  Reveals GPU internals!

22 Final Puzzle  What happens when the polynomial order increases?  All these threads that we think are competing… are they?

23 The Real World

24 In medicine…  More sensitive devices -> more data!  More intensive algorithms  Real-time imaging and analysis  Most are parallelizable problems! http://www.varian.com

25 MRI  “k-space” – Inverse FFT  Real-time and high-resolution imaging http://oregonstate.edu

26 CT, PET  Low-dose techniques  Safety!  4D CT imaging  X-ray CT vs. PET CT  Texture memory! http://www.upmccancercenter.com/

27 Radiation Therapy  Goal: Give sufficient dose to cancerous cells, minimize dose to healthy cells  More accurate algorithms possible!  Accuracy = safety!  40 minutes -> 10 seconds http://en.wikipedia.org

28 Notes  Office hours:  Kevin: Monday 8-10 PM  Ben: Tuesday 7-9 PM  Connor: Tuesday 8-10 PM  Lab 2: Due Wednesday (4/16), 5 PM


Download ppt "CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy) (largest to smallest)  “Grid”:  All of the threads  Size: (number of threads per block)"

Similar presentations


Ads by Google