CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy) (largest to smallest)  “Grid”:  All of the threads  Size: (number of threads per block)

CS 179: Lecture 4 Lab Review 2

Groups of Threads (Hierarchy) (largest to smallest)  “Grid”:  All of the threads  Size: (number of threads per block) * (number of blocks)  “Block”:  Size: User-specified  Should at least be a multiple of 32 (often, higher is better)  Upper limit given by hardware (512 in Tesla, 1024 in Fermi)  Features:  Shared memory  Synchronization

Groups of Threads  “Warp”:  Group of 32 threads  Execute in lockstep (same instructions)  Susceptible to divergence!

Divergence “Two roads diverged in a wood… …and I took both”

Divergence  What happens:  Executes normally until if-statement  Branches to calculate Branch A (blue threads)  Goes back (!) and branches to calculate Branch B (red threads)

“Divergent tree” … 506, 508, 510 Assume 512 threads in block… … 500, 504, 508 … 488, 496, 504 … 464, 480, 496

“Divergent tree” //Let our shared memory block be partial_outputs[]... synchronize threads before starting... set offset to 1 while ( (offset * 2) <= block dimension): if (thread index % (offset * 2) is 0): add partial_outputs[thread index + offset] to partial_outputs[thread index] double the offset synchronize threads Get thread 0 to atomicAdd() partial_outputs[0] to output Assumes block size is power of 2…

“Non-divergent tree” Example purposes only! Real blocks are way bigger!

“Non-divergent tree” //Let our shared memory block be partial_outputs[]... set offset to highest power of 2 that’s less than the block dimension while (offset >= 1): if (thread index < offset): add partial_outputs[thread index + offset] to partial_outputs[thread index] halve the offset synchronize threads Get thread 0 to atomicAdd() partial_outputs[0] to output Assumes block size is power of 2…

“Divergent tree” Where is the divergence?  Two branches:  Accumulate  Do nothing  If the second branch does nothing, then where is the performance loss?

“Divergent tree” – Analysis  First iteration: (Reduce 512 -> 256):  Warp of threads 0-31: (After calculating polynomial)  Thread 0: Accumulate  Thread 1: Do nothing  Thread 2: Accumulate  Thread 3: Do nothing  …  Warp of threads 32-63:  (same thing!)  …  (up to) Warp of threads 480-511  Number of executing warps: 512 / 32 = 16

“Divergent tree” – Analysis  Second iteration: (Reduce 256 -> 128):  Warp of threads 0-31: (After calculating polynomial)  Threads 0: Accumulate  Thread 1-3: Do nothing  Thread 4: Accumulate  Thread 5-7: Do nothing  …  Warp of threads 32-63:  (same thing!)  …  (up to) Warp of threads 480-511  Number of executing warps: 16 (again!)

“Divergent tree” – Analysis  (Process continues, until offset is large enough to separate warps)

“Non-divergent tree” – Analysis  First iteration: (Reduce 512 -> 256): (Part 1)  Warp of threads 0-31:  Accumulate  Warp of threads 32-63:  Accumulate  …  (up to) Warp of threads 224-255  Then what?

“Non-divergent tree” – Analysis  First iteration: (Reduce 512 -> 256): (Part 2)  Warp of threads 256-287:  Do nothing!  …  (up to) Warp of threads 480-511  Number of executing warps: 256 / 32 = 8 (Was 16 previously!)

“Non-divergent tree” – Analysis  Second iteration: (Reduce 256 -> 128):  Warp of threads 0-31, …, 96-127:  Accumulate  Warp of threads 128-159, …, 480-511  Do nothing!  Number of executing warps: 128 / 32 = 4 (Was 16 previously!)

What happened?  “Implicit divergence”

Why did we do this?  Performance improvements  Reveals GPU internals!

Final Puzzle  What happens when the polynomial order increases?  All these threads that we think are competing… are they?

The Real World

In medicine…  More sensitive devices -> more data!  More intensive algorithms  Real-time imaging and analysis  Most are parallelizable problems! http://www.varian.com

MRI  “k-space” – Inverse FFT  Real-time and high-resolution imaging http://oregonstate.edu

CT, PET  Low-dose techniques  Safety!  4D CT imaging  X-ray CT vs. PET CT  Texture memory! http://www.upmccancercenter.com/

Radiation Therapy  Goal: Give sufficient dose to cancerous cells, minimize dose to healthy cells  More accurate algorithms possible!  Accuracy = safety!  40 minutes -> 10 seconds http://en.wikipedia.org

Notes  Office hours:  Kevin: Monday 8-10 PM  Ben: Tuesday 7-9 PM  Connor: Tuesday 8-10 PM  Lab 2: Due Wednesday (4/16), 5 PM

CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy) (largest to smallest)  “Grid”:  All of the threads  Size: (number of threads per block)

Similar presentations

Presentation on theme: "CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy) (largest to smallest)  “Grid”:  All of the threads  Size: (number of threads per block)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy) (largest to smallest)  “Grid”:  All of the threads  Size: (number of threads per block)

Similar presentations

Presentation on theme: "CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy) (largest to smallest)  “Grid”:  All of the threads  Size: (number of threads per block)"— Presentation transcript:

Similar presentations

About project

Feedback