Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009
Outline Introduction 7 implementations Work plan
Parallel Reduction Common and important data parallel primitive; Best example to learn optimization –Easy to implement but hard to get a high efficiency. –NVIDIA supplies 7 versions for computing the sum of the array. –I learn them one by one.
Parallel Reduction In order to deal with large arrays, the algorithm need to use multiple thread blocks. –Each block reduces a portion of the array How to communicate partial results between thread blocks? –No global synchronization: expensive, deadlock; –Decompose into multiple kernels
Optimization Goal Reach GPU peak performance –GLOP/s: for compute-bound kernels –Bandwidth: for memory-bound kernels Reductions have low arithmetic intensity –1 flop per element loaded –Try to achieve peak bandwidth!
Reduction 1: Interleaved Addressing
Hardware My computer –GeForce 8500GT NVIDIA
Performance for 4M element reduction NVIDIA’s results On my computer, the same code~ KernelTime (ms)Bandwidth (GB/s)
Reduction 2: Interleaved Addressing
Performance for 4M element reduction NVIDIA’s results On my computer KernelTime (ms) Bandwidth (GB/s) Step Speedup Cumulative speedup
Reduction 3: Sequential Addressing
Performance for 4M element reduction NVIDIA’s results On my computer KernelTime (ms) Bandwidth (GB/s) Step Speedup Cumulative speedup
Reduction 3: Sequential Addressing
Reduction 4: First Add During Load
Performance for 4M element reduction NVIDIA’s results On my computer KernelTime (ms) Bandwidth (GB/s) Step Speedup Cumulative speedup
Instruction Bottleneck Address arithmetic and loop overhead –17GB far from bandwidth bound –Ancillary instructions that are not loads, stores, or arithmetic for the core computation Strategy: unroll loops –When s<=32, only one warp left –Instructions are SIMD synchronous within a warp –If (tid<s) is useless Unroll the last 6 iterations
Reduction 5: Unroll the last Warp
Performance for 4M element reduction NVIDIA’s results On my computer KernelTime (ms)Bandwidth (GB/s)Step SpeedupCumulative speedup
Further optimization Complete Unrolling Multiple Adds/Thread
Other works Read two papers about matrix multiplication; Begin to read parallel computing books.
Work plan Learn the last two Reduction algorithms; Re-read the programming user guide.
Thanks