Presentation is loading. Please wait.

Presentation is loading. Please wait.

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

Similar presentations


Presentation on theme: "Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009."— Presentation transcript:

1 Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009

2 Outline Introduction 7 implementations Work plan

3 Parallel Reduction Common and important data parallel primitive; Best example to learn optimization –Easy to implement but hard to get a high efficiency. –NVIDIA supplies 7 versions for computing the sum of the array. –I learn them one by one.

4 Parallel Reduction In order to deal with large arrays, the algorithm need to use multiple thread blocks. –Each block reduces a portion of the array How to communicate partial results between thread blocks? –No global synchronization: expensive, deadlock; –Decompose into multiple kernels

5 Optimization Goal Reach GPU peak performance –GLOP/s: for compute-bound kernels –Bandwidth: for memory-bound kernels Reductions have low arithmetic intensity –1 flop per element loaded –Try to achieve peak bandwidth!

6 Reduction 1: Interleaved Addressing

7

8 Hardware My computer –GeForce 8500GT NVIDIA

9 Performance for 4M element reduction NVIDIA’s results On my computer, the same code~ KernelTime (ms)Bandwidth (GB/s) 1530.31

10 Reduction 2: Interleaved Addressing

11 Performance for 4M element reduction NVIDIA’s results On my computer KernelTime (ms) Bandwidth (GB/s) Step Speedup Cumulative speedup 1530.31 2250.662.12

12 Reduction 3: Sequential Addressing

13 Performance for 4M element reduction NVIDIA’s results On my computer KernelTime (ms) Bandwidth (GB/s) Step Speedup Cumulative speedup 1530.31 2250.662.12 3121.372.014.42

14 Reduction 3: Sequential Addressing

15 Reduction 4: First Add During Load

16 Performance for 4M element reduction NVIDIA’s results On my computer KernelTime (ms) Bandwidth (GB/s) Step Speedup Cumulative speedup 1530.31 2250.662.12 3121.372.014.42 46.852.441.757.74

17 Instruction Bottleneck Address arithmetic and loop overhead –17GB far from bandwidth bound –Ancillary instructions that are not loads, stores, or arithmetic for the core computation Strategy: unroll loops –When s<=32, only one warp left –Instructions are SIMD synchronous within a warp –If (tid<s) is useless Unroll the last 6 iterations

18 Reduction 5: Unroll the last Warp

19 Performance for 4M element reduction NVIDIA’s results On my computer KernelTime (ms)Bandwidth (GB/s)Step SpeedupCumulative speedup 1530.31 2250.662.12 3121.372.014.42 46.852.441.757.74 54.1 1.6713

20 Further optimization Complete Unrolling Multiple Adds/Thread

21 Other works Read two papers about matrix multiplication; Begin to read parallel computing books.

22 Work plan Learn the last two Reduction algorithms; Re-read the programming user guide.

23 Thanks


Download ppt "Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009."

Similar presentations


Ads by Google