Back-Projection on GPU: Improving the Performance Wenlay “Esther” Wei Advisor: Jeff Fessler Mentor: Yong Long April 29, 2010.

Back-Projection on GPU: Improving the Performance Wenlay “Esther” Wei Advisor: Jeff Fessler Mentor: Yong Long April 29, 2010

Overview CPU vs. GPU Original CUDA Program Strategy 1: Parallelization Along Z-Axis Strategy 2: Projection View Data in Shared Memory Strategy 3: Reconstructing Each Voxel in Parallel Strategy 4: Shared Memory Integration Between Two Kernels Strategies Not Used Conclusion

CPUs vs. GPUs CPUs are optimized for sequential performance – Sophisticated control logic – Large cache memory GPUs are optimized for parallel performance – Large number of execution threads – Minimal control logic required Most applications use both GPU and CPU – CUDA

Original CUDA Program Back-projection of FDK cone-beam image reconstruction algorithm on GPU One kernel of nx-by-ny Each thread reconstructs one “bar” of voxels with the same (x,y) coordinates The kernel is executed for each projection view – Back-projection result is added onto the image 2.2x speed-up for 128x124x120-voxel image My goal is to accelerate this algorithm

Strategy 1: Parallelization Along Z-Axis Eliminates sequential components Avoids repeating the computations – Additional kernel is needed – Parameters that shared between two kernels are stored in global memory

Strategy 1 Analysis 2.5x speed-up for 128x124x120-voxel image Global memory accesses prevents an even greater speed-up

Strategy 2: Projection View Data in Shared Memory Modified version of previous strategy Threads that share the same projection view data are grouped in the same block Every thread is responsible for copying a portion of data to shared memory Each thread must copy four pixels from the global memory otherwise the results would be approximate

Strategy 3: Reconstructing Each Voxel in Parallel Global memory loads and stores are costly operations – Necessary for Strategy 1 to pass parameters between kernels Trade global memory accesses with the repeated instructions Perform reconstruction on each voxel in parallel

Strategy 3: Analysis Does compensate for the processing time of repeated computation Does not improve the performance overall – 2.5x speed-up for 128x124x120 -voxel image

Strategy 4: Shared Memory Integration Between Two Kernels Modify Strategy 1 to reduce the time spent on global memory accesses Threads sharing the same parameters from kernel 1 reside in the same block in kernel 2 Only the first thread has to load the data from global memory into shared memory Synchronize threads within a block after memory load

Strategy 4 Analysis 7x speed-up for 128x124x120-voxel image 8.5x speed-up for 256x248x240-voxel image

Strategies Not Used #1 Resolving Thread Divergence – Single-instruction, multiple thread (SIMT) style 32-thread warps Diverging threads within a warp will execute each set of instructions in a sequential manner – Thought thread divergence would be a problem and was seeking solutions – Occupied less than 1% of GPU processing – One of the reasons could be that most of the threads follow the same path when branching

Strategies Not Used #2 Constant Memory – Read-only memory, readable from all threads in a grid – Faster access than global memory – Considered copying all the projection view data into constant memory – There are only 64 kilobytes of constant memory in the GeForce GTX 260 GPU A 128x128 projection view uses that much memory

Conclusion Must eliminate as many sequential processes as possible Must avoid repeating multiple computations Must keep number of global memory accesses should to the minimum necessary – One of the solutions is to use shared memory – Strategize the usage of shared memory in order to actually improve the performance Must consider if the strategy would work on the specific example we are working on – Gather information on the performance

References Kirk, David, and Wen-mei Hwu. Programming Massively Parallel Processors: a Hands-on Approach. Burlington, MA: Morgan Kaufmann, 2010. Print. Fessler, J. "Analytical Tomographic Image Reconstruction Methods." Print. Special thanks to Professor Fessler, Yong Long and Matt Lauer

Thank You For Listening Does anyone have questions?

Back-Projection on GPU: Improving the Performance Wenlay “Esther” Wei Advisor: Jeff Fessler Mentor: Yong Long April 29, 2010.

Similar presentations

Presentation on theme: "Back-Projection on GPU: Improving the Performance Wenlay “Esther” Wei Advisor: Jeff Fessler Mentor: Yong Long April 29, 2010."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Back-Projection on GPU: Improving the Performance Wenlay “Esther” Wei Advisor: Jeff Fessler Mentor: Yong Long April 29, 2010.

Similar presentations

Presentation on theme: "Back-Projection on GPU: Improving the Performance Wenlay “Esther” Wei Advisor: Jeff Fessler Mentor: Yong Long April 29, 2010."— Presentation transcript:

Similar presentations

About project

Feedback