L13: Application Case Studies CS6235. Outline Application Case Studies – Advanced MRI Reconstruction Reading: Kirk and Hwu, Chapter 8 or From the sample.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 More on Performance Considerations.
Advertisements

ECE 598HK Computational Thinking for Many-core Computing Lecture 2: Many-core GPU Performance Considerations © Wen-mei W. Hwu and David Kirk/NVIDIA,
L14: Application Case Studies II CS6963. Administrative Issues Project proposals – Due 5PM, Wednesday, March 17 (hard deadline) – handin cs6963 prop CS6963.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 1 ECE498AL Programming Massively Parallel Processors Lecture16-17:
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
L13: Application Case Studies CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Monday, March 8 – handin cs6963 lab 3 ” Project.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
Back-Projection on GPU: Improving the Performance Wenlay “Esther” Wei Advisor: Jeff Fessler Mentor: Yong Long April 29, 2010.
L9: Control Flow CS6963. Administrative Project proposals Due 5PM, Friday, March 13 (hard deadline) MPM Sequential code and information posted on website.
L8: Memory Hierarchy Optimization, Bandwidth CS6963.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, Taiwan CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395: CUDA Lecture 5 Memory coalescing (from.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 Final Project Notes.
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
L12: Application Case Studies CS6235. Administrative Issues Next assignment, triangular solve – Due 5PM, Tuesday, March 5 – handin cs6235 lab 3 ” Project.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 13: Application Lessons When the tires.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 15: Basic Parallel Programming Concepts.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.
1 2D Convolution, Constant Memory and Constant Caching © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.
ECE 498AL Lectures 8: Bank Conflicts and Sample PTX code
ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.
ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
© David Kirk/NVIDIA and Wen-mei W. Hwu,
L18: CUDA, cont. Memory Hierarchy and Examples
© David Kirk/NVIDIA and Wen-mei W. Hwu,
2008 Taiwan CUDA Course Programming Massively Parallel Processors: the CUDA experience Lecture 8: Application Case Study - Quantitative MRI Reconstruction.
Parallel Computation Patterns (Reduction)
L6: Memory Hierarchy Optimization IV, Bandwidth Optimization
ECE 498AL Lecture 15: Reductions and Their Implementation
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
Mattan Erez The University of Texas at Austin
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
ECE 498AL Spring 2010 Lecture 10: Control Flow
(non-Cartesian) trajectory with linear-solver-based reconstruction.
Presentation transcript:

L13: Application Case Studies CS6235

Outline Application Case Studies – Advanced MRI Reconstruction Reading: Kirk and Hwu, Chapter 8 or From the sample book hapter7-MRI-Case-Study.pdf hapter7-MRI-Case-Study.pdf 2 L12: Application Case Studies

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 3 Reconstructing MR Images Cartesian Scan DataSpiral Scan Data Gridding FFTLS Cartesian scan data + FFT: Slow scan, fast reconstruction, images may be poor Cartesian Scan - straightforward implementation on Scanners Spiral Scan -reduced sensibility to patient motion, better ability to provide self calibration,

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 4 Reconstructing MR Images Cartesian Scan DataSpiral Scan Data Gridding FFTLeast-Squares (LS) Spiral scan data + LS Superior images at expense of significantly more computation

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 5 Least-Squares Reconstruction Compute Q for F H F Acquire Data Compute F H d Find ρ Q : a data structure that allows us to compute F H F Q depends only on scanner configuration F H d depends on scan data Aim: find ρ; using linear solver Accelerate Q and F H d on GPU – Q: 1-2 days on CPU – F H d: 6-7 hours on CPU – ρ: 1.5 minutes on CPU d:data samples on scanner F: matrix that models the physics of the imaging process W:matrix that incorporates anatomical differences p: vector that contains reconstructed voxel values f H d: computed for every image acquisition - reduce time to mins calculating p uses the CG algo Q:once for every trajectory

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 6 for (m = 0; m < M; m++) { phiMag[m] = rPhi[m]*rPhi[m] + iPhi[m]*iPhi[m]; for (n = 0; n < N; n++) { expQ = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]); rQ[n] +=phiMag[m]*cos(expQ); iQ[n] +=phiMag[m]*sin(expQ); } (a) Q computation for (m = 0; m < M; m++) { rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m]; iMu[m] = rPhi[m]*iD[m] – iPhi[m]*rD[m]; for (n = 0; n < N; n++) { expFhD = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]); cArg = cos(expFhD); sArg = sin(expFhD); rFhD[n] += rMu[m]*cArg – iMu[m]*sArg; iFhD[n] += iMu[m]*cArg + rMu[m]*sArg; } }(b) F H d computation Q v.s. F H D Same structure for Q and fHd Q: matrix to matrix Fhd: matrix and vector

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 7 for (m = 0; m < M; m++) { rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m]; iMu[m] = rPhi[m]*iD[m] – iPhi[m]*rD[m]; for (n = 0; n < N; n++) { expFhD = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]); cArg = cos(expFhD); sArg = sin(expFhD); rFhD[n] += rMu[m]*cArg – iMu[m]*sArg; iFhD[n] += iMu[m]*cArg + rMu[m]*sArg; } Algorithms to Accelerate Scan data – M = # scan points – kx, ky, kz = 3D scan data Pixel data – N = # pixels – x, y, z = input 3D pixel data – rFhD, iFhD= output pixel data Complexity is O(MN) Inner loop – 13 FP MUL or ADD ops – 2 FP trig ops – 12 loads, 2 stores - inner and outer loop can be done in parallel - inner loop: depends on previous calculation in loop - ratio of fp to mem: 3:1 best; worst 1:1

Step 1. Consider Parallelism to Evaluate Partitioning Options for (m = 0; m < M; m++) { rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m]; iMu[m] = rPhi[m]*iD[m] – iPhi[m]*rD[m]; for (n = 0; n < N; n++) { expFhD = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]); cArg = cos(expFhD); sArg = sin(expFhD); rFhD[n] += rMu[m]*cArg – iMu[m]*sArg; iFhD[n] += iMu[m]*cArg + rMu[m]*sArg; } What about M total threads? Note: M is O(millions) (Step 2) What happens to data accesses with this strategy? - map outer loop to threads - then each processes inside

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 8 __global__ void cmpFHd(float* rPhi, iPhi, phiMag, kx, ky, kz, x, y, z, rMu, iMu, int N) { int m = blockIdx.x * FHD_THREADS_PER_BLOCK + threadIdx.x; rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m]; iMu[m] = rPhi[m]*iD[m] – iPhi[m]*rD[m]; for (n = 0; n < N; n++) { expFhD = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]); cArg = cos(expFhD); sArg = sin(expFhD); rFhD[n] += rMu[m]*cArg – iMu[m]*sArg; iFhD[n] += iMu[m]*cArg + rMu[m]*sArg; } One Possibility

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 8 __global__ void cmpFHd(float* rPhi, iPhi, phiMag, kx, ky, kz, x, y, z, rMu, iMu, int N) { int m = blockIdx.x * FHD_THREADS_PER_BLOCK + threadIdx.x; rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m]; iMu[m] = rPhi[m]*iD[m] – iPhi[m]*rD[m]; for (n = 0; n < N; n++) { expFhD = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]); cArg = cos(expFhD); sArg = sin(expFhD); rFhD[n] += rMu[m]*cArg – iMu[m]*sArg; iFhD[n] += iMu[m]*cArg + rMu[m]*sArg; } One Possibility Major flaw: This code does not work correctly! The accumulation needs to use atomic operation.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 9 Back to the Drawing Board – Maybe map the n loop to threads? for (m = 0; m < M; m++) { rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m]; iMu[m] = rPhi[m]*iD[m] – iPhi[m]*rD[m]; for (n = 0; n < N; n++) { expFhD = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]); cArg = cos(expFhD); sArg = sin(expFhD); rFhD[n] += rMu[m]*cArg – iMu[m]*sArg; iFhD[n] += iMu[m]*cArg + rMu[m]*sArg; }

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 10 for (m = 0; m < M; m++) { rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m]; iMu[m] = rPhi[m]*iD[m] – iPhi[m]*rD[m]; for (n = 0; n < N; n++) { expFhD = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]); cArg = cos(expFhD); sArg = sin(expFhD); rFhD[n] += rMu[m]*cArg – iMu[m]*sArg; iFhD[n] += iMu[m]*cArg + rMu[m]*sArg; } (a) F H d computation for (m = 0; m < M; m++) { rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m]; iMu[m] = rPhi[m]*iD[m] – iPhi[m]*rD[m]; } for (m = 0; m < M; m++) { for (n = 0; n < N; n++) { expFhD = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]); cArg = cos(expFhD); sArg = sin(expFhD); rFhD[n] += rMu[m]*cArg – iMu[m]*sArg; iFhD[n] += iMu[m]*cArg + rMu[m]*sArg; } }(b) after loop fission Loop Fission

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana- Champaign 11 __global__ void cmpMu(float* rPhi, iPhi, rD, iD, rMu, iMu) { int m = blockIdx.x * MU_THREAEDS_PER_BLOCK + threadIdx.x; rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m]; iMu[m] = rPhi[m]*iD[m] – iPhi[m]*rD[m]; } A Separate cmpMu Kernel After loop fission –computation is in 2 steps –2 parallel kernels executed one after the other –first step implemented below:

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 12 for (m = 0; m < M; m++) { for (n = 0; n < N; n++) { expFhD = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]); cArg = cos(expFhD); sArg = sin(expFhD); rFhD[n] += rMu[m]*cArg – iMu[m]*sArg; iFhD[n] += iMu[m]*cArg + rMu[m]*sArg; } } (a) before loop interchange Figure 7.9 Loop interchange of the F H D computation Loop Interchange Parallelizing second loop Options: –1. one thread for each inner loop N * M threads –2. loop interchange each outer loop processes an rFhD[n] and iFhD[n]

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 12 for (m = 0; m < M; m++) { for (n = 0; n < N; n++) { expFhD = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]); cArg = cos(expFhD); sArg = sin(expFhD); rFhD[n] += rMu[m]*cArg – iMu[m]*sArg; iFhD[n] += iMu[m]*cArg + rMu[m]*sArg; } } (a) before loop interchange for (n = 0; n < N; n++) { for (m = 0; m < M; m++) { expFhD = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]); cArg = cos(expFhD); sArg = sin(expFhD); rFhD[n] += rMu[m]*cArg – iMu[m]*sArg; iFhD[n] += iMu[m]*cArg + rMu[m]*sArg; } } (b) after loop interchange Figure 7.9 Loop interchange of the F H D computation Loop Interchange Possible since all iterations of both level are independent

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 13 __global__ void cmpFHd(float*kx,ky,kz, x,y,z, rMu,iMu, int M){ int n = blockIdx.x * FHD_THREADS_PER_BLOCK + threadIdx.x; for (m = 0; m < M; n++) { float expFhD = 2*PI*(kx[m]*x[n]+ky[m]*y[n]+kz[m]*z[n]); float cArg = cos(expFhD); float sArg = sin(expFhD); rFhD[n] += rMu[m]*cArg – iMu[m]*sArg; iFhD[n] += iMu[m]*cArg + rMu[m]*sArg; } 6 Step 2. New FHd kernel

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 14 __global__ void cmpFHd(float* rPhi, iPhi, phiMag, kx, ky, kz, x, y, z, rMu, iMu, int M) { int n = blockIdx.x * FHD_THREADS_PER_BLOCK + threadIdx.x; float xn_r = x[n]; float yn_r = y[n]; float zn_r = z[n]; float rFhDn_r = rFhD[n]; float iFhDn_r = iFhD[n]; for (m = 0; m < M; m++) { float expFhD = 2*PI*(kx[m]*xn_r+ky[m]*yn_r+kz[m]*zn_r); float cArg = cos(expFhD); float sArg = sin(expFhD); rFhDn_r += rMu[m]*cArg – iMu[m]*sArg; iFhDn_r += iMu[m]*cArg + rMu[m]*sArg; } rFhD[n] = rFhD_r; iFhD[n] = iFhD_r; } Step 3. Using Registers to Reduce Global Memory Traffic Still too much stress on memory! Note that kx, ky and kz are read-only and based on m - only for loop n can we move into register - there should be way for m, they are read only - goal: access ratio of 10:1 - m is independent of n => al threads will be accessing same ks -each time something is accessed from cache it is broadcasted to every thread in warp => no conflicts - constant memory is only 64KB => we need tiling

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 15 Tiling of Scan Data LS recon uses multiple grids – Each grid operates on all pixels – Each grid operates on a distinct subset of scan data – Each thread in the same grid operates on a distinct pixel for (m = 0; m < M/32; m++) { exQ = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]) rQ[n] += phi[m]*cos(exQ) iQ[n] += phi[m]*sin(exQ) } 12 Thread n operates on pixel n:

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 16 __constant__ float kx_c[CHUNK_SIZE], ky_c[CHUNK_SIZE], kz_c[CHUNK_SIZE]; … __ void main() { int i; for (i = 0; i < M/CHUNK_SIZE; i++); cudaMemcpyToSymbol(kx_c,&kx[i*CHUNK_SIZE],4*CHUNK_SIZE); cudaMemcpyToSymbol(ky_c,&ky[i*CHUNK_SIZE],4*CHUNK_SIZE); cudaMemcpyToSymbol(kz_c,&ky[i*CHUNK_SIZE],4*CHUNK_SIZE); … cmpFHD >> (rPhi, iPhi, phiMag, x, y, z, rMu, iMu, int M); } /* Need to call kernel one more time if M is not */ /* perfect multiple of CHUNK SIZE */ } Tiling k-space data to fit into constant memory each processes a chunk that fits in 64K memory

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 17 __global__ void cmpFHd(float*x, y, z, rMu, iMu, int M) { int n = blockIdx.x * FHD_THREADS_PER_BLOCK + threadIdx.x; float xn_r = x[n]; float yn_r = y[n]; float zn_r = z[n]; float rFhDn_r = rFhD[n]; float iFhDn_r = iFhD[n]; for (m = 0; m < M; m++) { float expFhD = 2*PI*(kx_c[m]*xn_r+ky_c[m]*yn_r+kz_c[m]*zn_r); float cArg = cos(expFhD); float sArg = sin(expFhD); rFhDn_r += rMu[m]*cArg – iMu[m]*sArg; iFhDn_r += iMu[m]*cArg + rMu[m]*sArg; } rFhD[n] = rFhD_r; iFhD[n] = iFhD_r; } Revised Kernel for Constant Memory

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 18 Sidebar: Cache-Conscious Data Layout kx, ky, kz, and phi components of same scan point have spatial and temporal locality –Prefetching –Caching Old layout does not fully leverage that locality New layout does fully leverage that locality

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 19 struct kdata { float x, float y, float z; } k; __constant__ struct kdata k_c[CHUNK_SIZE]; … __ void main() { int i; for (i = 0; i < M/CHUNK_SIZE; i++); cudaMemcpyToSymbol(k_c, k, 12*CHUNK_SIZE); cmpFHD >> (...); } Adjusting K-space Data Layout

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 20 __global__ void cmpFHd(float* rPhi, iPhi, phiMag, x, y, z, rMu, iMu, int M) { int n = blockIdx.x * FHD_THREADS_PER_BLOCK + threadIdx.x; float xn_r = x[n]; float yn_r = y[n]; float zn_r = z[n]; float rFhDn_r = rFhD[n]; float iFhDn_r = iFhD[n]; for (m = 0; m < M; m++) { float expFhD = 2*PI*(k[m].x*xn_r+k[m].y*yn_r+k[m].z*zn_r); float cArg = cos(expFhD); float sArg = sin(expFhD); rFhDn_r += rMu[m]*cArg – iMu[m]*sArg; iFhDn_r += iMu[m]*cArg + rMu[m]*sArg; } rFhD[n] = rFhD_r; iFhD[n] = iFhD_r; } Figure 7.16 Adjusting the k-space data memory layout in the F H d kernel

21 Overcoming Mem BW Bottlenecks Old bottleneck: off-chip BW –Solution: constant memory –FP arithmetic to off-chip loads: 421 to 1 Performance –22.8 GFLOPS (F H d) New bottleneck: trig operations © David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign

22 Using Super Function Units Old bottleneck: trig operations –Solution: SFUs Performance –92.2 GFLOPS (F H d) New bottleneck: overhead of branches and address calculations © David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 23 Sidebar: Effects of Approximations Avoid temptation to measure only absolute error (I 0 – I) –Can be deceptively large or small Metrics –PSNR: Peak signal-to-noise ratio –SNR: Signal-to-noise ratio Avoid temptation to consider only the error in the computed value –Some apps are resistant to approximations; others are very sensitive 20 A.N. Netravali and B.G. Haskell, Digital Pictures: Representation, Compression, and Standards (2nd Ed), Plenum Press, New York, NY (1995).

24 Step 4: Overcoming Bottlenecks (Overheads) Old bottleneck: Overhead of branches and address calculations –Solution: Loop unrolling and experimental tuning Performance –145 GFLOPS (F H d) 21 © David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign Summary of Results QFHdFHdTotal ReconstructionRun Time (m)GFLOPRun Time (m) GFLOPLinear Solver (m) Recon. Time (m) Gridding + FFT (CPU, DP) N/A 0.39 LS (CPU, DP) LS (CPU, SP) LS (GPU, Naïve) LS (GPU, CMem) LS (GPU, CMem, SFU) LS (GPU, CMem, SFU, Exp) X228X 357X