Presentation is loading. Please wait.

Presentation is loading. Please wait.

CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge.

Similar presentations


Presentation on theme: "CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge."— Presentation transcript:

1 CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge

2 CUDA Image Registration 29 Oct 2008 Richard Ansorge The problem CT, MRI, PET and Ultrasound produce 3D volume images Typically 256 x 256 x 256 = 16,777,216 image voxels. Combining modalities (inter modality) gives extra information. Repeated imaging over time same modality, e.g. MRI, (intra modality) equally important. Have to spatially register the images.

3 CUDA Image Registration 29 Oct 2008 Richard Ansorge Example – brain lesion CT MRI PET

4 CUDA Image Registration 29 Oct 2008 Richard Ansorge PET-MR Fusion The PET image shows metabolic activity. This complements the MR structural information

5 CUDA Image Registration 29 Oct 2008 Richard Ansorge Registration Algorithm Transform Im B to match Im A Im A Im B′ Im B Compute Cost Function Done Update transform parameters Yes No good fit? NB Cost function calculation dominates for 3D images and is inherently parallel

6 CUDA Image Registration 29 Oct 2008 Richard Ansorge Transformations General affine transform has 12 parameters: Polynomial transformations can be useful for e.g. pin- cushion type distortions: Local, non-linear transformations, e.g using cubic BSplines, increasingly popular, very computationally demanding.

7 CUDA Image Registration 29 Oct 2008 Richard Ansorge We tried this before

8 CUDA Image Registration 29 Oct 2008 Richard Ansorge Now - Desktop PC - Windows XP Needs 400 W power supply

9 CUDA Image Registration 29 Oct 2008 Richard Ansorge Free Software: CUDA & Visual C++ Express

10 CUDA Image Registration 29 Oct 2008 Richard Ansorge Visual C++ SDK in action

11 CUDA Image Registration 29 Oct 2008 Richard Ansorge Visual C++ SDK in action

12 CUDA Image Registration 29 Oct 2008 Richard Ansorge Architecture

13 CUDA Image Registration 29 Oct 2008 Richard Ansorge 9600 GT Device Query Current GTX 280 has 240 cores!

14 CUDA Image Registration 29 Oct 2008 Richard Ansorge Matrix Multiply from SDK NB using 4-byte floats

15 CUDA Image Registration 29 Oct 2008 Richard Ansorge Matrix Multiply (from SDK)

16 CUDA Image Registration 29 Oct 2008 Richard Ansorge Matrix Multiply (from SDK)

17 CUDA Image Registration 29 Oct 2008 Richard Ansorge Matrix Multiply (from SDK)

18 CUDA Image Registration 29 Oct 2008 Richard Ansorge Image Registration CUDA Code

19 CUDA Image Registration 29 Oct 2008 Richard Ansorge #include texture tex1; // Target Image in texture __constant__ float c_aff[16]; // 4x4 Affine transform // Function arguments are image dimensions and pointers to output buffer b // and Source Image s. These buffers are in device memory __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s) { int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matches int iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-y float x = (float)ix; float y = (float)iy; float z = 0.0f; // start with slice zero float4 v = make_float4(x,y,z,1.0f); float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]); float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]); float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]); float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1? float tx = dot(r0,v); // Matrix Multiply using dot products float ty = dot(r1,v); float tz = dot(r2,v); float source = 0.0f; float target = 0.0f; float cost = 0.0f; uint is = iy*nx+ix; uint istep = nx*ny; for(int iz=0;iz

20 CUDA Image Registration 29 Oct 2008 Richard Ansorge #include texture tex1; // Target Image in texture __constant__ float c_aff[16]; // 4x4 Affine transform // Function arguments are image dimensions and pointers to output buffer b // and Source Image s. These buffers are in device memory __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s) { int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matches int iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-y float x = (float)ix; float y = (float)iy; float z = 0.0f; // start with slice zero float4 v = make_float4(x,y,z,1.0f); float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]); float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]); float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]); float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1? float tx = dot(r0,v); // Matrix Multiply using dot products float ty = dot(r1,v); float tz = dot(r2,v); float source = 0.0f; float target = 0.0f; float cost = 0.0f; uint is = iy*nx+ix; uint istep = nx*ny; for(int iz=0;iz

21 CUDA Image Registration 29 Oct 2008 Richard Ansorge #include texture tex1; // Target Image in texture __constant__ float c_aff[16]; // 4x4 Affine transform // Function arguments are image dimensions and pointers to output buffer b // and Source Image s. These buffers are in device memory __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s) { int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matches int iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-y float x = (float)ix; float y = (float)iy; float z = 0.0f; // start with slice zero float4 v = make_float4(x,y,z,1.0f); float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]); float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]); float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]); float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1? float tx = dot(r0,v); // Matrix Multiply using dot products float ty = dot(r1,v); float tz = dot(r2,v); float source = 0.0f; float target = 0.0f; float cost = 0.0f; uint is = iy*nx+ix; uint istep = nx*ny; for(int iz=0;iz

22 CUDA Image Registration 29 Oct 2008 Richard Ansorge #include texture tex1; // Target Image in texture __constant__ float c_aff[16]; // 4x4 Affine transform // Function arguments are image dimensions and pointers to output buffer b // and Source Image s. These buffers are in device memory __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s) { int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matches int iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-y float x = (float)ix; float y = (float)iy; float z = 0.0f; // start with slice zero float4 v = make_float4(x,y,z,1.0f); float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]); float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]); float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]); float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1? float tx = dot(r0,v); // Matrix Multiply using dot products float ty = dot(r1,v); float tz = dot(r2,v); float source = 0.0f; float target = 0.0f; float cost = 0.0f; uint is = iy*nx+ix; uint istep = nx*ny; for(int iz=0;iz

23 CUDA Image Registration 29 Oct 2008 Richard Ansorge #include texture tex1; // Target Image in texture __constant__ float c_aff[16]; // 4x4 Affine transform // Function arguments are image dimensions and pointers to output buffer b // and Source Image s. These buffers are in device memory __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s) { int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matches int iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-y float x = (float)ix; float y = (float)iy; float z = 0.0f; // start with slice zero float4 v = make_float4(x,y,z,1.0f); float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]); float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]); float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]); float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1? float tx = dot(r0,v); // Matrix Multiply using dot products float ty = dot(r1,v); float tz = dot(r2,v); float source = 0.0f; float target = 0.0f; float cost = 0.0f; uint is = iy*nx+ix; uint istep = nx*ny; for(int iz=0;iz

24 CUDA Image Registration 29 Oct 2008 Richard Ansorge #include texture tex1; // Target Image in texture __constant__ float c_aff[16]; // 4x4 Affine transform // Function arguments are image dimensions and pointers to output buffer b // and Source Image s. These buffers are in device memory __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s) { int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matches int iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-y float x = (float)ix; float y = (float)iy; float z = 0.0f; // start with slice zero float4 v = make_float4(x,y,z,1.0f); float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]); float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]); float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]); float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1? float tx = dot(r0,v); // Matrix Multiply using dot products float ty = dot(r1,v); float tz = dot(r2,v); float source = 0.0f; float target = 0.0f; float cost = 0.0f; uint is = iy*nx+ix; uint istep = nx*ny; for(int iz=0;iz

25 CUDA Image Registration 29 Oct 2008 Richard Ansorge Host Code Initialization Fragment... blockSize.x = blockSize.y = 16; // multiples of 16 a VERY good idea gridSize.x = (w2+15) / blockSize.x; gridSize.y = (h2+15) / blockSize.y; // allocate working buffers, image is W2 x H2 x D2 cudaMalloc((void**)&dbuff,w2*h2*sizeof(float)); // passed as “b” to kernel bufflen = w2*h2; Array1D shbuff = Array1D (bufflen); shbuff.Zero(); hbuff = shbuff.v; cudaMalloc((void**)&dnewbuff,w2*h2*d2*sizeof(float)); //passed as “s” to kernel cudaMemcpy(dnewbuff,vol2,w2*h2*d2*sizeof(float),cudaMemcpyHostToDevice); e = make_float3((float)w2/2.0f,(float)h2/2.0f,(float)d2/2.0f); // fixed rotation origin o = make_float3(0.0f); // translations r = make_float3(0.0f); // rotations s = make_float3(1.0f,1.0f,1.0f); // scale factors t = make_float3(0.0f); // tans of shears...

26 CUDA Image Registration 29 Oct 2008 Richard Ansorge Calling the Kernel double nr_costfun(Array1D &a) { static Array2D affine = Array2D (4,4); // a holds current transformation double sum = 0.0; make_affine_from_a(nr_fit,affine,a); // convert to 4x4 matrix of floats cudaMemcpyToSymbol(c_aff,affine.v[0],4*4*sizeof(float)); // load constant mem d_costfun >>(w2,h2,d2,dbuff,dnewbuff); // run kernel CUT_CHECK_ERROR("kernel failed"); // OK? cudaThreadSynchronize(); // make sure all done // copy partial sums from device to host cudaMemcpy(hbuff,dbuff,bufflen*sizeof(float),cudaMemcpyDeviceToHost); for(int iy=0;iy1){ printf("call %d costfun %12.0f, a:",calls,sum); for(int i=0;i

27 CUDA Image Registration 29 Oct 2008 Richard Ansorge Example Run (240x256x176 images) C: >airwc airwc v2.5 Usage: AirWc opts(12rtdgsf) C:>airwc sb1 sb2 junk 1f NIFTI Header on File sb1.nii converting short to float NIFTI Header on File sb2.nii converting short to float Using device 0: GeForce 9600 GT Initial correlation using cost function 1 (abs-difference) Amoeba time: 4297, calls 802, cost: Cuda Total time 4297, Total calls 802 File dofmat.mat written Nifti file junk.nii written, bswop=0 Full Time 6187 timer ms timer 1 0 ms timer ms timer ms timer 4 0 ms Total secs Final Transformation: Final rots and shifts scales and shears

28 CUDA Image Registration 29 Oct 2008 Richard Ansorge Desktop 3D Registration Registration with CUDA 6 Seconds Registration with FLIRT Minutes

29 CUDA Image Registration 29 Oct 2008 Richard Ansorge Comments This is actually already very useful. Almost interactive (add visualisation) Further speedups possible –Faster card –Smarter optimiser –Overlap IO and Kernel execution –Tweek CUDA code Extend to non-linear local registration

30 CUDA Image Registration 29 Oct 2008 Richard Ansorge Intel Larabee? Figure 1: Schematic of the Larabee many-core architecture: The number of CPU cores and the number and type of co-processors and I/O blocks are implementation-dependent, as are the positions of the CPU and non-CPU blocks on the chip. Porting from CUDA to Larabee should be easy

31 CUDA Image Registration 29 Oct 2008 Richard Ansorge Thank you


Download ppt "CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge."

Similar presentations


Ads by Google