# J ACOBI I TERATIVE TECHNIQUE ON M ULTI GPU PLATFORM By Ishtiaq Hossain Venkata Krishna Nimmagadda.

## Presentation on theme: "J ACOBI I TERATIVE TECHNIQUE ON M ULTI GPU PLATFORM By Ishtiaq Hossain Venkata Krishna Nimmagadda."— Presentation transcript:

J ACOBI I TERATIVE TECHNIQUE ON M ULTI GPU PLATFORM By Ishtiaq Hossain Venkata Krishna Nimmagadda

A PPLICATION OF J ACOBI I TERATION Cardiac Tissue is considered as a grid of cells. Each GPU thread takes care of voltage calculation at one cell. This calculation requires Voltage values of neighboring cells Two different models are shown in the bottom right corner V cell0 in current time step is calculated by using values of surrounding cells from previous time step to avoid synchronization issues V cell0 k = f(V cell1 k-1 +V cell2 k-1 +V cell3 k- 1 ….+V cellN k-1 ) where N can be 6 or 18

A PPLICATION OF J ACOBI I TERATION Initial values are provided to start computation In s single time step ODE and PDE parts are sequentially evaluated and added By solving the finite difference equations, voltage values of every cell in a time step is calculated by a thread Figure 1 shows a healthy cells voltage curve with time. Figure 1

T HE T IME S TEP Solve for ODE part and add it to the current cells voltage to obtain voltage Vtemp1 for each cell Use Vtemp1 as initial value, perform jacobi iteration by considering surrounding values to generate Vtemp2 Vtemp2 is generated in every iteration V temp2 is generated in every iteration for all the cells in the grid Calculation of V temp2 requires V temp2 values of previous time step Once the iterations are completed, final V temp2 is added with V temp1 to generate Voltage values for that time step

C ORRECTNESS OF OUR IMPLEMENTATION

M EMORY C OALESCING typedef struct __align__(N) { int a[N]; int b[N] - } NODE;. NODE nodes[N*N]; N*N blocks and N threads are launched so that all the N threads access values in consecutive places Design of data Structure Time in milli secs

S ERIAL V S S INGLE GPU Hey serial, what take you so long? Time in secs 128X128X128 gives us 309 secs Enormous Speed Up

S TEP 1 L ESSONS LEARNT Choose Data structure which maximizes the memory coalescing The mechanics of serial code and parallel code are very different Develop algorithms that address the areas where serial code takes long time

M ULTI GPU A PPROACH Multiple Host threads Creation Establishing Multiple Host – GPU Contexts Solve Cell Model ODE Solve Communicatio n model PDE Visualize Data Using OpenMP for launching host threads. Data partitioning and kernel invocation for GPU computation. ODE is solved using Forward Eular Method PDE is solved using Jacobi Iteration

I NTER GPU DATA PARTITIONING Let both the cubes are of dimensions s X s X s Interface Region of left one is 2s 2 Interface Region of right one is 3s 2 After division, data is copied into the device memory (global) of each GPU. Input data: 2D array of structures. Structures contain arrays. Data resides in host memory. Interface Region

S OLVING PDE S USING MULTIPLE GPU S During each Jacobi Iteration threads use Global memory to share data among them. Threads in the Interface Region need data from other GPUs. Inter GPUs sharing is done through Host memory. A separate kernel is launched that handles the interface region computation and copies result back to device memory. So GPUs are synchronized. Once PDE calculation is completed for one timestamp, all values are written back to the Host Memory.

S OLVING PDE S USING MULTIPLE GPU S Time Host to device copy GPU Computation Device to host copy Interface Region Computation

T HE CIRCUS OF I NTER GPU SYNC Ghost Cell computing! Pad with dummy cells at the inter GPU interfaces to reduce communication Lets make other cores of CPU work 4 out of 8 cores in CPU are having contexts Use the free 4 cores to do interface computation Simple is the best Launch new kernels with different dimensions to handle cells at interface.

V ARIOUS S TAGES Interestingly solving PDE using Jacobi iteration is eating most of the time.

S CALABILITY A = 32X32X32 cells executed by each GPU B= 32X32X32 cells executed by each GPU C= 32X32X32 cells executed by each GPU D= 32X32X32 cells executed by each GPU

S TEP 2 L ESSONS L EARNT The Jacobi iterative technique looks pretty good in scalability Interface Selection is very important Making a Multi GPU program generic is a lot of effort from programmer side

L ETS WATCH A VIDEO

Q & A

Download ppt "J ACOBI I TERATIVE TECHNIQUE ON M ULTI GPU PLATFORM By Ishtiaq Hossain Venkata Krishna Nimmagadda."

Similar presentations