J ACOBI I TERATIVE TECHNIQUE ON M ULTI GPU PLATFORM By Ishtiaq Hossain Venkata Krishna Nimmagadda.

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Advertisements

Instructor Notes This lecture begins with an example of how a wide- memory bus is utilized in GPUs The impact of memory coalescing and memory bank conflicts.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

Intermediate GPGPU Programming in CUDA

Speed, Accurate and Efficient way to identify the DNA.

List Ranking and Parallel Prefix

List Ranking on GPUs Sathish Vadhiyar. List Ranking on GPUs Linked list prefix computations – computations of prefix sum on the elements contained in.

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Optimization on Kepler Zehuan Wang

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.

Network coding on the GPU Péter Vingelmann Supervisor: Frank H.P. Fitzek.

CUDA and the Memory Model (Part II). Code executed on GPU.

Waves!. Solving something like this… The Wave Equation (1-D) (n-D)

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

© John A. Stratton, 2014 CS 395 CUDA Lecture 6 Thread Coarsening and Register Tiling 1.

Panda: MapReduce Framework on GPU’s and CPU’s

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

GPU Programming with CUDA – Optimisation Mike Griffiths

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

GPU Architecture and Programming

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.

Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.

CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.

An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.

Implementation and Optimization of SIFT on a OpenCL GPU Final Project 5/5/2010 Guy-Richard Kayombya.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Synchronization These notes introduce:

1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

CUDA C/C++ Basics Part 2 - Blocks and Threads

Sathish Vadhiyar Parallel Programming

CS 179: Lecture 12.

6- General Purpose GPU Programming

Presentation transcript:

J ACOBI I TERATIVE TECHNIQUE ON M ULTI GPU PLATFORM By Ishtiaq Hossain Venkata Krishna Nimmagadda

A PPLICATION OF J ACOBI I TERATION Cardiac Tissue is considered as a grid of cells. Each GPU thread takes care of voltage calculation at one cell. This calculation requires Voltage values of neighboring cells Two different models are shown in the bottom right corner V cell0 in current time step is calculated by using values of surrounding cells from previous time step to avoid synchronization issues V cell0 k = f(V cell1 k-1 +V cell2 k-1 +V cell3 k- 1 ….+V cellN k-1 ) where N can be 6 or 18

A PPLICATION OF J ACOBI I TERATION Initial values are provided to start computation In s single time step ODE and PDE parts are sequentially evaluated and added By solving the finite difference equations, voltage values of every cell in a time step is calculated by a thread Figure 1 shows a healthy cells voltage curve with time. Figure 1

T HE T IME S TEP Solve for ODE part and add it to the current cells voltage to obtain voltage Vtemp1 for each cell Use Vtemp1 as initial value, perform jacobi iteration by considering surrounding values to generate Vtemp2 Vtemp2 is generated in every iteration V temp2 is generated in every iteration for all the cells in the grid Calculation of V temp2 requires V temp2 values of previous time step Once the iterations are completed, final V temp2 is added with V temp1 to generate Voltage values for that time step

C ORRECTNESS OF OUR IMPLEMENTATION

M EMORY C OALESCING typedef struct __align__(N) { int a[N]; int b[N] - } NODE;. NODE nodes[N*N]; N*N blocks and N threads are launched so that all the N threads access values in consecutive places Design of data Structure Time in milli secs

S ERIAL V S S INGLE GPU Hey serial, what take you so long? Time in secs 128X128X128 gives us 309 secs Enormous Speed Up

S TEP 1 L ESSONS LEARNT Choose Data structure which maximizes the memory coalescing The mechanics of serial code and parallel code are very different Develop algorithms that address the areas where serial code takes long time

M ULTI GPU A PPROACH Multiple Host threads Creation Establishing Multiple Host – GPU Contexts Solve Cell Model ODE Solve Communicatio n model PDE Visualize Data Using OpenMP for launching host threads. Data partitioning and kernel invocation for GPU computation. ODE is solved using Forward Eular Method PDE is solved using Jacobi Iteration

I NTER GPU DATA PARTITIONING Let both the cubes are of dimensions s X s X s Interface Region of left one is 2s 2 Interface Region of right one is 3s 2 After division, data is copied into the device memory (global) of each GPU. Input data: 2D array of structures. Structures contain arrays. Data resides in host memory. Interface Region

S OLVING PDE S USING MULTIPLE GPU S During each Jacobi Iteration threads use Global memory to share data among them. Threads in the Interface Region need data from other GPUs. Inter GPUs sharing is done through Host memory. A separate kernel is launched that handles the interface region computation and copies result back to device memory. So GPUs are synchronized. Once PDE calculation is completed for one timestamp, all values are written back to the Host Memory.

S OLVING PDE S USING MULTIPLE GPU S Time Host to device copy GPU Computation Device to host copy Interface Region Computation

T HE CIRCUS OF I NTER GPU SYNC Ghost Cell computing! Pad with dummy cells at the inter GPU interfaces to reduce communication Lets make other cores of CPU work 4 out of 8 cores in CPU are having contexts Use the free 4 cores to do interface computation Simple is the best Launch new kernels with different dimensions to handle cells at interface.

V ARIOUS S TAGES Interestingly solving PDE using Jacobi iteration is eating most of the time.

S CALABILITY A = 32X32X32 cells executed by each GPU B= 32X32X32 cells executed by each GPU C= 32X32X32 cells executed by each GPU D= 32X32X32 cells executed by each GPU

S TEP 2 L ESSONS L EARNT The Jacobi iterative technique looks pretty good in scalability Interface Selection is very important Making a Multi GPU program generic is a lot of effort from programmer side

L ETS WATCH A VIDEO

Q & A