1 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Parallelising Pipelined Wavefront Computations on the.

Slides:



Advertisements
Similar presentations
Refining High Performance FORTRAN Code from Programming Model Dependencies Ferosh Jacob University of Alabama Department of Computer Science
Advertisements

Using Graphics Processors for Real-Time Global Illumination UK GPU Computing Conference 2011 Graham Hazel.
Outline Flows Flow tiles Applications Assessment Conclusion.
Copyright 2011, Data Mining Research Laboratory Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining Xintian Yang, Srinivasan.
Speed, Accurate and Efficient way to identify the DNA.
The benefits of using cloud computing for Stem Cell Imaging Nick Trigg CEO, Constellation 24 th June 2009.
Cpt S 223 – Advanced Data Structures Graph Algorithms: Introduction
MPI version of the Serial Code With One-Dimensional Decomposition Presented by Timothy H. Kaiser, Ph.D. San Diego Supercomputer Center Presented by Timothy.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.
A Coherent Grid Traversal Algorithm for Volume Rendering Ioannis Makris Supervisors: Philipp Slusallek*, Céline Loscos *Computer Graphics Lab, Universität.
st International Conference on Parallel Processing (ICPP)
OpenFOAM on a GPU-based Heterogeneous Cluster
GPU Computational Screening of Carbon Capture Materials J Kim 1, A Koniges 1, R Martin 1, M Haranczyk 1, J Swisher 2 and B Smit 1,2 1 Berkeley Lab (USA),
GPU Computing with CUDA as a focus Christie Donovan.
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
Multi Agent Simulation and its optimization over parallel architecture using CUDA™ Abdur Rahman and Bilal Khan NEDUET(Department Of Computer and Information.
CPS 533 Scientific Visualization Wensheng Shen Department of Computational Science SUNY Brockport.
Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.
Acceleration on many-cores CPUs and GPUs Dinesh Manocha Lauri Savioja.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
An Introduction to Programming with CUDA Paul Richmond
AGENT SIMULATIONS ON GRAPHICS HARDWARE Timothy Johnson - Supervisor: Dr. John Rankin 1.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.
Enhancing GPU for Scientific Computing Some thoughts.
Stratified Magnetohydrodynamics Accelerated Using GPUs:SMAUG.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Challenges Bit-vector approach Conclusion & Future Work A subsequence of a string of symbols is derived from the original string by deleting some elements.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Massively Parallel Mapping of Next Generation Sequence Reads Using GPUs Azita Nouri, Reha Oğuz Selvitopi, Özcan Öztürk, Onur Mutlu, Can Alkan Bilkent University,
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
Y. Kotani · F. Ino · K. Hagihara Springer Science + Business Media B.V Reporter: 李長霖.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY 1 Parallel Solution of the 3-D Laplace Equation Using a Symmetric-Galerkin Boundary Integral.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
GPU Architectural Considerations for Cellular Automata Programming A comparison of performance between a x86 CPU and nVidia Graphics Card Stephen Orchowski,
Robert Liao Tracy Wang CS252 Spring Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-1 Two allocations of a 16X16 array to 16 processes: (a) 2-dimensional blocks;
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
Test Architecture Design and Optimization for Three- Dimensional SoCs Li Jiang, Lin Huang and Qiang Xu CUhk Reliable Computing Laboratry Department of.
Interactive Computational Sciences Laboratory Clarence O. E. Burg Assistant Professor of Mathematics University of Central Arkansas Science Museum of Minnesota.
Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
Accelerating Spherical Harmonic Transforms on the NVIDIA® GPGPU
Project Demonstration Template Computer Science University of Birmingham.
CDVS on mobile GPUs MPEG 112 Warsaw, July Our Challenge CDVS on mobile GPUs  Compute CDVS descriptor from a stream video continuously  Make.
Scientific Computing on Graphics Hardware Robert Strzodka, Dominik G ö ddeke Reading, UK, May
An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.
Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.
GPUs – Graphics Processing Units Applications in Graphics Processing and Beyond COSC 3P93 – Parallel ComputingMatt Peskett.
Using Charm++ to Mask Latency in Grid Computing Applications Gregory A. Koenig Parallel Programming Laboratory Department.
Space Charge with PyHEADTAIL and PyPIC on the GPU Stefan Hegglin and Adrian Oeftiger Space Charge Working Group meeting –
Performance Modelling of Parallel and Distributed Computing Using PACE High Performance Systems Laboratory University of Warwick Junwei Cao Darren J. Kerbyson.
MDR: PERFORMANCE MODEL DRIVEN RUNTIME FOR HETEROGENEOUS PARALLEL PLATFORMS AUTHORS: JACQUES A. PIENAAR, ANAND RAGHUNATHAN, SRIMAT CHAKRADHAR SOURCE: INTERNATIONAL.
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
15 th October 2007SRMCwww.orhltd.com How are you solving the puzzle? Integrated Risk Management Plans.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
IPDPS 2003, Nice, France Agent-Based Grid Load Balancing Using Performance-Driven Task Scheduling Junwei Cao (C&C Research Labs, NEC Europe Ltd., Germany)
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
Resource Utilization in Large Scale InfiniBand Jobs
Affiliation of presenter
Programming Massively Parallel Processors Lecture Slides for Chapter 9: Application Case Study – Electrostatic Potential Calculation © David Kirk/NVIDIA.
A Comparison-FREE SORTING ALGORITHM ON CPUs
Presentation transcript:

1 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Parallelising Pipelined Wavefront Computations on the GPU S.J. Pennycook G.R. Mudalige, S.D. Hammond, and S.A. Jarvis. High Performance Systems Group Department of Computer Science University of Warwick U.K S.J. Pennycook G.R. Mudalige, S.D. Hammond, and S.A. Jarvis. High Performance Systems Group Department of Computer Science University of Warwick U.K 1st UK CUDA Developers Conference 7 th Dec 2009 – Oxford, U.K. 1st UK CUDA Developers Conference 7 th Dec 2009 – Oxford, U.K.

2 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Overview  Wavefront Computations  A GPGPU Solution?  Wavefronts within Wavefronts  Performance Modelling  Beating the CPU – Optimisations to Win  Results, Validations and Model Projections  Current and Future Work  Conclusions

3 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Wavefront Computations  Wavefront computations are at the core of a number of large scientific computing workloads.  Centers including the Los Alamos National Laboratory (LANL) in the United States and the Atomic Weapons Establishment (AWE) in the UK use these codes heavily.  Lamport’s core (hyperplane) algorithm that underpins these codes has existed for more than thirty five years.  Defining characteristics:  Operating on a grid of cells with each cell requiring some computation to be performed.  Each cell has a data dependency, such that the solution of up to three neighbouring cells is required.

4 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Cell Dependencies

5 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Motivation  Our previous work was on analysing and optimising applications that use the wavefront algorithm using MPI. Processor (1,m) Processor (1,1) Processor (n,m) Processor (n,1) Ny Nz Nx Proceeds as Wavefronts through the 3D data cube

6 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Motivation (cont’d)  Algorithm operates over a three-dimensional structure of size Nx ×Ny ×Nz.  Grid is mapped onto a 2D m x n grid of processors; each is assigned a stack of Nx / n x Ny / m x Nz cells.  Data dependency results in a sequence of wavefronts (or a sweep) that starts from one corner and makes its way through other cells.  We have modelled codes (e.g. Chimaera, LU, and Sweep3D) that employ wavefront computations with MPI.

7 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Motivation (cont’d)  Our focus is now on using GPUs to investigate improvements to the solution per processor.  A canonical solution is normally employed by the CPU to solve the computation per processor.  Listing: Canonical Algorithm For k=1; k<=kend do For j=1; j<=jend do For i=1; i<=iend do A(i,j,k)=A(i−1,j,k)+ A(i,j−1,k)+A(i,j,k−1) // Compute cell End for

8 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Hyperplane (Wavefront) Algorithm  Let f = i + j + k, g = k and h = j.  The plane defined by i + j + k = CONST is called a hyperplane.  Listing : Hyperplane Algorithm DO CONCURRENTLY ON EACH PROCESSOR For f = 3, iend+jend+kend do A(f−g−h,h,g) = A(f−g−h−1,h,g)+A(f−g−h,h−1,g)+ A(f − g − h, h, g − 1) End For  The critical dependencies are preserved, even though the solution is carried out across the grid in wavefronts.

9 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 A GPGPU Solution ?  Can we utilise the many cores on a GPU to get a speedup to this algorithm?  Theoretically simple...

10 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 A GPGPU Solution ? (cont’d)  For a 3D cube of cells:

11 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 GPU Limitations  What’s the practical situation?  Experimental System – Daresbury Laboratory U.K.  8 x NVIDIA Tesla S1070 servers, each with four Tesla C1060 cards.  Compute nodes consists of Nehalem processors 2.53 GHz quad-core, 24 GB RAM).  Each CPU core sees one Tesla card.  Voltaire HCA410-4EX InfiniBand adapter.  NVIDIA Tesla C1060 GPU Specifications:  Each GPU card has 30 multi-processors – Streaming Multiprocessor (SM) with 8 cores per processor.  Each card therefore has 240 cores (streaming processor cores).  Each core operates at to 1.44 GHz.  4 GB Memory per card.

12 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 GPU Limitations (cont’d)  CUDA Device Architecture: SM 1 SM 2 SM 4 SM 30 RegistersShared Memory Processor Cores (8 cores) GPU Constant and Texture Cache Memory DRAM Local Global Constant Texture To Host

13 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 GPU Limitations (cont’d)  Each SM is allocated a number of threads, arranged as blocks.  No synchronisation between threads in different blocks.  Limit of 512 threads per block.  Memory hierarchy:  Global memory access is slow and should be avoided.  Limit of 16KB of shared memory per SM.  Other considerations:  Limit of 16,384 registers per block.  Aligning half-warps for performance.

14 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 A Solution ?  Wavefronts within Wavefronts  Need to be scalable. Run more than 512 threads by utilising parallelism across all the multiprocessors.  The cells on each diagonal are decomposed into coarse subtasks, and assigned to an SM as thread blocks.

15 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Wavefronts within Wavefronts  Each diagonal is computed by a kernel: for (wave = 0; wave < (3*(N/dimBlock.x)) - 2; wave++) { // Run the kernel. hyperplane_3d >> (d_gpu, wave); } cudaThreadSynchronize(); // Not strictly necessary.  The time to compute one diagonal is ≈ ceiling (number of blocks per diagonal / number of SMs)  Each block utilises the resources available to an SM to solve the cells – we will talk about this later.

16 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 A Performance Model  What does this solution mean in terms of a performance model?  Modelling Block level performance  Assume a 3D cube of data cells with dimension N  P GPU – Number of SMs on the GPU  W g,GPU – Time to solve a block of cells  W GPU – Time to solve the 3D cube of cells using the GPU

17 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Initial Results  Each cell is randomly initialised, and at each step calculates the average of itself and its top, north and west neighbours.  How the 3D data is decomposed has a significant effect on execution time.  Strange behaviour where the number of cells is a multiple of 32 (especially at powers of 2).

18 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Initial Results (cont’d)

19 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Initial Results (cont’d)

20 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Initial Results (cont’d)

21 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Beating the CPU  Optimisation within the blocks:  Thread re-use.  Caching values in shared memory.  Coalesced memory accesses.  Avoiding shared memory bank-conflicts.  Optimisations over the blocks:  Explicit vs implicit CPU synchronisation.  Inter-block synchronisation using mutexes.

22 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Thread Reuse in a Block Thread 0 Thread 4 Thread 1 Thread 8 Thread 5 Thread 2Thread 3 Thread 6 Thread 9 Thread 12Thread 13 Thread 10 Thread 7 Thread 11 Thread 14Thread 15

23 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Coalesced Memory Access  Requires padding on devices below compute capability 1.3.  How does this apply to 3D?

24 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Beating the CPU (Results)

25 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Beating the CPU (Results)  Code was restructured for GPU to avoid unnecessary branching. Similar restructuring applied to CPU in kind.  Re-use of threads and shared memory offers a 2x speedup over the naive GPU implementation.  Spikes remain, likely to be an issue at the warp level.  Kernel information:  17 registers.  2948 bytes of shared memory per block.  42% occupancy.

26 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 The Bigger Picture  Current work:  Porting LU, Sweep3D and Chimaera to GPU. (CUDA and OpenCL)  Additional barriers from larger programs:  Double precision.  Multiple computations per cell.  Looking towards the future:  How well does our algorithm perform on a consumer card (eg GTX 295)?  How well will our algorithm perform on Fermi?  Benchmarking and analysis should facilitate predictions.

27 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Conclusions  Wavefront computations can utilise emerging GPU architectures, despite their dependencies.  To see speedup:  Memcpy() needs to be faster.  Require more work per Memcpy().  Codes cannot be ported naively. Hardware limitations may be a problem (particularly for larger codes).  Performance modelling will offer insights into which applications can be ported successfully.