BWUPEP2011, UIUC, May 29 - June 10 2011 1 Blue Waters Undergraduate Petascale Education Program May 29 – June 10 2011 Hybrid MPI/CUDA Scaling accelerator.

Slides:

Advertisements

Similar presentations

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),

Advertisements

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

List Ranking and Parallel Prefix

List Ranking on GPUs Sathish Vadhiyar. List Ranking on GPUs Linked list prefix computations – computations of prefix sum on the elements contained in.

Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.

Parallel Computing in Matlab

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

Hybrid Redux: CUDA / MPI 1. CUDA / MPI Hybrid – Why? 2  Harness more hardware  16 CUDA GPUs > 1!  You have a legacy MPI code that you’d like to accelerate.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

BWUPEP2011, UIUC, May 29 - June Taking CUDA to Ludicrous Speed BWUPEP2011, UIUC, May 29 - June Blue Waters Undergraduate Petascale Education.

GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.

Alternative Processors 5/22/20151 John Gustafson CEO, Massively Parallel Technologies (Former CTO, ClearSpeed)

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

MPI and C-Language Seminars Seminar Plan  Week 1 – Introduction, Data Types, Control Flow, Pointers  Week 2 – Arrays, Structures, Enums, I/O,

Scalable Data Clustering with GPUs Student: Andrew D. Pangborn 1 Advisors: Dr. Muhammad Shaaban 1, Dr. Gregor von Laszewski 2, Dr. James Cavenaugh 3, Dr.

Reference: Message Passing Fundamentals.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

Waves!. Solving something like this… The Wave Equation (1-D) (n-D)

Contemporary Languages in Parallel Computing Raymond Hummel.

Panda: MapReduce Framework on GPU’s and CPU’s

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

Task Farming on HPCx David Henty HPCx Applications Support

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

Exploiting Disruptive Technology: GPUs for Physics Chip Watson Scientific Computing Group Jefferson Lab Presented at GlueX Collaboration Meeting, May 11,

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering1 Score-P Hands-On CUDA: Jacobi example.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.

OpenMP Blue Waters Undergraduate Petascale Education Program May 29 – June

GPU Programming with CUDA – Optimisation Mike Griffiths

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 3, 2011outline.1 ITCS 6010/8010 Topics in Computer Science: GPU Programming for High Performance.

Porting the physical parametrizations on GPUs using directives X. Lapillonne, O. Fuhrer, Cristiano Padrin, Piero Lanucara, Alessandro Cheloni Eidgenössisches.

Algorithm Engineering „GPGPU“ Stefan Edelkamp. Graphics Processing Units  GPGPU = (GP)²U General Purpose Programming on the GPU  „Parallelism for the.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

Advanced / Other Programming Models Sathish Vadhiyar.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Hybrid MPI and OpenMP Parallel Programming

GPU Architecture and Programming

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

CSS 700: MASS CUDA Parallel‐Computing Library for Multi‐Agent Spatial Simulation Fall Quarter 2014 Nathaniel Hart UW Bothell Computing & Software Systems.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, Dec 26, 2012outline.1 ITCS 4145/5145 Parallel Programming Spring 2013 Barry Wilkinson Department.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Slide 1 Using OpenACC in IFS Physics’ Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015.

CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

Synchronization These notes introduce:

Supercomputing in Plain English What is a Cluster? Blue Waters Undergraduate Petascale Education Program May 23 – June

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Heterogeneous Computing With GPGPUs Matthew Piehl Overview Introduction to CUDA Project Overview Issues faced nvcc Implementation Performance Metrics Conclusions.

CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

GPU Programming Contest. Contents Target: Clustering with Kmeans How to use toolkit1.0 Towards the fastest program.

PuReMD Design Initialization – neighbor-list, bond-list, hydrogenbond-list and Coefficients of QEq matrix Bonded interactions – Bond-order, bond-energy,

GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Matthew Royle Supervisor: Prof Shaun Bangay.  How do we implement OpenCL for CPUs  Differences in parallel architectures  Is our CPU implementation.

Supercomputing in Plain English Tuning Blue Waters Undergraduate Petascale Education Program May 29 – June

CS427 Multicore Architecture and Parallel Computing

Introduction to CUDA.

6- General Purpose GPU Programming

Presentation transcript:

BWUPEP2011, UIUC, May 29 - June Blue Waters Undergraduate Petascale Education Program May 29 – June Hybrid MPI/CUDA Scaling accelerator code

BWUPEP2011, UIUC, May 29 - June Why Hybrid CUDA?  CUDA is fast! (for some problems)  CUDA on a single card is like OpenMP (doesn’t scale)  MPI can only scale so far  Excessive power  Communication overhead  Large amount of work remains for each node  What if you can harness the power of multiple accelerators on multiple MPI processes?

BWUPEP2011, UIUC, May 29 - June Hybrid Architectures  Tesla S1050 connected to nodes  1 GPU, connected directly to a node  Earlham (as11 & as12)  Tesla S1070  A server node with 4 GPUs, typically connected via PCI-E to 2 nodes  OU has some of these  NCSA (192 nodes)  Accelerator Cluster NCSA (32 nodes) RAM GPU Node

BWUPEP2011, UIUC, May 29 - June MPI/CUDA Approach  CUDA will be:  Doing the computational heavy lifting  Dictating your algorithm & parallel layout (data parallel)  Therefore:  Design CUDA portions first  Use MPI to move work to each node

BWUPEP2011, UIUC, May 29 - June Implementation  Do as much work as possible on the GPU before bringing data back to the CPU and communicating it  Sometimes you won’t have a choice…  Debugging tips:  Develop/test/debug one- node version first  Then test it with multiple nodes to verify communication move data to each node while not done: copy data to GPU do work >> get new state out of GPU communicate with others aggregate results from all nodes move data to each node while not done: copy data to GPU do work >> get new state out of GPU communicate with others aggregate results from all nodes

BWUPEP2011, UIUC, May 29 - June Multi-GPU Programming  A CPU thread can only have a single active context to communicate with a GPU  cudaGetDeviceCount(int * count)  cudaSetDevice(int device)  Be careful using MPI rank alone, device count only counts the cards visible from each node  Use MPI_Get_processor_name() to determine which processes are running where

BWUPEP2011, UIUC, May 29 - June Compiling  CUDA needs nvcc, MPI needs mpicc  Dirty trick: wrap mpicc with nvcc  nvcc processes.cu files, sends the rest to its wrapped compiler  Kernel, kernel invocation, cudaMalloc, are all best off in a.cu file somewhere  MPI calls should be in.c files  There are workarounds, but this is the simplest approach nvcc --compiler-bindir mpicc main.c kernel.cu

BWUPEP2011, UIUC, May 29 - June Executing  Typically one MPI process per available GPU  On Sooner (OU), each node has 2 GPUs available, so ppn should be 2.  On AC, each node has 4 GPUs and correspond to the number of processors requested, so this requests a total of 8 GPUs on 2 nodes: #BSUB -l nodes=2:tesla:cuda3.2:ppn=4 #BSUB -R "select[cuda > 0]“ #BSUB -R "rusage[cuda=2]“ #BSUB –l nodes=1:ppn=2 #BSUB -R "select[cuda > 0]“ #BSUB -R "rusage[cuda=2]“ #BSUB –l nodes=1:ppn=2

BWUPEP2011, UIUC, May 29 - June Hybrid CUDA Lab  We already have Area Under a Curve code for MPI and CUDA independently.  You can write a hybrid code that has each GPU calculate a portion of the area, then use MPI to combine subtotals for the complete area.  Otherwise feel free to take any code we’ve used so far and experiment!