Hybrid Redux: CUDA / MPI 1. CUDA / MPI Hybrid – Why? 2  Harness more hardware  16 CUDA GPUs > 1!  You have a legacy MPI code that you’d like to accelerate.

Slides:



Advertisements
Similar presentations
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Parallel Computing in Matlab
Computing with Accelerators: Overview ITS Research Computing Mark Reed.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
BWUPEP2011, UIUC, May 29 - June Blue Waters Undergraduate Petascale Education Program May 29 – June Hybrid MPI/CUDA Scaling accelerator.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
Scalable Data Clustering with GPUs Student: Andrew D. Pangborn 1 Advisors: Dr. Muhammad Shaaban 1, Dr. Gregor von Laszewski 2, Dr. James Cavenaugh 3, Dr.
The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 8 Introduction to Printers in a Windows Server 2008 Network.
Contemporary Languages in Parallel Computing Raymond Hummel.
What is a Computer Network? Two or more computers which are connected together.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
NETWORK TOPOLOGY. WHAT IS NETWORK TOPOLOGY?  Network Topology is the shape or physical layout of the network. This is how the computers and other devices.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
Overview Print and Document Services Print Management console Printer properties Troubleshooting.
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
High Performance Computing G Burton – ICG – Oct12 – v1.1 1.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.
WORK ON CLUSTER HYBRILIT E. Aleksandrov 1, D. Belyakov 1, M. Matveev 1, M. Vala 1,2 1 Joint Institute for nuclear research, LIT, Russia 2 Institute for.
SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering1 Score-P Hands-On CUDA: Jacobi example.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.
Debugging and Profiling GMAO Models with Allinea’s DDT/MAP Georgios Britzolakis April 30, 2015.
NETWORKING. OBJECTIVES Identify network topologies Identify hardware components of a network.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
Parallel Computing with Matlab CBI Lab Parallel Computing Toolbox TM An Introduction Oct. 27, 2011 By: CBI Development Team.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
Common Practices for Managing Small HPC Clusters Supercomputing 12
Sharif University of technology, Parallel Processing course, MPI & ADA Server Introduction By Shervin Daneshpajouh.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Evaluation of Agent Teamwork High Performance Distributed Computing Middleware. Solomon Lane Agent Teamwork Research Assistant October 2006 – March 2007.
GPU Architecture and Programming
Alternative ProcessorsHPC User Forum Panel1 HPC User Forum Alternative Processor Panel Results 2008.
ARCHES: GPU Ray Tracing I.Motivation – Emergence of Heterogeneous Systems II.Overview and Approach III.Uintah Hybrid CPU/GPU Scheduler IV.Current Uintah.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
TEMPLATE DESIGN © BOINC: Middleware for Volunteer Computing David P. Anderson Space Sciences Laboratory University of.
Installing a Network Printer. Network printers work much like any other printer except the data flow is through a network. This means the printer must.
WSV207. Cluster Public Cloud Servers On-Premises Servers Desktop Workstations Application Logic.
Liverpool Experience of MDC 1 MAP (and in our belief any system which attempts to be scaleable to 1000s of nodes) broadcasts the code to all the nodes.
Configuring and Running the OPC.NET Generic Clients 1.
What’s Coming? What are we Planning?. › Better docs › Goldilocks – This slot size is just right › Storage › New.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
Terra-Fusion Loads Tiles in real-time while panning Loads Tiles in real-time while panning Improved overall performance via: Improved overall performance.
How to use HybriLIT Matveev M. A., Zuev M.I. Heterogeneous Computations team HybriLIT Laboratory of Information Technologies (LIT), Joint Institute for.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.
Debugging Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
Hands on training session for core skills
Computer System Laboratory
Blond GSOC 2016 BLonD code optimization strategy for parallel and concurrent architectures By Oleg Iakushkin
Volunteer Computing for Science Gateways
Lecture 2: Intro to the simd lifestyle and GPU internals
Advanced Computing Facility Introduction
Alternative Processor Panel Results 2008
© 2012 Elsevier, Inc. All rights reserved.
Introduction to CUDA.
3.8 static vs dynamic thread management
Presentation transcript:

Hybrid Redux: CUDA / MPI 1

CUDA / MPI Hybrid – Why? 2  Harness more hardware  16 CUDA GPUs > 1!  You have a legacy MPI code that you’d like to accelerate.

CUDA / MPI – Hardware? 3  In a cluster environment, a typical configuration is one of:  Tesla S1050 cards attached to (some) nodes.  1 GPU for each node.  Tesla S1070 server node (4 GPUs) connected to two different host nodes via PCI-E.  2 GPUs per node.  Sooner’s CUDA nodes are like the latter  Those nodes are used when you submit a job to the queue “cuda”.  You can also attach multiple cards to a workstation.

CUDA / MPI – Approach 4  CUDA will likely be: 1. Doing most of the computational heavy lifting 2. Dictating your algorithmic pattern 3. Dictating your parallel layout  i.e. which nodes have how many cards  Therefore: 1. We want to design the CUDA portions first, and 2. We want to do most of the compute in CUDA, and use MPI to move work and results around when needed.

Writing and Testing 5  Use MPI for what CUDA can’t do/does poorly  communication!  Basic Organization:  Get data to node/CUDA card  Compute  Get result out of node/CUDA card  Focus testing on these three spots to pinpoint bugs  Be able to “turn off” either MPI or CUDA for testing  Keep a serial version of the CUDA code around just for this

Compiling 6  Basically, we need to use both mpicc and nvcc  These are both just wrappers for other compilers  Dirty trick – wrap mpicc with nvcc: nvcc -arch sm_13 --compiler-bindir mpicc driver.c kernel.cu  add -g and -G to add debug info for gdb

Running CUDA / MPI 7  Run no more than one MPI process per GPU.  On sooner, this means each cuda node should be running no more than 2 MPI processes.  On Sooner, these cards can be reserved by using: #BSUB -R "select[cuda > 0]“ #BSUB -R "rusage[cuda=2]“ in your.bsub file.