Prepared 6/23/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Slides:

Advertisements

Similar presentations

Intermediate GPGPU Programming in CUDA

Advertisements

CUDA More on Blocks/Threads. 2 Debugging Using the Device Emulation Mode An executable compiled in device emulation mode ( nvcc -deviceemu ) runs completely.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

Optimization on Kepler Zehuan Wang

Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy Lee Senior SW Engineer, Development.

OpenCL Peter Holvenstot. OpenCL Designed as an API and language specification Standards maintained by the Khronos group  Currently 1.0, 1.1, and 1.2.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

WORK ON CLUSTER HYBRILIT E. Aleksandrov 1, D. Belyakov 1, M. Matveev 1, M. Vala 1,2 1 Joint Institute for nuclear research, LIT, Russia 2 Institute for.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Extracted directly from:

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 3: The CUDA Memory Model.

High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

Emergence of GPU systems and clusters for general purpose high performance computing ITCS 4145/5145 April 3, 2012 © Barry Wilkinson.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.

GPU Architecture and Programming

Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

ME964 High Performance Computing for Engineering Applications Execution Model and Its Hardware Support Sept. 25, 2008.

 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh

Martin Kruliš by Martin Kruliš (v1.0)1.

Auto-tuning Dense Matrix Multiplication for GPGPU with Cache

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Would'a, CUDA, Should'a. CUDA: Compute Unified Device Architecture OU Supercomputing Symposium Highly-Threaded HPC.

Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |

CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 4: The CUDA Memory Model (Cont.)

My Coordinates Office EM G.27 contact time:

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.

Summer School s-Science with Many-core CPU/GPU Processors Lecture 2 Introduction to CUDA © David Kirk/NVIDIA and Wen-mei W. Hwu Braga, Portugal, June 14-18,

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.

PERFORMANCE OF THE OPENMP AND MPI IMPLEMENTATIONS ON ULTRASPARC SYSTEM Abstract Programmers and developers interested in utilizing parallel programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.

General Purpose computing on Graphics Processing Units

Computer Engg, IIT(BHU)

CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)

CS427 Multicore Architecture and Parallel Computing

Basic CUDA Programming

Lecture 2: Intro to the simd lifestyle and GPU internals

NVIDIA Fermi Architecture

CUDA Introduction Martin Kruliš by Martin Kruliš (v1.0)

6- General Purpose GPU Programming

Presentation transcript:

Prepared 6/23/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Your own PCs running G80 emulators Better debugging environment Sufficient for the first couple of weeks Your own PCs with a CUDA-enabled GPU NVIDIA boards in department GeForce family of processors for high-performance gaming Tesla C2070 for high-performance computing – no graphics output (?) and more memory CUDA at the University of Akron – Slide 2

CUDA at the University of Akron – Slide 3 DescriptionCard ModelsWhere Available Low PowerIonNetbooks in CAS 241. Consumer Graphics Processors GeForce 8500GT GeForce 9500GT GeForce 9600GT Add-in cards in Dell Optiplex 745s in department. 2 nd Generation GPUsGeForce GTX275In Dell Precision T3500s in department. Fermi GPUsGeForce GTX480 Tesla C2070 In select Dell Precision T3500s in department. In Dell Precision T7500 Linux server ( tesla.cs.uakron.edu )

Basic building block is a “streaming multiprocessor” different chips have different numbers of these SMs: CUDA at the University of Akron – Slide 4 ProductSMsCompute Capability GeForce 8500GT2v. 1.1 GeForce 9500GT4v. 1.1 GeForce 9600GT8v. 1.1

Basic building block is a “streaming multiprocessor” with 8 cores, each with 2048 registers up to 128 threads per core 16KB of shared memory 8KB cache for constants held in device memory different chips have different numbers of these SMs: CUDA at the University of Akron – Slide 5 ProductSMsBandwidthMemoryCompute Capability GTX GB/s1 -2 GBv. 1.3

each streaming multiprocessor has 32 cores, each with 1024 registers up to 48 threads per core 64KB of shared memory / L1 cache 8KB cache for constants held in device memory there’s also a unified 384KB L2 cache different chips again have different numbers of SMs: CUDA at the University of Akron – Slide 6 ProductSMsBandwidthMemoryCompute Capability GTX GB/s1.5 GBv. 2.0 Tesla C GB/s6 GB ECCv. 2.1

CUDA at the University of Akron – Slide 7 Featurev. 1.1v. 1.3, 2.x Integer atomic functions operating on 64-bit words in global memory noyes Integer atomic functions operating on 32-bit words in shared memory noyes Warp vote functions noyes Double-precision floating-point operations noyes

CUDA at the University of Akron – Slide 8 Featurev. 1.1, 1.3v. 2.x 3D grid of thread block noyes Floating-point atomic addition operating on 32-bit words in global and shared memory noyes _ballot() noyes _threadfence_system() noyes _syncthread_count(), _syncthread_and(), _syncthread_or() noyes Surface functions noyes

CUDA at the University of Akron – Slide 9 Spec Maximum x- or y- dimensions of a grid of thread blocks Maximum dimensionality of thread block 3 Maximum z- dimension of a block 64 Warp size 32 Maximum number of resident blocks per multiprocessor 8 Constant memory size 64 K Cache working set per multiprocessor for constant memory 8 K Maximum width for 1D texture reference bound to linear memory 2 27 Maximum width, height and depth for a 3D texture reference bound to linear memory or a CUDA array 2048 x 2048 x 2048 Maximum number of textures that can be bound to a kernel 128 Maximum number of instructions per kernel 2 million

CUDA at the University of Akron – Slide 10 Specv. 1.1v. 1.3v. 2.x Maximum number of resident warps per multiprocessor Maximum number of resident threads per multiprocessor Number of 32-bit registers per multiprocessor 8 K16 K32 K

CUDA at the University of Akron – Slide 11 Specv. 1.1, 1.3v. 2.x Maximum dimensionality of grid of thread block 23 Maximum x- or y- dimension of a block Maximum number of threads per block Maximum amount of shared memory per multiprocessor 16 K48 K Number of shared memory banks 1632 Amount of local memory per thread 16 K512 K Maximum width for 1D texture reference bound to a CUDA array

CUDA at the University of Akron – Slide 12 Specv. 1.1, 1.3v. 2.x Maximum width and number of layers for a 1D layered texture reference 8192 x x 2048 Maximum width and height for 2D texture reference bound to linear memory or a CUDA array x x Maximum width, height, and number of layers for a 2D layered texture reference 8192 x 8192 x x x 2048 Maximum width for a 1D surface reference bound to a CUDA array Not supported 8192 Maximum width and height for a 2D surface reference bound to a CUDA array 8192 x 8192 Maximum number of surfaces that can be bound to a kernel 8

CUDA (Compute Unified Device Architecture) is NVIDIA’s program development environment: based on C with some extensions C++ support increasing steadily FORTRAN support provided by PGI compiler lots of example code and good documentation – 2-4 week learning curve for those with experience of OpenMP and MPI programming large user community on NVIDIA forums CUDA at the University of Akron – Slide 13

When installing CUDA on a system, there are 3 components: driver low-level software that controls the graphics card usually installed by sys-admin toolkit nvcc CUDA compiler some profiling and debugging tools various libraries usually installed by sys-admin in /usr/local/cuda CUDA at the University of Akron – Slide 14

SDK lots of demonstration examples a convenient Makefile for building applications some error-checking utilities not supported by NVIDIA almost no documentation often installed by user in own directory CUDA at the University of Akron – Slide 15

Remotely access the front end: ssh tesla.cs.uakron.edu ssh sends your commands over an encrypted stream so your passwords, etc., can’t be sniffed over the network CUDA at the University of Akron – Slide 16

The first time you do this: After login, run /root/gpucomputingsdk_3.2.16_linux.run and just take the default answers to get your own personal copy of the SDK. Then: cd ~/NVIDIA_GPU_Computing_SDK/C make -j12 -k will build all that can be built. CUDA at the University of Akron – Slide 17

The first time you do this: Binaries end up in: ~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release In particular header file is in ~/NVIDIA_GPU_Computing_SDK/C/common/inc Can then get a summary of technical specs and compute capabilities by executing ~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/deviceQuery CUDA at the University of Akron – Slide 18

Two choices: use nvcc within a standard Makefile use the special Makefile template provided in the SDK The SDK Makefile provides some useful options: make emu=1 uses an emulation library for debugging on a CPU make dbg=1 activates run-time error checking In general just use a standard Makefile CUDA at the University of Akron – Slide 19

CUDA at the University of Akron – Slide 20 GENCODE_ARCH := -gencode=arch=compute_10,code=\"sm_10,compute_10\“ -gencode=arch=compute_13,code=\"sm_13,compute_13\“ -gencode=arch=compute_20,code=\"sm_20,compute_20\“ INCLOCS := -I$(HOME)/NVIDIA_GPU_Computing_SDK/shared/inc -I$(HOME)/NVIDIA_GPU_Computing_SDK/C/common/inc LIBLOCS := -L/usr/local/cuda/lib64 -L/usr/local/cuda/lib -L$(HOME)/NVIDIA_GPU_Computing_SDK/C/lib LIBS = -lcutil_x86_64 :.cu.cu.cuh nvcc $(GENCODE_ARCH) $(INCLOCS).cu $(LIBLOCS) $(LIBS) -o

Parallel Thread Execution (PTX) Virtual machine and ISA Programming model Execution resources and state CUDA Tools and Threads – Slide 2

Any source file containing CUDA extensions must be compiled with NVCC NVCC is a compiler driver Works by invoking all the necessary tools and compilers like cudacc, g++, cl, … NVCC outputs C code (host CPU code) Must then be compiled with the rest of the application using another tool PTX Object code directly, or PTX source interpreted at runtime CUDA Tools and Threads – Slide 22

Any executable with CUDA code requires two dynamic libraries The CUDA runtime library ( cudart ) The CUDA core library ( cuda ) CUDA Tools and Threads – Slide 23

An executable compiled in device emulation mode ( nvcc –deviceemu ) runs completely on the host using the CUDA runtime No need of any device and CUDA driver Each device thread is emulated with a host thread CUDA Tools and Threads – Slide 24

Running in device emulation mode, one can Use host native debug support (breakpoints, inspection, etc.) Access any device-specific data from host code and vice- versa Call any host function from device code (e.g. printf ) and vice-versa Detect deadlock situations caused by improper usage of __syncthreads CUDA Tools and Threads – Slide 25

Emulated device threads execute sequentially, so simultaneous access of the same memory location by multiple threads could produce different results Dereferencing device pointers on the host or host pointers on the device can produce correct results in device emulation mode, but will generate an error in device execution mode CUDA Tools and Threads – Slide 26

Results of floating-point computations will slightly differ because of Different compiler outputs, instructions sets Use of extended precision for intermediate results There are various options to force strict single precision on the host CUDA Tools and Threads – Slide 27

New Visual Studio Based GPU Integrated Development Available in Beta (as of October 2009) CUDA Tools and Threads – Slide 28

Based on original material from accessed 6/22/2011. The University of Akron: Charles Van Tilburg The University of Illinois at Urbana-Champaign David Kirk, Wen-mei W. Hwu Oxford University: Mike Giles Stanford University Jared Hoberock, David Tarjan Revision history: last updated 6/23/2011. CUDA at the University of Akron – Slide 29