Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

Slides:

Advertisements

Similar presentations

CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.

Advertisements

Intermediate GPGPU Programming in CUDA

CUDA More on Blocks/Threads. 2 Debugging Using the Device Emulation Mode An executable compiled in device emulation mode ( nvcc -deviceemu ) runs completely.

Multi-GPU and Stream Programming Kishan Wimalawarne.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

Programming with CUDA WS 08/09 Lecture 6 Thu, 11 Nov, 2008.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

CS179: GPU Programming Lecture 8: More CUDA Runtime.

Programming with CUDA WS 08/09 Lecture 8 Thu, 18 Nov, 2008.

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson Revised.

Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.

Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.

Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

1 I/O Management in Representative Operating Systems.

Cuda Streams Presented by Savitha Parur Venkitachalam.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

Linux Installation and Administration – Lesson 5 Tutor: George Papamarkos Topic: Devices in Linux.

Martin Kruliš by Martin Kruliš (v1.0)1.

CS179: GPU Programming Lecture 11: Lab 5 Recitation.

CUDA Asynchronous Memory Usage and Execution Yukai Hung Department of Mathematics National Taiwan University Yukai Hung

+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.

CIS 565 Fall 2011 Qing Sun

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.

Today’s lecture 2-Dimensional indexing Color Format Thread Synchronization within for- loops Shared Memory Tiling Review example programs Using Printf.

© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 24: Advanced CUDA Feature Highlights April 21, 2009.

Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.

CS6235 L17: Generalizing CUDA: Concurrent Dynamic Execution, and Unified Address Space.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.

OpenCL Programming James Perry EPCC The University of Edinburgh.

Processes CS 6560: Operating Systems Design. 2 Von Neuman Model Both text (program) and data reside in memory Execution cycle Fetch instruction Decode.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,

Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.

How to write a MSGQ Transport (MQT) Overview Nov 29, 2005 Todd Mullanix.

Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Synchronization These notes introduce:

CS/EE 217 GPU Architecture and Parallel Programming Lecture 17: Data Transfer and CUDA Streams.

CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS Fall 2011.

Would'a, CUDA, Should'a. CUDA: Compute Unified Device Architecture OU Supercomputing Symposium Highly-Threaded HPC.

Martin Kruliš by Martin Kruliš (v1.0)1.

UDI Technology Benefits Slide 1 Uniform Driver Interface UDI Technology Benefits.

CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

1 Lecture 19: Unix signals and Terminal management n what is a signal n signal handling u kernel u user n signal generation n signal example usage n terminal.

Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.

Embedded Real-Time Systems Processing interrupts Lecturer Department University.

Introduction to CUDA Programming Textures Andreas Moshovos Winter 2009 Some material from: Matthew Bolitho’s slides.

CS 179 Lecture 13 Host-Device Data Transfer 1. Moving data is slow So far we’ve only considered performance when the data is already on the GPU This neglects.

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.

CUDA Libraries and Language Extensions for GKLEE.

Multi-GPU Programming

Processes and threads.

GPU Computing CIS-543 Lecture 10: Streams and Events

Heterogeneous Programming

Dynamic Parallelism Martin Kruliš by Martin Kruliš (v1.0)

Lecture 2: Intro to the simd lifestyle and GPU internals

CS 179: GPU Programming Lecture 7.

NVIDIA Fermi Architecture

Top Half / Bottom Half Processing

ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2

CUDA Execution Model – III Streams and Events

6- General Purpose GPU Programming

Presentation transcript:

Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008

Previously CUDA Runtime Component CUDA Runtime Component –Common Component Built-in vector types Built-in vector types Math functions Math functions Timing Timing Textures Textures –Texture fetch –Texture reference –Texture read modes –Normalized texture coordinates –Linear texture filtering

Today CUDA Runtime Component CUDA Runtime Component –Common Component –Device Component –Host Component

CUDA Runtime Component Common Component Common Component Device Component Device Component Host Component Host Component

Device Runtime Component Can only be used in device code Can only be used in device code Math functions Math functions –Faster, less accurate versions of functions from common component –__ –__ log and __logf log and __logf –Appendix B of Programming Guide –Use fast math by default Compiler option -use_fast_math Compiler option -use_fast_math

Device Runtime Component Synch function: __syncThreads()‏ Synch function: __syncThreads()‏ –Synchronize threads in a block –Avoid read-after-write, write-after-read, write-after-write hazards for commonly accessed shared memory –Dangerous to use in conditionals Code hangs / unwanted effects Code hangs / unwanted effects

Device Runtime Component Atomic functions Atomic functions –Guaranteed to perform un-interfered Memory address is locked Memory address is locked –Supported by CUDA cards > 1.0 –Mostly operate on integers only –Appendix C of programming guide

Device Runtime Component Warp vote functions Warp vote functions –Supported by CUDA cards >= 1.2 –Check a condition on all threads in a warp int __all (int predicate) true (non-zero) if predicate is true for all warp threads int __all (int predicate) true (non-zero) if predicate is true for all warp threads int __any (int predicate) true (non-zero) if predicate is true for any warp thread int __any (int predicate) true (non-zero) if predicate is true for any warp thread

Device Runtime Component Texture functions: fetching textures, or texturing Texture functions: fetching textures, or texturing –Texture data may be stored in linear memory or CUDA arrays –Texturing from linear memory template Type tex1Dfetch( texture texRef, int x); float tex1Dfetch( texture texRef, int x);

Device Runtime Component Texture functions: fetching textures, or texturing Texture functions: fetching textures, or texturing –Texturing from linear memory – Type can be any of the supported 1-, 2- or 4- vector types template Type tex1Dfetch( texture texRef, int x); float4 tex1Dfetch( texture texRef, int x);

Device Runtime Component Texture functions: fetching textures, or texturing Texture functions: fetching textures, or texturing –Texturing from linear memory –No addressing modes supported –No texture filtering supported

Device Runtime Component Texture functions: fetching textures, or texturing Texture functions: fetching textures, or texturing –Texturing from CUDA arrays template Type tex1D(texture texRef, float x); template Type tex2D(texture texRef, float x, float y); template Type tex3D(texture texRef, float x, float y, float z);

Device Runtime Component Texture functions: fetching textures, or texturing Texture functions: fetching textures, or texturing –Texturing from CUDA arrays –Run-time attributes determine Coordinate normalization Coordinate normalization Addressing mode (clamp/wrap)‏ Addressing mode (clamp/wrap)‏ Filtering Filtering

CUDA Runtime Component Common Component Common Component Device Component Device Component Host Component Host Component

Host Runtime Component Can only be used by host functions Can only be used by host functions Composed of 2 APIs Composed of 2 APIs –High-level CUDA runtime API, which runs on top of –Low-level CUDA driver API No mixing: an application should use either one or the other. No mixing: an application should use either one or the other.

Each API provides functions for Each API provides functions for –Device management –Context management –Memory management –Code module management –Execution control –Texture reference management –OpenGL/Direct3D interoperability Host Runtime Component

The CUDA runtime API implicitly provides The CUDA runtime API implicitly provides –Initialization –Context management –Module management CUDA driver API does not, and is harder to program. CUDA driver API does not, and is harder to program. Host Runtime Component

Recall: nvcc parses an input source file Recall: nvcc parses an input source file –Separates device and host code –Device code compiled to cubin object –Generated host code in C compiled by external tool Host Runtime Component

Generated host code Generated host code –Is in C format –Includes the cubin object Applications may Applications may –Ignore host code and run cubin object directly using the low-level CUDA driver API –Link to generated host code and launch it using the high-level CUDA runtime API Host Runtime Component

The CUDA driver API The CUDA driver API –Is harder to program –Offers greater control –Does not depend on C –Does not offer device emulation Host Runtime Component

CUDA runtime functions and other entry points are prefixed by cuda CUDA runtime functions and other entry points are prefixed by cuda CUDA driver functions and other entry points are prefixed by cu CUDA driver functions and other entry points are prefixed by cu Host Runtime Component

Device memory is always allocated as either of Device memory is always allocated as either of –Linear memory –CUDA arrays Host Runtime Component - detour

Linear memory in device Linear memory in device –Contiguous segment of memory –32-bit addresses –Can be referenced using pointers Host Runtime Component - detour

CUDA arrays CUDA arrays –“opaque” memory layout –1D/2D/3D arrays of 1/2/4 vectors of 8/16/32 bit integers or 16/32 bit floats 16 bit floats from driver API only 16 bit floats from driver API only –Optimized for texture fetching –Accessible from kernels through texture fetches only Host Runtime Component - detour

Both the CUDA runtime and CUDA driver APIs Both the CUDA runtime and CUDA driver APIs –Can access device information –Enable the host to read/write to linear memory/CUDA arrays With support for pinned memory With support for pinned memory Host Runtime Component

Both the CUDA runtime and CUDA driver APIs Both the CUDA runtime and CUDA driver APIs –Can access device information –Enable the host to read/write to linear memory/CUDA arrays With support for pinned memory With support for pinned memory –Provide OpenGL/Direct3D interoperability –Provide management for asynchronous execution Host Runtime Component

Asynchronous functions Asynchronous functions –Kernel launches, and some others – Async memory copies –Device device memory copies –Memory setting Concurrent execution of functions is managed through streams Concurrent execution of functions is managed through streams Host Runtime Component

Streams Streams –A queue of operations –An application may have multiple stream objects simultaneously – kernel >> – A kernel can be scheduled to execute on a stream – Some memory copy functions can also be queued on a stream Host Runtime Component

Streams Streams – If no stream is specified, stream 0 is used by default. – Operations in a stream are executed synchronously Previous stream operations have to end before a new one begins Previous stream operations have to end before a new one begins Host Runtime Component

CUDA runtime and driver APIs provide execution control through stream management CUDA runtime and driver APIs provide execution control through stream management – StreamQuery()‏ Is stream free? Is stream free? – StreamSynchronize()‏ Wait for stream operations to end Wait for stream operations to end Host Runtime Component

CUDA runtime and driver APIs provide execution control through stream management CUDA runtime and driver APIs provide execution control through stream management – cudaThreadSynchronize() / cuCtxSynchronize()‏ Wait for all streams to be free Wait for all streams to be free – StreamDestroy()‏ Wait for stream to get free Wait for stream to get free Destroy stream Destroy stream Host Runtime Component

Accurate timing using events Accurate timing using events – CUEvent/cudaEvent_t start,stop; EventCreate (&start); EventCreate (&stop); –Events have to be recorded EventRecord (start, 0); // asynchronous // stuff to time EventRecord (stop, 0); // asynchronous –Stream 0: record all operations from all streams –Stream N: record operations in stream N Host Runtime Component

Accurate timing using events Accurate timing using events – EventRecord (start, 0); // asynchronous // stuff to time EventRecord (stop, 0); // asynchronous EventSynchronize (stop); float time; EventElapsedTime (&time, start, stop); –As call to record is asynchronous, the event has to be synchronized before timing – EventDestroy (start); EventDestroy (stop); Host Runtime Component

Asynchronous execution can get confusing Asynchronous execution can get confusing –Can be switched off –Useful for degbugging –Set CUDA_LAUNCH_BLOCKING to 1 Host Runtime Component

Device Initialization Device Initialization –CUDA Runtime API Automatically with first function call Automatically with first function call –Cuda Driver API cuInit()‏ cuInit()‏ MUST be called before calling any other API function MUST be called before calling any other API function Host Runtime Component

Device Management Device Management – cudaDeviceProp / CUDevice device; – int devCount; cudaGetDeviceCount (&devCount) / cuDeviceGetCount (&devCount)‏ – for dev = 1 to devCount do cudaGetDeviceProperties / cuDeviceGet (&device, dev)‏ Host Runtime Component

Device Management Device Management – cudaSetDevice()‏ Sets the device to be used Sets the device to be used MUST be set before calling any __global__ function MUST be set before calling any __global__ function Device 0 used by default Device 0 used by default Host Runtime Component

Stream Management Stream Management – CUStream / cudaStream_t st; – cudaStreamCreate (&st); / cuStreamCreate (&st, 0); – cudaStreamDestroy (&st); Host Runtime Component

Accurate timing using events Accurate timing using events – EventRecord (start, 0); // asynchronous // stuff to time EventRecord (stop, 0); // asynchronous EventSynchronize (stop); float time; EventElapsedTime (&time, start, stop); –As call to record is asynchronous, the event has to be synchronized before timing – EventDestroy (start); EventDestroy (stop); Host Runtime Component

Event management Event management – CUEvent/cudaEvent_t start,stop; EventCreate (&start); EventCreate (&stop); EventRecord (start, 0); // asynchronous // stuff to time EventRecord (stop, 0); // asynchronous EventSynchronize (stop); float time; EventElapsedTime (&time, start, stop); EventDestroy (start); EventDestroy (stop); Host Runtime Component

All for today Next time Next time –More on the host runtime APIs Memory, stream, event, texture management Memory, stream, event, texture management Debug mode for runtime API Debug mode for runtime API Context, module, execution control for driver API Context, module, execution control for driver API –Performance & Optimization

See you next week!