GPU computing and CUDA Marko Mišić

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

Intermediate GPGPU Programming in CUDA
Speed, Accurate and Efficient way to identify the DNA.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Towards Acceleration of Fault Simulation Using Graphics Processing Units Kanupriya Gulati Sunil P. Khatri Department of ECE Texas A&M University, College.
CUDA Programming Model Xing Zeng, Dongyue Mou. Introduction Motivation Programming Model Memory Model CUDA API Example Pro & Contra Trend Outline.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 3: The CUDA Memory Model.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
GPU Architecture and Programming
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
CUDA - 2.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
1 Massively Parallel Computing with CUDA Dr. Antonino Tumeo Politecnico di Milano.
Computer Engg, IIT(BHU)
Sathish Vadhiyar Parallel Programming
CS427 Multicore Architecture and Parallel Computing
EECE571R -- Harnessing Massively Parallel Processors ece
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
© David Kirk/NVIDIA and Wen-mei W. Hwu,
ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2
Mattan Erez The University of Texas at Austin
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Mattan Erez The University of Texas at Austin
Graphics Processing Unit
6- General Purpose GPU Programming
Presentation transcript:

GPU computing and CUDA Marko Mišić (marko.misic@etf.rs) Milo Tomašević (mvt@etf.rs) YUINFO 2012 Kopaonik, 29.02.2012.

Introduction to GPU computing (1) Graphics Processing Units (GPUs) have been used for non-graphics computation for several years This trend is called General-Purpose computation on GPUs (GPGPU) The GPGPU applications can be found in: Computational physics/chemistry/biology Signal processing Computational geometry Database management Computational finance Computer vision

Introduction to GPU computing (2) The GPU is a highly parallel processor good at data-parallel processing with many calculations per memory access The same computation executed on many data elements in parallel with high arithmetic intensity Same computation means lower requirement for sophisticated flow control High arithmetic intensity and many data elements mean that memory access latency can be hidden with calculations instead of big data caches

CPU vs. GPU trends (1) CPU is optimized to execute tasks Big caches hide memory latencies Sophisticated flow control GPU is specialized for compute-intensive, highly parallel computation More transistors can be devoted to data processing rather than data caching and flow control Cache ALU Control DRAM DRAM The GPU is specialized for compute-intensive, highly parallel computation (exactly what graphics rendering is about) So, more transistors can be devoted to data processing rather than data caching and flow control CPU GPU

CPU vs. GPU trends (2) The GPU has evolved into a very flexible and powerful processor Programmable using high-level languages Computational power: 1 TFLOPS vs. 100 GFLOPS Bandwidth: ~10x bigger GPU is found in almost every workstation

(GFLOPS as defined by benchFFT) CPU vs. GPU trends (3) CUDA Advantage Rigid Body Physics Solver 10x 20x 47x 197x Matrix Numerics BLAS1: 60+ GB/s BLAS3: 100+ GFLOPS Wave Equation FDTD: 1.2 Gcells/s FFT: 52 GFLOPS (GFLOPS as defined by benchFFT) Biological Sequence Match SSEARCH: 5.2 Gcells/s Finance Black Scholes: 4.7 GOptions/s 6

History of GPU programming The fast-growing video game industry puts strong pressure that forces constant innovation GPUs evolved from fixed-function pipeline processors to the more programmable, general-purpose processors Programmable shaders (2000) Programmed through OpenGL and DirectX API Lots of limitations Memory access, ISA, floating-point support, etc. NVIDIA CUDA (2007) AMD/ATI (Brook+, FireStream, Close-To-Metal) Microsoft DirectCompute (DirectX 10/DirectX 11) OpenCompute Language, OpenCL (2009)

CUDA overview (1) Compute Device Unified Architecture (CUDA) A new hardware and software architecture for issuing and managing computations on the GPU Started with NVIDIA 8000 (G80) series GPUs General-purpose programming model SIMD / SPMD User launches batches of threads on the GPU GPU could be seen as dedicated super-threaded, massively data parallel coprocessor Explicit and unrestricted memory management

CUDA overview (2) The GPU is viewed as a compute device that is a coprocessor to the CPU (host) Executes compute-intensive part of the application Runs many threads in parallel Has its own DRAM (device memory) Data-parallel portions of an application are expressed as device kernels which run on many threads GPU threads are extremely lightweight Very little creation overhead GPU needs 1000s of threads for full efficiency Multicore CPU needs only a few

CUDA overview (3) Dedicated software stack Runtime and driver C-language extension for easier programming Targeted API for advanced users Complete tool chain Compiler, debugger, profiler Libraries and 3rd party support GPU Computing SDK cuFFT, cuBLAS... FORTRAN, C++, Python, MATLAB, Thrust, GMAC… GPU CPU CUDA Runtime CUDA Libraries (FFT, BLAS) CUDA Driver Application

Programming model (1) CUDA application consists of two parts Sequential parts are executed on the CPU (host) Compute-intensive parts are executed on the GPU (device) The CPU is responsible for data management, memory transfers, and the GPU execution configuration Serial Code (host)‏ . . . Parallel Kernel (device)‏ KernelA<<< nBlk, nTid >>>(args); Serial Code (host)‏ . . . Parallel Kernel (device)‏ KernelB<<< nBlk, nTid >>>(args);

Programming model (2) A kernel is executed as a grid of thread blocks A thread block is a batch of threads that can cooperate with each other by: Efficiently sharing data through shared memory Synchronizing their execution Two threads from two different blocks cannot cooperate Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) (1, 0) (2, 0) (0, 1) (1, 1) (2, 1) Grid 2 Block (1, 1) Thread (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) (3, 0) (4, 0)

Programming model (3) Threads and blocks have IDs So each thread can decide what data to work on Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D Simplifies memory addressing when processing multidimensional data Image processing Solving PDEs on volumes Device Grid 1 Block (0, 0) (1, 0) (2, 0) (0, 1) (1, 1) (2, 1) Block (1, 1) Thread (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) (3, 0) (4, 0)

Memory model (1) Each thread can: Read/write per-thread registers Read/write per-thread local memory Read/write per-block shared memory Read/write per-grid global memory Read only per-grid constant memory Read only per-grid texture memory Grid Constant Memory Texture Global Block (0, 0) Shared Memory Local Thread (0, 0) Registers Thread (1, 0) Block (1, 0) Host

Memory model (2) The host can read/write global, constant, and texture memory All stored in device DRAM Global memory accesses are slow Around ~200 cycles Memory architecture optimized for high bandwidth Memory banks Transactions Device Block (0, 0)‏ Block (1, 0)‏ Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0)‏ Thread (1, 0)‏ Thread (0, 0)‏ Thread (1, 0)‏ Global Memory (DRAM) Host Global Memory (DRAM)

Memory model (3) Shared memory is a fast on-chip memory … Allows threads in a block to share intermediate data Access time ~ 3-4 cycles Could be seen as user-managed cache (scratchpad) Threads are responsible to bring the data to and move it from the shared memory Small in size (up to 48KB) DRAM ALU Shared memory Control Cache ... d0 d1 d2 d3 d4 d5 d6 d7 …

A common programming strategy Local and global memory reside in device memory (DRAM) Much slower access than shared memory A common way of performing computation on the device is to block it up (tile) to take advantage of fast shared memory Partition the data set into subsets that fit into shared memory Handle each data subset with one thread block by: Loading the subset from global memory to shared memory Performing the computation on the subset from shared memory Each thread can efficiently multi-pass over any data element Copying results from shared memory to global memory

Matrix Multiplication Example (1) P = M * N of size WIDTH x WIDTH Without blocking: One thread handles one element of P M and N are loaded WIDTH times from global memory N WIDTH M P The GPU is specialized for compute-intensive, highly parallel computation (exactly what graphics rendering is about) So, more transistors can be devoted to data processing rather than data caching and flow control WIDTH WIDTH WIDTH

Matrix Multiplication Example (2) P = M * N of size WIDTH x WIDTH With blocking: One thread block handles one BLOCK_SIZE x BLOCK_SIZE sub-matrix Psub of P M and N are only loaded WIDTH / BLOCK_SIZE times from global memory Great saving of memory bandwidth! M N P Psub BLOCK_SIZE WIDTH The GPU is specialized for compute-intensive, highly parallel computation (exactly what graphics rendering is about) So, more transistors can be devoted to data processing rather than data caching and flow control

CUDA API (1) The CUDA API is an extension to the C programming language consisting of: Language extensions To target portions of the code for execution on the device A runtime library split into: A common component providing built-in vector types and a subset of the C runtime library in both host and device codes A host component to control and access one or more devices from the host A device component providing device-specific functions

CUDA API (2) Function declaration qualifiers Variable qualifiers __global__, __host__, __device__ Variable qualifiers __host__, __device___, __shared__, etc. Built-in variables gridDim, blockDim, blockIdx, threadIdx Mathematical functions Kernel calling convention (execution configuration) myKernel<<< DimGrid, DimBlock >>>(arg1, … ); Programmer explicitly specifies block and grid organization 1D, 2D or 3D

Hardware implementation (1) The device is a set of multiprocessors Each multiprocessor is a set of 32-bit processors with a SIMD architecture At each clock cycle, a multiprocessor executes the same instruction on a group of threads called a warp Including branches Allows scalable execution of kernels Adding more multiprocessors improves performance Device Multiprocessor N … Multiprocessor 2 Multiprocessor 1 Instruction Unit Processor 1 Processor 2 … Processor M

Hardware implementation (2) Load/store Global Memory Thread Execution Manager Input Assembler Host Texture Parallel Data Cache

Hardware implementation (3) Each thread block of a grid is split into warps that get executed by one multiprocessor Warp consists of threads with consecutive thread IDs) Each thread block is executed by only one multiprocessor Shared memory space resides in the on-chip shared memory Registers are allocated among the threads A kernel that requires too many registers will fail to launch A multiprocessor can execute several blocks concurrently Shared memory and registers are allocated among the threads of all concurrent blocks Decreasing shared memory usage (per block) and register usage (per thread) increases number of blocks that can run concurrently

Memory architecture (1) In a parallel machine, many threads access memory Memory is divided into banks Essential to achieve high bandwidth Each bank can service one address per cycle A memory can service as many simultaneous accesses as it has banks Multiple simultaneous accesses to a bank result in a bank conflict Conflicting accesses are serialized Shared memory is organized in similar fashion Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0

Memory architecture (2) When accessing global memory, accesses are combined into transactions Peak bandwidth is achieved when all threads in a half warp access continuous memory locations “Memory coalescing” In that case, there are no bank conflicts Programmer is responsible to optimize algorithms to access data in appropriate fashion

Performance considerations CUDA has a low learning curve It is easy to write a correct program Performance can vary greatly depending on the resource constraints of the particular device architecture Performance concerned programmers still need to be aware of them to make a good use of a contemporary hardware It is essential to understand hardware and memory architecture Thread scheduling and execution Suitable memory access patterns Shared memory utilization Resource limitations

Conclusion Highly multithreaded architecture of modern GPUs is very suitable for solving data-parallel problems Vastly improves performance in certain domains It is expected that GPU architectures will evolve to further broaden application domains We are in the dawn of heterogeneous computing Software support is developing rapidly Mature tool chain Libraries Available applications OpenCL

References David Kirk, Wen-mei Hwu, Programming Massively Parallel Processors: A Hands on Approach, Morgan Kaufmann, 2010. Course ECE498AL, University of Illinois, Urbana-Champaign http://courses.engr.illinois.edu/ece498/al/ Dann Connors, OpenCL and CUDA Programming for Multicore and GPU Architectures, ACACES 2011, Fiuggi, Italy, 2011. David Kirk, Wen-mei Hwu, Programming and tUnining Massively Parallel Systems, PUMPS 2011, Barcelona, Spain, 2011. NVIDIA CUDA C Programming Guide 4.0, 2011. Mišić, Đurđević, Tomašević, “Evolution and Trends in GPU Computing”, MIPRO 2012, Abbazia, Croatia, 2012. (to be published) NVIDIA Developer zone, http://developer.nvidia.com/category/zone/cuda-zone http://en.wikipedia.org/wiki/GPGPU http://en.wikipedia.org/wiki/CUDA GPU training wiki, https://hpcforge.org/plugins/mediawiki/wiki/gpu-training/index.php/Main_Page

GPU computing and CUDA Questions? Marko Mišić (marko.misic@etf.rs) Milo Tomašević (mvt@etf.rs) YUINFO 2012 Kopaonik, 29.02.2012.