1 Massively Parallel Computing with CUDA Dr. Antonino Tumeo Politecnico di Milano.

Slides:



Advertisements
Similar presentations
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
GRAPHICS AND COMPUTING GPUS Jehan-François Pâris
© NVIDIA Corporation 2009 Mark Harris NVIDIA Corporation Tesla GPU Computing A Revolution in High Performance Computing.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
NVIDIA Research Parallel Computing on Manycore GPUs Vinod Grover.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
CUDA Programming Model Xing Zeng, Dongyue Mou. Introduction Motivation Programming Model Memory Model CUDA API Example Pro & Contra Trend Outline.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Graphics Processing Units
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
David Luebke NVIDIA Research GPU Computing: The Democratization of Parallel Computing.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
HPC C OMPUTING WITH CUDA AND T ESLA H ARDWARE Timothy Lanfear, NVIDIA.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
GPU computing and CUDA Marko Mišić
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : Prof. Christopher Cooper’s and Prof. Chowdhury’s course note slides are used in this lecture.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
GPU Architecture and Programming
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
© 2010 NVIDIA Corporation Optimizing GPU Performance Stanford CS 193G Lecture 15: Optimizing Parallel GPU Performance John Nickolls.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
ME964 High Performance Computing for Engineering Applications Execution Model and Its Hardware Support Sept. 25, 2008.
Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Computer Engg, IIT(BHU)
Appendix C Graphics and Computing GPUs
Prof. Zhang Gang School of Computer Sci. & Tech.
Leveraging GPUs for Application Acceleration Dan Ernst Cray, Inc.
Lecture 20 Computing with Graphical Processing Units
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
CS427 Multicore Architecture and Parallel Computing
Enabling machine learning in embedded systems
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Mattan Erez The University of Texas at Austin
NVIDIA Fermi Architecture
Mattan Erez The University of Texas at Austin
Graphics Processing Unit
6- General Purpose GPU Programming
CSE 502: Computer Architecture
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

1 Massively Parallel Computing with CUDA Dr. Antonino Tumeo Politecnico di Milano

GPUs have evolved to the point where many real world applications are easily implemented on them and run significantly faster than on multi-core systems. Future computing architectures will be hybrid systems with parallel-core GPUs working in tandem with multi-core CPUs. Jack Dongarra Professor, University of Tennessee; Author of “Linpack”

Why Use the GPU? The GPU has evolved into a very flexible and powerful processor: It’s programmable using high-level languages It supports 32-bit and 64-bit floating point IEEE-754 precision It offers lots of GFLOPS: GPU in every PC and workstation

What is behind such an Evolution? The GPU is specialized for compute-intensive, highly parallel computation (exactly what graphics rendering is about) So, more transistors can be devoted to data processing rather than data caching and flow control The fast-growing video game industry exerts strong economic pressure that forces constant innovation DRAM Cache ALU Control ALU DRAM CPUGPU

GPUs Each NVIDIA GPU has 240 parallel cores Within each core Floating point unit Logic unit (add, sub, mul, madd) Move, compare unit Branch unit Cores managed by thread manager Thread manager can spawn and manage 12,000+ threads per core Zero overhead thread switching NVIDIA GPU 1.4 Billion Transistors 1 Teraflop of processing power

Heterogeneous Computing Domains Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging GPU (Parallel Computing) Graphics CPU (Sequential Computing) Massive Data Parallelism Instruction Level Parallelism Data Fits in CacheLarger Data Sets

CUDA Parallel Programming Architecture and Model Programming the GPU in High-Level Languages

CUDA is C for Parallel Processors CUDA is industry-standard C with minimal extensions Write a program for one thread Instantiate it on many parallel threads Familiar programming model and language CUDA is a scalable parallel programming model Program runs on any number of processors without recompiling CUDA parallelism applies to both CPUs and GPUs Compile the same program source to run on different platforms with widely different parallelism Map to CUDA threads to GPU threads or to CPU vectors

CUDA Parallel Computing Architecture ATI’s Compute “Solution” Parallel computing architecture and programming model Includes a C compiler plus support for OpenCL and DX11 Compute Architected to natively support all computational interfaces (standard languages and APIs) NVIDIA GPU architecture accelerates CUDA Hardware and software designed together for computing Expose the computational horsepower of NVIDIA GPUs Enable general-purpose GPU computing

CUDA Libraries cuFFTcuBLAS cuDPP CUDA Compiler CFortran CUDA Tools Debugger Profiler CPU Hardware PCI-E Switch1U Application Software (written in C) 4 cores

Pervasive CUDA Parallel Computing CUDA brings data-parallel computing to the masses Over 100M CUDA-capable GPUs deployed since Nov 2006 Wide developer acceptance ww.nvidia.com/CUDA Over 150K CUDA developer downloads (CUDA is free!) Over 25k CUDA developers... and growing rapidly A GPU “developer kit” costs ~ $200 for 500 GFLOPS Now available on any new Macbook Data-parallel supercomputers are everywhere! CUDA makes this power readily accessible Enables rapid innovations in data-parallel computing Massively parallel computing has become a commodity technology!

CUDA Computing with Tesla 240 SP processors at 1.5 GHz: 1 TFLOPS peak 128 threads per processor: 30,720 threads total Tesla PCI-e board: C1060 (1 GPU) 1U Server: S1070 (4 GPUs)

CUDA Uses Extensive Multithreading CUDA threads express fine-grained data parallelism Map threads to GPU threads Virtualize the processors You must rethink your algorithms to be aggressively parallel CUDA thread blocks express coarse-grained parallelism Blocks hold arrays of GPU threads, define shared memory boundaries Allow scaling between smaller and larger GPUs GPUs execute thousands of lightweight threads (In graphics, each thread computes one pixel) One CUDA thread computes one result (or several results) Hardware multithreading & zero-overhead scheduling

CUDA Computing Sweet Spots Parallel Applications High bandwidth: Sequencing (virus scanning, genomics), sorting, database, … Visual computing: Graphics, image processing, tomography, machine vision, … High arithmetic intensity: Dense linear algebra, PDEs, n-body, finite difference, …

© NVIDIA Corporation 2008 A Highly Multithreaded Coprocessor The GPU is a highly parallel compute coprocessor serves as a coprocessor for the host CPU has its own device memory with high bandwidth interconnect The application run its parallel parts on GPU, via kernels. Many threads execute same kernel SIMT = Single Instruction Multiple Threads GPU Threads are extremely lightweight Very little creation overhead, Instant switching GPU uses 1000s of threads for efficiency 102 GigaBytes/sec Memory Bandwidth Bandwidth

© NVIDIA Corporation 2008 Heterogeneous Programming CUDA application = serial program executing parallel kernels, all in C Serial C code executed by a CPU thread Parallel kernel C code executed by GPU, in threads (grouped in blocks) Serial Code... Parallel Kernel KernelA >>(args); Parallel Kernel KernelA >>(args); Serial Code Parallel Kernel KernelB >>(args); Parallel Kernel KernelB >>(args);

17 © NVIDIA Corporation 2008 Arrays of Parallel Threads … float x = input[threadID]; float y = func(x); output[threadID] = y; … float x = input[threadID]; float y = func(x); output[threadID] = y; … threadID A CUDA kernel is executed by an array of threads All threads run the same program, SIMT (Singe Instruction multiple threads) Each thread uses its ID to compute addresses and make control decisions

18 Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (1, 1) Host Kernel 2 Kernel 1 Grid 2 Block (1, 1) CUDA Programming Model A kernel is executed by a grid, which contain blocks. These blocks contain our threads. A thread block is a batch of threads that can cooperate: Sharing data through shared memory Synchronizing their execution Threads from different blocks operate independently Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2)

Divide monolithic thread array into multiple blocks Threads within a block cooperate via shared memory Threads in different blocks cannot cooperate Enables programs to transparently scale to any number of processors! © NVIDIA Corporation 2008 threadID Thread Block 0 … … Thread Block 1 Thread Block N - 1 Thread Blocks: Scalable Cooperation … float x = input[threadID]; float y = func(x); output[threadID] = y; … float x = input[threadID]; float y = func(x); output[threadID] = y; … float x = input[threadID]; float y = func(x); output[threadID] = y; … float x = input[threadID]; float y = func(x); output[threadID] = y; … float x = input[threadID]; float y = func(x); output[threadID] = y; … float x = input[threadID]; float y = func(x); output[threadID] = y; …

© NVIDIA Corporation 2008 Thread Cooperation Thread cooperation is a powerful feature of CUDA Threads can cooperate via on-chip shared memory and synchronization The on-chip shared memory within one block allows: Share memory accesses, drastic memory bandwidth reduction Share intermediate results, thus: save computation Makes algorithm porting to GPUs a lot easier (vs. GPGPU and its strict stream processor model)

Reason for blocks: GPU scalability G80: 128 Cores G84: 32 Cores Tesla: 240 SP Cores

© NVIDIA Corporation 2008 Transparent Scalability Hardware is free to schedule thread blocks on any processor Kernels scale to any number of parallel multiprocessors Device B Device A Block 1 Block 0 Block 3 Block 2 Block 5 Block 4 Block 7 Block 6 Block 1 Block 0 Block 3 Block 2 Block 5 Block 4 Block 7 Block 6 Kernel grid Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7

© NVIDIA Corporation 2008 Tesla T10 chip (Tesla C1060 / one GPU of Tesla S1070) 240 Units execute kernel threads, grouped into 10 multiprocessors Up to 30,720 parallel threads active in the multiprocessors Threads are grouped in blocks, providing shared memory: Scalability!! Parallel Core GPU – Block Diagram

© NVIDIA Corporation 2008 Registers (per thread) Shared Memory Shared among threads in a single block On-chip, small As fast as registers Global Memory Kernel inputs and outputs reside here Off-chip, large Uncached (use coalescing) Note: The host can read & write global memory but not shared memory Grid Block (1, 0)Block (0, 0) Host Memory model seen from CUDA Kernel Global Memory Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers

25 © NVIDIA Corporation 2008 Simple “C” Extensions to Express Parallelism void saxpy_serial(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } // Invoke serial SAXPY kernel saxpy_serial(n, 2.0, x, y); __global__ void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i]; } // Invoke parallel SAXPY kernel with // 256 threads/block int nblocks = (n + 255) / 256; saxpy_parallel >>(n, 2.0, x, y); Standard C CodeCUDA C Code

Compilation Any source file containing CUDA language extensions must be compiled with nvcc NVCC is a compiler driver Works by invoking all the necessary tools and compilers like cudacc, g++, cl,... NVCC can output: Either C code (CPU Code) That must then be compiled with the rest of the application using another tool Or PTX object code directly Any executable with CUDA code requires two dynamic libraries: The CUDA runtime library ( cudart ) The CUDA core library ( cuda )

Compiling C for CUDA Applications NVCC CPU Code C CUDA Key Kernels CUDA object files Rest of C Application CPU object files Linker CPU-GPU Executable

Hardware Thread Management Thousands of lightweight concurrent threads No switching overhead Hide instruction and memory latency On-Chip Shared Memory User-managed data cache Thread communication / cooperation within blocks Random access to global memory Any thread can read/write any location(s) Direct host access 28 © NVIDIA Corporation 2008 Keys to GPU Computing Performance

Shared back-end compiler and optimization technology NVIDIA C for CUDA and OpenCL OpenCL C for CUDA PTX GPU Entry point for developers who prefer high-level C Entry point for developers who want low-level API

Different Programming Styles C for CUDA C with parallel keywords C runtime that abstracts driver API Memory managed by C runtime Generates PTX OpenCL Hardware API - similar to OpenGL and CUDA driver API Programmer has complete access to hardware device Memory managed by programmer Generates PTX

© NVIDIA Corporation M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU 30K CUDA Developers

Resources NVIDIA CUDA Zone ( SDK Manuals Papers Forum Courses 32