© NVIDIA Corporation 2009 Mark Harris NVIDIA Corporation Tesla GPU Computing A Revolution in High Performance Computing.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

Intermediate GPGPU Programming in CUDA
INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Optimization on Kepler Zehuan Wang
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
NVIDIA Research Parallel Computing on Manycore GPUs Vinod Grover.
CUDA Programming Model Xing Zeng, Dongyue Mou. Introduction Motivation Programming Model Memory Model CUDA API Example Pro & Contra Trend Outline.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
An Introduction to Programming with CUDA Paul Richmond
Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
Basic CUDA Programming Computer Architecture 2014 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
David Luebke NVIDIA Research GPU Computing: The Democratization of Parallel Computing.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
Parallel Computing in CUDA Michael Garland NVIDIA Research.
Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
CIS 565 Fall 2011 Qing Sun
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : Prof. Christopher Cooper’s and Prof. Chowdhury’s course note slides are used in this lecture.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
GPU Architecture and Programming
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
CUDA - 2.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
OpenCL Programming James Perry EPCC The University of Edinburgh.
© 2010 NVIDIA Corporation Optimizing GPU Performance Stanford CS 193G Lecture 15: Optimizing Parallel GPU Performance John Nickolls.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.
Martin Kruliš by Martin Kruliš (v1.0)1.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
My Coordinates Office EM G.27 contact time:
NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.
Computer Engg, IIT(BHU)
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)
Sathish Vadhiyar Parallel Programming
CS427 Multicore Architecture and Parallel Computing
Basic CUDA Programming
Mattan Erez The University of Texas at Austin
NVIDIA Fermi Architecture
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

© NVIDIA Corporation 2009 Mark Harris NVIDIA Corporation Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009 Agenda CUDA Review Architecture Programming Model Memory Model CUDA C CUDA General Optimizations Fermi Next Generation Architecture Getting Started Resources

© NVIDIA Corporation 2009 CUDA ARCHITECTURE

© NVIDIA Corporation 2009 CUDA Parallel Computing Architecture Parallel computing architecture and programming model Includes a CUDA C compiler, support for OpenCL and DirectCompute Architected to natively support multiple computational interfaces (standard languages and APIs)

© NVIDIA Corporation 2009 CUDA Parallel Computing Architecture CUDA defines: Programming model Memory model Execution model CUDA uses the GPU, but is for general-purpose computing Facilitate heterogeneous computing: CPU + GPU CUDA is scalable Scale to run on 100s of cores/1000s of parallel threads

© NVIDIA Corporation 2009 PROGRAMMING MODEL CUDA

© NVIDIA Corporation 2009 CUDA Kernels Parallel portion of application: execute as a kernel Entire GPU executes kernel, many threads CUDA threads: Lightweight Fast switching 1000s execute simultaneously CPUHostExecutes functions GPUDeviceExecutes kernels

© NVIDIA Corporation 2009 CUDA Kernels: Parallel Threads A kernel is a function executed on the GPU Array of threads, in parallel All threads execute the same code, can take different paths Each thread has an ID Select input/output data Control decisions float x = input[threadID]; float y = func(x); output[threadID] = y;

© NVIDIA Corporation 2009 CUDA Kernels: Subdivide into Blocks

© NVIDIA Corporation 2009 CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks

© NVIDIA Corporation 2009 CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks Blocks are grouped into a grid

© NVIDIA Corporation 2009 CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads

© NVIDIA Corporation 2009 CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads GPU

© NVIDIA Corporation 2009 Communication Within a Block Threads may need to cooperate Memory accesses Share results Cooperate using shared memory Accessible by all threads within a block Restriction to “within a block” permits scalability Fast communication between N threads is not feasible when N large

© NVIDIA Corporation 2009 Transparent Scalability – G

© NVIDIA Corporation 2009 Transparent Scalability – G

© NVIDIA Corporation 2009 Transparent Scalability – GT Idle

© NVIDIA Corporation 2009 CUDA Programming Model - Summary A kernel executes as a grid of thread blocks A block is a batch of threads Communicate through shared memory Each block has a block ID Each thread has a thread ID Host Kernel 1 Kernel 2 Device ,00,10,20,3 1,01,11,21,3 1D 2D

© NVIDIA Corporation 2009 MEMORY MODEL CUDA

© NVIDIA Corporation 2009 Memory hierarchy Thread: Registers

© NVIDIA Corporation 2009 Memory hierarchy Thread: Registers Thread: Local memory

© NVIDIA Corporation 2009 Memory hierarchy Thread: Registers Thread: Local memory Block of threads: Shared memory

© NVIDIA Corporation 2009 Memory hierarchy Thread: Registers Thread: Local memory Block of threads: Shared memory

© NVIDIA Corporation 2009 Memory hierarchy Thread: Registers Thread: Local memory Block of threads: Shared memory All blocks: Global memory

© NVIDIA Corporation 2009 Memory hierarchy Thread: Registers Thread: Local memory Block of threads: Shared memory All blocks: Global memory

© NVIDIA Corporation 2009 PROGRAMMING ENVIRONMENT CUDA

© NVIDIA Corporation 2009 CUDA APIs API allows the host to manage the devices Allocate memory & transfer data Launch kernels CUDA C “Runtime” API High level of abstraction - start here! CUDA C “Driver” API More control, more verbose OpenCL Similar to CUDA C Driver API

© NVIDIA Corporation 2009 CUDA C and OpenCL Shared back-end compiler and optimization technology Entry point for developers who want low-level API Entry point for developers who prefer high-level C

© NVIDIA Corporation 2009 Visual Studio Separate file types.c/.cpp for host code.cu for device/mixed code Compilation rules: cuda.rules Syntax highlighting Intellisense Integrated debugger and profiler: Nexus

© NVIDIA Corporation 2009 NVIDIA Nexus IDE The industry’s first IDE for massively parallel applications Accelerates co-processing (CPU + GPU) application development Complete Visual Studio-integrated development environment

© NVIDIA Corporation 2009 NVIDIA Nexus IDE - Debugging

© NVIDIA Corporation 2009 NVIDIA Nexus IDE - Profiling

© NVIDIA Corporation 2009 Linux Separate file types.c/.cpp for host code.cu for device/mixed code Typically makefile driven cuda-gdb for debugging CUDA Visual Profiler

© NVIDIA Corporation 2009 CUDA C CUDA

© NVIDIA Corporation 2009 Example: Vector Addition Kernel // Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i]; } int main() { // Run N/256 blocks of 256 threads each vecAdd >>(d_A, d_B, d_C); } Device Code

© NVIDIA Corporation 2009 Example: Vector Addition Kernel // Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i]; } int main() { // Run N/256 blocks of 256 threads each vecAdd >>(d_A, d_B, d_C); } Host Code

© NVIDIA Corporation 2009 CUDA: Memory Management Explicit memory allocation returns pointers to GPU memory cudaMalloc() cudaFree() Explicit memory copy for host ↔ device, device ↔ device cudaMemcpy()

© NVIDIA Corporation 2009 Example: Vector Addition Kernel // Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i]; } int main() { // Run N/256 blocks of 256 threads each vecAdd >>(d_A, d_B, d_C); }

© NVIDIA Corporation 2009 Example: Host code for vecAdd // allocate and initialize host (CPU) memory float *h_A = …, *h_B = …; // allocate device (GPU) memory float *d_A, *d_B, *d_C; cudaMalloc( (void**) &d_A, N * sizeof(float)); cudaMalloc( (void**) &d_B, N * sizeof(float)); cudaMalloc( (void**) &d_C, N * sizeof(float)); // copy host memory to device cudaMemcpy( d_A, h_A, N * sizeof(float), cudaMemcpyHostToDevice) ); cudaMemcpy( d_B, h_B, N * sizeof(float), cudaMemcpyHostToDevice) ); // execute the kernel on N/256 blocks of 256 threads each vecAdd >>(d_A, d_B, d_C);

© NVIDIA Corporation 2009 CUDA: Minimal extensions to C/C++ Declaration specifiers to indicate where things live __global__ void KernelFunc(...); // kernel callable from host __device__ void DeviceFunc(...); // function callable on device __device__ int GlobalVar; // variable in device memory __shared__ int SharedVar; // in per-block shared memory Extend function invocation syntax for parallel kernel launch KernelFunc >>(...); // 500 blocks, 128 threads each Special variables for thread identification in kernels dim3 threadIdx; dim3 blockIdx; dim3 blockDim; dim3 gridDim; Intrinsics that expose specific operations in kernel code __syncthreads(); // barrier synchronization

© NVIDIA Corporation 2009 Synchronization of blocks Threads within block may synchronize with barriers … Step 1 … __syncthreads(); … Step 2 … Blocks coordinate via atomic memory operations e.g., increment shared queue pointer with atomicInc() Implicit barrier between dependent kernels vec_minus >>(a, b, c); vec_dot >>(c, c);

© NVIDIA Corporation 2009 Using per-block shared memory Variables shared across block __shared__ int *begin, *end; Scratchpad memory __shared__ int scratch[blocksize]; scratch[threadIdx.x] = begin[threadIdx.x]; // … compute on scratch values … begin[threadIdx.x] = scratch[threadIdx.x]; Communicating values between threads scratch[threadIdx.x] = begin[threadIdx.x]; __syncthreads(); int left = scratch[threadIdx.x - 1]; Block Shared

© NVIDIA Corporation 2009 GPU ARCHITECTURE CUDA

© NVIDIA Corporation Series Architecture 240 Scalar Processor (SP) cores execute kernel threads 30 Streaming Multiprocessors (SMs) each contain 8 scalar processors 2 Special Function Units (SFUs) 1 double precision unit Shared memory enables thread cooperation Scalar Processors Multiprocessor Shared Memory Double

© NVIDIA Corporation 2009 Execution Model SoftwareHardware Threads are executed by scalar processors Thread Scalar Processor Thread Block Multiprocessor Thread blocks are executed on multiprocessors Thread blocks do not migrate Several concurrent thread blocks can reside on one multiprocessor - limited by multiprocessor resources (shared memory and register file)... Grid Device A kernel is launched as a grid of thread blocks Only one kernel can execute on a device at one time

© NVIDIA Corporation 2009 Warps and Half Warps Thread Block Multiprocessor 32 Threads... Warps 16 Half Warps 16 DRAM Global Local A thread block consists of 32-thread warps A warp is executed physically in parallel (SIMD) on a multiprocessor Device Memory = A half-warp of 16 threads can coordinate global memory accesses into a single transaction

© NVIDIA Corporation 2009 Memory Architecture Host CPU Chipset DRAM Device DRAM Global Constant Texture Local GPU Multiprocessor Registers Shared Memory Multiprocessor Registers Shared Memory Multiprocessor Registers Shared Memory Constant and Texture Caches

© NVIDIA Corporation 2009 OPTIMIZATION GUIDELINES CUDA

© NVIDIA Corporation 2009 Optimize Algorithms for the GPU Maximize independent parallelism Maximize arithmetic intensity (math/bandwidth) Sometimes it’s better to recompute than to cache GPU spends its transistors on ALUs, not memory Do more computation on the GPU to avoid costly data transfers Even low parallelism computations can sometimes be faster than transferring back and forth to host

© NVIDIA Corporation 2009 Optimize Memory Access Coalesced vs. Non-coalesced = order of magnitude Global/Local device memory Optimize for spatial locality in cached texture memory In shared memory, avoid high-degree bank conflicts

© NVIDIA Corporation 2009 Take Advantage of Shared Memory Hundreds of times faster than global memory Threads can cooperate via shared memory Use one / a few threads to load / compute data shared by all threads Use it to avoid non-coalesced access Stage loads and stores in shared memory to re-order non-coalesceable addressing

© NVIDIA Corporation 2009 Use Parallelism Efficiently Partition computation to keep the GPU multiprocessors equally busy Many threads, many thread blocks Keep resource usage low enough to support multiple active thread blocks per multiprocessor Registers, shared memory

© NVIDIA Corporation 2009 NEXT GENERATION ARCHITECTURE Fermi

© NVIDIA Corporation 2009 Many-Core High Performance Computing Each core has: Floating point & Integer unit Logic unit Move, compare unit Branch unit Cores managed by thread manager Spawn and manage over 30,000 threads Zero-overhead thread switching 10-series GPU has 240 cores 1.4 billion transistors 1 Teraflop of processing power

© NVIDIA Corporation 2009 Introducing the Fermi Architecture 3 billion transistors 512 cores DP performance 50% of SP ECC L1 and L2 Caches GDDR5 Memory Up to 1 Terabyte of GPU Memory Concurrent Kernels, C++

© NVIDIA Corporation 2009 Fermi SM Architecture 32 CUDA cores per SM (512 total) Double precision 50% of single precision 8x over GT200 Dual Thread Scheduler 64 KB of RAM for shared memory and L1 cache (configurable)

© NVIDIA Corporation 2009 CUDA Core Architecture New IEEE floating-point standard, surpassing even the most advanced CPUs Fused multiply-add (FMA) instruction for both single and double precision Newly designed integer ALU optimized for 64-bit and extended precision operations

© NVIDIA Corporation 2009 Cached Memory Hierarchy First GPU architecture to support a true cache hierarchy in combination with on-chip shared memory L1 Cache per SM (per 32 cores) Improves bandwidth and reduces latency Unified L2 Cache (768 KB) Fast, coherent data sharing across all cores in the GPU Parallel DataCache™ Memory Hierarchy

© NVIDIA Corporation 2009 Larger, Faster Memory Interface GDDR5 memory interface 2x speed of GDDR3 Up to 1 Terabyte of memory attached to GPU Operate on large data sets

© NVIDIA Corporation 2009 ECC ECC protection for DRAM ECC supported for GDDR5 memory All major internal memories Register file, shared memory, L1 cache, L2 cache Detect 2-bit errors, correct 1-bit errors (per word)

© NVIDIA Corporation 2009 GigaThread™ Hardware Thread Scheduler Hierarchically manages thousands of simultaneously active threads 10x faster application context switching Concurrent kernel execution

© NVIDIA Corporation 2009 GigaThread™ Hardware Thread Scheduler Concurrent Kernel Execution + Faster Context Switch

© NVIDIA Corporation 2009 GigaThread Streaming Data Transfer Engine Dual DMA engines Simultaneous CPU  GPU and GPU  CPU data transfer Fully overlapped with CPU and GPU processing time Activity Snapshot:

© NVIDIA Corporation 2009 Enhanced Software Support Full C++ Support Virtual functions Try/Catch hardware support System call support Support for pipes, semaphores, printf, etc Unified 64-bit memory addressing

© NVIDIA Corporation 2009 Tesla S1070 1U System Tesla C1060 Computing Board Tesla Personal Supercomputer GPUs2 Tesla GPUs4 Tesla GPUs1 Tesla GPU4 Tesla GPUs Single Precision Performance 1.87 Teraflops4.14 Teraflops933 Gigaflops3.7 Teraflops Double Precision Performance 156 Gigaflops346 Gigaflops78 Gigaflops312 Gigaflops Memory8 GB (4 GB / GPU)16 GB (4 GB / GPU)4 GB16 GB (4 GB / GPU) SuperMicro 1U GPU SuperServer Tesla GPU Computing Products

© NVIDIA Corporation 2009 Tesla S2070 1U System Tesla C2050 Computing Board Tesla C2070 Computing Board GPUs4 Tesla GPUs1 Tesla GPU Double Precision Performance 2.1 – 2.5 Teraflops520 – 630 Gigaflops Memory12 GB (3 GB / GPU)24 GB (6 GB / GPU)3 GB6 GB Tesla S2050 1U System Tesla GPU Computing Products: Fermi

© NVIDIA Corporation 2009 Tesla GPU Computing Questions?

© NVIDIA Corporation 2009 RESOURCES Getting Started

© NVIDIA Corporation 2009 Getting Started CUDA Zone Introductory tutorials/webinars Forums Documentation Programming Guide Best Practices Guide Examples CUDA SDK

© NVIDIA Corporation 2009 Libraries NVIDIA CUBLASDense linear algebra (subset of full BLAS suite) CUFFT1D/2D/3D real and complex Third party NAGNumeric libraries e.g. RNGs CULAPACK/MAGMA Open Source ThrustSTL/Boost style template language CUDPPData parallel primitives (e.g. scan, sort and reduction) CUSPSparse linear algebra and graph computation Many more...