Ferienakademie 2007 Alexander Heinecke (TUM) 1 A short introduction to nVidia‘s CUDA Alexander Heinecke Technical University of Munich

Slides:

Advertisements

Similar presentations

Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.

Advertisements

Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.

CUDA Continued Adrian Harrington COSC 3P93. 2 Material to be Covered What is CUDA Review ▫Architecture ▫Programming Model Programming Examples ▫Matrix.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois, ECE408/CS483 Applied Parallel Programming Lecture.

Introduction to CUDA and CELL SpursEngine Multi-core Programming 1 Reference: 1. NVidia CUDA (Compute Unified Device Architecture) documents 2. Presentation.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

The Missouri S&T CS GPU Cluster Cyriac Kandoth. Pretext NVIDIA ( ) is a manufacturer of graphics processor technologies that has begun to promote their.

Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

GPU PROGRAMMING David Gilbert California State University, Los Angeles.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.

CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

GPU Programming EPCC The University of Edinburgh.

An Introduction to Programming with CUDA Paul Richmond

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.

ME964 High Performance Computing for Engineering Applications “They have computers, and they may have other weapons of mass destruction.” Janet Reno, former.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.

GPU Programming with CUDA – Optimisation Mike Griffiths

Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.

CIS 565 Fall 2011 Qing Sun

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.

1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.

Today’s lecture 2-Dimensional indexing Color Format Thread Synchronization within for- loops Shared Memory Tiling Review example programs Using Printf.

GPU Architecture and Programming

Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Accelerating Spherical Harmonic Transforms on the NVIDIA® GPGPU

CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Martin Kruliš by Martin Kruliš (v1.0)1.

CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.

Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

CS/EE 217 GPU Architecture and Parallel Programming Lectures 4 and 5: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu,

©Wen-mei W. Hwu and David Kirk/NVIDIA, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 6: DRAM Bandwidth.

Heterogeneous Computing With GPGPUs Matthew Piehl Overview Introduction to CUDA Project Overview Issues faced nvcc Implementation Performance Metrics Conclusions.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

My Coordinates Office EM G.27 contact time:

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.

Matrix Multiplication in CUDA Kyeo-Reh Park Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering.

CUDA C/C++ Basics Part 2 - Blocks and Threads

Chao Li Yi Yang HongwenDai Shenggen Yan Frank Mueller Huiyang Zhou

ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.

Slides from “PMPP” book

© David Kirk/NVIDIA and Wen-mei W. Hwu,

L4: Memory Hierarchy Optimization II, Locality and Data Placement

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

Parallel programming with GPGPU coprocessors

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

General Purpose Graphics Processing Units (GPGPUs)

© David Kirk/NVIDIA and Wen-mei W. Hwu,

Chapter 4:Parallel Programming in CUDA C

6- General Purpose GPU Programming

Presentation transcript:

Ferienakademie 2007 Alexander Heinecke (TUM) 1 A short introduction to nVidia‘s CUDA Alexander Heinecke Technical University of Munich

Ferienakademie 2007 Alexander Heinecke (TUM) 2 Overview 1.Differences CPU – GPU3 1.General CPU/GPU properties 2.Compare specifications 2.CUDA Programming Model10 1.Application stack 2.Thread implementation 3.Memory Model 3.CUDA API13 1.Extension of the C/C++ Programming Lang. 2.Example structure of a CUDA application 4.Examples15 1.Matrix Addition 2.Matrix Multiplication 3.Jacobi & Gauß – Seidel 5.Benchmark Results21

Ferienakademie 2007 Alexander Heinecke (TUM) 3 Differences between CPU and GPU GPU: nearly all transistors are ALUs CPU: most of the transistors are Cache (taken from [NV1])

Ferienakademie 2007 Alexander Heinecke (TUM) 4 AMD Opteron Dieshot

Ferienakademie 2007 Alexander Heinecke (TUM) 5 Intel Itanium2 Dual-Core Dieshot

Ferienakademie 2007 Alexander Heinecke (TUM) 6 Intel Core Architecture Pipeline / Simple Example (taken from IN1) IFETCH #1IFETCH #2 IDEC #1 IFETCH #3 IDEC #2 OFETCH #1 IFETCH #4 IDEC #3 OFETCH #2 EXEC #1 IFETCH #5 IDEC #4 OFETCH #3 EXEC #2 RET #1 IFETCH #6 IDEC #5 OFETCH #4 EXEC #3 RET #2 IFETCH #7 IDEC #6 OFETCH #5 EXEC #4 RET #3 Pipeline cycle Step 1 Step 2 Step 3 Step 4 Step

Ferienakademie 2007 Alexander Heinecke (TUM) 7 nVidia G80 Pipeline

Ferienakademie 2007 Alexander Heinecke (TUM) 8 Properties of CPU and GPU Intel Xeon X5355nVidia G80 (8800 GTX) Clock Speed2,66 GHz575 MHz #Cores / SPEs4128 Floats in register Max. GFlop/s (float) 84 (prac) 85 (theo) 460 (prac) 500 (theo) Max. InstructionsRAM limited2 Million G80 ASM Instr. typ. dur. Inst.1-2 cycles (SSE)min. 4 cycles Price (€)800500

Ferienakademie 2007 Alexander Heinecke (TUM) 9 History: Power of GPUs in the last four years (taken from [NV1])

Ferienakademie 2007 Alexander Heinecke (TUM) 10 Application stack of CUDA (taken from [NV1])

Ferienakademie 2007 Alexander Heinecke (TUM) 11 Thread organization in CUDA (taken from [NV1])

Ferienakademie 2007 Alexander Heinecke (TUM) 12 Memory organization in CUDA (taken from [NV1])

Ferienakademie 2007 Alexander Heinecke (TUM) 13 Extensions to C (functions and varaible) CUDA Code is saved in special files (*.cu) These are precompiled by nvcc (nvidia compiler) There are some function type qualifiers, which decide the execution place: –__host__ (CPU only, called by CPU) –__global__ (GPU only, called by CPU) –__device__ (GPU only, called by GPU) For varaibles: __device__, __constant__, __shared__

Ferienakademie 2007 Alexander Heinecke (TUM) 14 Example structure of a CUDA application min. two functions to isolate CUDA Code from your app. First function: –Init CUDA –Copy data to device –Call kernel with execution settings –Copy data to host and shut down (automatic) Second function (kernel): –Contains problem for ONE thread

Ferienakademie 2007 Alexander Heinecke (TUM) 15 Tested Algorithms (2D Arrays) All tested algorithms operate on 2D Arrays Matrix Addtion Matrix Multiplication Jacobi & Gauß-Seidel (iterative solver)

Ferienakademie 2007 Alexander Heinecke (TUM) 16 Example Matrix Addition (Init function) CUT_DEVICE_INIT(); // allocate device memory float* d_A; CUDA_SAFE_CALL(cudaMalloc((void**) &d_A, mem_size)); … // copy host memory to device CUDA_SAFE_CALL(cudaMemcpy(d_A, ma_a, mem_size, cudaMemcpyHostToDevice) ); … cudaBindTexture(0, texRef_MaA, d_A, mem_size); // texture binding … dim3 threads(BLOCK_SIZE_GPU, BLOCK_SIZE_GPU); dim3 grid(n_dim / threads.x, n_dim / threads.y); // execute the kernel cuMatrixAdd_kernel >>(d_C, n_dim); cudaUnbindTexture(texRef_MaA); // texture unbinding … // copy result from device to host CUDA_SAFE_CALL(cudaMemcpy(ma_c, d_C, mem_size, cudaMemcpyDeviceToHost) ); … CUDA_SAFE_CALL(cudaFree(d_A));

Ferienakademie 2007 Alexander Heinecke (TUM) 17 Example Matrix Addition (kernel) // Block index int bx = blockIdx.x; int by = blockIdx.y; // Thread index int tx = threadIdx.x; int ty = threadIdx.y; int start = (n_dim * by * BLOCK_SIZE_GPU) + bx * BLOCK_SIZE_GPU; C[start + (n_dim * ty) + tx] = tex1Dfetch(texRef_MaA, start + (n_dim * ty) + tx) + tex1Dfetch(texRef_MaB, start + (n_dim * ty) + tx);

Ferienakademie 2007 Alexander Heinecke (TUM) 18 Example Matrix Multiplication (kernel) int tx2 = tx + BLOCK_SIZE_GPU; int ty2 = n_dim * ty; float Csub1 = 0.0; float Csub2 = 0.0; int b = bBegin; for (int a = aBegin; a <= aEnd; a += aStep) { __shared__ float As[BLOCK_SIZE_GPU][BLOCK_SIZE_GPU]; AS(ty, tx) = A[a + ty2 + tx]; __shared__ float B1s[BLOCK_SIZE_GPU][BLOCK_SIZE_GPU*2]; B1S(ty, tx) = B[b + ty2 + tx]; B1S(ty, tx2) = B[b + ty2 + tx2]; __syncthreads(); Csub1 += AS(ty, 0) * B1S(0, tx); // more calcs b+= bStep; } __syncthreads(); // Write result back

Ferienakademie 2007 Alexander Heinecke (TUM) 19 Example Jacobi (kernel), no internal loops // Block index int bx = blockIdx.x; int by = blockIdx.y; // Thread index int tx = threadIdx.x+1; int ty = threadIdx.y+1; int ustart =((by * BLOCK_SIZE_GPU) * n_dim ) + (bx * BLOCK_SIZE_GPU); float res = tex1Dfetch(texRef_MaF, ustart + (ty * n_dim) + tx) * qh; res += tex1Dfetch(texRef_MaU, ustart + (ty * n_dim) + tx - 1) + tex1Dfetch(texRef_MaU, ustart + (ty * n_dim) + tx + 1); res += tex1Dfetch(texRef_MaU, ustart + ((ty+1) * n_dim) + tx) + tex1Dfetch(texRef_MaU, ustart + ((ty-1) * n_dim) + tx); res = 0.25f * res; ma_u[ustart + (ty * n_dim) + tx] = res;

Ferienakademie 2007 Alexander Heinecke (TUM) 20 Example Jacobi (kernel), internal loops int tx = threadIdx.x+1; int ty = threadIdx.y+1; // *some more inits* // load to calc u_ij __shared__ float Us[BLOCK_SIZE_GPU+2][BLOCK_SIZE_GPU+2]; US(ty, tx) = tex1Dfetch(texRef_MaU, ustart + (ty * n_dim) + tx); // *init edge u* … for (unsigned int i = 0; i < n_intern_loops; i++) { res = funk; res += US(ty, tx - 1) + US(ty, tx + 1); res += US(ty - 1, tx) + US(ty + 1, tx); res = 0.25f * res; __syncthreads(); // not used in parallel jacobi US(ty, tx) = res; } ma_u[ustart + (ty * n_dim) + tx] = res;

Ferienakademie 2007 Alexander Heinecke (TUM) 21 Performance Results (1)

Ferienakademie 2007 Alexander Heinecke (TUM) 22 Performance Results (2)

Ferienakademie 2007 Alexander Heinecke (TUM) 23 Performance Results (3)

Ferienakademie 2007 Alexander Heinecke (TUM) 24 Performance Results (4)

Ferienakademie 2007 Alexander Heinecke (TUM) 25 Conclusion (Points to take care of) Be care of / you should use: min. number of memory accesses use unrolling instead of for loops use blocking algorithms only algorithms, which are not extremly memory bounded (NOT matrix addition) should be implemented with CUDA try to do not use the if statement, or other programmecontrolling statements (slow)

Ferienakademie 2007 Alexander Heinecke (TUM) 26 Appendix - References [NV1]NVIDIA CUDA Compute Unified Device Architecture, Programming Guide; nVidia Corporation, Version 1.0, [IN1/2/3]Intel Architecture Handbook, Version November 2006 [NR]Numerical receipies (online generated pdf)