GPGPU Programming Dominik G ö ddeke. 2Overview Choices in GPGPU programming Illustrated CPU vs. GPU step by step example GPU kernels in detail.

Slides:



Advertisements
Similar presentations
Programming with OpenGL - Getting started - Hanyang University Han Jae-Hyek.
Advertisements

COMPUTER GRAPHICS SOFTWARE.
Introduction to the CUDA Platform
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
COMPSCI 105 S Principles of Computer Science 12 Abstract Data Type.
A Crash Course on Programmable Graphics Hardware Li-Yi Wei 2005 at Tsinghua University, Beijing.
Network coding on the GPU Péter Vingelmann Supervisor: Frank H.P. Fitzek.
Contemporary Languages in Parallel Computing Raymond Hummel.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
Realtime 3D Computer Graphics Computer Graphics Computer Graphics Software & Hardware Rendering Software & Hardware Rendering 3D APIs 3D APIs Pixel & Vertex.
What Programming Language Should We Use Tomorrow Kim Young Soo.
CS 480/680 Computer Graphics Course Overview Dr. Frederick C Harris, Jr. Fall 2012.
GPU Programming Robert Hero Quick Overview (The Old Way) Graphics cards process Triangles Graphics cards process Triangles Quads.
CHAPTER 4 Window Creation and Control © 2008 Cengage Learning EMEA.
Programmable Pipelines. Objectives Introduce programmable pipelines ­Vertex shaders ­Fragment shaders Introduce shading languages ­Needed to describe.
Real-time Graphical Shader Programming with Cg (HLSL)
Computer Graphics Graphics Hardware
GPU Computation Strategies & Tricks Ian Buck Stanford University.
GPU Shading and Rendering Shading Technology 8:30 Introduction (:30–Olano) 9:00 Direct3D 10 (:45–Blythe) Languages, Systems and Demos 10:30 RapidMind.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
CSC 461: Lecture 41 CSC461: Lecture 4 Introduction to OpenGL Objectives: Development of the OpenGL API OpenGL Architecture -- OpenGL as a state machine.
CS 480/680 Computer Graphics Programming with Open GL Part 1: Background Dr. Frederick C Harris, Jr. Fall 2011.
Chris Kerkhoff Matthew Sullivan 10/16/2009.  Shaders are simple programs that describe the traits of either a vertex or a pixel.  Shaders replace a.
Interactive Time-Dependent Tone Mapping Using Programmable Graphics Hardware Nolan GoodnightGreg HumphreysCliff WoolleyRui Wang University of Virginia.
Cg Programming Mapping Computational Concepts to GPUs.
Computer Graphics Tz-Huan Huang National Taiwan University.
The Graphics Rendering Pipeline 3D SCENE Collection of 3D primitives IMAGE Array of pixels Primitives: Basic geometric structures (points, lines, triangles,
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
1 Introduction to Computer Graphics SEN Introduction to OpenGL Graphics Applications.
Computational Biology 2008 Advisor: Dr. Alon Korngreen Eitan Hasid Assaf Ben-Zaken.
Tone Mapping on GPUs Cliff Woolley University of Virginia Slides courtesy Nolan Goodnight.
GPGPU Tools and Source Code Mark HarrisNVIDIA Developer Technology.
CS662 Computer Graphics Game Technologies Jim X. Chen, Ph.D. Computer Science Department George Mason University.
The course. Description Computer systems programming using the C language – And possibly a little C++ Translation of C into assembly language Introduction.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
게임 프로그래밍 특론 Advanced Game Programming 한신대학교 대학원 컴퓨터공학과 류승택 Spring.
From Turing Machine to Global Illumination Chun-Fa Chang National Taiwan Normal University.
Chapter 1 Basic Concepts of Operating Systems Introduction Software A program is a sequence of instructions that enables the computer to carry.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
Visualization Programming: “Libraries” and “Toolkits” Class visualization resources CSCI 6361.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.
An Introduction to the Cg Shading Language Marco Leon Brandeis University Computer Science Department.
Martin Kruliš by Martin Kruliš (v1.0)1.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
1 Introduction to Computer Graphics with WebGL Ed Angel Professor Emeritus of Computer Science Founding Director, Arts, Research, Technology and Science.
Rendering pipeline: The hardware side
Introduction to Operating Systems Concepts
Computer Graphics Graphics Hardware
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 July 12, 2012 © Barry Wilkinson CUDAIntro.ppt.
Khang Lam Daniel Limas Kevin Castillo Juan Battini
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
Our Graphics Environment
Introduction to Computer Graphics
A Crash Course on Programmable Graphics Hardware
Graphics Processing Unit
OpenGL and Related Libraries
Chapter 6 GPU, Shaders, and Shading Languages
From Turing Machine to Global Illumination
Unit 20 Software Part 2.
Unit 20 Software Part 2.
Introduction to Computer Graphics
Computer Graphics Graphics Hardware
Ray Tracing on Programmable Graphics Hardware
RADEON™ 9700 Architecture and 3D Performance
Computer Graphics Introduction to Shaders
CIS 441/541: Introduction to Computer Graphics Lecture 15: shaders
Presentation transcript:

GPGPU Programming Dominik G ö ddeke

2Overview Choices in GPGPU programming Illustrated CPU vs. GPU step by step example GPU kernels in detail

3 Application e.g. in C/C++, Java, Fortran, Perl Shader programs e.g. in HLSL, GLSL, Cg Choices in GPU Programming Graphics hardware e.g. Radeon (ATI), GeForce (NV) Operating system e.g. Windows, Unix, Linux, MacOS Graphics API e.g. OpenGL, DirectX Window manager e.g. GLUT, Qt, Win32, Motif Metaprogramming language e.g. BrookGPU, Sh OR Self-written libGPU hides the graphics details

4 Bottom lines This is not as difficult as it seems –Similar choices to be made in all software projects –Some options are mutually exclusive –Some can be used without in-depth knowledge –No direct access to the hardware, the driver does all the tedious thread-management anyway Advantages and disadvantages –Steeper learning curve vs. higher flexibility –Focus on algorithm, not on (unnecessary) graphics –Portable code vs. platform and hardware specific

5 Shading languages Kernels are programmed in a shading language –Cg (NVIDIA) –HLSL (Microsoft, only Direct3D) –GLSL (OpenGL) Feature sets –Array access– Conditionals, loops –Math– No bitwise ops (yet) Typically very easy to learn –All three languages are very similar

6 Libraries and Abstractions Some coding is required –no library available that you just link against –tremendously hard to massively parallelize existing complex code automatically Good news –much functionality can be added to applications in a minimally invasive way, no rewrite from scratch First libraries under development –Accelerator (Microsoft): linear algebra, BLAS-like –Glift (Lefohn et al.): abstract data structures, e.g. trees

7Overview Choices in GPGPU programming Illustrated CPU vs. GPU step by step example GPU kernels in detail

8 Native Data Layout CPU: 1D array GPU: 2D array Indices are floats, addressing array element centers (GL) or top-left corners (D3D). This will be important later.

9 saxpy (from BLAS) –given two vectors x and y of size N and a scalar a –compute scaled vector-vector addition y = y + a*x CPU implementation –store each vector in one array, loop over all elements Identify computation inside loop as kernel –no logic in this basic kernel, pure computation –logic and computation fully separated for (i=0; i<N; i++) y[i] = y[i] + a*x[i]; Example Problem y[i] = y[i] + a*x[i]; for (i=0; i<N; i++)

10 Understanding GPU Limitations No simultaneous reads and writes into the same memory –No read-modify-write buffer means no logic required to handle read-before-write hazards –Not a missing feature, but essential hardware design for good performance and throughput –saxpy: introduce additional array: y new = y old + a*x Coherent memory access –For a given output element, read in from the same index in the two input arrays –Trivially achieved in this basic example

11 Performing Computations Load a kernel program –Detailed examples later on Specify the output and input arrays –Pseudocode: setInputArrays(y old, x); setOutputArray(y new ); Trigger the computation –GPU is after all a graphics processor –So just draw something appropriate

12 Computing = Drawing Specify input and output regions –Set up 1:1 mapping from graphics viewport to output array elements, set up input regions –saxpy: input and output regions coincide Generate data streams –Literally draw some geometry that covers all elements in the output array –In this example, a 4x4 filled quad from four vertices –GPU will interpolate output array indices from vertices across the output region –And generate data stream flowing through the parallel PEs

13Example Kernel y + 0.5*x

14 Performing Computations High-level view –Kernel is executed simultaneously on all elements in the output region –Kernel knows its output index (and eventually additional input indices, more on that later) –Drawing replaces CPU loops, foreach -execution –Output array is write-only Feedback loop (ping-pong technique) –Output array can be used read-only as input for next operation

15Overview Choices in GPGPU programming Illustrated CPU vs. GPU step by step example GPU kernels in detail

16 Kernel on the CPU Written in Cg for the GPU GPU Kernels: saxpy y[i] = y[i] + a*x[i] float saxpy(float2 coords: WPOS, uniform samplerRECT arrayX, uniform samplerRECT arrayY, uniform float a) : COLOR { float y = texRECT(arrayY,coords); float x = texRECT(arrayX,coords); return y+a*x; } input arrays array index gather compute

17 GPU Kernels: Jacobi Iteration Good news -Simple linear system solver can be built with exactly these basic techniques! Example: Finite Differences -x: vector of unknowns, sampled with a 5-point stencil (offsets) -b: right-hand-side -regular, equidistant grid -`solved´ with Jacobi iteration

18 GPU Kernels: Jacobi Iteration float jacobi (float2 center : WPOS, uniform samplerRECT x, uniform samplerRECT b, uniform float one_over_h) : COLOR { float2 left = center – float2(1,0); float2 right = center + float2(1,0); float2 bottom = center – float2(0,1); float2 top = center + float2(0,1); float x_center = texRECT(x, center); float x_left = texRECT(x, left); float x_right = texRECT(x, right); float x_bottom = texRECT(x, bottom); float x_top = texRECT(x, top); float rhs = texRECT(b, center); float Ax = one_over_h * ( 4.0 * x_center – x_left - x_right – x_bottom – x_top ); float inv_diag = one_over_h / 4.0; return x_center + inv_diag * (rhs – Ax); } calculate offsets gather values matrix-vector Jacobi step

19 Maximum of an Array Entirely different operation –Output is single scalar, input is array of length N Naive approach –Use 1x1 array as output, gather all N values in one step –Doomed: will only use one PE, no parallelism at all –Runs into all sorts of other troubles Solution: parallel reduction –Idea based on global communication in parallel computing –Smart interplay of output and input regions –Same technique applies to dot products, norms etc.

20 Maximum of an Array input arrayN/2 x N/2 outputadjust indices to gather 2x2 regions for each output maximum of 2x2 region first output inputintermediatesresults float maximum (float2 coords: WPOS, uniform samplerRECT array) : COLOR { float2 topleft = ((coords-0.5)*2.0)+0.5; float val1 = texRECT(array, topleft); float val2 = texRECT(array, topleft+float2(1,0)); float val3 = texRECT(array, topleft+float2(1,1)); float val4 = texRECT(array, topleft+float2(0,1)); return max(val1,max(val2,max(val3,val4))); }

21 Multigrid Transfers Restriction –Interpolate values from fine into coarse array –Local neighborhood weighted gather on both CPU and GPU fine coarse adjust index to read neighbors output region coarse array i 2i2i+12i-1 result

22 Multigrid Transfers Prolongation –Scatter values from fine to coarse with weighting stencil –Typical CPU implementation: loop over coarse array with stride-2 daxpys

23 Multigrid Transfers Three cases 1)Fine node lies in the center of an element (4 interpolants) 2)Fine node lies on the edge of an element (2 interpolants) 3)Fine node lies on top of a coarse node (copy) Reformulate scatter as gather for the GPU –Set fine array as output region –Sample with index offset snaps back to center (case 3) 0.25 snaps to neigh- bors (case 1 and 2) same code for all three cases, no conditionals or red-black-map

24Conclusions This is not as complicated as it might seem –Course notes online: –GPGPU community site: Developer information, lots of useful references Paper archive Help from real people in the GPGPU forums