GPGPU Lessons Learned Mark Harris. General-Purpose Computation on GPUs Highly parallel applications Physically-based simulation image processing scientific.

Slides:



Advertisements
Similar presentations
Physical Simulation on GPUs Jim Van Verth OpenGL Software Engineer NVIDIA
Advertisements

Using Graphics Processors for Real-Time Global Illumination UK GPU Computing Conference 2011 Graham Hazel.
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.
Dealing with Computational Load in Multi-user Scalable City with OpenCL Assets and Dynamics Computation for Virtual Worlds.
Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Real-World GPGPU Mark Harris NVIDIA Developer Technology.
ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.
The programmable pipeline Lecture 10 Slide Courtesy to Dr. Suresh Venkatasubramanian.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Gregex: GPU based High Speed Regular Expression Matching Engine Date:101/1/11 Publisher:2011 Fifth International Conference on Innovative Mobile and Internet.
GPU Tutorial 이윤진 Computer Game 2007 가을 2007 년 11 월 다섯째 주, 12 월 첫째 주.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
CSE 690 General-Purpose Computation on Graphics Hardware (GPGPU) Courtesy David Luebke, University of Virginia.
General-Purpose Computation on Graphics Hardware.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Thermoacoustics in random fibrous materials Seminar Carl Jensen Tuesday, March
Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Computationally Efficient Histopathological Image Analysis: Use of GPUs for Classification of Stromal Development Olcay Sertel 1,2, Antonio Ruiz 3, Umit.
Havok. ©Copyright 2006 Havok.com (or its licensors). All Rights Reserved. HavokFX Next Gen Physics on ATI GPUs Andrew Bowell – Senior Engineer Peter Kipfer.
Computer Graphics Graphics Hardware
Mapping Computational Concepts to GPUs Mark Harris NVIDIA.
GPU Computation Strategies & Tricks Ian Buck Stanford University.
GPU Shading and Rendering Shading Technology 8:30 Introduction (:30–Olano) 9:00 Direct3D 10 (:45–Blythe) Languages, Systems and Demos 10:30 RapidMind.
Cg Programming Mapping Computational Concepts to GPUs.
Instructor Notes GPU debugging is still immature, but being improved daily. You should definitely check to see the latest options available before giving.
1 © 2012 The MathWorks, Inc. Parallel computing with MATLAB.
Accelerating a Software Radio Astronomy Correlator By Andrew Woods Supervisor: Prof. Inggs & Dr Langman.
General-Purpose Computation on Graphics Hardware.
The programmable pipeline Lecture 3.
GPU Computation Strategies & Tricks Ian Buck NVIDIA.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
By Dirk Hekhuis Advisors Dr. Greg Wolffe Dr. Christian Trefftz.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
May 8, 2007Farid Harhad and Alaa Shams CS7080 Overview of the GPU Architecture CS7080 Final Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad.
Havok FX Physics on NVIDIA GPUs. Copyright © NVIDIA Corporation 2004 What is Effects Physics? Physics-based effects on a massive scale 10,000s of objects.
David Angulo Rubio FAMU CIS GradStudent. Introduction  GPU(Graphics Processing Unit) on video cards has evolved during the last years. They have become.
From Turing Machine to Global Illumination Chun-Fa Chang National Taiwan Normal University.
1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.
Mapping Computational Concepts to GPUs Mark Harris NVIDIA.
GPGPU: Parallel Reduction and Scan Joseph Kider University of Pennsylvania CIS Fall 2011 Credit: Patrick Cozzi, Mark Harris Suresh Venkatensuramenan.
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
My Coordinates Office EM G.27 contact time:
GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.
Postmortem: Deferred Shading in Tabula Rasa Rusty Koonce NCsoft September 15, 2008.
COMP 175 | COMPUTER GRAPHICS Remco Chang1/XX13 – GLSL Lecture 13: OpenGL Shading Language (GLSL) COMP 175: Computer Graphics April 12, 2016.
Computer Graphics Graphics Hardware
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
DX11 TECHNIQUES IN HK2207 Takahiro Harada AMD. DX11 TECHNIQUES IN HK2207 Takahiro Harada AMD.
Graphics Processing Unit
From Turing Machine to Global Illumination
Static Image Filtering on Commodity Graphics Processors
GPGPU: Parallel Reduction and Scan
Computer Graphics Graphics Hardware
Presentation transcript:

GPGPU Lessons Learned Mark Harris

General-Purpose Computation on GPUs Highly parallel applications Physically-based simulation image processing scientific computing computer vision computational finance medical imaging bioinformatics

NVIDIA GPU Pixel Shader GFLOPS GPU Observed GFLOPS CPU Theoretical peak GFLOPS

Physics on GPUs GPU: very high data parallelism G70 24 pixel pipelines, 48 shading processors 1000s of simultaneous threads Very high memory bandwidth SLI enables 1-4 GPUs per system Physics: very high data parallelism 1000s of colliding objects 1000s of collisions to resolve every frame Physics is an ideal match for GPUs

Physically-based Simulation on GPUs Jens Krüger, TU-Munich Doug L. James, CMU Particle Systems Fluid Simulation Cloth Simulation Soft-body Simulation

What about Game Physics? Fluids, particles, cloth map naturally to GPUs Highly parallel, independent data Game Physics = rigid body physics Collision detection and response Solving constraints Rigid body physics is more complicated Arbitrary shapes Arbitrary interactions and dependencies Parallelism is harder to extract

Havok FX A framework for Game Physics on the GPU Joint NVIDIA / Havok R&D project launched in 2005 For details, come to the talks: Havok FX™: GPU-accelerated Physics for PC Games 4:00-5:00PM Thursday [Need location] Physics Simulation on NVIDIA GPUs 5:30-6:30PM Thursday [Need Location]

Havok FX Demo NVIDIA DinoBones demo

Lessons learned from Havok FX Arithmetic Intensity is Key CPUs and GPUs can get along Readback ain’t wrong Vertex Scatter vs. Pixel Gather Printf debugging for pixel shaders

Arithmetic Intensity is Key Arithmetic Intensity = Arithmetic / Bandwidth GPUs like it high Very little on-chip cache Going to mem and back costs a lot Long programs with much more math than texture fetch Game physics has very high AI > 1500 pixel shader cycles per collision ~ 100 texture fetches per collision

Havok GPU Threading Experiment ms Number of 32 pixel rows shaded Performance of a pixel shader

Leverage Processor Strengths GPUs are good at data parallel computation CPUs are good at sequential computation Most real problems have a bit of both Luckily most real computers have both processors! Especially game platforms Rigid body collision processing is a great example

Rigid Body Dynamics Overview 3 phases to every simulation clock tick Integrate positions and velocities Detect collisions Resolve collisions Integration is embarrassingly parallel No dependencies between objects: use the GPU Detecting collision is basically scene traversal CPU is good at this – use it Resolving collisions is a tricky one Is it parallel enough for the GPU?

Is physics a data parallel task? Solve Collisions New Velocities Contacts & Velocities Body 1 Body 2 Body 5 Body 7 Body 6 Body 3 Body 8 Body

Is physics a data parallel task? Solve Collisions New Velocities Contacts & Velocities Body

Is physics a data parallel task? Solve Collisions New Velocities Contacts Solve link 1 Solve link 2 Solve link N Batch 1Batch 2Batch 3 Solve link 1 Solve link 2 Solve link N Solve link 1 Solve link 2 Solve link N

Readback is Not Evil Hybrid CPU-GPU solution implies communication Readback and download across PCI-e bus It’s not that bad if you use it wisely Minimize transfer size and frequency Use PBO to optimize transfers Physics data << computation Read back and download a few bytes per obj each frame At most a few MB per frame < 200 MB/sec PCI-e = 4 GB / sec

Vertex Scatter vs. Pixel Gather Problem: sparse array update Computed a set of addresses that need to be updated Compute updates for only those addresses in an array

Method 1: Pixel Gather (Pre)compute an array of compute flags Process all pixels in the destination array Branch out of computation where flag is zero

0,4 2,3 4,5 0,1 0,2 4,2 Method 2: Vertex Scatter (Pre)compute addresses of elements to update Draw 1-pixel points at those addresses Run update shader on points

Vertex Scatter vs. Pixel Gather Not obvious that Vertex Scatter can be a win Drawing single-pixel points is inefficient Because shader pipes process 2x2 “quads” of pixels But you can use a simple heuristic Use Vertex Scatter if # of updates is significantly smaller than array size Otherwise use pixel gather But always experiment in your own application!

Printf for Pixels Debugging pixel shaders is hard Especially GPGPU shaders – output not an image Even harder if you’ve used all of your outputs Havok FX easily uses up 4 float4 MRT outputs A simple hack to dump data from your shaders A macro to dump arbitrary shader variables A wrapper function to run the program once for all “printfs” And run it once more with correct outputs

Printf for Pixels First, define a handy macro to put in your shaders #ifdef DEBUG_SHADER #define CG_PRINTF(index, variable) \ if (debugSelector == index) \ DEBUG_OUT = variable; #else #define CG_PRINTF(index, variable) #endif

Printf for Pixels float4 foo( float2 coords, // other params #ifdef DEBUG_SHADER uniform float debugSelector #endif ) : COLOR0 { #ifdef DEBUG_SHADER float4 DEBUG_OUT; #endif float4 temp1 = complexCalc1(); float4 temp2 = complexCalc1(temp1); float4 ret = complexCalc3(temp2); CG_PRINTF(1, temp1); CG_PRINTF(2, temp2); #ifdef DEBUG_SHADER CG_PRINTF(0, ret); return DEBUG_OUT; #else return ret; #endif } Use the macro to instrument your shader

Printf for Pixels: C++ code debugProgram( CGProgram prog, x, y, w, h, float** debugData) { CGParam psel = cgGetNamedParameter(prog, “debugSelector”); for (int selector = 1; selector < 100; ++selector) { if (debugData[selector] == 0) break; // run program with debug selector cgGLBindFloat1f(psel, selector); runProgram( prog, x, y, w, h); // read back results glReadPixels(x, y, w, h, GL_RGBA, GL_FLOAT, debugData[selector]); } cgGLBindFloat1f(psel, 0); runProgram( prog, x, y, w, h); // run program as normal }

Questions? Havok FX presentations at GDC 2006: Havok FX™: GPU-accelerated Physics for PC Games 4:00-5:00PM Thursday [Need location] Physics Simulation on NVIDIA GPUs 5:30-6:30PM Thursday [Need Location]