General-Purpose Computation on Graphics Hardware Adapted from: David Luebke (University of Virginia) and NVIDIA.

Slides:



Advertisements
Similar presentations
Fragment level programmability in OpenGL Evan Hart
Advertisements

Is There a Real Difference between DSPs and GPUs?
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Photon Mapping on Programmable Graphics Hardware Timothy J. Purcell Mike Cammarano Pat Hanrahan Stanford University Craig Donner Henrik Wann Jensen University.
Understanding the graphics pipeline Lecture 2 Original Slides by: Suresh Venkatasubramanian Updates by Joseph Kider.
Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
CS-378: Game Technology Lecture #9: More Mapping Prof. Okan Arikan University of Texas, Austin Thanks to James O’Brien, Steve Chenney, Zoran Popovic, Jessica.
IMGD 4000: Computer Graphics in Games Emmanuel Agu.
The Programmable Graphics Hardware Pipeline Doug James Asst. Professor CS & Robotics.
Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.
Modified from: A Survey of General-Purpose Computation on Graphics Hardware John Owens University of California, Davis David Luebke University of Virginia.
Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC.
Data Parallel Computing on Graphics Hardware Ian Buck Stanford University.
A Crash Course on Programmable Graphics Hardware Li-Yi Wei 2005 at Tsinghua University, Beijing.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
3D Graphics Processor Architecture Victor Moya. PhD Project Research on architecture improvements for future Graphic Processor Units (GPUs). Research.
ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.
Evolution of the Programmable Graphics Pipeline Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Data Parallel Computing on Graphics Hardware Ian Buck Stanford University.
Status – Week 281 Victor Moya. Objectives Research in future GPUs for 3D graphics. Research in future GPUs for 3D graphics. Simulate current and future.
GPGPU CS 446: Real-Time Rendering & Game Technology David Luebke University of Virginia.
GPU Tutorial 이윤진 Computer Game 2007 가을 2007 년 11 월 다섯째 주, 12 월 첫째 주.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
CSE 690 General-Purpose Computation on Graphics Hardware (GPGPU) Courtesy David Luebke, University of Virginia.
General-Purpose Computation on Graphics Hardware.
Ray Tracing and Photon Mapping on GPUs Tim PurcellStanford / NVIDIA.
REAL-TIME VOLUME GRAPHICS Christof Rezk Salama Computer Graphics and Multimedia Group, University of Siegen, Germany Eurographics 2006 Real-Time Volume.
Slide 1 / 16 On Using Graphics Hardware for Scientific Computing ________________________________________________ Stan Tomov June 23, 2006.
Enhancing GPU for Scientific Computing Some thoughts.
May 8, 2007Farid Harhad and Alaa Shams CS7080 Over View of the GPU Architecture CS7080 Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad &
Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.
Computer Graphics Graphics Hardware
Mapping Computational Concepts to GPUs Mark Harris NVIDIA.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Cg Programming Mapping Computational Concepts to GPUs.
1 SIC / CoC / Georgia Tech MAGIC Lab Rossignac GPU  Precision, Power, Programmability –CPU: x60/decade, 6 GFLOPS,
General-Purpose Computation on Graphics Hardware.
CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University.
GPU Computation Strategies & Tricks Ian Buck NVIDIA.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
A User-Programmable Vertex Engine Erik Lindholm Mark Kilgard Henry Moreton NVIDIA Corporation Presented by Han-Wei Shen.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
May 8, 2007Farid Harhad and Alaa Shams CS7080 Overview of the GPU Architecture CS7080 Final Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad.
Havok FX Physics on NVIDIA GPUs. Copyright © NVIDIA Corporation 2004 What is Effects Physics? Physics-based effects on a massive scale 10,000s of objects.
David Luebke 1 1/25/2016 Programmable Graphics Hardware.
David Angulo Rubio FAMU CIS GradStudent. Introduction  GPU(Graphics Processing Unit) on video cards has evolved during the last years. They have become.
From Turing Machine to Global Illumination Chun-Fa Chang National Taiwan Normal University.
09/25/03CS679 - Fall Copyright Univ. of Wisconsin Last Time Shadows Stage 2 outline.
Mapping Computational Concepts to GPUs Mark Harris NVIDIA.
The Graphics Pipeline Revisited Real Time Rendering Instructor: David Luebke.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
An Introduction to the Cg Shading Language Marco Leon Brandeis University Computer Science Department.
Computer Graphics Graphics Hardware
GPU Architecture and Its Application
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
第七课 GPU & GPGPU.
Graphics Processing Unit
GP2: General Purpose Computation using Graphics Processors
NVIDIA Fermi Architecture
Introduction to Programmable Hardware
Computer Graphics Graphics Hardware
Ray Tracing on Programmable Graphics Hardware
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

General-Purpose Computation on Graphics Hardware Adapted from: David Luebke (University of Virginia) and NVIDIA

Your Presentations Everyone Must attend 10% loss of presentation points if a day is missed Presentation Grading Criteria posted online Everyone must present Grades are given to the group, not individual

Outline Overview / Motivation GPU Architecture Fundamentals GPGPU Programming and Usage New NVIDIA Architectures (FERMI) More Infromation

Motivation: Computational Power GPUs are fast… –3 GHz Pentium4 theoretical: 6 GFLOPS, 5.96 GB/sec peak –GeForceFX 5900 observed: 20 GFLOPS, 25.3 GB/sec peak –GeForce 6800 Ultra observed: 53 GFLOPS, 35.2 GB/sec peak –GeForce 8800 GTX: estimated at 520 GFLOPS, 86.4 GB/sec peak (reference here)here That’s almost 100 times faster than a 3 Ghz Pentium4! GPUs are getting faster, faster –CPUs: annual growth  1.5×  decade growth  60× –GPUs: annual growth > 2.0×  decade growth > 1000 Courtesy Kurt Akeley, Ian Buck & Tim Purcell, GPU Gems (see course notes)

Motivation: Computational Power Courtesy Naga Govindaraju GPU CPU

Motivation: Computational Power Courtesy Ian Buck GFLOPS multiplies per second NVIDIA NV30, 35, 40 ATI R300, 360, 420 Pentium 4 July 01Jan 02July 02Jan 03July 03Jan 04

An Aside: Computational Power Why are GPUs getting faster so fast? –Arithmetic intensity: the specialized nature of GPUs makes it easier to use additional transistors for computation not cache –Economics: multi-billion dollar video game market is a pressure cooker that drives innovation

Motivation: Flexible and precise Modern GPUs are deeply programmable –Programmable pixel, vertex, video engines –Solidifying high-level language support Modern GPUs support high precision –32 bit floating point throughout the pipeline –High enough for many (not all) applications

Motivation: The Potential of GPGPU The power and flexibility of GPUs makes them an attractive platform for general-purpose computation Example applications range from in-game physics simulation to conventional computational science Goal: make the inexpensive power of the GPU available to developers as a sort of computational coprocessor

The Problem: Difficult To Use GPUs designed for and driven by video games –Programming model is unusual & tied to computer graphics –Programming environment is tightly constrained Underlying architectures are: –Inherently parallel –Rapidly evolving (even in basic feature set!) –Largely secret Can’t simply “port” code written for the CPU!

GPU Fundamentals: The Graphics Pipeline A simplified graphics pipeline –Note that pipe widths vary –Many caches, FIFOs, and so on not shown GPUCPU Application Transform Rasterizer Shade Video Memory (Textures) Vertices (3D) Xformed, Lit Vertices (2D) Fragments (pre-pixels) Final pixels (Color, Depth) Graphics State Render-to-texture

GPU Fundamentals: The Modern Graphics Pipeline Programmable vertex processor! Programmable pixel processor! GPUCPU Application Vertex Processor Rasterizer Pixel Processor Video Memory (Textures) Vertices (3D) Xformed, Lit Vertices (2D) Fragments (pre-pixels) Final pixels (Color, Depth) Graphics State Render-to-texture Vertex Processor Fragment Processor

GPU Pipeline: Transform Vertex Processor (multiple operate in parallel) –Transform from “world space” to “image space” –Compute per-vertex lighting

GPU Pipeline: Rasterizer Rasterizer –Convert geometric rep. (vertex) to image rep. (fragment) Fragment = image fragment –Pixel + associated data: color, depth, stencil, etc. –Interpolate per-vertex quantities across pixels

GPU Pipeline: Shade Fragment Processors (multiple in parallel) –Compute a color for each pixel –Optionally read colors from textures (images)

Importance of Data Parallelism GPU: Each vertex / fragment is independent –Temporary registers are zeroed –No static data –No read-modify-write buffers Data parallel processing –Best for ALU-heavy architectures: GPUs Multiple vertex & pixel pipelines –Hide memory latency (with more computation) Courtesy of Ian Buck

Arithmetic Intensity Lots of ops per word transferred GPGPU demands high arithmetic intensity for peak performance –Ex: solving systems of linear equations –Physically-based simulation on lattices –All-pairs shortest paths Courtesy of Pat Hanrahan

Data Streams & Kernels Streams –Collection of records requiring similar computation Vertex positions, Voxels, FEM cells, etc. –Provide data parallelism Kernels –Functions applied to each element in stream Transforms, PDE, … –No dependencies between stream elements Encourage high arithmetic intensity Courtesy of Ian Buck

Example: Simulation Grid Common GPGPU computation style –Textures represent computational grids = streams Many computations map to grids –Matrix algebra –Image & Volume processing –Physical simulation –Global Illumination ray tracing, photon mapping, radiosity Non-grid streams can be mapped to grids

Stream Computation Grid Simulation algorithm –Made up of steps –Each step updates entire grid –Must complete before next step can begin Grid is a stream, steps are kernels –Kernel applied to each stream element

Scatter vs. Gather Grid communication –Grid cells share information

Computational Resources Inventory Programmable parallel processors –Vertex & Fragment pipelines Rasterizer –Mostly useful for interpolating addresses (texture coordinates) and per-vertex constants Texture unit –Read-only memory interface Render to texture –Write-only memory interface

Vertex Processor Fully programmable (SIMD / MIMD) Processes 4-vectors (RGBA / XYZW) Capable of scatter but not gather –Can change the location of current vertex –Cannot read info from other vertices –Can only read a small constant memory Future hardware enables gather! –Vertex textures

Fragment Processor Fully programmable (SIMD) Processes 4-vectors (RGBA / XYZW) Random access memory read (textures) Capable of gather but not scatter –No random access memory writes –Output address fixed to a specific pixel Typically more useful than vertex processor –More fragment pipelines than vertex pipelines –RAM read –Direct output

CPU-GPU Analogies CPU programming is familiar –GPU programming is graphics-centric Analogies can aid understanding

CPU-GPU Analogies CPU GPU Stream / Data Array = Texture Memory Read = Texture Sample

CPU-GPU Analogies Loop body / kernel / algorithm step = Fragment Program CPUGPU

Feedback Each algorithm step depend on the results of previous steps Each time step depends on the results of the previous time step

ARB GPU Assembly Language Architecture Review Board ABS - absolute value ADD - add ARL - address register load DP3 - 3-component dot product DP4 - 4-component dot product DPH - homogeneous dot product DST - distance vector EX2 - exponential base 2 EXP - exponential base 2 (approximate) FLR - floor FRC - fraction LG2 - logarithm base 2 LIT - compute light coefficients ABS - absolute value LOG - logarithm base 2 (approximate) MAD - multiply and add MAX - maximum MIN - minimum MOV - move MUL - multiply POW - exponentiate RCP – reciprocal RSQ - reciprocal square root SGE - set on greater than or equal SLT - set on less than SUB - subtract SWZ - extended swizzle XPD - cross product

Nvidia Graphics Card Architecture GeForce-8 Series –12,288 concurrent threads, hardware managed –128 Thread Processor cores at 1.35 GHz == 518 GFLOPS peak TEX L1 SP SharedMemory IU SP SharedMemory IU TF TEX L1 SP SharedMemory IU SP SharedMemory IU TF TEX L1 SP SharedMemory IU SP SharedMemory IU TF TEX L1 SP SharedMemory IU SP SharedMemory IU TF TEX L1 SP SharedMemory IU SP SharedMemory IU TF TEX L1 SP SharedMemory IU SP SharedMemory IU TF TEX L1 SP SharedMemory IU SP SharedMemory IU TF TEX L1 SP SharedMemory IU SP SharedMemory IU TF L2 Memory Work Distribution Host CPU L2 Memory L2 Memory L2 Memory L2 Memory L2 Memory

NVIDIA FERMI

FERMI: Streaming Multiprocessor (SM) Each SM contains 32 Cores 16 Load/Store units 32,768 registers

FERMI: Core Architecture Newer FP representation –IEEE Two units –Floating point –Integer Simultaneous execution possible

FERMI: Comparison

FERMI: Results

What it looks like!

Applications Includes lots of sample applications –Ray-tracer –FFT –Image segmentation –Linear algebra

Brook performance 2-3x faster than CPU implementation compared against 3GHz P4: Intel Math Library FFTW Custom cached-blocked segment C code GPUs still lose against SSE cache friendly code. Super-optimizations ATLAS FFTW ATI Radeon 9800 XT NVIDIA GeForce 6800

GPGPU Examples Fluids Reaction/Diffusion Multigrid Tone mapping