Data Parallel Computing on Graphics Hardware Ian Buck Stanford University.

Slides:



Advertisements
Similar presentations
Is There a Real Difference between DSPs and GPUs?
Advertisements

Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan
Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman Pat Hanrahan GCafe December 10th, 2003.
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Photon Mapping on Programmable Graphics Hardware Timothy J. Purcell Mike Cammarano Pat Hanrahan Stanford University Craig Donner Henrik Wann Jensen University.
Status – Week 257 Victor Moya. Summary GPU interface. GPU interface. GPU state. GPU state. API/Driver State. API/Driver State. Driver/CPU Proxy. Driver/CPU.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman Pat Hanrahan February 10th, 2003.
Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.
Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC.
Data Parallel Computing on Graphics Hardware Ian Buck Stanford University.
Status – Week 274 Victor Moya. Simulator model Boxes. Boxes. Perform the actual work. Perform the actual work. A box can only access its own data, external.
February 11, Streaming Architectures and GPUs Ian Buck Bill Dally & Pat Hanrahan Stanford University February 11, 2004.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
3D Graphics Processor Architecture Victor Moya. PhD Project Research on architecture improvements for future Graphic Processor Units (GPUs). Research.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
GPU Simulator Victor Moya. Summary Rendering pipeline for 3D graphics. Rendering pipeline for 3D graphics. Graphic Processors. Graphic Processors. GPU.
ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.
The programmable pipeline Lecture 10 Slide Courtesy to Dr. Suresh Venkatasubramanian.
Some Things Jeremy Sugerman 22 February Jeremy Sugerman, FLASHG 22 February 2005 Topics Quick GPU Topics Conditional Execution GPU Ray Tracing.
Mapping Computational Concepts to GPU’s Jesper Mosegaard Based primarily on SIGGRAPH 2004 GPGPU COURSE and Visualization 2004 Course.
Interactive k-D Tree GPU Raytracing Daniel Reiter Horn, Jeremy Sugerman, Mike Houston and Pat Hanrahan.
Status – Week 260 Victor Moya. Summary shSim. shSim. GPU design. GPU design. Future Work. Future Work. Rumors and News. Rumors and News. Imagine. Imagine.
Brook for GPUs. May 6, Status –Current efforts toward supporting Reservoir RStream compiler –Brook version 0.2 spec:
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
General-Purpose Computation on Graphics Hardware.
Ray Tracing and Photon Mapping on GPUs Tim PurcellStanford / NVIDIA.
CSE 690: GPGPU Lecture 4: Stream Processing Klaus Mueller Computer Science, Stony Brook University.
Slide 1 / 16 On Using Graphics Hardware for Scientific Computing ________________________________________________ Stan Tomov June 23, 2006.
Enhancing GPU for Scientific Computing Some thoughts.
May 8, 2007Farid Harhad and Alaa Shams CS7080 Over View of the GPU Architecture CS7080 Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad &
Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Mapping Computational Concepts to GPUs Mark Harris NVIDIA.
GPU Computation Strategies & Tricks Ian Buck Stanford University.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Interactive Time-Dependent Tone Mapping Using Programmable Graphics Hardware Nolan GoodnightGreg HumphreysCliff WoolleyRui Wang University of Virginia.
Cg Programming Mapping Computational Concepts to GPUs.
1 SIC / CoC / Georgia Tech MAGIC Lab Rossignac GPU  Precision, Power, Programmability –CPU: x60/decade, 6 GFLOPS,
General-Purpose Computation on Graphics Hardware Adapted from: David Luebke (University of Virginia) and NVIDIA.
The programmable pipeline Lecture 3.
CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University.
Tone Mapping on GPUs Cliff Woolley University of Virginia Slides courtesy Nolan Goodnight.
A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.
CS662 Computer Graphics Game Technologies Jim X. Chen, Ph.D. Computer Science Department George Mason University.
GPU Computation Strategies & Tricks Ian Buck NVIDIA.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
May 8, 2007Farid Harhad and Alaa Shams CS7080 Overview of the GPU Architecture CS7080 Final Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad.
GPU Based Sound Simulation and Visualization Torbjorn Loken, Torbjorn Loken, Sergiu M. Dascalu, and Frederick C Harris, Jr. Department of Computer Science.
From Turing Machine to Global Illumination Chun-Fa Chang National Taiwan Normal University.
Ray Tracing using Programmable Graphics Hardware
Mapping Computational Concepts to GPUs Mark Harris NVIDIA.
Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.
Ray Tracing by GPU Ming Ouhyoung. Outline Introduction Graphics Hardware Streaming Ray Tracing Discussion.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
GPU Architecture and Its Application
CS427 Multicore Architecture and Parallel Computing
Graphics Processing Unit
Chapter 6 GPU, Shaders, and Shading Languages
Graphics Processing Unit
Sorting and Searching Tim Purcell NVIDIA.
Data Parallel Computing on Graphics Hardware
Ray Tracing on Programmable Graphics Hardware
CSE 502: Computer Architecture
Presentation transcript:

Data Parallel Computing on Graphics Hardware Ian Buck Stanford University

October 19th, Why graphics hardware Raw Performance: Pentium 4 SSE Theoretical* 3GHz * 4 wide *.5 inst / cycle = 6 GFLOPS GeForce FX 5900 (NV35) Fragment Shader Obtained: MULR R0, R0, R0: 20 GFLOPS Equivalent to a 10 GHz P4 And getting faster: 3x improvement over NV30 (6 months) 2002 R&D Costs: Intel: $4 Billion NVIDIA: $150 Million *from Intel P4 Optimization Manual GeForce FX

October 19th, GPU: Data Parallel –Each fragment shaded independently No dependencies between fragments –Temporary registers are zeroed –No static variables –No Read-Modify-Write textures Multiple “pixel pipes” –Data Parallelism Better ALU usage Hide Memory Latency [Torborg and Kajiya 96, Anderson et al. 97, Igehy et al. 98]

October 19th, Arithmetic Intensity Lots of ops per word transferred Graphics pipeline –Vertex BW: 1 triangle = 32 bytes; OP: f32-ops / triangle –Rasterization Create fragments per triangle –Fragment BW: 1 fragment = 10 bytes OP: i8-ops/fragment Courtesy of Pat Hanrahan

October 19th, Arithmetic Intensity Compute-to-Bandwidth ratio High Arithmetic Intensity desirable –App limited by ALU performance, not off-chip bandwidth –More chip real estate for ALUs, not caches Courtesy of Bill Dally

October 19th, Brook General purpose Streaming language Stream Programming Model –Enforce Data Parallel computing –Encourage Arthithmetic Intensity –Provide fundamental ops for stream computing

October 19th, Brook General purpose Streaming language Demonstrate GPU streaming coprocessor –Virtualize the GPU Hide texture/pbuffer data management Hide graphics based constructs in CG/HLSL Hide rendering passes –Highlight GPU areas for improvement Features required general purpose stream computing

October 19th, Streams & Kernels Streams (Data) –Collection of records requiring similar computation Vertex positions, voxels, FEM cell, … –Provide data parallelism Kernels (Functions) –Functions applied to each element in stream transforms, PDE, … –No dependencies between stream elements Encourage high Arithmetic Intensity

October 19th, Brook C with Streams –Simple language additions for streams/kernels –API for managing streams Streams float s ; streamread (s, ptr); streamwrite (s, ptr);

October 19th, Brook Kernel Functions –Map a function to a set pos update in velocity field kernel void updatepos (float3 pos<>, float3 vel[100,100,100], float timestep, out float newpos<>) { newpos = pos + vel[pos]*timestep; } updatepos (pos, vel, timestep, pos);

October 19th, Fundamental Ops Associative Reductions My ReduceFunc (s, result) –Produce a single value from a stream –Examples: Compute Max or Sum

October 19th, Fundamental Ops Gather: p = a[i] –Indirect Read –Permitted inside kernels Scatter: a[i] = p –Indirect Write ScatterOp(s_index, s_data, s_dst, SCATTEROP_ASSIGN) –Last write wins rule

October 19th, GatherOp & ScatterOp Indirect read/write with atomic operation GatherOp : p = a[i]++ GatherOp(s_index, s_data, s_src, GATHEROP_INC) ScatterOp : a[i] += p ScatterOp(s_index, s_data, s_dst, SCATTEROP_ADD) Important for building and updating data structures for data parallel computing

October 19th, Implementation Streams –Stored in 2D fp textures / pbuffers –Allocation at compile time –Managed by runtime Kernels –Compiled to fragment programs –Executed by rendering quad

October 19th, Implementation Compiler: brcc foo.br foo.cg foo.fp foo.c Source to Source compiler –Generate CG/HLSL code Convert array lookups to texture fetches Perform stream/texture lookups Texture address calculation –Generate C Stub file Fragment Program Loader Render code –Targets NV fp30, ARB fp, ps2.0

October 19th, Implementation Reduction –O(lg(n)) Passes Gather –Dependent texture read Scatter –Vertex shader (slow) GatherOp / ScatterOp –Vertex shader with CPU sort (slow)

October 19th, Gromacs Molecular Dynamics Simulator Force Function (~90% compute time): Energy Function: Acceleration Structure: Eric Lindhal, Erik Darve, Yanan Zhao

October 19th, Ray Tracing Tim Purcell, Bill Mark, Pat Hanrahan

October 19th, Finite Volume Methods W i = ∂W/∂I i Joseph Teran, Victor Ng-Thow-Hing, Ronald Fedkiw

October 19th, Applications Gromacs, Ray Tracing: –Reduce, Gather, GatherOp, ScatterOp FVM, FEM –Reduce, Gather Sparse Matrix Multiply, Batcher Bitonic Sort –Gather

October 19th, GPU Gotchas NVIDIA NV3x: Register usage vs. GFLOPS Time Registers Used

October 19th, GPU Gotchas ATI Radeon 9800 Pro Limited dependent texture lookup 96 instructions 24-bit floating point Texture Lookup Math Ops Texture Lookup Math Ops Texture Lookup Math Ops Texture Lookup Math Ops

October 19th, GPU Issues Missing Integer & Bit Ops Texture Memory Addressing –Need large flat texture addressing Readback still slow Shader compiler performance –Hand code performance critical code No native reduction support

October 19th, GPU Issues No native Scatter Support –Cannot do p[i]=a (indirect write) –Needs: Dependent Texture Write Set x,y inside fragment program No programmable blend –GatherOp / ScatterOp Limited Output

October 19th, Status In progress –NV30 prototype working Scheduled public release date –December 15, 2003 –SourceForge Stanford Streaming Supercomputer Cluster version

October 19th, Summary GPUs are faster than CPUs –and getting faster Why? –Data Parallelism –Arithmetic Intensity What is the right programming model? –Stream Computing –Brook for GPUs

October 19th, Summary “All processors aspire to be general-purpose” –Tim van Hook, Keynote, Graphics Hardware 2001