Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.

Slides:



Advertisements
Similar presentations
DirectX11 Performance Reloaded
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Optimization on Kepler Zehuan Wang
Understanding the graphics pipeline Lecture 2 Original Slides by: Suresh Venkatasubramanian Updates by Joseph Kider.
Status – Week 257 Victor Moya. Summary GPU interface. GPU interface. GPU state. GPU state. API/Driver State. API/Driver State. Driver/CPU Proxy. Driver/CPU.
RealityEngine Graphics Kurt Akeley Silicon Graphics Computer Systems.
Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
The Programmable Graphics Hardware Pipeline Doug James Asst. Professor CS & Robotics.
Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.
Computer Graphics Hardware Acceleration for Embedded Level Systems Brian Murray
1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.
A Crash Course on Programmable Graphics Hardware Li-Yi Wei 2005 at Tsinghua University, Beijing.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
Status – Week 243 Victor Moya. Summary Current status. Current status. Tests. Tests. XBox documentation. XBox documentation. Post Vertex Shader geometry.
3D Graphics Processor Architecture Victor Moya. PhD Project Research on architecture improvements for future Graphic Processor Units (GPUs). Research.
IN4151 Introduction 3D graphics 1 Introduction to 3D computer graphics part 2 Viewing pipeline Multi-processor implementation GPU architecture GPU algorithms.
Z-Buffer Optimizations Patrick Cozzi Analytical Graphics, Inc.
GPU Simulator Victor Moya. Summary Rendering pipeline for 3D graphics. Rendering pipeline for 3D graphics. Graphic Processors. Graphic Processors. GPU.
1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.
ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.
Z-Buffer Optimizations Patrick Cozzi Analytical Graphics, Inc.
Graphic Architecture introduction and analysis
Status – Week 260 Victor Moya. Summary shSim. shSim. GPU design. GPU design. Future Work. Future Work. Rumors and News. Rumors and News. Imagine. Imagine.
Graphics Processors CMSC 411. GPU graphics processing model Texture / Buffer Texture / Buffer Vertex Geometry Fragment CPU Displayed Pixels Displayed.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
University of Texas at Austin CS 378 – Game Technology Don Fussell CS 378: Computer Game Technology Beyond Meshes Spring 2012.
CSL 859: Advanced Computer Graphics Dept of Computer Sc. & Engg. IIT Delhi.
© Copyright Khronos Group, Page 1 Harnessing the Horsepower of OpenGL ES Hardware Acceleration Rob Simpson, Bitboys Oy.
REAL-TIME VOLUME GRAPHICS Christof Rezk Salama Computer Graphics and Multimedia Group, University of Siegen, Germany Eurographics 2006 Real-Time Volume.
Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.
Extracted directly from:
Chris Kerkhoff Matthew Sullivan 10/16/2009.  Shaders are simple programs that describe the traits of either a vertex or a pixel.  Shaders replace a.
Cg Programming Mapping Computational Concepts to GPUs.
Week 2 - Friday.  What did we talk about last time?  Graphics rendering pipeline  Geometry Stage.
Fast Computation of Database Operations using Graphics Processors Naga K. Govindaraju Univ. of North Carolina Modified By, Mahendra Chavan forCS632.
Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.
GPU Computation Strategies & Tricks Ian Buck NVIDIA.
Xbox MB system memory IBM 3-way symmetric core processor ATI GPU with embedded EDRAM 12x DVD Optional Hard disk.
ATI Stream Computing ATI Radeon™ HD 2900 Series GPU Hardware Overview Micah Villmow May 30, 2008.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Computer Graphics 3 Lecture 6: Other Hardware-Based Extensions Benjamin Mora 1 University of Wales Swansea Dr. Benjamin Mora.
Fateme Hajikarami Spring  What is GPGPU ? ◦ General-Purpose computing on a Graphics Processing Unit ◦ Using graphic hardware for non-graphic computations.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
Ray Tracing using Programmable Graphics Hardware
Mapping Computational Concepts to GPUs Mark Harris NVIDIA.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 GPU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
GLSL Review Monday, Nov OpenGL pipeline Command Stream Vertex Processing Geometry processing Rasterization Fragment processing Fragment Ops/Blending.
GPU Architecture and Its Application
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
CMSC 611: Advanced Computer Architecture
Week 2 - Friday CS361.
A Crash Course on Programmable Graphics Hardware
Graphics on GPU © David Kirk/NVIDIA and Wen-mei W. Hwu,
CS427 Multicore Architecture and Parallel Computing
Graphics Processing Unit
Chapter 6 GPU, Shaders, and Shading Languages
From Turing Machine to Global Illumination
The Graphics Rendering Pipeline
Graphics Hardware CMSC 491/691.
Joshua Barczak CMSC435 UMBC
Graphics Processing Unit
UMBC Graphics for Games
RADEON™ 9700 Architecture and 3D Performance
Balancing the Graphics Pipeline for Optimal Performance
CIS 6930: Chip Multiprocessor: GPU Architecture and Programming
Presentation transcript:

Graphics Hardware CMSC 435/634

Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline

67 GFLOPS 1.1 TFLOPS 75 GB/s 13 GB/s 335 GB/s Texture 45 GB/s Fragment Vertex Triangle Fragment Computation and Bandwidth Based on: 100 Mtri/sec 256 Bytes vertex data 128 Bytes interpolated 68 Bytes fragment output 5x depth complexity 16 4-Byte textures 223 ops/vert 1664 ops/frag No caching No compression

Task Distribute Merge Data Parallel

Vertex Distribute objects by screen tile Triangle Fragment Some pixels Some objects Vertex Triangle Fragment Vertex Triangle Fragment Objects Screen Sort First

Vertex Distribute objects or vertices Merge & Redistribute by screen location Vertex Triangle Fragment Triangle Fragment Triangle Fragment Triangle Fragment Some pixels Some objects Objects Screen Sort Middle

TiledInterleaved Screen Subdivision

Vertex Triangle Fragment Distribute by object Z-merge Vertex Triangle Fragment Vertex Triangle Fragment Full Screen Some objects Objects Screen Sort Last

Graphics Processing Unit (GPU) Sort Middle(ish) Fixed-Function HW for clip/cull, raster, texturing, Ztest Programmable stages Commands in, pixels out

Vertex Pixel Triangle Pipeline … Parallel More Parallel More Pipeline More Parallel GPU Computation

Architecture: Latency CPU: Make one thread go very fast Avoid the stalls –Branch prediction –Out-of-order execution –Memory prefetch –Big caches GPU: Make 1000 threads go very fast Hide the stalls –HW thread scheduler –Swap threads to hide stalls

Architecture (MIMD vs SIMD) CTRL ALU CTRL ALU CTRL ALU MIMD(CPU-Like)SIMD (GPU-Like) CTRL Flexibility Horsepower Ease of Use

SIMD Branching if( x ) // mask threads // issue instructions else // invert mask // issue instructions // unmask Threads agree, issue if Threads disagree, issue if AND else Threads agree, issue else

SIMD Looping while(x) // update mask // do stuff They all run ‘till the last one’s done…. Useful Useless

Z-Buffer Rasterize GPU graphics processing model Vertex Geometry Fragment CPU Displayed Pixels Displayed Pixels Texture/ Buffer

[Kilgaraff and Fernando, GPU Gems 2] NVIDIA GeForce 6 Vertex Rasterize Fragment Z-Buffer Displayed Pixels Displayed Pixels

Z-Buffer Rasterize GPU graphics processing model Vertex Geometry Fragment CPU Displayed Pixels Displayed Pixels Texture/ Buffer

GPU graphics processing model CPU Displayed Pixels Displayed Pixels Vertex Geometry Fragment Rasterize Z-Buffer Texture/ Buffer

AMD/ATI R600 Dispatch

SIMD Units 2x2 Quads (4 per SIMD) 20 ALU/Quad (5 per thread) “Wavefront” of 64 Threads, executed over 8 clocks 2 Waves interleaved Interleaving + multi-cycling hides ALU latency. Wavefront switching hides memory latency. GPR Usage determines wavefront count. General Purpose Registers 4x32bit (THOUSANDS of them)

[Tom’s Hardware] AMD/ATI R600

Texture

NVIDIA Kepler [NVIDIA, NVIDIA’s Next Generation CUDA Compute Architecture: Kepler GK110, 2012]

Kepler SMX [NVIDIA, NVIDIA’s Next Generation CUDA Compute Architecture: Kepler GK110, 2012]

GPU Performance Tips

Graphics System Architecture Your Code APIDriver Current Frame (Buffering Commands) Previous Frame(s) (Submitted, Pending Execution) GPU GPU(s) Display

GPU Performance Tips API and Driver –Reading Results Derails the train….. Occlusion Queries  Death –When used poorly…. Framebuffer reads  DEATH!!! –Almost always… CPU-GPU communication should be one way If you must read, do it a few frames later…

GPU Performance Tips API and Driver –Minimize shader/texture/constant changes –Minimize Draw calls –Minimize CPU  GPU traffic glBegin()/glEnd() are EVIL! Use static vertex/index buffers if you can Use dynamic buffers if you must –With discarding locks…

GPU Performance Tips Shaders, Generally –NO unnecessary work! Precompute constant expressions Div by constant  Mul by reciprocal –Minimize fetches Prefer compute (generally) If ALU/TEX < 4, ALU is under-utilized If combining static textures, bake it all down…

GPU Performance Tips Shaders, Generally –Careful with flow control Avoid divergence Flatten small branches Prefer simple control structure –Double-check the compiler –Look over artists’ shoulders Material editors give them lots of rope….

GPU Performance Tips Vertex Processing –Use the right data format Cache-optimized index buffer Small, 16-byte aligned vertices –Cull invisible geometry Coarse-grained (few thousand triangles) is enough “Heavy” Geometry load is ~2MTris and rising

GPU Performance Tips Pixel Processing –Small triangles hurt performance 2x2 pixel quads  Waste at edge pixels –Respect the texture cache Adjacent pixels should touch adjacent texels Use the smallest possible texture format –Avoid dependent texture reads –Do work per vertex There’s usually less of those (see small triangles)

GPU Performance Tips Pixel Processing –HW is very good at Z culling Early Z, Hierarchical Z –If possible, submit geometry front to back –“Z Priming” is commonplace

GPU Performance Tips Blending/Backend –Turn off what you don’t need Alpha blending Color/Z writes –Minimize redundant passes Multiple lights/textures in one pass –Use the smallest possible pixel format –Consider clip() in transparent regions