Joshua Barczak CMSC435 UMBC

Joshua Barczak CMSC435 UMBC
Graphics Hardware Joshua Barczak CMSC435 UMBC

Object-Order Rendering
Object-Space Primitives World-space Primitives Camera-Space Primitives Clip-Space Primitives Clip Modeling XForm Viewing XForm Projection XForm Clip-Space Primitives Rasterize Window XForm Viewport XForm Displayed Pixels Pixels Raster-Space Primitives

The Code DrawTriangles( Vertex* vb, int* ib, int n_primitives ) {
for( each primitive i ){ // fetch the indices int indices[3] = { ib[3*i], ib[3*i+1], ib[3*i+2] }; // fetch the vertices Vertex v[3] = { vb[indices[0]], vb[indices[1]], vb[indices[0]] }; // Transform the vertices TransformedVertex v[3] = { Process(v[0]), Process(v[1]), Process(v[2]) }; // Clip/cull (may create more triangles) for( each triangle ) { // Backface cull if( !Backfacing() ) { // rasterization setup for( each rasterized pixel ) { // interpolate vertex attributes // z test // calculate color // blend result into frame buffer }

The Machine Host CPU Memory Application Device Driver PCIE bus Memory
The DrawTriangles function, in ASIC form GPU

The Cheap Machine Shared Die CPU Application Device Driver Memory

Our View of The Machine API calls to Manage memory
Configure the pipeline Shader code Resource bindings Fixed-function states Z/Stencil Alpha Draw sets of triangles Code VS Resource Resource Resource Rasterizer State Code PS Resource Resource Resource Output Merger State

The Rules Rule #1: Don’t be silly
If you have 1000 triangles, do not make 1000 API calls glBegin/glEnd are evil

The Rules Rule #1: Don’t be silly Compute at the correct rate
Per-vertex work is cheaper than per-pixel work In general Simplify uniform expressions X = CONST * CONST  x = const X < SQRT(CONST)  x*x < const X = y/CONST;  x = y*(1/const) = y*const X = pow( CONST,y) = exp( log(x)*y) = exp(const*y)

The unstoppable march of time
GPU Operation Hardware Queue Command Buffers Driver Thread (on other core) Frame N Frame N+1 Frame N+2 Draw Calls Our Thread Our Scene The unstoppable march of time

The Worst Thing Imaginable
Do nothing while we wait for more draws CPU/GPU dependencies GPU Draw() Draw() Draw() Draw() ReadPixels() …….. Draw() Draw() Draw() ReadPixels() ……. CPU Do whatever it is we’re doing with those pixels Wait again Process Draw calls Do nothing while we wait for the pixels Draw calls

The Rules Rule #1: Don’t be silly Rule #2: One way traffic
Don’t read the frame buffer If you use occlusion queries, wait a few frames before reading them

OpenGL Drawing Convenient, but wrong:
void DrawIndexedMesh( float* pPositions, float* pNormals, GLuint* pIndices, GLuint nTriangles ) { glEnableClientState(GL_VERTEX_ARRAY); glEnableClientState(GL_NORMAL_ARRAY); glVertexPointer(3,GL_FLOAT,3*sizeof(GLfloat),pPositions); glNormalPointer(GL_FLOAT,3*sizeof(GLfloat),pNormals); glDrawElements( GL_TRIANGLES, 3*nTriangles, GL_UNSIGNED_INT, pIndices ); } Convenient, but wrong: Driver must copy the data every draw

OpenGL Drawing the Right Way
void DrawIndexedMesh( GLuint vbo, GLuint ibo, GLuint nTriangles, GLuint index_offset ) { glEnableClientState(GL_VERTEX_ARRAY); glEnableClientState(GL_NORMAL_ARRAY); glVertexPointer(3,GL_FLOAT,3*sizeof(GLfloat),0); glNormalPointer(GL_FLOAT,3*sizeof(GLfloat), (GLvoid*) 3*sizeof(float) ); glBindBuffer( GL_ARRAY_BUFFER, vbo ); glBindBuffer( GL_ELEMENT_ARRAY_BUFFER, ibo ); glDrawElements( GL_TRIANGLES, 3*nTriangles, GL_UNSIGNED_INT, index_offset ); } At startup…. void CreateBufferObjects( Vertex* pVB, int nVertices, int* pIB, int nTris ) { glGenBuffers( 1, &vbo ); glGenBuffers( 1, &ibo ); glBindBuffer( GL_ARRAY_BUFFER, vbo ); glBindBuffer( GL_ELEMENT_ARRAY_BUFFER, ibo ); glBufferData( GL_ARRAY_BUFFER, nVertices*sizeof(Vertex), pVB, GL_STATIC_DRAW ); glBufferData( GL_ELEMENT_ARRAY_BUFFER, 3*nTris*sizeof(int), pIB, GL_STATIC_DRAW); } These are no longer pointers

Rule #3: Do not move data

Dynamic Buffers But… I NEED to move data Sometimes you do:
Do you really? Sometimes you do: Particles CPU animation You need to double-buffer to avoid stalls

Dynamic Buffers Buffer1 Buffer0 GPU Frame 0 Frame 1 Frame 2 Frame 3
CPU If you give them the right flags, drivers will manage this for you (read the docs)

Rule #3: Do not move data Rule #4: When breaking rule #3, do so correctly Use dynamic, write-only buffers with discard

Pipelining Instructions takes several cycles Fetch Decode Execute
Nonpipelined: N instructions in 4N clocks 4 C.P.I Decode Execute Write Regs Pipelined: 9 intructions in 12 clocks C.P.I (1 CPI in the limit) F F F F F F F F F D D D D D D D D D E E E E E E E E E W W W W W W W W W

Graphics Pipeline Hardware Stages Index Fetch Vertex Fetch Vertex
DrawTriangles( Vertex* vb, int* ib, int n_primitives ) { for( each primitive i ){ // fetch the indices int indices[3] = { ib[3*i], ib[3*i+1], ib[3*i+2] }; // fetch the vertices Vertex v[3] = { vb[indices[0]], vb[indices[1]], vb[indices[0]] }; // Transform the vertices TransformedVertex v[3] = { Process(v[0]), Process(v[1]), Process(v[2]) }; // Clip/cull (may create more triangles) for( each triangle ) { // Backface cull if( !Backfacing() ) { // rasterization setup for( each rasterized pixel ) { // interpolate vertex attributes // z test // calculate color // blend result into frame buffer } Vertex Fetch Vertex Shade Clip Cull Raster Z Shade Blend

Graphics Pipelining Index Fetch Vertex Fetch Vertex Shade Clip Cull
Raster Z Shade Blend Clock cycles

The Physical Machine Control Registers Stuff like: - VB/IB address - Primitive type - Cull mode - Viewport size - ZBuffer pointer format -Color buffer pointer format - Textures pointer format - Blend mode - Z/Stencil modes Vertex Functional Units Triangle Rasterizer Pixel Blend

State Change Index Fetch Wait for VS Vertex Fetch Vertex Shade Clip
Cull Raster White space indicates wasted electricity Z Shade Blend Clock Cycles (tick tock)

State Change (Software)
Your Code Driver Workload SetState() Draw triangles SetState() Draw triangles SetState() Draw triangles Figure out what regs to change Turn texture/VBO handles to addresses/formats Convert GL state to register bits Change shader code addresses Put register writes into command buffer Put draw commands into command buffer Repeat… This part is much more severe than the hardware bubbles….

Rule #3: Do not move data Rule #4: Move data correctly Rule #5: Avoid changing state Sort objects by state Pack meshes together

Computation & Bandwidth
Based on: • 100 Mtri/sec • 256 B vertex data • 128 B interpolated • 68 B fragment output • 5x depth complexity • 16 4-byte textures • 223 ops/vert • 1664 ops/frag • No caching • No compression Vertex 75 GB/s 67 GFLOPS Triangle It is physically impossible to run a serial datapath at these rates 13 GB/s 335 GB/s Texture 45 GB/s Fragment Fragment 1.1 TFLOPS Slide: Olano

Data Parallel Distribute Task Task Task Task Merge Slide: Olano

Parallel Graphics Vertex Geometry Pixel

Barycentric Rasterization
SIMD Parallelism (NxN Stamp)

Ordering Independent pixels Strict primitive order within a pixel
Or transparency doesn’t work

Parallel Raster Architectures
“Sort” Resolving pixel ordering Where the “sort” happens First Middle Last Good read: “A Sorting Classification of Parallel Rendering” Molnar et al.

Distribute objects by screen tile
Slide: Olano Sort First Objects Distribute objects by screen tile Vertex Vertex Vertex Some pixels Some objects Triangle Triangle Triangle Fragment Fragment Fragment Screen

Sort Middle Objects Distribute objects or vertices Vertex Vertex
Slide: Olano Sort Middle Objects Distribute objects or vertices Vertex Vertex Vertex Some objects Merge & Redistribute by screen location Triangle Triangle Triangle Triangle Some pixels Some objects Fragment Fragment Fragment Fragment Screen

Screen Subdivision Tiled Interleaved Scan-Line Interleaved Column
Slide: Olano Screen Subdivision Tiled Interleaved Scan-Line Interleaved Column Interleaved

Sort Last Objects Distribute by object Vertex Vertex Vertex
Slide: Olano Sort Last Objects Distribute by object Vertex Vertex Vertex Full Screen Some objects Triangle Triangle Triangle Fragment Fragment Fragment Screen Partitioned Frag-merge Screen

Architecture An execution core Not to scale Control Logic ALU Regs

16 Instruction Streams 16 Data Streams
Architecture Multiple parallel cores Multiple-instruction multiple-data (MIMD) 16 Instruction Streams 16 Data Streams

1 Instruction Stream 60 Data Streams
Architecture SIMD Machine Single Instruction Multiple Data Shared control logic Pro More throughput Con Coherent execution 1 Instruction Stream 60 Data Streams

SIMD Branching if( x ) // mask threads { // issue instructions }
else // invert mask } // unmask Threads agree, take if Threads agree, take else Threads disagree, take if AND else Useful Useless

SIMD Looping while(x) // update mask { // do stuff }
They all run ‘till the last one’s done….

Rule #3: Do not move data Rule #4: Move data correctly Rule #5: Avoid changing state Rule #6: Think about coherence Keep control flow simple Flatten branches Avoid ‘else’ branches

DX9-Era GPU Primitive Assembly Rasterizer Z Post-TnL Cache PS PS PS PS
Clipspace Primitives Early-Z Primitive Assembly Rasterizer Z Pixel blocks Vertices Post-TnL Cache PS PS PS PS Index stream PS PS PS PS Blend Pixels VS VS PS PS PS PS VS VS PS PS PS PS Vertex Cache Texture Cache Cache Cache Memory

Memory Bandwidth Lots of things need data all at once Index Stream
Vertex Demand Texture Demand Alpha/Z operation Vertex $ Tex $ Color $ Depth $ Memory

Texture Tiling Images tiled in memory In Memory:

Texture Tiling Texture cache is for reuse across pixel blocks
Bandwidth savings Not latency reduction Like in a CPU In Memory:

Block Compression DXT1 (BC1): 4x4 pixel block packed into 8 bytes
8:1 over standard 32-bit color Endpoint colors Color Index (2 bits/pixel) Four possible colors 2 endpoints and 2 interior points

Block Compression DXT5(BC3): BC4 (ATI1N) BC5 (ATI2N) DXT1 plus alpha
Endpoint alphas (2 bytes) DXT5(BC3): DXT1 plus alpha 4:1 BC4 (ATI1N) Just the alpha 2:1 for greyscale BC5 (ATI2N) 2 alphas slapped together 2:1 for 2 channels 4:1 for TS normal map Alpha index (3 bits/pixel) 8 possible alphas

Block Compression BC6/BC7 16 byte block 7 different formats

Z-Ordered Rasterization
Take 2D integer coordinates Interleave bits Get a 1D index Consecutive 1D indices are spatially coherent Deinterleave a counter to walk through space Wikipedia

Rule #3: Do not move data Rule #4: Move data correctly Rule #5: Avoid changing state Rule #6: Think about coherence Rule #7: Save bandwidth Small vertex formats 16-bit float,8-bit fixed Small texture formats Compress Don’t use 4 8-bit channels for a greyscale image! See Rules 1 and 6

DX9-Era GPU Primitive Assembly Rasterizer Z Post-TnL Cache PS PS PS PS
Clipspace Primitives Early-Z Primitive Assembly Rasterizer Z Pixel blocks Vertices Post-TnL Cache PS PS PS PS Index stream PS PS PS PS Blend Pixels VS VS PS PS PS PS VS VS PS PS PS PS Vertex Cache Texture Cache Cache Cache Memory

Nvidia “Technical Brief” (Read: Marketing)
Unified Shaders Nvidia “Technical Brief” (Read: Marketing)

DX10-Era GPU Primitive Assembly Rasterizer Z GS Blend Post-TL$ L1$ L1$
Clipspace Primitives Early-Z Primitive Assembly Rasterizer Z Pixels GS US US US US GS Threads Blend Pixels Post-TL$ VS Threads L1$ L1$ L1$ L1$ Index Data L2 $ Cache Cache Memory

The Geometry Shader One Primitive In Point/Line/Triangle
Up to N primitives out Unpredictable data amplification Order MUST be preserved

The Geometry Shader One geometry shader Parallel geometry shaders
Spits out primitives one by one GS GS GS GS GS GS Buffering Rasterizer Lots of buffering Results must be consumed in order…. Rasterizer

The Geometry Shader Nvidia AMD Buffers in on chip memory
Parallelism limited by buffer space Faster for small amplification AMD Buffers in DRAM Lots of latency Faster for large amplification Performance

Rule #3: Do not move data Rule #4: Move data correctly Rule #5: Avoid changing state Rule #6: Think about coherence Rule #7: Save bandwidth Rule #8: Geometry shaders are slooooow

Memory Latency Do a little math… Miss the cache…
Wait a few hundred cycles for memory Keep going

Memory Latency CPU Strategy Bend over backwards to avoid stalls
Sequencer ALU Regs Gigantic Cache Branch Predictor Out-of-Order Exec. Memory Prefetch More Regs

Regs (THOUSANDS of them)
Memory Latency GPU Strategy Run lots of threads “Hide” the stalls with useful work ALU Regs (THOUSANDS of them) Tiny Cache Sequencer Scheduler

Latency Do a little math Hardware swaps in other threads
Miss the cache… Memory access overlapped by useful work Keep going

Terminology Thread Warp/Wavefront One instance of a shader program
One pixel/vertex Warp/Wavefront SIMD-sized collection of threads, in lockstep What H/W people call a thread Many warps in flight for latency hiding

Occupancy Register file: “Registers” are SIMD-sized
Evenly divided among warps SIMD Lane Register Number

Occupancy 4 registers per thread 8 warps SIMD Lane Register Number

Occupancy 16 Registers per thread 2 warps SIMD Lane Register Number

Rule #3: Do not move data Rule #4: Move data correctly Rule #5: Avoid changing state Rule #6: Think about coherence Rule #7: Save bandwidth Rule #8: Geometry shaders are slooooow Rule #9: Keep the machine full

Modern GPU Rast Rast Rast Rast Z Setup GS Tess PA Blend Post-TL$ L1$
Clipspace Primitives Rast Rast Rast Rast Z Setup GS US US US US Tess GS/HS/DS Threads PA Blend Pixels Post-TL$ Vertex Threads L1$ L1$ L1$ L1$ Index Data L2 $ Memory

Tessellation Unigine.com

DX11 Tessellation Pipeline
Patches Control Points Hull Shader (Selects Tess Factors) Detail Levels Tessellation Hardware U,V Coordinates Domain Shader (Evaluation) Vertices Moreton 2001 Geometry Shader

Tessellation Tessellation Pitfalls: Backface cull happens post-tess
LOTS of wasted DS work 2x2 Quad Utilization problem

Derivatives for MipMapping
2x2 Quads + Differencing Missing pixels are extrapolated… Each 2x2 quad is self-contained

Big Triangle

Rasterized Quads

Wasted Pixels 27 of 76 (35%) - Drops off very fast for big triangles - At this scale, this triangle is “small”

In the limit… In this scenario, we shade 4 times as many pixels as we need This is essentially what happens when we over-tessellate

Rule #3: Do not move data Rule #4: Move data correctly Rule #5: Avoid changing state Rule #6: Think about coherence Rule #7: Save bandwidth Rule #8: Geometry shaders are slooooow Rule #9: Keep the machine full Rule #10: Small triangles are slow

Rule #3: Do not move data Rule #4: Move data correctly Rule #5: Avoid changing state Rule #6: Think about coherence Rule #7: Save bandwidth Rule #8: Geometry shaders are slooooow Rule #9: Keep the machine full Rule #10: Small triangles are slow Rule #11: The rules are subject to change at any time and without notice…

NVIDIA GeForce 6 [Kilgaraff and Fernando, GPU Gems 2]

Vertex Processing 4-wide FP vector + special functions
Vertex texture fetch Unfiltered Very slow MIMD

Fragment Processing Pixel pipe 2 4-wide vector pipes
Dual issue Vector co-issue 3x1 or 2x2 FP16 arithmetic Poor flow control granularity All in-flight threads take same path

AMD/ATI R600 [Tom’s Hardware]

SIMD Units VLIW 4 texture engines 5 ALUs 16-wide SIMD
1 with transcendentals 16-wide SIMD In groups of 4 64 thread “wavefront” 2 waves issue over 8 clocks 4 texture engines Texture ops ¼ ALU rate

Dispatch

NVIDIA G80 [NVIDIA 8800 Architectural Overview, NVIDIA TB _v01, November 2006]

Streaming Processors Scalar architecture Instruction issue
32-wide “warp” issued over 4 clocks Special functions take 16 clocks (2 SFUs) Instruction issue Interleaved warp instrucitons

NVIDIA Fermi [Beyond3D NVIDIA Fermi GPU and Architecture Analysis, 2010]

Fermi Rasterization Round robin vertex processing
4 rasterizers (1 per GPC) Screen partitioned Purcell 2010

NVIDA Fermi SM 2 concurrent warps 32 ALUs 16 Load/Store units 4 SFUs
[NVIDIA, NVIDIA’s Next Generation CUDA Compute Architecture: Fermi, 2009]

AMD GCN Vector pipes Scalar processor
4 SIMDs per CU Issued round-robin 64-wide waves Scalar processor Integer ops and branching Separate register set Different instruction types can co-issue From different waves

Joshua Barczak CMSC435 UMBC

Similar presentations

Presentation on theme: "Joshua Barczak CMSC435 UMBC"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Joshua Barczak CMSC435 UMBC

Similar presentations

Presentation on theme: "Joshua Barczak CMSC435 UMBC"— Presentation transcript:

Similar presentations

About project

Feedback