Presentation is loading. Please wait.

Presentation is loading. Please wait.

Joshua Barczak CMSC435 UMBC

Similar presentations


Presentation on theme: "Joshua Barczak CMSC435 UMBC"— Presentation transcript:

1 Joshua Barczak CMSC435 UMBC
Graphics Hardware Joshua Barczak CMSC435 UMBC

2 Object-Order Rendering
Object-Space Primitives World-space Primitives Camera-Space Primitives Clip-Space Primitives Clip Modeling XForm Viewing XForm Projection XForm Clip-Space Primitives Rasterize Window XForm Viewport XForm Displayed Pixels Pixels Raster-Space Primitives

3 The Code DrawTriangles( Vertex* vb, int* ib, int n_primitives ) {
for( each primitive i ){ // fetch the indices int indices[3] = { ib[3*i], ib[3*i+1], ib[3*i+2] }; // fetch the vertices Vertex v[3] = { vb[indices[0]], vb[indices[1]], vb[indices[0]] }; // Transform the vertices TransformedVertex v[3] = { Process(v[0]), Process(v[1]), Process(v[2]) }; // Clip/cull (may create more triangles) for( each triangle ) { // Backface cull if( !Backfacing() ) { // rasterization setup for( each rasterized pixel ) { // interpolate vertex attributes // z test // calculate color // blend result into frame buffer }

4 The Machine Host CPU Memory Application Device Driver PCIE bus Memory
The DrawTriangles function, in ASIC form GPU

5 The Cheap Machine Shared Die CPU Application Device Driver Memory

6 Our View of The Machine API calls to Manage memory
Configure the pipeline Shader code Resource bindings Fixed-function states Z/Stencil Alpha Draw sets of triangles Code VS Resource Resource Resource Rasterizer State Code PS Resource Resource Resource Output Merger State

7 The Rules Rule #1: Don’t be silly
If you have 1000 triangles, do not make 1000 API calls glBegin/glEnd are evil

8 The Rules Rule #1: Don’t be silly Compute at the correct rate
Per-vertex work is cheaper than per-pixel work In general Simplify uniform expressions X = CONST * CONST  x = const X < SQRT(CONST)  x*x < const X = y/CONST;  x = y*(1/const) = y*const X = pow( CONST,y) = exp( log(x)*y) = exp(const*y)

9 The unstoppable march of time
GPU Operation Hardware Queue Command Buffers Driver Thread (on other core) Frame N Frame N+1 Frame N+2 Draw Calls Our Thread Our Scene The unstoppable march of time

10 The Worst Thing Imaginable
Do nothing while we wait for more draws CPU/GPU dependencies GPU Draw() Draw() Draw() Draw() ReadPixels() …….. Draw() Draw() Draw() ReadPixels() ……. CPU Do whatever it is we’re doing with those pixels Wait again Process Draw calls Do nothing while we wait for the pixels Draw calls

11 The Rules Rule #1: Don’t be silly Rule #2: One way traffic
Don’t read the frame buffer If you use occlusion queries, wait a few frames before reading them

12 OpenGL Drawing Convenient, but wrong:
void DrawIndexedMesh( float* pPositions, float* pNormals, GLuint* pIndices, GLuint nTriangles ) { glEnableClientState(GL_VERTEX_ARRAY); glEnableClientState(GL_NORMAL_ARRAY); glVertexPointer(3,GL_FLOAT,3*sizeof(GLfloat),pPositions); glNormalPointer(GL_FLOAT,3*sizeof(GLfloat),pNormals); glDrawElements( GL_TRIANGLES, 3*nTriangles, GL_UNSIGNED_INT, pIndices ); } Convenient, but wrong: Driver must copy the data every draw

13 OpenGL Drawing the Right Way
void DrawIndexedMesh( GLuint vbo, GLuint ibo, GLuint nTriangles, GLuint index_offset ) { glEnableClientState(GL_VERTEX_ARRAY); glEnableClientState(GL_NORMAL_ARRAY); glVertexPointer(3,GL_FLOAT,3*sizeof(GLfloat),0); glNormalPointer(GL_FLOAT,3*sizeof(GLfloat), (GLvoid*) 3*sizeof(float) ); glBindBuffer( GL_ARRAY_BUFFER, vbo ); glBindBuffer( GL_ELEMENT_ARRAY_BUFFER, ibo ); glDrawElements( GL_TRIANGLES, 3*nTriangles, GL_UNSIGNED_INT, index_offset ); } At startup…. void CreateBufferObjects( Vertex* pVB, int nVertices, int* pIB, int nTris ) { glGenBuffers( 1, &vbo ); glGenBuffers( 1, &ibo ); glBindBuffer( GL_ARRAY_BUFFER, vbo ); glBindBuffer( GL_ELEMENT_ARRAY_BUFFER, ibo ); glBufferData( GL_ARRAY_BUFFER, nVertices*sizeof(Vertex), pVB, GL_STATIC_DRAW ); glBufferData( GL_ELEMENT_ARRAY_BUFFER, 3*nTris*sizeof(int), pIB, GL_STATIC_DRAW); } These are no longer pointers

14 The Rules Rule #1: Don’t be silly Rule #2: One way traffic
Rule #3: Do not move data

15 Dynamic Buffers But… I NEED to move data Sometimes you do:
Do you really? Sometimes you do: Particles CPU animation You need to double-buffer to avoid stalls

16 Dynamic Buffers Buffer1 Buffer0 GPU Frame 0 Frame 1 Frame 2 Frame 3
CPU If you give them the right flags, drivers will manage this for you (read the docs)

17 The Rules Rule #1: Don’t be silly Rule #2: One way traffic
Rule #3: Do not move data Rule #4: When breaking rule #3, do so correctly Use dynamic, write-only buffers with discard

18 Pipelining Instructions takes several cycles Fetch Decode Execute
Nonpipelined: N instructions in 4N clocks 4 C.P.I Decode Execute Write Regs Pipelined: 9 intructions in 12 clocks C.P.I (1 CPI in the limit) F F F F F F F F F D D D D D D D D D E E E E E E E E E W W W W W W W W W

19 Graphics Pipeline Hardware Stages Index Fetch Vertex Fetch Vertex
DrawTriangles( Vertex* vb, int* ib, int n_primitives ) { for( each primitive i ){ // fetch the indices int indices[3] = { ib[3*i], ib[3*i+1], ib[3*i+2] }; // fetch the vertices Vertex v[3] = { vb[indices[0]], vb[indices[1]], vb[indices[0]] }; // Transform the vertices TransformedVertex v[3] = { Process(v[0]), Process(v[1]), Process(v[2]) }; // Clip/cull (may create more triangles) for( each triangle ) { // Backface cull if( !Backfacing() ) { // rasterization setup for( each rasterized pixel ) { // interpolate vertex attributes // z test // calculate color // blend result into frame buffer } Vertex Fetch Vertex Shade Clip Cull Raster Z Shade Blend

20 Graphics Pipelining Index Fetch Vertex Fetch Vertex Shade Clip Cull
Raster Z Shade Blend Clock cycles

21 The Physical Machine Control Registers Stuff like: - VB/IB address - Primitive type - Cull mode - Viewport size - ZBuffer pointer format -Color buffer pointer format - Textures pointer format - Blend mode - Z/Stencil modes Vertex Functional Units Triangle Rasterizer Pixel Blend

22 State Change Index Fetch Wait for VS Vertex Fetch Vertex Shade Clip
Cull Raster White space indicates wasted electricity Z Shade Blend Clock Cycles (tick tock)

23 State Change (Software)
Your Code Driver Workload SetState() Draw triangles SetState() Draw triangles SetState() Draw triangles Figure out what regs to change Turn texture/VBO handles to addresses/formats Convert GL state to register bits Change shader code addresses Put register writes into command buffer Put draw commands into command buffer Repeat… This part is much more severe than the hardware bubbles….

24 The Rules Rule #1: Don’t be silly Rule #2: One way traffic
Rule #3: Do not move data Rule #4: Move data correctly Rule #5: Avoid changing state Sort objects by state Pack meshes together

25 Computation & Bandwidth
Based on: • 100 Mtri/sec • 256 B vertex data • 128 B interpolated • 68 B fragment output • 5x depth complexity • 16 4-byte textures • 223 ops/vert • 1664 ops/frag • No caching • No compression Vertex 75 GB/s 67 GFLOPS Triangle It is physically impossible to run a serial datapath at these rates 13 GB/s 335 GB/s Texture 45 GB/s Fragment Fragment 1.1 TFLOPS Slide: Olano

26 Data Parallel Distribute Task Task Task Task Merge Slide: Olano

27 Parallel Graphics Vertex Geometry Pixel

28 Barycentric Rasterization
SIMD Parallelism (NxN Stamp)

29 Barycentric Rasterization
SIMD Parallelism (NxN Stamp)

30 Barycentric Rasterization
SIMD Parallelism (NxN Stamp)

31 Barycentric Rasterization
SIMD Parallelism (NxN Stamp)

32 Barycentric Rasterization
SIMD Parallelism (NxN Stamp)

33 Barycentric Rasterization
SIMD Parallelism (NxN Stamp)

34 Ordering Independent pixels Strict primitive order within a pixel
Or transparency doesn’t work

35 Parallel Raster Architectures
“Sort” Resolving pixel ordering Where the “sort” happens First Middle Last Good read: “A Sorting Classification of Parallel Rendering” Molnar et al.

36 Distribute objects by screen tile
Slide: Olano Sort First Objects Distribute objects by screen tile Vertex Vertex Vertex Some pixels Some objects Triangle Triangle Triangle Fragment Fragment Fragment Screen

37 Sort Middle Objects Distribute objects or vertices Vertex Vertex
Slide: Olano Sort Middle Objects Distribute objects or vertices Vertex Vertex Vertex Some objects Merge & Redistribute by screen location Triangle Triangle Triangle Triangle Some pixels Some objects Fragment Fragment Fragment Fragment Screen

38 Screen Subdivision Tiled Interleaved Scan-Line Interleaved Column
Slide: Olano Screen Subdivision Tiled Interleaved Scan-Line Interleaved Column Interleaved

39 Sort Last Objects Distribute by object Vertex Vertex Vertex
Slide: Olano Sort Last Objects Distribute by object Vertex Vertex Vertex Full Screen Some objects Triangle Triangle Triangle Fragment Fragment Fragment Screen Partitioned Frag-merge Screen

40 Architecture An execution core Not to scale Control Logic ALU Regs

41 16 Instruction Streams 16 Data Streams
Architecture Multiple parallel cores Multiple-instruction multiple-data (MIMD) 16 Instruction Streams 16 Data Streams

42 1 Instruction Stream 60 Data Streams
Architecture SIMD Machine Single Instruction Multiple Data Shared control logic Pro More throughput Con Coherent execution 1 Instruction Stream 60 Data Streams

43 SIMD Branching if( x ) // mask threads { // issue instructions }
else // invert mask } // unmask Threads agree, take if Threads agree, take else Threads disagree, take if AND else Useful Useless

44 SIMD Looping while(x) // update mask { // do stuff }
They all run ‘till the last one’s done….

45 The Rules Rule #1: Don’t be silly Rule #2: One way traffic
Rule #3: Do not move data Rule #4: Move data correctly Rule #5: Avoid changing state Rule #6: Think about coherence Keep control flow simple Flatten branches Avoid ‘else’ branches

46 DX9-Era GPU Primitive Assembly Rasterizer Z Post-TnL Cache PS PS PS PS
Clipspace Primitives Early-Z Primitive Assembly Rasterizer Z Pixel blocks Vertices Post-TnL Cache PS PS PS PS Index stream PS PS PS PS Blend Pixels VS VS PS PS PS PS VS VS PS PS PS PS Vertex Cache Texture Cache Cache Cache Memory

47 Memory Bandwidth Lots of things need data all at once Index Stream
Vertex Demand Texture Demand Alpha/Z operation Vertex $ Tex $ Color $ Depth $ Memory

48 Texture Tiling Images tiled in memory In Memory:

49 Texture Tiling Texture cache is for reuse across pixel blocks
Bandwidth savings Not latency reduction Like in a CPU In Memory:

50 Block Compression DXT1 (BC1): 4x4 pixel block packed into 8 bytes
8:1 over standard 32-bit color Endpoint colors Color Index (2 bits/pixel) Four possible colors 2 endpoints and 2 interior points

51 Block Compression DXT5(BC3): BC4 (ATI1N) BC5 (ATI2N) DXT1 plus alpha
Endpoint alphas (2 bytes) DXT5(BC3): DXT1 plus alpha 4:1 BC4 (ATI1N) Just the alpha 2:1 for greyscale BC5 (ATI2N) 2 alphas slapped together 2:1 for 2 channels 4:1 for TS normal map Alpha index (3 bits/pixel) 8 possible alphas

52 Block Compression BC6/BC7 16 byte block 7 different formats

53 Z-Ordered Rasterization
Take 2D integer coordinates Interleave bits Get a 1D index Consecutive 1D indices are spatially coherent Deinterleave a counter to walk through space Wikipedia

54 The Rules Rule #1: Don’t be silly Rule #2: One way traffic
Rule #3: Do not move data Rule #4: Move data correctly Rule #5: Avoid changing state Rule #6: Think about coherence Rule #7: Save bandwidth Small vertex formats 16-bit float,8-bit fixed Small texture formats Compress Don’t use 4 8-bit channels for a greyscale image! See Rules 1 and 6

55 DX9-Era GPU Primitive Assembly Rasterizer Z Post-TnL Cache PS PS PS PS
Clipspace Primitives Early-Z Primitive Assembly Rasterizer Z Pixel blocks Vertices Post-TnL Cache PS PS PS PS Index stream PS PS PS PS Blend Pixels VS VS PS PS PS PS VS VS PS PS PS PS Vertex Cache Texture Cache Cache Cache Memory

56 Nvidia “Technical Brief” (Read: Marketing)
Unified Shaders Nvidia “Technical Brief” (Read: Marketing)

57 Nvidia “Technical Brief” (Read: Marketing)
Unified Shaders Nvidia “Technical Brief” (Read: Marketing)

58 DX10-Era GPU Primitive Assembly Rasterizer Z GS Blend Post-TL$ L1$ L1$
Clipspace Primitives Early-Z Primitive Assembly Rasterizer Z Pixels GS US US US US GS Threads Blend Pixels Post-TL$ VS Threads L1$ L1$ L1$ L1$ Index Data L2 $ Cache Cache Memory

59 The Geometry Shader One Primitive In Point/Line/Triangle
Up to N primitives out Unpredictable data amplification Order MUST be preserved

60 The Geometry Shader One geometry shader Parallel geometry shaders
Spits out primitives one by one GS GS GS GS GS GS Buffering Rasterizer Lots of buffering Results must be consumed in order…. Rasterizer

61 The Geometry Shader Nvidia AMD Buffers in on chip memory
Parallelism limited by buffer space Faster for small amplification AMD Buffers in DRAM Lots of latency Faster for large amplification Performance

62 The Rules Rule #1: Don’t be silly Rule #2: One way traffic
Rule #3: Do not move data Rule #4: Move data correctly Rule #5: Avoid changing state Rule #6: Think about coherence Rule #7: Save bandwidth Rule #8: Geometry shaders are slooooow

63 Memory Latency Do a little math… Miss the cache…
Wait a few hundred cycles for memory Keep going

64 Memory Latency CPU Strategy Bend over backwards to avoid stalls
Sequencer ALU Regs Gigantic Cache Branch Predictor Out-of-Order Exec. Memory Prefetch More Regs

65 Regs (THOUSANDS of them)
Memory Latency GPU Strategy Run lots of threads “Hide” the stalls with useful work ALU Regs (THOUSANDS of them) Tiny Cache Sequencer Scheduler

66 Latency Do a little math Hardware swaps in other threads
Miss the cache… Memory access overlapped by useful work Keep going

67 Terminology Thread Warp/Wavefront One instance of a shader program
One pixel/vertex Warp/Wavefront SIMD-sized collection of threads, in lockstep What H/W people call a thread Many warps in flight for latency hiding

68 Occupancy Register file: “Registers” are SIMD-sized
Evenly divided among warps SIMD Lane Register Number

69 Occupancy 4 registers per thread 8 warps SIMD Lane Register Number

70 Occupancy 16 Registers per thread 2 warps SIMD Lane Register Number

71 The Rules Rule #1: Don’t be silly Rule #2: One way traffic
Rule #3: Do not move data Rule #4: Move data correctly Rule #5: Avoid changing state Rule #6: Think about coherence Rule #7: Save bandwidth Rule #8: Geometry shaders are slooooow Rule #9: Keep the machine full

72 Modern GPU Rast Rast Rast Rast Z Setup GS Tess PA Blend Post-TL$ L1$
Clipspace Primitives Rast Rast Rast Rast Z Setup GS US US US US Tess GS/HS/DS Threads PA Blend Pixels Post-TL$ Vertex Threads L1$ L1$ L1$ L1$ Index Data L2 $ Memory

73 Tessellation Unigine.com

74 DX11 Tessellation Pipeline
Patches Control Points Hull Shader (Selects Tess Factors) Detail Levels Tessellation Hardware U,V Coordinates Domain Shader (Evaluation) Vertices Moreton 2001 Geometry Shader

75 Tessellation Tessellation Pitfalls: Backface cull happens post-tess
LOTS of wasted DS work 2x2 Quad Utilization problem

76 Derivatives for MipMapping
2x2 Quads + Differencing Missing pixels are extrapolated… Each 2x2 quad is self-contained

77 Big Triangle

78 Rasterized Quads

79 Wasted Pixels 27 of 76 (35%) - Drops off very fast for big triangles - At this scale, this triangle is “small”

80 In the limit… In this scenario, we shade 4 times as many pixels as we need This is essentially what happens when we over-tessellate

81 The Rules Rule #1: Don’t be silly Rule #2: One way traffic
Rule #3: Do not move data Rule #4: Move data correctly Rule #5: Avoid changing state Rule #6: Think about coherence Rule #7: Save bandwidth Rule #8: Geometry shaders are slooooow Rule #9: Keep the machine full Rule #10: Small triangles are slow

82 The Rules Rule #1: Don’t be silly Rule #2: One way traffic
Rule #3: Do not move data Rule #4: Move data correctly Rule #5: Avoid changing state Rule #6: Think about coherence Rule #7: Save bandwidth Rule #8: Geometry shaders are slooooow Rule #9: Keep the machine full Rule #10: Small triangles are slow Rule #11: The rules are subject to change at any time and without notice…

83 NVIDIA GeForce 6 [Kilgaraff and Fernando, GPU Gems 2]

84 Vertex Processing 4-wide FP vector + special functions
Vertex texture fetch Unfiltered Very slow MIMD

85 Fragment Processing Pixel pipe 2 4-wide vector pipes
Dual issue Vector co-issue 3x1 or 2x2 FP16 arithmetic Poor flow control granularity All in-flight threads take same path

86 AMD/ATI R600 [Tom’s Hardware]

87 SIMD Units VLIW 4 texture engines 5 ALUs 16-wide SIMD
1 with transcendentals 16-wide SIMD In groups of 4 64 thread “wavefront” 2 waves issue over 8 clocks 4 texture engines Texture ops ¼ ALU rate

88 Dispatch

89 Demo

90 NVIDIA G80 [NVIDIA 8800 Architectural Overview, NVIDIA TB _v01, November 2006]

91 Streaming Processors Scalar architecture Instruction issue
32-wide “warp” issued over 4 clocks Special functions take 16 clocks (2 SFUs) Instruction issue Interleaved warp instrucitons

92 NVIDIA Fermi [Beyond3D NVIDIA Fermi GPU and Architecture Analysis, 2010]

93 Fermi Rasterization Round robin vertex processing
4 rasterizers (1 per GPC) Screen partitioned Purcell 2010

94 NVIDA Fermi SM 2 concurrent warps 32 ALUs 16 Load/Store units 4 SFUs
[NVIDIA, NVIDIA’s Next Generation CUDA Compute Architecture: Fermi, 2009]

95 AMD GCN Vector pipes Scalar processor
4 SIMDs per CU Issued round-robin 64-wide waves Scalar processor Integer ops and branching Separate register set Different instruction types can co-issue From different waves


Download ppt "Joshua Barczak CMSC435 UMBC"

Similar presentations


Ads by Google