Status – Week 276 Victor Moya. Hardware Pipeline Command Processor. Command Processor. Vertex Shader. Vertex Shader. Rasterization. Rasterization. Pixel.

Status – Week 276 Victor Moya

Hardware Pipeline Command Processor. Command Processor. Vertex Shader. Vertex Shader. Rasterization. Rasterization. Pixel Shader. Pixel Shader. Fragment Operations and Tests. Fragment Operations and Tests.

Command Processor Recieves commands from the CPU (driver, OpenGL/Direct3D). Recieves commands from the CPU (driver, OpenGL/Direct3D). Fetches data from memory: vertex data (DMA). Fetches data from memory: vertex data (DMA). Updates and stores OpenGL/Direct3D render state. Updates and stores OpenGL/Direct3D render state.

Vertex Shader Transforms and lits vertex streams. Transforms and lits vertex streams. Vertex shader program (from GPU memory?). Vertex shader program (from GPU memory?). Vertex shader constans (from GPU memory?). Vertex shader constans (from GPU memory?). Inputs: vertex data 16x4D Inputs: vertex data 16x4D Outputs: vertex data 14x4D Outputs: vertex data 14x4D

Rasterization Includes: Includes: Clipping Clipping Divide by w Divide by w Affine transform Affine transform Primitive assembly Primitive assembly Culling Culling Setup Setup Fragment generation. Fragment generation. Recieves vertexs and produces fragments. Recieves vertexs and produces fragments. Uses OpenGL/Direct3D render state. Uses OpenGL/Direct3D render state. Input: vertex (15x4D). Input: vertex (15x4D). Output: fragments (10x4D). Output: fragments (10x4D).

Pixel Shader Shades fragments: calculate texture address, read texture, color operations. Shades fragments: calculate texture address, read texture, color operations. Pixel Shader program and constants (from GPU memory?). Pixel Shader program and constants (from GPU memory?). Texture read: TMU (texture sample, filter unit, texture cache, GPU memory). Texture read: TMU (texture sample, filter unit, texture cache, GPU memory). Optional: Optional: Modify depth coordinate (1 Z output). Modify depth coordinate (1 Z output). Render to texture (up to 4 colors outputs). Render to texture (up to 4 colors outputs). Input: fragment (12x4D). Input: fragment (12x4D). Output: color (2x4D). Output: color (2x4D).

Fragment Operations and Tests Includes (OpenGL): Includes (OpenGL): Fog. Fog. Color Sum. Color Sum. Ownership Test. Ownership Test. Scissor Test. Scissor Test. Alpha Test. Alpha Test. Stencil Test. Stencil Test. Depth Test. Depth Test. Blend. Blend. Logic Operation. Logic Operation. Accesses framebuffer (GPU memory). Updates framebuffer. Accesses framebuffer (GPU memory). Updates framebuffer. Framebuffer: color, Z and stencil. Framebuffer: color, Z and stencil. OpenGL/Direct3D render state defines operations. OpenGL/Direct3D render state defines operations. Input: color. Input: color. Output: FB updated. Output: FB updated.

Vertex Shader The command processor sends a vertex stream to the vertex shaders. The command processor sends a vertex stream to the vertex shaders. A vertex buffer stores data read from DMA. A vertex buffer stores data read from DMA. A vertex cache (~ 10 vertexs) can be used to avoid to execute vertex shader for the same vertex twice. A vertex cache (~ 10 vertexs) can be used to avoid to execute vertex shader for the same vertex twice. The vertex stream is grouped in primitives and sent to the rasterizer. The vertex stream is grouped in primitives and sent to the rasterizer.

Hardware Pipeline

Vertex Shader Architecture SIMD architecture. Registers are 128b wide, four 32 bit fields. SIMD architecture. Registers are 128b wide, four 32 bit fields. Instruction set: typical arithmetic instructions (vector mul, add) and some special instructions (ARL, DST), some complex mathematic instructions (EXP, COS), support for branching, loops and procedures. Instruction set: typical arithmetic instructions (vector mul, add) and some special instructions (ARL, DST), some complex mathematic instructions (EXP, COS), support for branching, loops and procedures. 3 different sources of data: 3 different sources of data: Input stream (~ 16 registers). Input stream (~ 16 registers). Constants (~ 256 registers). Constants (~ 256 registers). Temporaries (~ 16 registers). Temporaries (~ 16 registers). 2 different destinations: 2 different destinations: Output stream (~ 15 registers). Output stream (~ 15 registers). Temporaries (~ 16 registers). Temporaries (~ 16 registers). Conditional registers (NV30) and boolean constants (R300, DX9) for conditional ‘execution’. Conditional registers (NV30) and boolean constants (R300, DX9) for conditional ‘execution’.

Vertex Shader Inputs and Outputs

Vertex Shader Architecture

Vertex Shader: NV20 Exposes programmability of a small part of the geometry pipeline. Exposes programmability of a small part of the geometry pipeline. Vertex load & store, format conversion, primitive assembly, clipping, triangle setup occur completely in parallel, in pipeline fashion. Vertex load & store, format conversion, primitive assembly, clipping, triangle setup occur completely in parallel, in pipeline fashion. 4-wide fine grained SIMD FP to provide the necessary performance, and run multiple execution threads to maintain efficiency and provide a very simple programming mode. 4-wide fine grained SIMD FP to provide the necessary performance, and run multiple execution threads to maintain efficiency and provide a very simple programming mode.

NV20: Introduction Independent vertices. Independent vertices. IEEE single precission FP. IEEE single precission FP. 4 component vectors (x, y, z, w). 4 component vectors (x, y, z, w). Input registers can have their components arbitrarily rearranged/replicated (swizzled). Input registers can have their components arbitrarily rearranged/replicated (swizzled). Any operation generating a scalar must generate that scalar replicated across all components, and output writes have a component write mask. Any operation generating a scalar must generate that scalar replicated across all components, and output writes have a component write mask.

NV20: Program Model

NV20: Input Attributes Input Attributes: Input Attributes: 16 quad-float vertex source attribute registers. 16 quad-float vertex source attribute registers. Position, normal, two colors, up to 8 texture coordinate sets, skin weights, fog and point size. Position, normal, two colors, up to 8 texture coordinate sets, skin weights, fog and point size. Default 0.0 for second and third components, 1.0 for the fourth. Default 0.0 for second and third components, 1.0 for the fourth. Attributes are persistent. Attributes are persistent. Only one vertex attribute may be read per program instruction. Only one vertex attribute may be read per program instruction. Constant memory: Constant memory: 96 quad floats. 96 quad floats. Can only be loaded before vertices are processed. Can only be loaded before vertices are processed. Only one constant may be read by one program instruction. Only one constant may be read by one program instruction. The program may not read to constants. The program may not read to constants.

NV20: Input Attributes Integer address register: Integer address register: Loaded using ARL. Loaded using ARL. Indexed constant reads with out-of-range reads returning (0,0,0,0). Indexed constant reads with out-of-range reads returning (0,0,0,0). Read/Write register file: Read/Write register file: 12 quad floats. 12 quad floats. Three reads and one write per instruction. Three reads and one write per instruction. Initialized to (0,0,0,0) per vertex. Initialized to (0,0,0,0) per vertex. Any vector read may be sourced as multiple operands and individually swizzled/negated each time. Any vector read may be sourced as multiple operands and individually swizzled/negated each time.

NV20: Output attributes Standard mapping for the fixed function pipeline at the homogeneous clip space point. Standard mapping for the fixed function pipeline at the homogeneous clip space point. Position for clipping. Position for clipping. Vertex color output clamped to the range 0.0 to 1.0. Vertex color output clamped to the range 0.0 to 1.0. Fog distance, point size. Fog distance, point size. 8 texture coordinates. 8 texture coordinates. All instruction writes have an optional 4- component write mask. All instruction writes have an optional 4- component write mask. Initialized to (0.0, 0.0, 0.0, 1.0). Initialized to (0.0, 0.0, 0.0, 1.0).

NV20: Instruction Set. No branching. No branching. Constant Latency: issue any instruction per clock and execute all instructions with thhe same latency. All operands are immediately available, limiting the size of registers and memory banks. Constant Latency: issue any instruction per clock and execute all instructions with thhe same latency. All operands are immediately available, limiting the size of registers and memory banks.

NV20: Hardware Implementation Two blocks: vertex attribute buffer (VAB) and the floating point core. Two blocks: vertex attribute buffer (VAB) and the floating point core.

NV20: VAB The VAB is responsible for vertex attribute persistence. The VAB is responsible for vertex attribute persistence. 16 input attributes 16 input attributes When a write to an addres is recieved defaults (0.0, 0.0, 0.0, 1.0) and the valid data overwrites the components. When a write to an addres is recieved defaults (0.0, 0.0, 0.0, 1.0) and the valid data overwrites the components. The VAB drains into a number of input buffers (IB) that are used to feed the FP core in a round robin fashion. The VAB drains into a number of input buffers (IB) that are used to feed the FP core in a round robin fashion. Dirty bits are maintained in the VAB so only changed attributes are updated when the same buffer is again the drain target. Dirty bits are maintained in the VAB so only changed attributes are updated when the same buffer is again the drain target. The transfer of a vertex is triggered by a write to address 0 (vertex position). The transfer of a vertex is triggered by a write to address 0 (vertex position). To prevent bubbles during simultaneous loading and draining of the VAB, incoming writes may push out th contents of the target address, superceding a default drain sequence. To prevent bubbles during simultaneous loading and draining of the VAB, incoming writes may push out th contents of the target address, superceding a default drain sequence.

NV20: VAB

NV20: Floating Point Core Processes the instruction set. Processes the instruction set. Multithreaded vector processor operating on quad-float data. Multithreaded vector processor operating on quad-float data. Vertex data read from input buffers and transformed into output buffers (OB). Vertex data read from input buffers and transformed into output buffers (OB). Same latency for vector and special function units. Same latency for vector and special function units. Multiple vertex threads are used to hide this latency. Multiple vertex threads are used to hide this latency. SIMD VU: MOV, MUL, ADD, MAD, DP3, DP4, DST, MIN, MAX, SLT, SGE. SIMD VU: MOV, MUL, ADD, MAD, DP3, DP4, DST, MIN, MAX, SLT, SGE. Special FU: RCP, RSQ, LOG, EXP, LIT. Special FU: RCP, RSQ, LOG, EXP, LIT. VU is approximately IEEE (no denormalized numbers or exceptions, rounding always toward negative infinity). VU is approximately IEEE (no denormalized numbers or exceptions, rounding always toward negative infinity). 1 instruction per clock and all input/output options have no performance penalty. 1 instruction per clock and all input/output options have no performance penalty. All input vectors are available with no latency. All input vectors are available with no latency.

NV20: Float Point Core

Vertex Shader: R300 4 vertex shader units. 4 vertex shader units. 1 scalar unit, 1 vector unit. 1 scalar unit, 1 vector unit. Registers: Registers: ALU Registers: ALU Registers: Constants: 256 read only vectors. Constants: 256 read only vectors. Temporary: 12 read/write vectors Temporary: 12 read/write vectors Input: 16 read only vectors. Input: 16 read only vectors. Output: 15 write only vectors. Output: 15 write only vectors. Flow Control Registers: Flow Control Registers: Integer Constat: 16 read only vectors. Integer Constat: 16 read only vectors. Address: 1 read/write vector. Address: 1 read/write vector. Loop Counter: 1 scalar. Loop Counter: 1 scalar. Boolean Constant: 16 read only bits. Boolean Constant: 16 read only bits.

R300: Instructions Up to 256 instructions long shaders. Up to 256 instructions long shaders. Up to 64K executed instructions per vertex. Up to 64K executed instructions per vertex. ALU instructions: ADD, DP3, DP4, EXP, EXPP, EXPE, FRAC, LOG, LOGP, MAD, MADDX2, MAX, MIN, MOV, MUL, POW, RCP, RSQ, SGE, SLT. ALU instructions: ADD, DP3, DP4, EXP, EXPP, EXPE, FRAC, LOG, LOGP, MAD, MADDX2, MAX, MIN, MOV, MUL, POW, RCP, RSQ, SGE, SLT. Control Flow instructions: CALL, LOOP, ENDLOOP, JUMP, JNZ, LABEL, REPEAT, ENDREPEAT, RETURN. Control Flow instructions: CALL, LOOP, ENDLOOP, JUMP, JNZ, LABEL, REPEAT, ENDREPEAT, RETURN. Address Instructions: ARL, ARR. Address Instructions: ARL, ARR. Graphic Instructions: DST, LIT. Graphic Instructions: DST, LIT. Instructions based in DX9 VS2.0. Instructions based in DX9 VS2.0.

NV30: Overview Supports all VS1 instructions and features. Supports all VS1 instructions and features. Beyond VS2? Beyond VS2? Condition codes. Condition codes. Branches and subroutines. Branches and subroutines. Modifiers: absolute. Modifiers: absolute. User clip support (new output registers CLP0- CLP5). User clip support (new output registers CLP0- CLP5). New instructions. New instructions. More registers. More registers.

NV30: Overview Up to 256 instructions per program. Up to 256 instructions per program. Up to 64K executed instructions per vertex. Up to 64K executed instructions per vertex. 16 temporary registers. 16 temporary registers. 2 vector address registers. 2 vector address registers. 256 program parameters (constants). 256 program parameters (constants).

NV30: Condition Codes 4 component register: 4 component register: LT: less than zero. LT: less than zero. EQ: equal to zero. EQ: equal to zero. GT: greater than zero. GT: greater than zero. UN: unordered, for comparisions involving NaN. UN: unordered, for comparisions involving NaN. Instructions optionally update condition code state: Instructions optionally update condition code state: “C” suffix: DP4C, MOVC. “C” suffix: DP4C, MOVC. “CC” pseudo register for update condition codes. “CC” pseudo register for update condition codes. Condition code used in: Condition code used in: Branches and procedure call/return. Branches and procedure call/return. Result masking. Result masking.

NV30: Modifiers Source: Source: Swizle Swizle Negate Negate Absolute Absolute Target Target Masking Masking Conditional masking Conditional masking

NV30: Branching and subroutines BRA BRA Unconditional. Unconditional. Conditional: BRA label (LE.xyww) Conditional: BRA label (LE.xyww) Computed (indirect): BRA [A1.z] (GT.x) Computed (indirect): BRA [A1.z] (GT.x) Call & return for subroutines. Call & return for subroutines. CAL & RET. CAL & RET. Same options that with branches. Same options that with branches. Four levels of subroutin execution. Four levels of subroutin execution. No parameter stack. No parameter stack.

NV30: Clipping New output registers: o[CLP0]..o[CLP5]. New output registers: o[CLP0]..o[CLP5]. GL_CLIP_PLANEn enabled. GL_CLIP_PLANEn enabled. Clip coordinate n interpolated across the primitive. Clip coordinate n interpolated across the primitive. Only the portion of the primitive where the clip coordinate is greater than zero is rasterized. Only the portion of the primitive where the clip coordinate is greater than zero is rasterized. Hardware performs fast trivial reject if all clip coordinats of a primitive are negative. Hardware performs fast trivial reject if all clip coordinats of a primitive are negative.

NV30: New Instructions ARL: supports loading 4-component A0 and A1 intergre registers now. ARL: supports loading 4-component A0 and A1 intergre registers now. ARR: like ARL except rounds rather than truncates before storing integer result in an address register. ARR: like ARL except rounds rather than truncates before storing integer result in an address register. BRA, CAL, RET: branching instructions. BRA, CAL, RET: branching instructions. COS, SIN: high precision trigonometric functions. COS, SIN: high precision trigonometric functions. FLR, FRC: floor and fraction of floating point values. FLR, FRC: floor and fraction of floating point values. EX2, LG2: high-preccision exponentiation and logarithm functions. EX2, LG2: high-preccision exponentiation and logarithm functions. ARA: adds pairs of components of an address register, useful for looping and other operations. ARA: adds pairs of components of an address register, useful for looping and other operations. SEQ, SFL, SGT, SLE, SNE, STR: add six “set on” instructions similar to SLT and SGE. SEQ, SFL, SGT, SLE, SNE, STR: add six “set on” instructions similar to SLT and SGE. SSG: “set sign” operation generates a vector holding –1.0 for negative operand components, 0 for zero components, and +1.0 for positive components. SSG: “set sign” operation generates a vector holding –1.0 for negative operand components, 0 for zero components, and +1.0 for positive components.

NV30: Instruction List Add & multiply instructions: ADD, DP3, DP4, DPH, MAD, MOV, SUB. Add & multiply instructions: ADD, DP3, DP4, DPH, MAD, MOV, SUB. Math functions: ABS, COS, EX2, FLR, FRC, LG2, LOG, RCP, RSQ, SIN. Math functions: ABS, COS, EX2, FLR, FRC, LG2, LOG, RCP, RSQ, SIN. Set on instructions: SEG, SFL, SGE, SGT, SLE, SLT, SNE, STR. Set on instructions: SEG, SFL, SGE, SGT, SLE, SLT, SNE, STR. Branching instructions: BRA, CAL, RET. Branching instructions: BRA, CAL, RET. Address register instructions: ARL, ARA. Address register instructions: ARL, ARA. Graphics-oriented instructions: DST, LIT, RCC, SSG. Graphics-oriented instructions: DST, LIT, RCC, SSG. Minimum/maximum instructions: MAX, MIN Minimum/maximum instructions: MAX, MIN

Others Antialiasing Antialiasing Anisotropic Filtering (textures). Anisotropic Filtering (textures). Line Antialiasing. Line Antialiasing. Edge Antialiasing Edge Antialiasing Full Screen Antialiasing (FSAA): Full Screen Antialiasing (FSAA): Supersampling. Supersampling. MultiSampling. MultiSampling. TBDR: Tile Based Deferred Rendering (STMicro PowerVR). TBDR: Tile Based Deferred Rendering (STMicro PowerVR). HOS (High Order Surfaces): N-Patches, Bezier, Displacement Mapping, TruForm, Tesselation. HOS (High Order Surfaces): N-Patches, Bezier, Displacement Mapping, TruForm, Tesselation.

Status – Week 276 Victor Moya. Hardware Pipeline Command Processor. Command Processor. Vertex Shader. Vertex Shader. Rasterization. Rasterization. Pixel.

Similar presentations

Presentation on theme: "Status – Week 276 Victor Moya. Hardware Pipeline Command Processor. Command Processor. Vertex Shader. Vertex Shader. Rasterization. Rasterization. Pixel."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Status – Week 276 Victor Moya. Hardware Pipeline Command Processor. Command Processor. Vertex Shader. Vertex Shader. Rasterization. Rasterization. Pixel.

Similar presentations

Presentation on theme: "Status – Week 276 Victor Moya. Hardware Pipeline Command Processor. Command Processor. Vertex Shader. Vertex Shader. Rasterization. Rasterization. Pixel."— Presentation transcript:

Similar presentations

About project

Feedback