Peter-Pike Sloan Microsoft Corporation

Peter-Pike Sloan Microsoft Corporation
Direct3D 10 and Beyond Peter-Pike Sloan Microsoft Corporation

The Direct3D 10 System Latest step in GPU evolution
Coming to millions of PCs near you Large, complex system General overview and a few highlights Motivations Discuss current post Direct3D 10 thoughts

Prior Work < 2001 2001 2002-3 2004+ Direct3D 7 OpenGL 1.4
Fixed Function Hardware Programmable Vertex Processing Programmable Fragment Processing Primitive Processing Unified Programming Direct3D 7 OpenGL 1.4 Direct3D 8 OpenGL 1.5 Direct3D 9 OpenGL 2.0 Direct3D 10 Ad Hoc Multipass Assembly Programming High Level Shading Languages More CPU-like features Increasing programmability

… … Design Process Collaboration with Iterative process
Application Developers (ISVs) Hardware Developers (IHVs) Iterative process Start - spring 2003 Spec - fall 2004 HW implementations … ISV1 ISV2 ISVn DirectX Team … IHV1 IHV2 IHVm

Constraints & Problems
Preserve data parallelism memory system efficiency coherence determinism Performance/$$ Improve state change agility implementation consistency program expressiveness resource limitations CPU offload Visual Complexity

Guiding Decisions Narrow gap between abstraction and implementation
Improve overall system efficiency Avoid undefined behavior Avoid defacto defined behavior problems Avoid promising generality that can’t be delivered If you specify CPU generality, you will get CPU performance No new API support for older hardware Allows fixed feature set, tighter behavior compliance Cull unnecessary fixed-functions Performance-per-watt and -per-$$ informs what to retain

System Architecture Logical pipeline Programmer’s view Pixel Input
Vertex Shader Geometry Pixel Input Assembler Setup/ Rasterization Output Merger Stream Out Memory Buffer Texture Depth Color Index Logical pipeline Programmer’s view

System Architecture Input assembler Fixed-function
Vertex Shader Geometry Pixel Input Assembler Setup/ Rasterization Output Merger Stream Out Memory Buffer Texture Depth Color Index Input assembler Fixed-function Canonicalize vertex data Generate IDs Primitive, vertex, instance

System Architecture Vertex shader Programmable Vertex transformations
Geometry Pixel Input Assembler Setup/ Rasterization Output Merger Stream Out Memory Buffer Texture Depth Color Index Vertex shader Programmable Vertex transformations 1 vertex in, 1 out Read from memory

System Architecture Geometry Shader New, programmable
Vertex Shader Geometry Pixel Input Assembler Setup/ Rasterization Output Merger Stream Out Memory Buffer Texture Depth Color Index Geometry Shader New, programmable Per-primitive processing 1 prim in, k prims out Read from memory

System Architecture Stream Out New, fixed-function
Vertex Shader Geometry Pixel Input Assembler Setup/ Rasterization Output Merger Stream Out Memory Buffer Texture Depth Color Index Stream Out New, fixed-function Divert primitive data to 1D buffers 1 in, 1 out Write to memory

System Architecture Setup/Rasterization Fixed-function
Vertex Shader Geometry Pixel Input Assembler Setup/ Rasterization Output Merger Stream Out Memory Buffer Texture Depth Color Index Setup/Rasterization Fixed-function Clipping, divide by w Convert primitives to fragments 1 prim in, m frags out

System Architecture Pixel Shader Programmable Shade fragments
Vertex Shader Geometry Pixel Input Assembler Setup/ Rasterization Output Merger Stream Out Memory Buffer Texture Depth Color Index Pixel Shader Programmable Shade fragments 1 frag in, 0 or 1 out Read from memory

System Architecture Output Merger Fixed function Depth/stencil tests
Vertex Shader Geometry Pixel Input Assembler Setup/ Rasterization Output Merger Stream Out Memory Buffer Texture Depth Color Index Output Merger Fixed function Depth/stencil tests Color buffer blending Read/modify/write to memory

System Architecture Common programmable “Core” Flexible memory objects
Vertex Buffer Input Assembler Common programmable “Core” Same ISA Flexible memory objects Reuse at different stages Array forms of memory objects Indexes generated in shaders Index Buffer Vertex Shader Texture Texture Geometry Shader Texture Texture Buffer Stream Out Setup/ Rasterization Pixel Shader Texture Texture Depth Depth Output Merger Color Color

Geometry Shader Entire primitive as input
Adjacency Optional Outputs zero or more primitives 1024 scalars out max

Geometry Shader Programmable Setup (0,1)
Generate barycentric coordinates, interpolate arbitrary amount of data downstream Quadratic interpolation over triangles Data stored/computed at edge midpoints Basis functions simple polynomials of barycentric coordinates Analytic gradients (0,0) (1,0)

Geometry Shader Amplify geometry Expand Point Sprites
Extrude silhouettes Extrude prisms/tets [Hirche04]

Geometry Shader Generate Array Index for render target array
E.g., render to cube map Treat cube map as 6-element array Emit primitive multiple times Per-cube face transform + array index GS 1 2 3 4 5 Render Target Array

Determinism & Parallelism
Allow parallel processing but preserve serial order 1 2 n Buffer GS outputs (on chip) Limit output to 1K 32-bit values Application can specify less May allow greater parallelism … GS GS GS … Expansion to 2 triangles

Stream Out Data from VS/GS can optionally be streamed out to a buffer
32 bits per component (int or float) Either single buffer of up to 16 elements (64 scalars max) with flexible stride Up to 4 buffers that have single elements and unit element stride Always sent to rasterizer if rasterizer is enabled

Stream Out Generated geometry easily redrawn using DrawAuto() command with no CPU intervention DrawAuto()

Multi-Stream Output Array-of-structures vs. structure-of-arrays . . .
Position Color Normal Texture Position Color Normal Texture . . . . Position Color Normal Texture Input Assembler supports both types as vertex buffers Both styles are useful Access pattern vs. memory coherency

Multi-Stream Output Add multiple stream capability
Compromise - support 1 multi-element stream with up to 16 elements (AoS) Up to 4 single-element (SoA) streams Future expansion

Programmability Virtual machine model
Machine-independent intermediate language (IL) Just in time translation (JIT) in hardware driver When shader program object is created HLSL Program HLSL Compiler IL JIT in Driver Program Object

The Virtual Machine New Features Integer instruction set
Load instruction (no store!) IEEE-754 format & ~accuracy Separate samplers & textures Writable private memory Direct3D 9 Direct3D 10 Instructions 64K/512 unlimited Textures 16 128 Temporary registers 32 Constants 256 4Kx16 Interstage registers 2D texture 4Kx4K 8Kx8K Render targets 4 8

Shading A Triangle Per-Level Data Per-Frame Data Per-Instance Data
Static light positions Dynamic light positions Camera positions View/Projection Matrices Bone Matrices LOD Material Parameters Normals, Positions, Texcoords Per-Frame Data Per-Instance Data Per-Primitive Data Per-Vertex Data Texture A Texture C Texture B Vertex Shader Pixel Shader

Constants in Direct3D 9 . Per-Level Data Per-Frame Data
SetConstant() Per-Level Data Per-Frame Data Per-Instance Data Per-Primitive Data Per-Vertex Data VS PS constants Texture A Texture C . Texture B Vertex Shader Pixel Shader

Constant Buffers Split parameters into buffers
Per-Level Data Split parameters into buffers Organize by update frequency Bulk update any buffer Bind up to 16 buffers/shader Sounds like 1D textures But, access pattern is different Uniform vs. Non-uniform index Frequent vs. Infrequent access Per-Frame Data Per-Instance Data Per-Primitive Data Per-Vertex Data

API/Runtime Plumbing for Restructure for efficiency & flexibility
Creating/managing objects Binding state to pipeline stages Restructure for efficiency & flexibility Aggregate bits of state into large objects More “real” work done per API call Group related state together (blend, raster, stencil, depth) Guide hardware implementation

Configuring the Pipeline
Vertex Shader Geometry Pixel Input Assembler Setup/ Rasterization Output Merger Stream Out Memory Buffer Texture Depth Color Index IASetVertexBuffers/SetIndexBuffer IASetPrimitiveTopology {VS|GS|PS}SetShader {VS|GS|PS}SetShaderResources {VS|GS|PS}SetConstantBuffers {VS|GS|PS}SetSamplers SOSetTargets RSSetState RSSetViewports/ScissorRects OMSetRenderTargets OMSetBlendState OMSetDepthStencilState

Shading Language Avoid features with large run-time (CPU) cost
HLSL is the real API? Shader programs considered part of art assets! Support new instructions (integer, load, …) Parameter grouping into constant buffers Geometry Shader Multiple input vertices, multiple output support (emit & reset) Intrinsics for stream output Avoid features with large run-time (CPU) cost E.g., requiring re-compilation if state changes

Particle System Example
No CPU intervention Particle state in 1D buffer Read buffer and rewrite 2nd buffer each pass Use GS to add or destroy particles

Displacement Map Example
GS extrudes prism at each face [Hirche04] PS ray casts against height field Shade or discard pixel depending on ray test

Instancing Example GS can determine shader, instance and primitive ID’s used to index texture array

Sparse Morph Targets “Render to VB” updates vertices
GS uses stretch of triangle to drive wrinkles

Other Ideas Considered
Programmable Input Assembler Unwarranted complexity Tessellation Complexity too high for this design (deferred) Access to color/depth buffer from pixel shader Prohibitive performance implications Simultaneous read/write access to memory Unpredictable results → non-determinism Scatter, reduction operations Performance vs. determinism issues (deferred)

Results State change agility Greater expressiveness & flexibility
State objects, constant buffers, instancing, array resources Greater expressiveness & flexibility Integer, load, etc. instructions; stream out, flexible memory objects Fewer resource constraints Huge increase in resources (hardware cost) Feature consistency Very tight behavioral specification (feature set, arithmetic tolerances) 2 optional features (multisampling, 32-bit float texture filtering) CPU Offload Memory model, geometry shader, stream out, predicated rendering, …

Acknowledgements … ATI Epic NCsoft Autodesk nVidia id SOE RAD Intel
Numerous software and hardware companies contributed to the design ATI Epic NCsoft Autodesk nVidia id SOE RAD Intel Valve Ubisoft XSI S3 Blizzard Naughty Dog Discreet 3Dlabs Ritual Lucas Arts Alias XGI Crytek Emogence DirectX team PowerVR Bungie Lionhead GameFu Matrox Monolith EA …

Post Direct3D 10

Direct3D 10.1 Small improvements for important problems
Limited to small hardware changes More VS→GS inter-stage registers, VS input Cube map arrays Multi-sample control (patterns, alpha to cvg) Better multi-sample color & depth access Per-render target blending modes API/runtime enhancements for multi-core Precision improvements

Future: Addressing GPU Evolution
? Direct3D10+ Multi-GPU ? ? ? ? REYES Physics GPGPU Raytracing

Complexity & Balance Increase realism/fidelity in weaker areas
Geometric Material Lighting Transport Animation Dynamics Increase realism/fidelity in weaker areas Complexity inflection points require new techniques Complexity - Quality normalized by “importance” Visual “Attribute”

Problems to Solve Content Generation Better Visuals
Create more artwork faster 20+ GB of content to be created Preserve content investment Better Visuals Silhouette edges, transparency, antialiasing, texture filtering Non-rendering computation Physics, animation, morphing… Programmability Fixed functions vs. programmability

Content Generation Tackle two areas – inflection points Texture maps
Currently hand painted, 2K×2K  4K×4K Transition to procedural methods (long term) Improve texture management Character modeling with detail and deformation Currently skinned polygonal models with normal maps Transition to deformable subdiv patches with displacement & normal maps

Tessellation Primary motivator is amplification of animation/morph targets/deformation models Everything stays on GPU if possible Displacement mapped surfaces become first class primitives

Three-Domain Pipeline
Patches control points “Low frequency” phenomena (animation, vector irradiance?, indirect vector irradiance?) Triangles 3 vertices “Mid frequency” phenomena Pixel fragments n-samples per pixel “High frequency” phenomena (gloss, material roughness)

(Logical) Pipeline Evolution
(Hypothetical!) Spill patch data to memory? fixed programmable memory Tessellator Control Point Shader Texture Sampler Constant Constant Constant Constant Vertex Buffer Input Assembler Index Vertex Shader Geometry Shader Setup Rasterizer Pixel Shader Output Merger Sampler Sampler Stream out Sampler Texture Texture Stream Buffer Texture Depth Stencil Render Target Memory

Tessellation with Displacement
Integration into art pipeline Surface formats (SubD, bi-cubic patches)? Approximation? How much tessellation? Adaptive? How does it fit into the logical pipeline? New stages? How many? Try and keep everything on chip? Updating control cage, multi-pass makes sense Conversion to other basis – multi-pass?

Displacement Mapping Vertex Based Local Ray Tracing
How much tessellation? Interaction with fractional tessellation? Are more sophisticated tessellation schemes required? Local Ray Tracing How inexpensive can you make shaders? Interaction with MSAA? Shadows? Interaction with hierarchical/early Z?

Improving Visual Quality
Many areas to improve Some solved with programmability + performance But not all Texture filter quality Texture compression; e.g., HDR images Derivatives Order independent transparency Antialiasing quality Global illumination Static/Parameterized GI Dynamic GI Ray casting/tracing

Transparency and Antialiasing
Current state of art 4 - 8 sample multisample antialiasing n + 1 levels of transparency Transparency Feathered edges (foliage) Windshields Particles Sort transparent objects Alpha to coverage for alpha textures (avoid blend)

Transparency and Antialiasing
Need to do better Sorting too expensive Must work with multipass algorithms E.g., apply shadow maps Move sort to hardware Track individual pixel fragments → A-buffer (cf. R-buffer, F-buffer, T-buffer)

A-Buffer Save all fragments and sort
Memory intensive (≈64 fragments/pixel) Tiling to reduce memory constraint? Discard overflow fragments? Operations on pre-resolved fragments Shadow computations, multipass layers

A-Buffer Fixed-function or programmable? What happens to MSAA, MRT…
Sorting/resolve operation Overflow handling What happens to MSAA, MRT… Fragment = attributes + coverage + depth “Defer” explosion to samples until resolve-time Opportunity to do better antialiasing

A-Buffer Implications
Separate opaque and transparent object processing Draw opaque first to cull invisible transparent fragments Switch to tiling (chunking) to save memory cf. predicated tiling on Xbox 360 How much memory for fragments (100MB?) Exacerbated by larger displays Render at reduced resolution and up-sample? Opportunity for better resolve filtering Filter support larger than a pixel

Non-rendering Computations
Direct3D 10 enables new computations using Additional programmability Integer & load instructions More general data flow Render to vertex buffer Animation + skinning Solved problem? Particle systems, Morphing…

GPU vs. Multicore CPU GPU – large flops, memory bandwidth
Data parallelism, streaming caches Multi-core CPU Task parallelism, cache locality Boundary between the two is fuzzy Matrix multiply, sparse matrix x sparse vector Convergence?

Programmability Direct3D 10 computationally complete?
Make entire pipeline programmable? Some processing more efficient as fixed function Set-up, Rasterization Hiearchical Z Filtering (does it need 32-bit float?) Clipping (do you really want to write that code?) Orthogonality Keep data types/formats independent from algorithms

Programmability Every function we remove…
You may need to add back in shader code E.g., suppose we enable alpha-to-coverage in shaders Compute coverage mask and output in pixel shader Do we keep the fixed-function version? If removed, then all (pixel) shaders need to implement alpha-to- coverage Developer implements “virtual pipeline” in shaders HLSL/FX provides support for implementing “virtual pipelines” Can we do more?

Dynamic Subroutines Do dynamic subroutines simplify/solve the problem?
i.e., shaders with function pointers Call overhead must be tiny Otherwise, end up inlining and recompiling Can I dynamically stack (append) subroutines? A→B→C→ Or do subroutines need to have static call sites (bind points)? A0→B A1→C A2→D

Programmability Next steps Generalized data parallel computation
Efficient dynamic subroutine mechanism Eliminate combinatorial explosions Allow shader composition through libraries Need efficient dynamic binding cf. version 1.0 – Fragment Linker in Direct3D 9 Generalized data parallel computation Neighbor communication? Scatter? Read-modify-write operations to memory

Summary Lots to figure out! Better texturing Surface Tessellation
Transparency/Anti-aliasing General computation

Acknowledgements DirectX group
David Blythe, Michael Bunnell, Shanon Drone, Sam Glassenberg, Michael Oneppo IHV’s/ISV’s [see earlier slide…]

no dates/promises for anything post Direct3D 10
Questions? no dates/promises for anything post Direct3D 10

Peter-Pike Sloan Microsoft Corporation

Similar presentations

Presentation on theme: "Peter-Pike Sloan Microsoft Corporation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Peter-Pike Sloan Microsoft Corporation

Similar presentations

Presentation on theme: "Peter-Pike Sloan Microsoft Corporation"— Presentation transcript:

Similar presentations

About project

Feedback