Presentation is loading. Please wait.

Presentation is loading. Please wait.

Direct3D 10 and Beyond Peter-Pike Sloan Microsoft Corporation.

Similar presentations


Presentation on theme: "Direct3D 10 and Beyond Peter-Pike Sloan Microsoft Corporation."— Presentation transcript:

1 Direct3D 10 and Beyond Peter-Pike Sloan Microsoft Corporation

2 The Direct3D 10 System Latest step in GPU evolution Coming to millions of PCs near you Large, complex system General overview and a few highlights Motivations Discuss current post Direct3D 10 thoughts

3 Prior Work Fixed Function Hardware Programmable Vertex Processing Programmable Fragment Processing < 2001 Primitive Processing Unified Programming Direct3D 7 OpenGL 1.4 Direct3D 8 OpenGL 1.5 Direct3D 9 OpenGL 2.0 Direct3D 10 Assembly Programming High Level Shading Languages More CPU-like features Ad Hoc Multipass Increasing programmability

4 Design Process Collaboration with Application Developers (ISVs) Hardware Developers (IHVs) Iterative process Start - spring 2003 Spec - fall 2004 HW implementations DirectX Team ISV 1 ISV 2 ISV n … IHV 1 IHV 2 IHV m …

5 Constraints & Problems Preserve data parallelism memory system efficiency coherence determinism Performance/$$ Improve state change agility implementation consistency program expressiveness resource limitations CPU offload Visual Complexity

6 Guiding Decisions Narrow gap between abstraction and implementation Improve overall system efficiency Avoid undefined behavior Avoid defacto defined behavior problems Avoid promising generality that can’t be delivered If you specify CPU generality, you will get CPU performance No new API support for older hardware Allows fixed feature set, tighter behavior compliance Cull unnecessary fixed-functions Performance-per-watt and -per-$$ informs what to retain

7 Vertex Shader Geometry Shader Pixel Shader Input Assembler Setup/ Rasterization Output Merger Stream Out Memory Vertex Buffer Texture Depth Texture Color Index Buffer System Architecture Logical pipeline Programmer’s view

8 Vertex Shader Geometry Shader Pixel Shader Input Assembler Setup/ Rasterization Output Merger Stream Out Memory Vertex Buffer Texture Depth Texture Color Index Buffer System Architecture Input assembler Fixed-function Canonicalize vertex data Generate IDs Primitive, vertex, instance

9 Vertex Shader Geometry Shader Pixel Shader Input Assembler Setup/ Rasterization Output Merger Stream Out Memory Vertex Buffer Texture Depth Texture Color Index Buffer System Architecture Vertex shader Programmable Vertex transformations 1 vertex in, 1 out Read from memory

10 Vertex Shader Geometry Shader Pixel Shader Input Assembler Setup/ Rasterization Output Merger Stream Out Memory Vertex Buffer Texture Depth Texture Color Index Buffer System Architecture Geometry Shader New, programmable Per-primitive processing 1 prim in, k prims out Read from memory

11 Vertex Shader Geometry Shader Pixel Shader Input Assembler Setup/ Rasterization Output Merger Stream Out Memory Vertex Buffer Texture Depth Texture Color Index Buffer System Architecture Stream Out New, fixed-function Divert primitive data to 1D buffers 1 in, 1 out Write to memory

12 Vertex Shader Geometry Shader Pixel Shader Input Assembler Setup/ Rasterization Output Merger Stream Out Memory Vertex Buffer Texture Depth Texture Color Index Buffer System Architecture Setup/Rasterization Fixed-function Clipping, divide by w Convert primitives to fragments 1 prim in, m frags out

13 Vertex Shader Geometry Shader Pixel Shader Input Assembler Setup/ Rasterization Output Merger Stream Out Memory Vertex Buffer Texture Depth Texture Color Index Buffer System Architecture Pixel Shader Programmable Shade fragments 1 frag in, 0 or 1 out Read from memory

14 System Architecture Output Merger Fixed function Depth/stencil tests Color buffer blending Read/modify/write to memory Vertex Shader Geometry Shader Pixel Shader Input Assembler Setup/ Rasterization Output Merger Stream Out Memory Vertex Buffer Texture Depth Texture Color Index Buffer

15 System Architecture Common programmable “Core” Same ISA Flexible memory objects Reuse at different stages Array forms of memory objects Indexes generated in shaders Vertex Shader Geometry Shader Pixel Shader Input Assembler Setup/ Rasterization Output Merger Stream Out Memory Vertex Buffer Texture Depth Texture Color Index Buffer Texture Color Depth

16 Geometry Shader Entire primitive as input Adjacency Optional Outputs zero or more primitives 1024 scalars out max

17 Geometry Shader Programmable Setup Generate barycentric coordinates, interpolate arbitrary amount of data downstream Quadratic interpolation over triangles Data stored/computed at edge midpoints Basis functions simple polynomials of barycentric coordinates Analytic gradients (0,0) (1,0) (0,1)

18 Geometry Shader Amplify geometry Expand Point Sprites Extrude silhouettes Extrude prisms/tets [Hirche04]

19 Geometry Shader Generate Array Index for render target array E.g., render to cube map Treat cube map as 6-element array Emit primitive multiple times Per-cube face transform + array index GS Render Target Array

20 Determinism & Parallelism Allow parallel processing but preserve serial order Buffer GS outputs (on chip) Limit output to 1K 32-bit values Application can specify less May allow greater parallelism 12n … … Expansion to 2 triangles GS

21 Stream Out Data from VS/GS can optionally be streamed out to a buffer 32 bits per component (int or float) Either single buffer of up to 16 elements (64 scalars max) with flexible stride Up to 4 buffers that have single elements and unit element stride Always sent to rasterizer if rasterizer is enabled

22 Stream Out Generated geometry easily redrawn using DrawAuto() command with no CPU intervention DrawAuto()

23 Multi-Stream Output Array-of-structures vs. structure-of-arrays Position Color Normal Texture Position Color Normal Texture Input Assembler supports both types as vertex buffers Both styles are useful Access pattern vs. memory coherency Position Color Normal Texture

24 Multi-Stream Output Add multiple stream capability Compromise - support 1 multi-element stream with up to 16 elements (AoS) Up to 4 single-element (SoA) streams Future expansion

25 Programmability Virtual machine model Machine-independent intermediate language (IL) Just in time translation (JIT) in hardware driver When shader program object is created HLSL Compiler HLSL Program IL JIT in Driver Program Object

26 The Virtual Machine New Features Integer instruction set Load instruction (no store!) IEEE-754 format & ~accuracy Separate samplers & textures Writable private memory Direct3D 9Direct3D 10 Instructions64K/512unlimited Textures16128 Temporary registers 1632 Constants2564Kx16 Interstage registers D texture4Kx4K8Kx8K Render targets 48

27 Texture C Shading A Triangle Static light positions Dynamic light positions Camera positions View/Projection Matrices Bone Matrices LOD Material Parameters Normals, Positions, Texcoords Per-Level Data Per-Frame Data Per-Instance Data Per-Primitive Data Per-Vertex Data Texture A Texture B Vertex Shader Pixel Shader

28 Texture C Constants in Direct3D 9 Per-Level Data Per-Frame Data Per-Instance Data Per-Primitive Data Per-Vertex Data Texture A Texture B Vertex Shader Pixel Shader VS PS constants SetConstant()......

29 Texture C Constants in Direct3D 9 Per-Level Data Per-Frame Data Per-Instance Data Per-Primitive Data Per-Vertex Data Texture A Texture B Vertex Shader Pixel Shader VS PS constants SetConstant()......

30 Per-Level Data Per-Frame Data Per-Instance Data Per-Primitive Data Per-Vertex Data Constant Buffers Split parameters into buffers Organize by update frequency Bulk update any buffer Bind up to 16 buffers/shader Sounds like 1D textures But, access pattern is different Uniform vs. Non-uniform index Frequent vs. Infrequent access

31 API/Runtime Plumbing for Creating/managing objects Binding state to pipeline stages Restructure for efficiency & flexibility Aggregate bits of state into large objects More “real” work done per API call Group related state together (blend, raster, stencil, depth) Guide hardware implementation

32 Configuring the Pipeline IASetVertexBuffers/SetIndexBuffer IASetPrimitiveTopology {VS|GS|PS}SetShader {VS|GS|PS}SetShaderResources {VS|GS|PS}SetConstantBuffers {VS|GS|PS}SetSamplers SOSetTargets RSSetState RSSetViewports/ScissorRects OMSetRenderTargets OMSetBlendState OMSetDepthStencilState Vertex Shader Geometry Shader Pixel Shader Input Assembler Setup/ Rasterization Output Merger Stream Out Memory Vertex Buffer Texture Depth Texture Color Index Buffer

33 Shading Language HLSL is the real API? Shader programs considered part of art assets! Support new instructions (integer, load, …) Parameter grouping into constant buffers Geometry Shader Multiple input vertices, multiple output support (emit & reset) Intrinsics for stream output Avoid features with large run-time (CPU) cost E.g., requiring re-compilation if state changes

34 Particle System Example No CPU intervention Particle state in 1D buffer Read buffer and rewrite 2 nd buffer each pass Use GS to add or destroy particles

35 Displacement Map Example GS extrudes prism at each face [Hirche04] PS ray casts against height field Shade or discard pixel depending on ray test

36 Instancing Example GS can determine shader, instance and primitive ID’s used to index texture array

37 Sparse Morph Targets “Render to VB” updates vertices GS uses stretch of triangle to drive wrinkles

38 Other Ideas Considered Programmable Input Assembler Unwarranted complexity Tessellation Complexity too high for this design (deferred) Access to color/depth buffer from pixel shader Prohibitive performance implications Simultaneous read/write access to memory Unpredictable results → non-determinism Scatter, reduction operations Performance vs. determinism issues (deferred)

39 Results State change agility State objects, constant buffers, instancing, array resources Greater expressiveness & flexibility Integer, load, etc. instructions; stream out, flexible memory objects Fewer resource constraints Huge increase in resources (hardware cost) Feature consistency Very tight behavioral specification (f eature set, arithmetic tolerances) 2 optional features (multisampling, 32-bit float texture filtering) CPU Offload Memory model, geometry shader, stream out, predicated rendering, …

40 Acknowledgements Numerous software and hardware companies contributed to the design ATIEpicNCsoftAutodesk nVidiaidSOERAD IntelValveUbisoftXSI S3BlizzardNaughty DogDiscreet 3DlabsRitualLucas ArtsAlias XGICrytekEmogenceDirectX team PowerVRBungieLionheadGameFu MatroxMonolithEA …

41 Post Direct3D 10

42 Direct3D 10.1 Small improvements for important problems Limited to small hardware changes More VS → GS inter-stage registers, VS input Cube map arrays Multi-sample control (patterns, alpha to cvg) Better multi-sample color & depth access Per-render target blending modes API/runtime enhancements for multi-core Precision improvements

43 Future: Addressing GPU Evolution Direct3D10+ Raytracing REYES GPGPU Physics ? ? ? ? Multi-GPU ?

44 Complexity & Balance Increase realism/fidelity in weaker areas Complexity inflection points require new techniques Geometric Material Lighting Transport Complexity - Quality Animation Dynamics Visual “Attribute” normalized by “importance”

45 Problems to Solve Content Generation Create more artwork faster 20+ GB of content to be created Preserve content investment Better Visuals Silhouette edges, transparency, antialiasing, texture filtering Non-rendering computation Physics, animation, morphing… Programmability Fixed functions vs. programmability

46 Content Generation Tackle two areas – inflection points Texture maps Currently hand painted, 2K×2K  4K×4K Transition to procedural methods (long term) Improve texture management Character modeling with detail and deformation Currently skinned polygonal models with normal maps Transition to deformable subdiv patches with displacement & normal maps

47 Tessellation Primary motivator is amplification of animation/morph targets/deformation models Everything stays on GPU if possible Displacement mapped surfaces become first class primitives

48 Displaced Subdivision Images © Fantasy Lab and Wizards of the Coast

49 Three-Domain Pipeline Patches control points “Low frequency” phenomena (animation, vector irradiance?, indirect vector irradiance?) Triangles 3 vertices “Mid frequency” phenomena Pixel fragments n-samples per pixel “High frequency” phenomena (gloss, material roughness)

50 (Logical) Pipeline Evolution VertexShaderSetupRasterizerOutputMergerPixelShaderGeometryShader TextureTextureRenderTargetDepthStencilTextureStreamBuffer Streamout Memory memory programmable fixed SamplerSamplerSampler ConstantConstantConstant VertexBuffer InputAssembler IndexBuffer Tessellator ControlPointShader Texture Sampler Constant (Hypothetical!) Spill patch data to memory?

51 Tessellation with Displacement Integration into art pipeline Surface formats (SubD, bi-cubic patches)? Approximation? How much tessellation? Adaptive? How does it fit into the logical pipeline? New stages? How many? Try and keep everything on chip? Updating control cage, multi-pass makes sense Conversion to other basis – multi-pass?

52 Displacement Mapping Vertex Based How much tessellation? Interaction with fractional tessellation? Are more sophisticated tessellation schemes required? Local Ray Tracing How inexpensive can you make shaders? Interaction with MSAA? Shadows? Interaction with hierarchical/early Z?

53 Improving Visual Quality Many areas to improve Some solved with programmability + performance But not all Texture filter quality Texture compression; e.g., HDR images Derivatives Order independent transparency Antialiasing quality Global illumination Static/Parameterized GI Dynamic GI Ray casting/tracing

54 Transparency and Antialiasing Current state of art sample multisample antialiasing n + 1 levels of transparency Transparency Feathered edges (foliage) Windshields Particles Sort transparent objects Alpha to coverage for alpha textures (avoid blend)

55 Transparency and Antialiasing Need to do better Sorting too expensive Must work with multipass algorithms E.g., apply shadow maps Move sort to hardware Track individual pixel fragments → A-buffer (cf. R-buffer, F-buffer, T-buffer)

56 A-Buffer Save all fragments and sort Memory intensive (≈64 fragments/pixel) Tiling to reduce memory constraint? Discard overflow fragments? Operations on pre-resolved fragments Shadow computations, multipass layers

57 A-Buffer Fixed-function or programmable? Sorting/resolve operation Overflow handling What happens to MSAA, MRT… Fragment = attributes + coverage + depth “Defer” explosion to samples until resolve-time Opportunity to do better antialiasing

58 A-Buffer Implications Separate opaque and transparent object processing Draw opaque first to cull invisible transparent fragments Switch to tiling (chunking) to save memory cf. predicated tiling on Xbox 360 How much memory for fragments (100MB?) Exacerbated by larger displays Render at reduced resolution and up-sample? Opportunity for better resolve filtering Filter support larger than a pixel

59 Non-rendering Computations Direct3D 10 enables new computations using Additional programmability Integer & load instructions More general data flow Render to vertex buffer Animation + skinning Solved problem? Particle systems, Morphing…

60 GPU vs. Multicore CPU GPU – large flops, memory bandwidth Data parallelism, streaming caches Multi-core CPU Task parallelism, cache locality Boundary between the two is fuzzy Matrix multiply, sparse matrix x sparse vector Convergence?

61 Programmability Direct3D 10 computationally complete? Make entire pipeline programmable? Some processing more efficient as fixed function Set-up, Rasterization Hiearchical Z Filtering (does it need 32-bit float?) Clipping (do you really want to write that code?) Orthogonality Keep data types/formats independent from algorithms

62 Programmability Every function we remove… You may need to add back in shader code E.g., suppose we enable alpha-to-coverage in shaders Compute coverage mask and output in pixel shader Do we keep the fixed-function version? If removed, then all (pixel) shaders need to implement alpha- to-coverage Developer implements “virtual pipeline” in shaders HLSL/FX provides support for implementing “virtual pipelines” Can we do more?

63 Dynamic Subroutines Do dynamic subroutines simplify/solve the problem? i.e., shaders with function pointers Call overhead must be tiny Otherwise, end up inlining and recompiling Can I dynamically stack (append) subroutines? A → B → C → Or do subroutines need to have static call sites (bind points)? A0 → B A1 → C A2 → D

64 Programmability Next steps Efficient dynamic subroutine mechanism Eliminate combinatorial explosions Allow shader composition through libraries Need efficient dynamic binding cf. version 1.0 – Fragment Linker in Direct3D 9 Generalized data parallel computation Neighbor communication? Scatter? Read-modify-write operations to memory

65 Summary Lots to figure out! Better texturing Surface Tessellation Transparency/Anti-aliasing General computation

66 Acknowledgements DirectX group David Blythe, Michael Bunnell, Shanon Drone, Sam Glassenberg, Michael Oneppo IHV’s/ISV’s [see earlier slide…]

67 Questions? no dates/promises for anything post Direct3D 10


Download ppt "Direct3D 10 and Beyond Peter-Pike Sloan Microsoft Corporation."

Similar presentations


Ads by Google