Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallel Tessellation using compute shaders

Similar presentations


Presentation on theme: "Parallel Tessellation using compute shaders"— Presentation transcript:

1 Parallel Tessellation using compute shaders
Team Members: David Sierra Erwin Holzhauser Matt Faller Project Sponsors: Mangesh Nijasure Todd Martin Saad Arrabi

2 The need for tessellation:
The industry demands a higher graphical fidelity Driven by competition i.e. My game looks better than yours Consumers tend to choose best looking games / software The solution: More Polygons or… Even More Polygons!! Credit: Epic Games inc.

3 The need for tessellation:
Credit: Utah Teapot by Martin Newell Problems: More Polygons! Meshes have gone from this… …to this High-poly scenes = high calculation cost Better to spend the scene’s polygon “budget” only where needed This is called level of detail (LOD) Tessellation is a vital LOD technique Other techniques not appealing to artists Credit: Crytek

4 Tessellation Overview:
Tessellation subdivides geometry to increase detail A tessellation control shader (TCS) decides which vertices need additional detail The Tessellator subdivides a selected primitive A tessellation evaluation shader (TES) applies subdivided primitive across selected vertices. This is called a patch

5 Tessellation Overview:
B-Spline Algorithm Hull Shader (TCS) DirectX 11 Tessellation Pipeline For each patch, a Hull shader outputs: Tessellation factors Control Points The tessellator outputs a subdivided primitive based on these factors The output is a series “domain points” that make up a subdivided primitive Domain Shader produces final result Using the control points… Applies math to each domain point according to some high-order surface Tessellator Domain Shader (TES) Credit: wikipedia

6 Tessellation Problems:
Massive Throughput  Runs on General Purpose Compute Units (CUs) Hull Shader (TCS) Look at where each stage of the pipeline is running: The general purpose hardware provides incredible throughput. 1 Tflops or more The tessellator has limited throughput, only a few patches at one time Including multiple tessellators on a GPU might mitigate the problem… A scalable compute implementation could be superior. Runs on fixed function hardware Tessellator limited throughput  Runs on General Purpose Compute Units (CUs) Domain Shader (TES) Massive Throughput 

7 GPU architecture Wavefront – 16 ALUs Compute Unit – 4 Wavefronts
A GPU can have up to 44 compute units! 2816 ALUs for those keeping track

8 DirectCompute Three steps: Configure Dispatch Retrieve
Tell GPU how to store information Dispatch Tell GPU to start doing work Retrieve Copy results of calculations back to main RAM

9 DirectCompute pros & cons
Uses familiar C like syntax Utilizes intrinsic hardware features float2, float3, float4, ... N-wide primitive data types Operations applied simultaneously on all members mad(a, b, c) Computes a*b + c in one cycle! Branching is inefficient Both paths are almost always taken Copying results of calculations back to CPU memory is a costly operation in processor time

10 Configuring buffers D3D11_BUFFER_DESC bDesc; D3D11_UNORDERED_ACCESS_VIEW_DESC uav_desc; bDesc.BindFlags |= D3D11_BIND_SHADER_RESOURCE | D3D11_BIND_UNORDERED_ACCESS; bDesc.ByteWidth = sizeof(TStruct)*numStructures; bDesc.StructureByteStride = sizeof(TStruct); bDesc.MiscFlags = D3D11_RESOURCE_MISC_BUFFER_STRUCTURED; bDesc.Usage = D3D11_USAGE_DEFAULT; bDesc.CPUAccessFlags = 0; uav_desc.Format = DXGI_FORMAT_UNKNOWN; uav_desc.ViewDimension = D3D11_UAV_DIMENSION_BUFFER; uav_desc.Buffer.FirstElement = 0; uav_desc.Buffer.NumElements = numStructures; HRESULT hr = device->CreateBuffer(&bDesc, pInitData, outBuff); Before a buffer is loaded onto the GPU, it must be configured based on the desired access patterns Directx11 uses “Descriptions” to represent these access patterns. A Buffer may then be created based on these descriptions.

11 DirectCompute Essentials
Groups Split up work at a high level One group for each texture or model Dispatch Groups Split up individual groups of work In each group, a thread does work on a small group of data elements

12 Tessellation Basics Point generation Connectivity Generation
Where does each point go? Connectivity Generation How are they connected? Tessellation Factor How many points per line Calculations can be done in parallel

13 Project Goals Parallel Compute Shader Implementation
Output matches Ref. Tessellator Faster than CPU implementation Better than fixed-function hardware

14 Isoline tessellation Relatively easy to parallelize
Initial implementation One thread per point Compiler ended up assigning one whole compute unit per point Terrible performance Only using 1/64 threads per compute unit

15 Isoline tessellation 2nd Generation Implementation
Each thread computes a nxn grid of points Compiler now splits threads evenly across compute units Tested 8x8, 4x4, and 2x2 grids 2x2 was by far the fastest

16 Isoline tessellation 3rd Generation Implementation
Each thread computes 1 point Launch these threads 64 at a time to minimize resources used. Very fast Threads are launched 8x8 or 64x1

17 TRI Tessellation 2nd Outer = 5 4 Tess. Factors Partitioning Mode
1st Outer = 2 Point Generation Shader Point Connectivity Shader Inner = 4 3rd Outer = 1 Vertex Buffer Index Buffer

18 Compute Implementation
2nd Outer = 1 Quad Tessellation Inner = 3 6 Tess. Factors Partitioning Mode Inner = 4 1st Outer = 3 3rd Outer = 1 Compute Implementation Vertex Buffer Index Buffer 4th Outer = 1

19 TRI Tessellation High level Design
Processed Tess Factor Struct with Input Patch CPU GPU Point Generation Shader Point Connectivity Shader Dispatch N Thread Groups Process Tessellation Factors Process Tessellation Factors Load Buffer on GPU Tess Factor Context (Thread Group Shared Memory) Tess Factor Context (Thread Group Shared Memory) Thread Group Sync Thread Group Sync Input Patch in Processed Tess Factor Buffer Generate Points Generate Points Vertex Buff Index Buff

20 Quad & Tri Tessellation: High level Design
Two Designs to implement Design #1 CPU GPU Do the following for each patch: Context Processing Patch Input Data Load input on GPU Input Data (RW Structured Buffer) TF_Context Dispatch Point Connectivity Group Point Generation Group Ind. Buff Vert. Buff

21 Quad & Tri Tessellation: High level Design
Input Data[N] Design #2 CPU GPU Point Generation Shader Point Connectivity Shader Dispatch N thread groups Compute TF_Context Compute TF_Context Load input on GPU TF_Context (Groupshared Memory) TF_Context (Groupshared Memory) Group Sync Group Sync N number of input Patches Compute Point Locations Compute Connectivity Vert. Buff Ind. Buff

22 Tri Tessellation: low level Details
P: 2 P: 3 P: 4 P: 5 P: 6 Shader must output each point in an exact order Order follows a spiral pattern Regular pattern allows connectivity and generation to be done in parallel Point generation shader computes a point-per-thread, based on its global thread ID Overhead for thread to figure out contextual information (edge, offset within edge) P: 9 P: 10 P: 11 P: 13 P: 8 P: 12 P: 1 P: 7 P: 0

23 Quad Tessellation: Low Level Details
P: 3 P: 4 The points are generated in a spiral pattern This regular pattern allows connectivity to be done in parallel Implemented in Microsoft reference using nested for-loops For-each-ring For-each-edge For-each-point (on edge) This for-loop structure makes indexing threads tricky P: 10 P: 11 P: 12 P: 13 P: 2 P: 9 P: 22 P: 23 P: 14 P: 8 P: 21 P: 24 P: 15 P: 1 P: 7 P: 20 P: 25 P: 16 P: 6 P: 19 P: 18 P: 17 P: 0 P: 5

24 Quad Tessellation: Low Level Details
The connectivity follows the same spiral pattern, assumes each point has correct value Triangles are created by connecting three points 1 6 7 8 2 Triangle: 1 Triangle: 2 Triangle: 3 Triangle: 4

25 Quad & Tri Tessellation: Low Level Details
Output Buffer Quad & Tri Tessellation: Low Level Details 0 – 64 64 – 128 Ideal: Have meaningful work for each thread in group Number of threads per thread group is a multiple of 64 Each thread in a group must access the appropriate data and not write to any other thread’s location In order to calculate the correct results we must find for any given buffer location: The current ring number The current edge Correct offset based on edge One problem with this is that calculating this information introduces divergent flow control for each thread. To counter this, each group of threads is instead responsible for placing points and connections on each edge. (although this sacrifices cache performance) Group Thread IDs 1 2 . 64

26 Work DISTRIBUTION Quad Tessellation – Matthew
Triangle Tessellation – Erwin Isoline Tessellation – David Additional Tools: DXQuery (David) – Simplifies the creation of DX11 queries which are used to collect accurate performance data Testing environment using Google Test API - David Library to simplify writing to and reading from buffers - Matt

27 BUDGET Radeon R9 290X Graphics Card – The cheapest runs from $360 on Newegg, bought with AMD donation Practical Rendering and Computation with Direct3D 11 – Runs from ~$50 on Amazon Bitbucket – Free for teams under 5 users Private code repository Visual Studio Professional - Free through UCF’s DreamSpark membership Credit: Newegg.com

28 Progress HLSL Compute Shader Implementation (Sequential):
Isoline: 100% Triangles: 100% Quads: 100% HLSL Compute Shader Implementation (Parallel): Additional Tools: DXWrapper: 100% Testing Environment: 100% Buffer Loading Library: 100% Abandoned PerfStudio for performance analysis; using DXQuery instead

29 Testing Tess Factors Part. Mode Reference Implementation
Shader Implementation The output for the compute shader must match the reference exactly. We test for accuracy using Google’s Test API Loop through every possible tessellation factor and mode Check output bit-for-bit Test API allows for a small margin of error IEEE floats vs Fixed-Point decimals In most cases our output differs because of a higher degree of accuracy Vertex Buff Index Buff Vertex Buff Index Buff Google Test Pass/Fail?

30 Experimental Results (ISOLINE I)

31 Experimental Results (ISOLINE II)

32 Experimental Results

33 Experimental Results

34 Experimental Results (TRI)

35 Questions?


Download ppt "Parallel Tessellation using compute shaders"

Similar presentations


Ads by Google