Presentation is loading. Please wait.

Presentation is loading. Please wait.

Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,

Similar presentations


Presentation on theme: "Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,"— Presentation transcript:

1 Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel, Philipp Slusallek

2 Stefan PopovHigh Performance GPU Ray Tracing Background  GPUs attractive for ray tracing  High computational power  Shading oriented architecture  GPU ray tracers  Carr – the ray engine  Purcell – Full ray tracing on the GPU, based on grids  Ernst – KD trees with parallel stack  Carr, Thrane & Simonsen – BVH  Foley, Horn, Popov – KD trees - stackless traversal

3 Stefan PopovHigh Performance GPU Ray Tracing Motivation  So far  Interactive RT on GPU, but  Limited model size  No dynamic scene support  The G80 – new approach to the GPU  High performance general purpose processor with graphics extensions  PRAM architecture  BVH allow for  Dynamic/deformable scenes  Small memory footprint  Goal: Recursive ordered traversal of BVH on the G80

4 Stefan PopovHigh Performance GPU Ray Tracing GPU Architecture (G80)  Multi-threaded scalar architecture  12K HW threads  Threads cover latencies  Off-chip memory ops  Instruction dependencies  4 or 16 cycles to issue instr.  16 (multi-)cores  8-wide SIMD  128 scalar cores in total  Cores process threads in 32 wide SIMD chunks … Multi-Core 1 … Thread 1 Thread 32 IP Chunk Pool … Thread 1 Thread 32 … Thread 1 Thread 32 … Thread 1 Thread` 32 … Thread 1 Thread 32 Multi-Core 16 … Thread 1 Thread 32 IP Chunk Pool … Thread 1 Thread 32 … Thread 1 Thread 32 … Thread 1 Thread` 32 … Thread 1 Thread 32

5 Stefan PopovHigh Performance GPU Ray Tracing GPU Architecture (G80)  Scalar register file (8K)  Partitioned among running threads  Shared memory (16KB)  On-chip, 0 cycle latency  On-board memory (768MB)  Large latency (~ 200 cycles)  R/W from within thread  Un-cached  Read-only L2 cache (128KB)  On chip, shared among all threads On-board memory Multi-Core 1 … Thread 1 Thread 32 Shared Memory Registers … Multi-Core 16 L2 Cache (128KB)

6 Stefan PopovHigh Performance GPU Ray Tracing Programming the G80  CUDA  C based language with parallel extensions  GPU utilization at 100% only if  Enough threads are present (>> 12K)  Every thread uses less than 10 registers and 5 words (32 bit) of shared memory  Enough computations per transferred word of data  Bandwidth << computational power  Adequate memory access pattern to allow read combining

7 Stefan PopovHigh Performance GPU Ray Tracing Performance Bottlenecks  Efficient per-thread stack implementation  Shared memory too small – will limit parallelism  On-board memory – uncached  Need enough computations between stack ops  Efficient memory access pattern  Use texture caches  However, only few words of cache / thread  Read successive memory locations in successive threads of a chunk  Single roundtrip to memory (read combining)  Cover latency with enough computations

8 Stefan PopovHigh Performance GPU Ray Tracing Ray Tracing on the G80  Map each ray to one thread  Enough threads to keep the GPU busy  Recursive ray tracing  Use per-thread stack stored on on-board memory  Efficient, since enough computations are present  But how to do the traversal ?  Skip pointers (Thrane) – no ordered traversal  Geometric images (Carr) – single mesh only  Shared stack traversal

9 Stefan PopovHigh Performance GPU Ray Tracing SIMD Packet Traversal of BVH  Traverse a node with the whole packet  At an internal node:  Intersect all rays with both children and determine traversal order  Push far child (if any) on a stack and descend to the near one with the packet  At a leaf:  Intersect all rays with contained geometry  Pop next node to visit from the stack

10 Stefan PopovHigh Performance GPU Ray Tracing PRAM Basics  The PRAM model  Implicitly synchronized processors (threads)  Shared memory between all processors  Basic PRAM operations  Parallel OR in O(1)  Parallel reduction in O(log N) false true false true false true 11 9 9 12 32 44 20 ++ + 64 20 11 9 9 9 9

11 Stefan PopovHigh Performance GPU Ray Tracing PRAM Packet Traversal of BVH  The G80 – PRAM machine on chunk level  Map packet  chunk, ray  thread  Threads behave as in the single ray traversal  At leaf: Intersect with geometry. Pop next node from stack  At node: Decide which children to visit and in what order. Push far child  Difference:  How rays choose which node to visit first  Might not be the one they want to

12 Stefan PopovHigh Performance GPU Ray Tracing PRAM Packet Traversal of BVH  Choose child traversal order  PRAM OR to determine if all rays agree on visiting the same node first  The result is stored in shared memory  In case of divergence: choose child with more ray candidates  Use PRAM SUM on +/- 1 for each thread, -1  left node  Look at result’s sign  Guarantees synchronous traversal of BVH

13 Stefan PopovHigh Performance GPU Ray Tracing PRAM Packet Traversal of BVH  Stack:  Near & far child – the same for all threads => store once  Keep stack in shared memory. Only few bits per thread!  Only Thread 0 does all stack ops.  Reading data:  All threads work with the same node / triangle  Sequential threads bring in sequential words  Single load operation. Single round trip to memory  Implementable in CUDA

14 Stefan PopovHigh Performance GPU Ray Tracing Results

15 Stefan PopovHigh Performance GPU Ray Tracing Analysis  Coherent branch decisions / memory access  Small footprint of the data structure  Can trace up to 12 million triangle models  Program becomes compute bound  Determined by over/under-clocking the core/memory  No frustums required  Good for secondary rays, bad for primary  Can use rasterization for primary rays  Implicit SIMD – easy shader programming  Running on a GPU – shading “for free”

16 Stefan PopovHigh Performance GPU Ray Tracing Dynamic Scenes  Update parts / whole BVH and geometry on GPU  Use GPU for RT and CPU for BVH construction / refitting  Construct BVH using binning  Similar to Wald RT07 / Popov RT06  Bin all 3 dimensions using SIMD  Results in > 10% better trees  Measured as SAH quality, not FPS  Speed loss is almost negligible

17 Stefan PopovHigh Performance GPU Ray Tracing Results

18 Stefan PopovHigh Performance GPU Ray Tracing Conclusions  New recursive PRAM BVH traversal algorithm  Very well suited for the new generation of GPUs  No additional pre-computed data required  First GPU ray tracer to handle large models  Previous implementations were limited to < 300K  Can handle dynamic scenes  By using the CPU to update the geometry / BVH

19 Stefan PopovHigh Performance GPU Ray Tracing Future Work  More features  Shaders, adaptive anti-aliasing, …  Global illumination  Code optimizations  Current implementation uses too many registers

20 Stefan PopovHigh Performance GPU Ray Tracing Thank you!

21 Stefan PopovHigh Performance GPU Ray Tracing CUDA Hello World __global__ void addArrays(int *arr1, int *arr2) { unsigned t = threadIdx.x + blockIdx.x * blockDim.x; arr1[t] += arr2[t]; } int main() { int *inArr1 = malloc(4194304), *inArr2 = malloc(4194304); int *ta1, *ta2; cudaMalloc((void**)&ta1, 4194304); cudaMalloc((void**)&ta2, 4194304); for(int i = 0; i < 4194304; i++) { inArr1[i] = rand(); inArr2[i] = rand(); } cudaMemcpy(ta1, inArr1, 4194304, cudaMemcpyHostToDevice); cudaMemcpy(ta2, inArr2, 4194304, cudaMemcpyHostToDevice); addArrays >>(ta1, ta2); cudaMemcpy(inArr1, ta1, 4194304, cudaMemcpyDeviceToHost); for(int i = 0; i < 4194304; i++) printf("%d ", inArr1[i]); return 0; }


Download ppt "Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,"

Similar presentations


Ads by Google