Presentation is loading. Please wait.

Presentation is loading. Please wait.

Latency considerations of depth-first GPU ray tracing Michael Guthe University Bayreuth Visual Computing 1.

Similar presentations


Presentation on theme: "Latency considerations of depth-first GPU ray tracing Michael Guthe University Bayreuth Visual Computing 1."— Presentation transcript:

1 Latency considerations of depth-first GPU ray tracing Michael Guthe University Bayreuth Visual Computing 1

2 Depth-first GPU ray tracing  Based on bounding box or spatial hierarchy  Recursive traversal  Usually using a stack  Threads inside a warp may access different data  May also diverge 2

3 Performance Analysis  What limits performance of the trace kernel?  Device memory bandwidth? 3 Obviously not!

4 Performance Analysis  What limits performance of the trace kernel?  Maximum (warp) instructions per clock? 4 Not really!

5 Performance Analysis  Why doesn’t the kernel fully utilize the cores?  Three possible reasons:  Instruction fetch  e.g. due to branches  Memory latency  a.k.a. data request  mainly due to random access  Read after write latency  a.k.a. execution dependency  It takes 22 clock cycles (Kepler) until the result is written to a register 5

6 Performance Analysis  Why doesn’t the kernel fully utilize the cores?  Profiling shows: 6 Memory & RAW latency limit performance!

7 Reducing Latency  Standard solution for latency:  Increase occupancy  No option due to register pressure  Relocate memory access  Automatically performed by compiler  But not between iterations of a while loop  Loop unrolling for triangle test 7

8 Reducing Latency  Instruction level parallelism  Not directly supported by GPU  Increases number of eligible warps  Same effect as higher occupancy  We might even spend some more registers  Wider trees  4-ary tree means 4 independent instructions paths  Almost doubles the number of eligible warps during node tests  Higher width increase number of node tests, 4 is optimum 8

9 Reducing Latency  Tree construction  Start from root  Recursively pull largest child up  Special rules for leaves to reduce memory consumption Goal: 4 child nodes whenever possible 9

10 Reducing Latency  Overhead: sorting intersected nodes  Can have two independent paths with parallel merge sort  We don‘t need sorting for occlusion rays     0.2

11 Results  Improved instructions per clock  Doesn’t directly translate to speedup 11

12 Results  Up to 20.1% speedup over Aila et. al: “Understan- ding the Efficiency of Ray Traversal on GPUs”, Sibenik, 80k tris.

13 Results  Up to 20.1% speedup over Aila et. al: “Understan- ding the Efficiency of Ray Traversal on GPUs”, Fairy forest, 174k tris.

14 Results  Up to 20.1% speedup over Aila et. al: “Understan- ding the Efficiency of Ray Traversal on GPUs”, Conference, 283k tris.

15 Results  Up to 20.1% speedup over Aila et. al: “Understan- ding the Efficiency of Ray Traversal on GPUs”, San Miguel, 11M tris.

16 Results  Latency is still performance limiter  Mostly improved memory latency 16


Download ppt "Latency considerations of depth-first GPU ray tracing Michael Guthe University Bayreuth Visual Computing 1."

Similar presentations


Ads by Google