Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,

Slides:



Advertisements
Similar presentations
List Ranking and Parallel Prefix
Advertisements

Sven Woop Jörg Schmittler Philipp Slusallek Computer Graphics Lab Saarland University, Germany RPU: A Programmable Ray Processing Unit for Realtime Ray.
Sven Woop Computer Graphics Lab Saarland University
† Saarland University, Germany ‡ University of Utah, USA Estimating Performance of a Ray-Tracing ASIC Design Sven Woop † Erik Brunvand ‡ Philipp Slusallek.
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Christian Lauterbach COMP 770, 2/16/2009. Overview  Acceleration structures  Spatial hierarchies  Object hierarchies  Interactive Ray Tracing techniques.
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Restart Trail for Stackless BVH Traversal Samuli Laine NVIDIA Research.
Physically Based Real-time Ray Tracing Ryan Overbeck.
More on threads, shared memory, synchronization
Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,
Latency considerations of depth-first GPU ray tracing
Two-Level Grids for Ray Tracing on GPUs
Experiences with Streaming Construction of SAH KD Trees Stefan Popov, Johannes Günther, Hans-Peter Seidel, Philipp Slusallek.
Afrigraph 2004 Interactive Ray-Tracing of Free-Form Surfaces Carsten Benthin Ingo Wald Philipp Slusallek Computer Graphics Lab Saarland University, Germany.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
NVIDIA Research Parallel Computing on Manycore GPUs Vinod Grover.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
3D Graphics Processor Architecture Victor Moya. PhD Project Research on architecture improvements for future Graphic Processor Units (GPUs). Research.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Interactive k-D Tree GPU Raytracing Daniel Reiter Horn, Jeremy Sugerman, Mike Houston and Pat Hanrahan.
RT08, August ‘08 Large Ray Packets for Real-time Whitted Ray Tracing Ryan Overbeck Columbia University Ravi Ramamoorthi Columbia University William R.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
RAY TRACING ON GPU By: Nitish Jain. Introduction Ray Tracing is one of the most researched fields in Computer Graphics A great technique to produce optical.
Computer Graphics 2 Lecture x: Acceleration Techniques for Ray-Tracing Benjamin Mora 1 University of Wales Swansea Dr. Benjamin Mora.
Interactive Ray Tracing: From bad joke to old news David Luebke University of Virginia.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
Ray Tracing and Photon Mapping on GPUs Tim PurcellStanford / NVIDIA.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Extracted directly from:
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
On a Few Ray Tracing like Algorithms and Structures. -Ravi Prakash Kammaje -Swansea University.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
By Mahmoud Moustafa Zidan Basic Sciences Department Faculty of Computer and Information Sciences Ain Shams University Under Supervision of Prof. Dr. Taymoor.
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
GPU Architecture and Programming
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
Saarland University, Germany B-KD Trees for Hardware Accelerated Ray Tracing of Dynamic Scenes Sven Woop Gerd Marmitt Philipp Slusallek.
Interactive Rendering With Coherent Ray Tracing Eurogaphics 2001 Wald, Slusallek, Benthin, Wagner Comp 238, UNC-CH, September 10, 2001 Joshua Stough.
Fast BVH Construction on GPUs (Eurographics 2009) Park, Soonchan KAIST (Korea Advanced Institute of Science and Technology)
Stefan Popov Space Subdivision for BVHs Stefan Popov, Iliyan Georgiev, Rossen Dimov, and Philipp Slusallek Object Partitioning Considered Harmful: Space.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Department of Computer Science 1 Beyond CUDA/GPUs and Future Graphics Architectures Karu Sankaralingam University of Wisconsin-Madison Adapted from “Toward.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Interactive Ray Tracing of Dynamic Scenes Tomáš DAVIDOVIČ Czech Technical University.
GPU Based Sound Simulation and Visualization Torbjorn Loken, Torbjorn Loken, Sergiu M. Dascalu, and Frederick C Harris, Jr. Department of Computer Science.
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
Ray Tracing by GPU Ming Ouhyoung. Outline Introduction Graphics Hardware Streaming Ray Tracing Discussion.
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
Path/Ray Tracing Examples. Path/Ray Tracing Rendering algorithms that trace photon rays Trace from eye – Where does this photon come from? Trace from.
CUDA programming Performance considerations (CUDA best practices)
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Single Instruction Multiple Threads
Task Scheduling for Multicore CPUs and NUMA Systems
Real-Time Ray Tracing Stefan Popov.
Christian Lauterbach GPGPU presentation 3/5/2007
Accelerated Single Ray Tracing for Wide Vector Units
Mattan Erez The University of Texas at Austin
General Purpose Graphics Processing Units (GPGPUs)
6- General Purpose GPU Programming
CSE 502: Computer Architecture
Presentation transcript:

Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel, Philipp Slusallek

Stefan PopovHigh Performance GPU Ray Tracing Background  GPUs attractive for ray tracing  High computational power  Shading oriented architecture  GPU ray tracers  Carr – the ray engine  Purcell – Full ray tracing on the GPU, based on grids  Ernst – KD trees with parallel stack  Carr, Thrane & Simonsen – BVH  Foley, Horn, Popov – KD trees - stackless traversal

Stefan PopovHigh Performance GPU Ray Tracing Motivation  So far  Interactive RT on GPU, but  Limited model size  No dynamic scene support  The G80 – new approach to the GPU  High performance general purpose processor with graphics extensions  PRAM architecture  BVH allow for  Dynamic/deformable scenes  Small memory footprint  Goal: Recursive ordered traversal of BVH on the G80

Stefan PopovHigh Performance GPU Ray Tracing GPU Architecture (G80)  Multi-threaded scalar architecture  12K HW threads  Threads cover latencies  Off-chip memory ops  Instruction dependencies  4 or 16 cycles to issue instr.  16 (multi-)cores  8-wide SIMD  128 scalar cores in total  Cores process threads in 32 wide SIMD chunks … Multi-Core 1 … Thread 1 Thread 32 IP Chunk Pool … Thread 1 Thread 32 … Thread 1 Thread 32 … Thread 1 Thread` 32 … Thread 1 Thread 32 Multi-Core 16 … Thread 1 Thread 32 IP Chunk Pool … Thread 1 Thread 32 … Thread 1 Thread 32 … Thread 1 Thread` 32 … Thread 1 Thread 32

Stefan PopovHigh Performance GPU Ray Tracing GPU Architecture (G80)  Scalar register file (8K)  Partitioned among running threads  Shared memory (16KB)  On-chip, 0 cycle latency  On-board memory (768MB)  Large latency (~ 200 cycles)  R/W from within thread  Un-cached  Read-only L2 cache (128KB)  On chip, shared among all threads On-board memory Multi-Core 1 … Thread 1 Thread 32 Shared Memory Registers … Multi-Core 16 L2 Cache (128KB)

Stefan PopovHigh Performance GPU Ray Tracing Programming the G80  CUDA  C based language with parallel extensions  GPU utilization at 100% only if  Enough threads are present (>> 12K)  Every thread uses less than 10 registers and 5 words (32 bit) of shared memory  Enough computations per transferred word of data  Bandwidth << computational power  Adequate memory access pattern to allow read combining

Stefan PopovHigh Performance GPU Ray Tracing Performance Bottlenecks  Efficient per-thread stack implementation  Shared memory too small – will limit parallelism  On-board memory – uncached  Need enough computations between stack ops  Efficient memory access pattern  Use texture caches  However, only few words of cache / thread  Read successive memory locations in successive threads of a chunk  Single roundtrip to memory (read combining)  Cover latency with enough computations

Stefan PopovHigh Performance GPU Ray Tracing Ray Tracing on the G80  Map each ray to one thread  Enough threads to keep the GPU busy  Recursive ray tracing  Use per-thread stack stored on on-board memory  Efficient, since enough computations are present  But how to do the traversal ?  Skip pointers (Thrane) – no ordered traversal  Geometric images (Carr) – single mesh only  Shared stack traversal

Stefan PopovHigh Performance GPU Ray Tracing SIMD Packet Traversal of BVH  Traverse a node with the whole packet  At an internal node:  Intersect all rays with both children and determine traversal order  Push far child (if any) on a stack and descend to the near one with the packet  At a leaf:  Intersect all rays with contained geometry  Pop next node to visit from the stack

Stefan PopovHigh Performance GPU Ray Tracing PRAM Basics  The PRAM model  Implicitly synchronized processors (threads)  Shared memory between all processors  Basic PRAM operations  Parallel OR in O(1)  Parallel reduction in O(log N) false true false true false true

Stefan PopovHigh Performance GPU Ray Tracing PRAM Packet Traversal of BVH  The G80 – PRAM machine on chunk level  Map packet  chunk, ray  thread  Threads behave as in the single ray traversal  At leaf: Intersect with geometry. Pop next node from stack  At node: Decide which children to visit and in what order. Push far child  Difference:  How rays choose which node to visit first  Might not be the one they want to

Stefan PopovHigh Performance GPU Ray Tracing PRAM Packet Traversal of BVH  Choose child traversal order  PRAM OR to determine if all rays agree on visiting the same node first  The result is stored in shared memory  In case of divergence: choose child with more ray candidates  Use PRAM SUM on +/- 1 for each thread, -1  left node  Look at result’s sign  Guarantees synchronous traversal of BVH

Stefan PopovHigh Performance GPU Ray Tracing PRAM Packet Traversal of BVH  Stack:  Near & far child – the same for all threads => store once  Keep stack in shared memory. Only few bits per thread!  Only Thread 0 does all stack ops.  Reading data:  All threads work with the same node / triangle  Sequential threads bring in sequential words  Single load operation. Single round trip to memory  Implementable in CUDA

Stefan PopovHigh Performance GPU Ray Tracing Results

Stefan PopovHigh Performance GPU Ray Tracing Analysis  Coherent branch decisions / memory access  Small footprint of the data structure  Can trace up to 12 million triangle models  Program becomes compute bound  Determined by over/under-clocking the core/memory  No frustums required  Good for secondary rays, bad for primary  Can use rasterization for primary rays  Implicit SIMD – easy shader programming  Running on a GPU – shading “for free”

Stefan PopovHigh Performance GPU Ray Tracing Dynamic Scenes  Update parts / whole BVH and geometry on GPU  Use GPU for RT and CPU for BVH construction / refitting  Construct BVH using binning  Similar to Wald RT07 / Popov RT06  Bin all 3 dimensions using SIMD  Results in > 10% better trees  Measured as SAH quality, not FPS  Speed loss is almost negligible

Stefan PopovHigh Performance GPU Ray Tracing Results

Stefan PopovHigh Performance GPU Ray Tracing Conclusions  New recursive PRAM BVH traversal algorithm  Very well suited for the new generation of GPUs  No additional pre-computed data required  First GPU ray tracer to handle large models  Previous implementations were limited to < 300K  Can handle dynamic scenes  By using the CPU to update the geometry / BVH

Stefan PopovHigh Performance GPU Ray Tracing Future Work  More features  Shaders, adaptive anti-aliasing, …  Global illumination  Code optimizations  Current implementation uses too many registers

Stefan PopovHigh Performance GPU Ray Tracing Thank you!

Stefan PopovHigh Performance GPU Ray Tracing CUDA Hello World __global__ void addArrays(int *arr1, int *arr2) { unsigned t = threadIdx.x + blockIdx.x * blockDim.x; arr1[t] += arr2[t]; } int main() { int *inArr1 = malloc( ), *inArr2 = malloc( ); int *ta1, *ta2; cudaMalloc((void**)&ta1, ); cudaMalloc((void**)&ta2, ); for(int i = 0; i < ; i++) { inArr1[i] = rand(); inArr2[i] = rand(); } cudaMemcpy(ta1, inArr1, , cudaMemcpyHostToDevice); cudaMemcpy(ta2, inArr2, , cudaMemcpyHostToDevice); addArrays >>(ta1, ta2); cudaMemcpy(inArr1, ta1, , cudaMemcpyDeviceToHost); for(int i = 0; i < ; i++) printf("%d ", inArr1[i]); return 0; }