Interactive k-D Tree GPU Raytracing Daniel Reiter Horn, Jeremy Sugerman, Mike Houston and Pat Hanrahan.

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan
Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman Pat Hanrahan GCafe December 10th, 2003.
Sven Woop Computer Graphics Lab Saarland University
† Saarland University, Germany ‡ University of Utah, USA Estimating Performance of a Ray-Tracing ASIC Design Sven Woop † Erik Brunvand ‡ Philipp Slusallek.
Christian Lauterbach COMP 770, 2/16/2009. Overview  Acceleration structures  Spatial hierarchies  Object hierarchies  Interactive Ray Tracing techniques.
Instruction Set Design
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Photon Mapping on Programmable Graphics Hardware Timothy J. Purcell Mike Cammarano Pat Hanrahan Stanford University Craig Donner Henrik Wann Jensen University.
Restart Trail for Stackless BVH Traversal Samuli Laine NVIDIA Research.
Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.
Fall EE 333 Lillevik 333f06-l20 University of Portland School of Engineering Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS.
Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Latency considerations of depth-first GPU ray tracing
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
Fast and Accurate Soft Shadows using a Real-Time Beam Tracer Ravi Ramamoorthi Columbia Vision and Graphics Center Columbia University
Distributed Interactive Ray Tracing for Large Volume Visualization Dave DeMarle Steven Parker Mark Hartner Christiaan Gribble Charles Hansen.
Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.
Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC.
GH05 KD-Tree Acceleration Structures for a GPU Raytracer Tim Foley, Jeremy Sugerman Stanford University.
ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.
Some Things Jeremy Sugerman 22 February Jeremy Sugerman, FLASHG 22 February 2005 Topics Quick GPU Topics Conditional Execution GPU Ray Tracing.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
RT08, August ‘08 Large Ray Packets for Real-time Whitted Ray Tracing Ryan Overbeck Columbia University Ravi Ramamoorthi Columbia University William R.
Communication-Minimizing 2D Convolution in GPU Registers Forrest N. Iandola David Sheffield Michael Anderson P. Mangpo Phothilimthana Kurt Keutzer University.
CEG 4131-Fall Graphics Processing Unit GPU CEG4131 – Fall 2012 University of Ottawa Bardia Bandali CEG4131 – Fall 2012.
RAY TRACING ON GPU By: Nitish Jain. Introduction Ray Tracing is one of the most researched fields in Computer Graphics A great technique to produce optical.
Ray Tracing and Photon Mapping on GPUs Tim PurcellStanford / NVIDIA.
Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor José-María Arnau, Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis.
Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.
1 Chapter 04 Authors: John Hennessy & David Patterson.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
Cg Programming Mapping Computational Concepts to GPUs.
Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,
Raytracing and Global Illumination Intro. to Computer Graphics, CS180, Fall 2008 UC Santa Barbara.
Photon Mapping on Programmable Graphics Hardware
On a Few Ray Tracing like Algorithms and Structures. -Ravi Prakash Kammaje -Swansea University.
CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University.
Saarland University, Germany B-KD Trees for Hardware Accelerated Ray Tracing of Dynamic Scenes Sven Woop Gerd Marmitt Philipp Slusallek.
Understanding the Efficiency of Ray Traversal on GPUs Timo Aila Samuli Laine NVIDIA Research.
GPU Computation Strategies & Tricks Ian Buck NVIDIA.
Interactive Rendering With Coherent Ray Tracing Eurogaphics 2001 Wald, Slusallek, Benthin, Wagner Comp 238, UNC-CH, September 10, 2001 Joshua Stough.
Department of Computer Science 1 Beyond CUDA/GPUs and Future Graphics Architectures Karu Sankaralingam University of Wisconsin-Madison Adapted from “Toward.
1 Ray Tracing with Existing Graphics Systems Jeremy Sugerman, FLASHG 31 January 2006.
By Dirk Hekhuis Advisors Dr. Greg Wolffe Dr. Christian Trefftz.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Stored Programs In today’s lesson, we will look at: what we mean by a stored program computer how computers store and run programs what we mean by the.
Performance Tuning John Black CS 425 UNR, Fall 2000.
Lecture on Central Process Unit (CPU)
Compact, Fast and Robust Grids for Ray Tracing Ares Lagae & Philip Dutré 19 th Eurographics Symposium on Rendering EGSR 2008Wednesday, June 25th.
Compact, Fast and Robust Grids for Ray Tracing
Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.
Memory Buffering Techniques Greg Stitt ECE Department University of Florida.
Ray Tracing by GPU Ming Ouhyoung. Outline Introduction Graphics Hardware Streaming Ray Tracing Discussion.
My Coordinates Office EM G.27 contact time:
Path/Ray Tracing Examples. Path/Ray Tracing Rendering algorithms that trace photon rays Trace from eye – Where does this photon come from? Trace from.
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
Buffering Techniques Greg Stitt ECE Department University of Florida.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
5.2 Eleven Advanced Optimizations of Cache Performance
Real-Time Ray Tracing Stefan Popov.
Christian Lauterbach GPGPU presentation 3/5/2007
Accelerated Single Ray Tracing for Wide Vector Units
Ray Tracing on Programmable Graphics Hardware
Memory System Performance Chapter 3
Presentation transcript:

Interactive k-D Tree GPU Raytracing Daniel Reiter Horn, Jeremy Sugerman, Mike Houston and Pat Hanrahan

Architectural trends Processors are becoming more parallel –SMP –Stream Processors (Cell) –Threaded Processors (Niagra) –GPUs To raytrace quickly in the future –We must understand how architectural tradeoffs affect raytracing performance

A Modern GPU: ATI X1900XT 360 GFLOPS peak 40 GB/s cache bandwidth 28 GB/s streaming bandwidth

ATI X1900XT architecture 1000’s of threads –Each does not communicate with any other –Each has 512 bytes of scratch space Exposed as byte registers –Groups of ~48 threads in lockstep Same program counter

ATI X1900XT architecture Whenever a memory fetch occurs –active thread group put on queue –inactive thread group resumes for more math Execute one thread until stall, then switch to next thread. STALL Mem access T4 T3 T2 T1 STALL

Evolving a GPU to raytrace Get all GPU features –Rasterizer –Fast Texturing Shading Plus a raytracer

Current state of GPU raytracing Foley et al. slower than CPU –Performance only 30% of a CPU –Limited by memory bandwidth More math units won’t improve raytracer –Hard to store a stack in 512 bytes Invented KD-Restart to compensate

GPU Improvements Allows us to apply modern CPU raytracing techniques to GPU raytracers Looping –Entire intersection as a single pass Longer supported programs –Ray packets of size 4 (matching SIMD width) Access to hardware assembly language –Hand-tune inner loop

Contribution Port to ATI x1900 Exploiting new architectural features Short stack Result: 4.75 x faster than CPU on untextured scene

A D C KD-Tree B X Y Z X YZ A B C D tmin tmax

D C A B X Y Z KD-Tree Traversal X YZ A B C D Z A Stack:

D C A B X Y Z KD-Restart Standard traversal –Omit stack operations –Proceed to 1st leaf If no intersection –Advance (tmin,tmax) –Restart from root Proceed to next leaf

Eliminating Cost of KD-Restart Only 512b storage space, no room for stack Save last 3 elements pushed –Call this a short stack When pushing a full short stack –Discard oldest element When popping an empty short stack –Fall back to restart –Rare

D C A B X Y Z KD-Restart with short stack (size 1) X YZ A B C D Z A Stack: A

Scenes Cornell Box 32 triangles BART Robots 71,708 triangles BART Kitchen 110,561 triangles Conference Room 282,801 triangles

How tall a short stack do we need? Vanilla KD-Restart visits 166% more nodes than standard k-D tree traversal on Robots scene Short stack size 1 visits only 25% extra nodes –Storage needed is 36 bytes for packets 12 bytes for single ray Short stack size 3 visits only 3% extra nodes –Storage needed is 108 bytes for packets 36 bytes for single ray

Demonstration

Performance of Intersection Cornell BoxKitchenRobots KD-Restart Packets Short Stack Millions of rays per second

End-to-end performance AMD 2.4 GHz ATI X1900CELL frames second And texturing is cheap! (diffuse texture doesn’t alter framerate) 1 Source: Ray Tracing on the Cell processor, Benthin et al., 2006] - We rasterize first hits 11 frames per second

Analysis Dual GPU can outperform a Cell processor –But both have comparable FLOPS Each GPU should be on par –We run at 40-60% of GPU’s peak instruction issue rate Why?

Why do we run at 40-60% peak? Memory bandwidth or latency? –No: Turned memory clock to 2/3: minimal effect KD-Restarts? –No: 3-tall short-stack is enough Execution incoherence? –Yes: 48 threads must be at the same program counter –Tested with a dummy kernel thaat fetched no data and did no math, but followed the same execution path as our raytracer: same timing

Raytracing rate vs # bounces Kitchen Scene single packets

Conclusion KD-Tree traversal with shortstack –Allows efficient GPU kd-tree Small, bounded state per ray Only visits 3% more nodes than a full stack Raytracer is compute bound –No longer memory bound Also SIMD bound –Running at 40-60% peak –Can only use more ALU’s if they are not SIMD

Acknowledgements Tim Foley Ian Buck, Mark Segal, Derek Gerstmann Department of Energy Rambus Graduate Fellowship ATI Fellowship Program Intel Fellowship Program

Questions? Feel free to ask questions! Source Available at

Relative Speedup Relative speedup over previous GPU raytracer.