Sven Woop Jörg Schmittler Philipp Slusallek Computer Graphics Lab Saarland University, Germany RPU: A Programmable Ray Processing Unit for Realtime Ray.

Slides:



Advertisements
Similar presentations
The OpenRT Application Programming Interface - Towards a Common API for Interactive Ray Tracing – OpenSG 2003 Darmstadt, Germany Andreas Dietrich Ingo.
Advertisements

RISC Instruction Pipelines and Register Windows
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Is There a Real Difference between DSPs and GPUs?
A Hardware Processing Unit For Point Sets S. Heinzle, G. Guennebaud, M. Botsch, M. Gross Graphics Hardware 2008.
A Micro-benchmark Suite for AMD GPUs Ryan Taylor Xiaoming Li.
Sven Woop Computer Graphics Lab Saarland University
† Saarland University, Germany ‡ University of Utah, USA Estimating Performance of a Ray-Tracing ASIC Design Sven Woop † Erik Brunvand ‡ Philipp Slusallek.
Technische Universität München Computer Graphics SS 2014 Graphics Effects Rüdiger Westermann Lehrstuhl für Computer Graphik und Visualisierung.
Christian Lauterbach COMP 770, 2/16/2009. Overview  Acceleration structures  Spatial hierarchies  Object hierarchies  Interactive Ray Tracing techniques.
DSPs Vs General Purpose Microprocessors
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
Jared Casper, Ronny Krashinsky, Christopher Batten, Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA A Parameterizable.
The Programmable Graphics Hardware Pipeline Doug James Asst. Professor CS & Robotics.
Afrigraph 2004 Interactive Ray-Tracing of Free-Form Surfaces Carsten Benthin Ingo Wald Philipp Slusallek Computer Graphics Lab Saarland University, Germany.
Rasterization and Ray Tracing in Real-Time Applications (Games) Andrew Graff.
Distributed Interactive Ray Tracing for Large Volume Visualization Dave DeMarle Steven Parker Mark Hartner Christiaan Gribble Charles Hansen.
Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.
Computer Graphics Hardware Acceleration for Embedded Level Systems Brian Murray
3D Graphics Processor Architecture Victor Moya. PhD Project Research on architecture improvements for future Graphic Processor Units (GPUs). Research.
Memory Management 2010.
GPUGI: Global Illumination Effects on the GPU
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
Compilation, Architectural Support, and Evaluation of SIMD Graphics Pipeline Programs on a General-Purpose CPU Mauricio Breternitz Jr, Herbert Hum, Sanjeev.
Interactive k-D Tree GPU Raytracing Daniel Reiter Horn, Jeremy Sugerman, Mike Houston and Pat Hanrahan.
Real-Time Ray Tracing 3D Modeling of the Future Marissa Hollingsworth Spring 2009.
RT08, August ‘08 Large Ray Packets for Real-time Whitted Ray Tracing Ryan Overbeck Columbia University Ravi Ramamoorthi Columbia University William R.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Interactive Ray Tracing: From bad joke to old news David Luebke University of Virginia.
Ray Tracing and Photon Mapping on GPUs Tim PurcellStanford / NVIDIA.
Realtime Caustics using Distributed Photon Mapping Johannes Günther Ingo Wald * Philipp Slusallek Computer Graphics Group Saarland University ( * now at.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Cg Programming Mapping Computational Concepts to GPUs.
Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Introduction to Realtime Ray Tracing Course 41 Philipp Slusallek Peter Shirley Bill Mark Gordon Stoll Ingo Wald.
Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.
Interactive Visualization of Exceptionally Complex Industrial CAD Datasets Andreas Dietrich Ingo Wald Philipp Slusallek Computer Graphics Group Saarland.
1 Real-time visualization of large detailed volumes on GPU Cyril Crassin, Fabrice Neyret, Sylvain Lefebvre INRIA Rhône-Alpes / Grenoble Universities Interactive.
A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.
Saarland University, Germany B-KD Trees for Hardware Accelerated Ray Tracing of Dynamic Scenes Sven Woop Gerd Marmitt Philipp Slusallek.
Interactive Rendering With Coherent Ray Tracing Eurogaphics 2001 Wald, Slusallek, Benthin, Wagner Comp 238, UNC-CH, September 10, 2001 Joshua Stough.
Fast BVH Construction on GPUs (Eurographics 2009) Park, Soonchan KAIST (Korea Advanced Institute of Science and Technology)
Introduction to Realtime Ray Tracing Course 41 Philipp Slusallek Peter Shirley Bill Mark Gordon Stoll Ingo Wald.
Department of Computer Science 1 Beyond CUDA/GPUs and Future Graphics Architectures Karu Sankaralingam University of Wisconsin-Madison Adapted from “Toward.
1 Ray Tracing with Existing Graphics Systems Jeremy Sugerman, FLASHG 31 January 2006.
- Laboratoire d'InfoRmatique en Image et Systèmes d'information
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
A SEMINAR ON 1 CONTENT 2  The Stream Programming Model  The Stream Programming Model-II  Advantage of Stream Processor  Imagine’s.
GPUs – Graphics Processing Units Applications in Graphics Processing and Beyond COSC 3P93 – Parallel ComputingMatt Peskett.
1 Saarland University, Germany 2 DFKI Saarbrücken, Germany.
From Turing Machine to Global Illumination Chun-Fa Chang National Taiwan Normal University.
COMPUTER GRAPHICS CS 482 – FALL 2015 SEPTEMBER 29, 2015 RENDERING RASTERIZATION RAY CASTING PROGRAMMABLE SHADERS.
Optimizing Ray Tracing on the Cell Microprocessor David Oguns.
Ray Tracing by GPU Ming Ouhyoung. Outline Introduction Graphics Hardware Streaming Ray Tracing Discussion.
My Coordinates Office EM G.27 contact time:
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
William Stallings Computer Organization and Architecture 8th Edition
Graphics Processing Unit
Real-Time Ray Tracing Stefan Popov.
From Turing Machine to Global Illumination
Graphics Processing Unit
Chapter 4 Multiprocessors
Ray Tracing on Programmable Graphics Hardware
RADEON™ 9700 Architecture and 3D Performance
Presentation transcript:

Sven Woop Jörg Schmittler Philipp Slusallek Computer Graphics Lab Saarland University, Germany RPU: A Programmable Ray Processing Unit for Realtime Ray Tracing

And there will also be colonies on Mars, underwater cities, and personal jet packs.“ „Some argue that in the very long term, rendering may best be solved by some variant of ray tracing, in which huge numbers of rays sample the environment for the eye’s view of each frame. „Real-Time Rendering“, 1999, 1st edition, page 391 Motivation

Rasterization Fundamental Operation: Project isolated triangles – No global access to the scene All interesting visual effects need 2+ triangles – Shadows, reflection, global illumination, … – Requires multiple passes & approximations, has many issues

Ray Tracing Fundamental Operation: Trace a Ray – Global scene access – Individual rays in O(log N) – Flexibility in space and time – Automatic combination of visual effects – Demand driven – Physical light simulation – Embarrassingly parallel

OpenRT Project: Realtime Ray Tracing in SW Results: – Exploit inherent coherence – Realtime performance (> 30x) – Scalability (> 80 CPUs) – Realtime indirect lighting & caustic computation – Large model visualization

OpenRT Project: Realtime Ray Tracing in SW Results: – Exploit inherent coherence – Realtime performance (> 30x) – Scalability (> 80 CPUs) – Realtime indirect lighting & caustic computation – Large model visualization

OpenRT Project: Realtime Ray Tracing in SW Results: – Exploit inherent coherence – Realtime performance (> 30x) – Scalability (> 80 CPUs) – Realtime indirect lighting & caustic computation – Large model visualization

OpenRT Project: Realtime Ray Tracing in SW Results: – Exploit inherent coherence – Realtime performance (> 30x) – Scalability (> 80 CPUs) – Realtime indirect lighting & caustic computation – Large model visualization

OpenRT Project: Realtime Ray Tracing in SW Results: – Exploit inherent coherence – Realtime performance (> 30x) – Scalability (> 80 CPUs) – Realtime indirect lighting & caustic computation – Large model visualization  Compact Hardware?

Previous Work CPUs [Parker’99, Wald’01] – Flexible, but needs massive number of CPUs GPUs [Purcell’02, Carr’02] – Stream programming model is too limited Custom HW [Green’91, Hall’01, Kobayashi’02, Schmittler’04] – Mostly focused on parts of the RT pipeline

RPU Approach Shading processor – Design similar to fragment processors on GPUs – Highly parallel, highly efficient Improved programming model – Add highly efficient recursion, conditional branching – Add flexible memory access (beyond textures) Custom traversal hardware – High-performance kd-tree traversal

RPU Design Shader Processing Units (SPU) – Intersection, Shading, Lighting, … – Significant vector processing – High instruction level parallelism – SPU approach: instruction set similar to GPUs – 4-vector SIMD – Dual issue & pairing – Arithmetic splitting (3+1 and 2+2 vector) – Arithmetic + load – Arithmetic + conditional jump, call, return – High code density Instruction Level Parallelism +

Instruction Set of SPU Short vector instruction set – mov, add, mul, mad, frac – dph2, dp3, dph3, dp4 Input modifiers – Swizzeling, negation, masking – Multiply with power of 2 Special operations (modifiers) – rcp, rsq, sat Fast 2D texture addressing – texload, texload4x Random bulk memory access – load, load4x, store Conditional instructions (paired) – if jmp label – if call – If return Efficient recursion – HW-managed register stack – Single cycle function call Ray traversal function call – trace() Instruction Level Parallelism +

Instruction Set of SPU Short vector instruction set – mov, add, mul, mad, frac – dph2, dp3, dph3, dp4 Input modifiers – Swizzeling, negation, masking – Multiply with power of 2 Special operations (modifiers) – rcp, rsq, sat Fast 2D texture addressing – texload, texload4x Random bulk memory access – load, load4x, store Conditional instructions (paired) – if jmp label – if call – If return Efficient recursion – HW-managed register stack – Single cycle function call Ray traversal function call – trace() Instruction Level Parallelism +

Instruction Set of SPU Short vector instruction set – mov, add, mul, mad, frac – dph2, dp3, dph3, dp4 Input modifiers – Swizzeling, negation, masking – Multiply with power of 2 Special operations (modifiers) – rcp, rsq, sat Fast 2D texture addressing – texload, texload4x Random bulk memory access – load, load4x, store Conditional instructions (paired) – if jmp label – if call – If return Efficient recursion – HW-managed register stack – Single cycle function call Ray traversal function call – trace() Instruction Level Parallelism +

Instruction Set of SPU Short vector instruction set – mov, add, mul, mad, frac – dph2, dp3, dph3, dp4 Input modifiers – Swizzeling, negation, masking – Multiply with power of 2 Special operations (modifiers) – rcp, rsq, sat Fast 2D texture addressing – texload, texload4x Random bulk memory access – load, load4x, store Conditional instructions (paired) – if jmp label – if call – If return Efficient recursion – HW-managed register stack – Single cycle function call Ray traversal function call – trace() Instruction Level Parallelism +

Instruction Set of SPU Short vector instruction set – mov, add, mul, mad, frac – dph2, dp3, dph3, dp4 Input modifiers – Swizzeling, negation, masking – Multiply with power of 2 Special operations (modifiers) – rcp, rsq, sat Fast 2D texture addressing – texload, texload4x Random bulk memory access – load, load4x, store Conditional instructions (paired) – if jmp label – if call – If return Efficient recursion – HW-managed register stack – Single cycle function call Ray traversal function call – trace() Instruction Level Parallelism +

Instruction Set of SPU Short vector instruction set – mov, add, mul, mad, frac – dph2, dp3, dph3, dp4 Input modifiers – Swizzeling, negation, masking – Multiply with power of 2 Special operations (modifiers) – rcp, rsq, sat Fast 2D texture addressing – texload, texload4x Random bulk memory access – load, load4x, store Conditional instructions (paired) – if jmp label – if call – If return Efficient recursion – HW-managed register stack – Single cycle function call Ray traversal function call – trace() Instruction Level Parallelism +

RPU Design Shader Processing Units (SPU) Custom Ray Traversal Unit (TPU) – Uses kd-trees: Flexible and adaptive – Handles general scenes well [Havran2000] – Simple traversal algorithm – But many instruction dependencies in inner loop – About 10 instructions – CPU:>100 cycles (un-optimized) ~15 cycles (optimized OpenRT) – TPU approach: Optimized pipeline – 1 cycle throughput + Optimized Pipelining Instruction Level Parallelism +

RPU Design Shader Processing Units (SPU) Custom Ray Traversal Unit (TPU) Multi-Threading – High computational & memory latency – Take advantage of thread level parallelism – High utilization of hardware units – No overhead for switching threads in HW ++ Thread Level Parallelism Optimized Pipelining Instruction Level Parallelism +

RPU Design Shader Processing Units (SPU) Custom Ray Traversal Unit (TPU) Multi-Threading Chunking – Takes advantage of ray coherence – SIMD execution of threads – Reduces hardware complexity – Shared processor infrastructure – Reduces external bandwidth – Combining memory requests – Automatic handling of incoherence – Splitting and masked execution ++ Thread Level Parallelism Optimized Pipelining Instruction Level Parallelism Control Flow Coherence ++

RPU Design Shader Processing Units (SPU) Custom Ray Traversal Unit (TPU) Multi-Threading Chunking Mailbox Processing Unit (MPU) – Objects often span many kd-tree cells – Redundant intersections – Mailboxing with a small per-chunk cache (e.g. 4 entries) – Up to 10x performance for some scenes ++ Thread Level Parallelism Optimized Pipelining Instruction Level Parallelism Control Flow Coherence Avoiding Redundancy ++

++ RPU Design Shader Processing Units (SPU) Custom Ray Traversal Unit (TPU) Multi-Threading Chunking Mailbox Processing Unit (MPU) Thread Level Parallelism Optimized Pipelining Instruction Level Parallelism Control Flow Coherence Avoiding Redundancy ++

++ RPU Design Shader Processing Units (SPU) Custom Ray Traversal Unit (TPU) Multi-Threading Chunking Mailbox Processing Unit (MPU) Thread Level Parallelism Optimized Pipelining Instruction Level Parallelism Control Flow Coherence Avoiding Redundancy ++

++ RPU Design Shader Processing Units (SPU) Custom Ray Traversal Unit (TPU) Multi-Threading Chunking Mailbox Processing Unit (MPU) Thread Level Parallelism Instruction Level Parallelism Control Flow Coherence Avoiding Redundancy ++ Optimized Pipelining

+ RPU Design Shader Processing Units (SPU) Custom Ray Traversal Unit (TPU) Multi-Threading Chunking Mailbox Processing Unit (MPU) Thread Level Parallelism Optimized Pipelining Instruction Level Parallelism Control Flow Coherence Avoiding Redundancy +++

++++ RPU Design Shader Processing Units (SPU) Custom Ray Traversal Unit (TPU) Multi-Threading Chunking Mailbox Processing Unit (MPU) Thread Level Parallelism Optimized Pipelining Instruction Level Parallelism Control Flow Coherence Avoiding Redundancy

RPU Architecture TPU MPU Vector- Cache Node- Cache List- Cache External DDR Memory RPU Thread Generator internalexternal SPU TPU

RPU Architecture TPU MPU Vector- Cache Node- Cache List- Cache External DDR Memory RPU Thread Generator internalexternal SPU TPU

RPU Architecture TPU MPU Vector- Cache Node- Cache List- Cache External DDR Memory RPU Thread Generator internalexternal SPU TPU

RPU Architecture TPU MPU Vector- Cache Node- Cache List- Cache External DDR Memory RPU Thread Generator internalexternal SPU TPU

RPU Architecture TPU MPU Vector- Cache Node- Cache List- Cache External DDR Memory RPU Thread Generator internalexternal SPU TPU

RPU Architecture TPU MPU Vector- Cache Node- Cache List- Cache External DDR Memory RPU internalexternal SPU TPU Thread Generator

RPU Architecture TPU MPU Vector- Cache Node- Cache List- Cache External DDR Memory RPU internalexternal SPU TPU Thread Generator

Top-Level Object Intersector Top-Level Object Intersector Ray Tracing on an RPU Thread generation: initialize SPU registers with pixel coordinates Primary Shader TPU/ MPU Intersector Thread Generator

Top-Level Object Intersector Top-Level Object Intersector Ray Tracing on an RPU Thread generation: initialize SPU registers with pixel coordinates Primary shader generates camera ray and calls trace() Primary Shader TPU/ MPU Intersector Thread Generator

Top-Level Object Intersector Top-Level Object Intersector Ray Tracing on an RPU Thread generation: initialize SPU registers with pixel coordinates Primary shader generates camera ray and calls trace() Ray traversal performed on TPU with mailboxing on MPU Primary Shader TPU/ MPU Intersector Thread Generator

Top-Level Object Intersector Top-Level Object Intersector Ray Tracing on an RPU Thread generation: initialize SPU registers with pixel coordinates Primary shader generates camera ray and calls trace() Ray traversal performed on TPU with mailboxing on MPU Data dependent call to object/intersection shader on SPU – Programmable geometry (triangles, spheres, quadrics, bicubic splines, …) Primary Shader TPU/ MPU Intersector Thread Generator shader calls

Top-Level Object Intersector Top-Level Object Intersector Ray Tracing on an RPU Thread generation: initialize SPU registers with pixel coordinates Primary shader generates camera ray and calls trace() Ray traversal performed on TPU with mailboxing on MPU Data dependent call to object/intersection shader on SPU – Programmable geometry (triangles, spheres, quadrics, bicubic splines, …) Nested ray traversal for object-based dynamic scenes Shader TPU/ MPU Top-Level Object Intersector Thread Generator Geometry Intersector Geometry Intersector Geometry Intersector TPU/ MPU top-level traversal

Top-Level Object Intersector Top-Level Object Intersector Ray Tracing on an RPU Thread generation: initialize SPU registers with pixel coordinates Primary shader generates camera ray and calls trace() Ray traversal performed on TPU with mailboxing on MPU Data dependent call to object/intersection shader on SPU – Programmable geometry (triangles, spheres, quadrics, bicubic splines, …) Nested ray traversal for object-based dynamic scenes Shader TPU/ MPU Top-Level Object Intersector Thread Generator Geometry Intersector Geometry Intersector Geometry Intersector TPU/ MPU top-level traversal

Top-Level Object Intersector Top-Level Object Intersector Ray Tracing on an RPU Thread generation: initialize SPU registers with pixel coordinates Primary shader generates camera ray and calls trace() Ray traversal performed on TPU with mailboxing on MPU Data dependent call to object/intersection shader on SPU – Programmable geometry (triangles, spheres, quadrics, bicubic splines, …) Nested ray traversal for object-based dynamic scenes Shader TPU/ MPU Top-Level Object Intersector Thread Generator Geometry Intersector Geometry Intersector Geometry Intersector TPU/ MPU top-level traversal

Top-Level Object Intersector Top-Level Object Intersector Ray Tracing on an RPU Thread generation: initialize SPU registers with pixel coordinates Primary shader generates camera ray and calls trace() Ray traversal performed on TPU with mailboxing on MPU Data dependent call to object/intersection shader on SPU – Programmable geometry (triangles, spheres, quadrics, bicubic splines, …) Nested ray traversal for object-based dynamic scenes Shader TPU/ MPU Top-Level Object Intersector Thread Generator top-level traversalbottom-level traversal Geometry Intersector Geometry Intersector Geometry Intersector TPU/ MPU

FPGA prototype Xilinx Virtex II 6000 – Usage: 99% logic, 70% on-chip memory – 128 MB DDR-RAM with 350 MB/s – 24 bit floating point Configuration of prototype RPU – 32 threads per SPU60% usage – Chunk size of 495% efficiency – 12 kB caches in total90% hit rate

Performance Single FPGA at only 66 MHz – 4 million rays/s x384 – Same performance as CPU 40x clock rate (2.66 GHz) Using our highly optimized software (OpenRT with SSE) Linear scalability with HW resources – Tested: 4x FPGA  4x performance – Independent of scenes

Prototype Performance Technology: FPGA versus ASIC (GPU) – Large headroom for scaling performance RPU prototype (FPGA)GPU (ASIC) 4 Gflops200 Gflops50x 0.3 GB/s30 GB/s100x

Demo

Conclusions Programmable Architecture for Ray Tracing – GPU-like shading processor – More flexible programming model Flexible control flow and memory access Efficient recursion with HW-managed stack – Efficient traversal through custom hardware – Realtime performance on FPGA prototype

Outlook ASIC design – Evaluate scalability with technology RPU as a general purpose processor – Explore non-rendering use of RPU New fundamental operation: Tracing a ray – Basis for next generation interactive 3D graphics

Siggraph 2005: More Realtime Ray Tracing Introduction to Realtime Ray Tracing – Full day course: Wednesday, Petree Hall D Booth 1155: Mercury Computer Systems – Realtime ray tracing product on PC clusters – Realtime ray tracing on the Cell Processor – Realtime previewing in Cinema-4D Booth 1511: SGI – Ray tracing massive model : Boeing 777

Ray Triangle Intersection ; load triangle transformation load4x A.y,0 ; prepare intersection comp. dp3_rcp R7.z,I2,R3 dp3 R7.y,I1,R3 dp3 R7.x,I0,R3 dph3 R6.x,I0,R2 dph3 R6.y,I1,R2 dph3 R6.z,I2,R2 ; compute hit distance mul R8.z,-R6.z,S.z + if z <0 return ; barycentric coordinates mad R8.xy,R8.z,R7,R6 + if or xy ( =1) return ; hit if u + v < 1 add R8.w,R8.x,R8.y + if w >=1 return ; hit distance closer than last one? add R8.w,R8.z,-R4.z + if w >=0 return ; save hit information mov SID,I3.x + mov MAX,R8.z mov R4.xyz,R8 + return Input Arithmetic (dot products) Multi-issue (arith. & cond.)

TPU (Traversal Processing Unit) Read Node Read Ray Project Ray (org_k – split)/dir_k d < t_min near < hit Decision Update: Node Addr, Near, Far, Stack to MPU Node Cache

Read Instruction Read 3 Source Registers Swizzeling mov R0,R1 * mov R2,R3 * mov R0,R2 Masking Writeback * Memory Access Writeback I0 – I3 *** Clamp Thread Control Branching Stack Control RCP, RSQ Writeback Masking Shader Processing Unit Pipelining