Introduction to Realtime Ray Tracing Course 41 Philipp Slusallek Peter Shirley Bill Mark Gordon Stoll Ingo Wald.

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

Is There a Real Difference between DSPs and GPUs?
Sven Woop Jörg Schmittler Philipp Slusallek Computer Graphics Lab Saarland University, Germany RPU: A Programmable Ray Processing Unit for Realtime Ray.
Sven Woop Computer Graphics Lab Saarland University
† Saarland University, Germany ‡ University of Utah, USA Estimating Performance of a Ray-Tracing ASIC Design Sven Woop † Erik Brunvand ‡ Philipp Slusallek.
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Christian Lauterbach COMP 770, 2/16/2009. Overview  Acceleration structures  Spatial hierarchies  Object hierarchies  Interactive Ray Tracing techniques.
DSPs Vs General Purpose Microprocessors
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Ray tracing. New Concepts The recursive ray tracing algorithm Generating eye rays Non Real-time rendering.
Microprocessors General Features To be Examined For Each Chip Jan 24 th, 2002.
Computer Organization and Architecture
The Programmable Graphics Hardware Pipeline Doug James Asst. Professor CS & Robotics.
Afrigraph 2004 Interactive Ray-Tracing of Free-Form Surfaces Carsten Benthin Ingo Wald Philipp Slusallek Computer Graphics Lab Saarland University, Germany.
Rasterization and Ray Tracing in Real-Time Applications (Games) Andrew Graff.
Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Status – Week 243 Victor Moya. Summary Current status. Current status. Tests. Tests. XBox documentation. XBox documentation. Post Vertex Shader geometry.
3D Graphics Processor Architecture Victor Moya. PhD Project Research on architecture improvements for future Graphic Processor Units (GPUs). Research.
IN4151 Introduction 3D graphics 1 Introduction to 3D computer graphics part 2 Viewing pipeline Multi-processor implementation GPU architecture GPU algorithms.
Chapter 17 Parallel Processing.
Microprocessors Introduction to RISC Mar 19th, 2002.
ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.
Compilation, Architectural Support, and Evaluation of SIMD Graphics Pipeline Programs on a General-Purpose CPU Mauricio Breternitz Jr, Herbert Hum, Sanjeev.
Interactive k-D Tree GPU Raytracing Daniel Reiter Horn, Jeremy Sugerman, Mike Houston and Pat Hanrahan.
Status – Week 260 Victor Moya. Summary shSim. shSim. GPU design. GPU design. Future Work. Future Work. Rumors and News. Rumors and News. Imagine. Imagine.
Real-Time Ray Tracing 3D Modeling of the Future Marissa Hollingsworth Spring 2009.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Advanced Computer Architectures
Ray Tracing Primer Ref: SIGGRAPH HyperGraphHyperGraph.
Ray Tracing and Photon Mapping on GPUs Tim PurcellStanford / NVIDIA.
Realtime Caustics using Distributed Photon Mapping Johannes Günther Ingo Wald * Philipp Slusallek Computer Graphics Group Saarland University ( * now at.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
-Global Illumination Techniques
Cg Programming Mapping Computational Concepts to GPUs.
TRIPS – An EDGE Instruction Set Architecture Chirag Shah April 24, 2008.
Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.
A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.
Saarland University, Germany B-KD Trees for Hardware Accelerated Ray Tracing of Dynamic Scenes Sven Woop Gerd Marmitt Philipp Slusallek.
Interactive Rendering With Coherent Ray Tracing Eurogaphics 2001 Wald, Slusallek, Benthin, Wagner Comp 238, UNC-CH, September 10, 2001 Joshua Stough.
Fast BVH Construction on GPUs (Eurographics 2009) Park, Soonchan KAIST (Korea Advanced Institute of Science and Technology)
Introduction to Realtime Ray Tracing Course 41 Philipp Slusallek Peter Shirley Bill Mark Gordon Stoll Ingo Wald.
Department of Computer Science 1 Beyond CUDA/GPUs and Future Graphics Architectures Karu Sankaralingam University of Wisconsin-Madison Adapted from “Toward.
1 Ray Tracing with Existing Graphics Systems Jeremy Sugerman, FLASHG 31 January 2006.
1 by: Ilya Melamed Supervised by: Eyal Sarfati High Speed Digital Systems Lab.
Overview of Super-Harvard Architecture (SHARC) Daniel GlickDaniel Glick – May 15, 2002 for V (Dewar)
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Processor Architecture
David Luebke 1 1/25/2016 Programmable Graphics Hardware.
From Turing Machine to Global Illumination Chun-Fa Chang National Taiwan Normal University.
COMPUTER GRAPHICS CS 482 – FALL 2015 SEPTEMBER 29, 2015 RENDERING RASTERIZATION RAY CASTING PROGRAMMABLE SHADERS.
Ray Tracing by GPU Ming Ouhyoung. Outline Introduction Graphics Hardware Streaming Ray Tracing Discussion.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
My Coordinates Office EM G.27 contact time:
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
Graphics Processing Unit
Real-Time Ray Tracing Stefan Popov.
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Ray Tracing on Programmable Graphics Hardware
Presentation transcript:

Introduction to Realtime Ray Tracing Course 41 Philipp Slusallek Peter Shirley Bill Mark Gordon Stoll Ingo Wald

Hardware for Realtime Ray Tracing Custom Hardware for Realtime Ray Tracing – Characteristics and requirements – RPU Design and Implementation GPU + Recursion + Custom Traversal HW – Programming Model – FPGA Prototype – Performance and Scalability

Ray Tracing on CPUs Characteristics – Commodity, well understood HW – High FP performance, yet still too slow – Limited parallelism, bulky clusters – Poor silicon usage (e.g. cache) Outlook – Multi-core designs are coming – Will still take too long

Ray Tracing on GPUs Characteristics – Very high raw FP performance – High degree of parallelism – Fast development cycle Stream programming model – Still too limited for efficient ray tracing No support for recursion Limited memory access

Ray Tracing Characteristics: kd-Tree Traversal One-dimensional computation along ray – Compute location of d relative to t_min / t_max – Iterate or recurse with updated t_max / t_max t_min t_max d t_min t_max dsplit t_min t_max d split Near: t_min< t_max < dBoth: t_min < d < t_maxFar: d < t_min < t_max

Ray Tracing Characteristics: kd-Tree Traversal Inner traversal loop tmp = node.split – ray.origin d = tmp * 1/ray.direction near = d > t_min far = d < t_max if (near & far) push(node.far, d, t_max) if (near)iterate(node.near, t_min, d) elseiterate(node.far, d, t_max) Advantages of using kd-trees – Simple and fast traversal & building algorithm – Robust & very good handling of large scenes t_min t_max d split

Ray Tracing Characteristics: kd-Tree Traversal Traversal Processing – k-D steps per 10 instructions/step  many instructions  many clock cycles – Serial dependency  low pipeline efficiency, stalls, latency – Limited but flexible control flow and memory access  Custom HW unit – One clock tick per traversal step (fully pipelined) – Up to 100:1 improvement

Ray Tracing Characteristics: Intersection Intersection computation – Triggered by traversal at every leaf node Called with: ray and address of geometry – Option 1: Custom hardware [SaarCOR’05] – Option 2: Software on programmable processor Can be implemented efficiently Enables arbitrary programmable primitives  Do not use costly dedicated hardware

Ray Tracing Characteristics: Shading Shading computation – Triggered by finished ray traversal Called with: ray, hit point, shader-id, address of parameters – Characteristics: General-purpose computation, many 3-/4-vectors Needs support for efficient texture and memory access Needs support for arbitrary recursive tracing rays – E.g. support dependent ray tracing  Main feature of ray tracing: Do not put limits on it

Ray Tracing Characteristics: Coherence Ray coherence – Neighboring primary rays Traverse highly similar kd-node in same order Often hit same geometric primitives Often execute the same shader, access same textures, … – Similar for shadow rays to one light source – Often (but not always) applies for secondary rays  HW should take advantage of this coherence

Previous Work SaarCOR I – Fixed function ray tracing chip [GH’05]

RPU Approach Take GPUs as basis and core component – Highly parallel, highly efficient Improve programming model – Add efficient recursion, conditionals – Add memory access options Add custom traversal unit – Slave to RPU – Performs indirect, data dependent functions calls

RPU Design  Shader Processing Units (SPU) -General purpose computation -For shading, geometry, lighting computations -Operates on 4-component vectors -Integer and float -Dual issue, split vector -GPU-like instruction set -Arbitrary read/write -Texture addressing mode -No texture filtering  SW

RPU Design  Shader Processing Units (SPU)  Custom Ray Traversal Unit (TPU) -Efficient traversal of k-D trees -Communicates with SPU over dedicated registers

RPU Design  Shader Processing Units (SPU)  Custom Ray Traversal Unit (TPU)  Multi-Threading -Increases usage of HW resources -Hides latency due to -Memory access -Instruction dependencies -Long traversal operations -Separate thread pool for SPU & TPU -Software scheduling (compiler) -No overhead for switching threads -Increases resources (mainly register file)

RPU Design  Shader Processing Units (SPU)  Custom Ray Traversal Unit (TPU)  Multi-Threading  Chunking -SIMD execution (SPUs & TPUs) -Takes advantage of coherence -Reduces hardware complexity -Can combine of memory requests -Reduces external bandwidth -Must allow for incoherence -Chunks may split at conditionals -Inactive sub-chunk put on stack -Masked execution -Worst case: serial computation

RPU Design  Shader Processing Units (SPU)  Custom Ray Traversal Unit (TPU)  Multi-Threading  Chunking  Mailbox Processing (MPU)  Per thread caching mechanism  Avoids multiple processing of same kd-tree entry (e.g. triangle)  10x performance for some scenes

RPU Architecture

SPU Vector Registers All registers have 4- component (float or integer) R0 to R15: General registers – Index into a HW managed register stack – Allows for single-cycle function call P0 to P15: shader parameters I0 to I3: data read from memory A = (A0,A1,A2,A3) – Memory addressing ORG, DIR,... – TPU communication registers

Instruction Set of SPU Ray traversal operation trace Conditional instructions (paired) if jmp label if call If return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store

Instruction Set of SPU Ray traversal operation trace Conditional instructions (paired) if jmp label if call If return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store

Instruction Set of SPU Ray traversal operation trace Conditional instructions (paired) if jmp label if call If return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store

Instruction Set of SPU Ray traversal operation trace Conditional instructions (paired) if jmp label if call If return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store

Instruction Set of SPU Ray traversal operation trace Conditional instructions (paired) if jmp label if call If return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store

Instruction Set of SPU Ray traversal operation trace Conditional instructions (paired) if jmp label if call If return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store

Ray Triangle Intersection Unit-Triangle Test ; load triangle transformation load4x A.y,0 ; transform ray dp3_rcp R7.z,I2,R3 dp3 R7.y,I1,R3 dp3 R7.x,I0,R3 dph3 R6.x,I0,R2 dph3 R6.y,I1,R2 dph3 R6.z,I2,R2 ; compute hit distance mul R8.z,-R6.z,S.z + if z <0 return ; barycentric coordinates mad R8.xy,R8.z,R7,R6 + if or xy ( =1) return ; hit if u + v < 1 add R8.w,R8.x,R8.y + if w >=1 return ; hit distance closer than last one? add R8.w,R8.z,-R4.z + if w >=0 return ; save hit information mov SID,I3.x + mov MAX,R8.z mov R4.xyz,R8 + return Input Arithmetic (dot products) Multi-issue (arith. & cond.)

Read Instruction Read 3 Source Registers Swizzeling mov R0,R1 * mov R2,R3 * mov R0,R2 Masking Writeback * Memory Access Writeback I0 – I3 *** Clamp Thread Control Branching Stack Control RCP, RSQ Writeback Masking Shader Processing Unit Pipelining

RPU Programming Model ↨: Direct function calls ↔: Indirect function calls via TPU... Primary Ray Shader TPU/ MPU Top-Level Object Intersector TPU/ MPU Surface/ BRDF Shader Light Source Shader Lighting Shader TPU/ MPU Light Source Shader Geometry Intersector primary ray secondary rays SPU Processing TPU / MPU Processing... TPU/ MPU shadow rays

RPU Programming Model Primary Ray Shader TPU/ MPU Top-Level Object Intersector TPU/ MPU Surface/ BRDF Shader Light Source Shader Lighting Shader TPU/ MPU Light Source Shader Geometry Intersector primary ray secondary rays TPU/ MPU shadow rays

RPU Programming Model Threads are started for each pixel Registers initialized from an input stream – 2D Hilbert curve generator sampling the screen – Memory stream for multi-pass Shader computes ray Primary Ray Shader TPU/ MPU Top-Level Object Intersector TPU/ MPU Surface/ BRDF Shader Light Source Shader Lighting Shader TPU/ MPU Light Source Shader Geometry Intersector primary ray secondary rays TPU/ MPU shadow rays

RPU Programming Model Threads are started Registers initialized from an input stream – 2D Hilbert curve generator sampling the screen – Memory stream for multi-pass Primary Ray Shader TPU/ MPU Top-Level Object Intersector TPU/ MPU Surface/ BRDF Shader Light Source Shader Lighting Shader TPU/ MPU Light Source Shader Geometry Intersector primary ray secondary rays TPU/ MPU shadow rays

RPU Programming Model Shooting Primary Rays – Ray traversal performed on the TPU – Started in top-level kd-tree – Intersector transforms ray into local coordinate system Primary Ray Shader TPU/ MPU Top-Level Object Intersector TPU/ MPU Surface/ BRDF Shader Light Source Shader Lighting Shader TPU/ MPU Light Source Shader Geometry Intersector primary ray secondary rays TPU/ MPU shadow rays top-level kd-tree

RPU Programming Model Primary Ray Shader TPU/ MPU Top-Level Object Intersector TPU/ MPU Surface/ BRDF Shader Light Source Shader Lighting Shader TPU/ MPU Light Source Shader Geometry Intersector primary ray secondary rays TPU/ MPU shadow rays top-level kd-tree

RPU Programming Model Primary Ray Shader TPU/ MPU Top-Level Object Intersector TPU/ MPU Surface/ BRDF Shader Light Source Shader Lighting Shader TPU/ MPU Light Source Shader Geometry Intersector primary ray secondary rays TPU/ MPU shadow rays top-level kd-tree

RPU Programming Model Shooting Primary Rays (II) – Transformed ray traversed through object kd-tree on TPU – Geometry intersection performed on programmable SPU – Programmable geometry: triangles, spheres, bicubic splines, quadrics, … Primary Ray Shader TPU/ MPU Top-Level Object Intersector TPU/ MPU Surface/ BRDF Shader Light Source Shader Lighting Shader TPU/ MPU Light Source Shader Geometry Intersector primary ray secondary rays TPU/ MPU shadow rays object-level kd-tree

RPU Programming Model Primary Ray Shader TPU/ MPU Top-Level Object Intersector TPU/ MPU Surface/ BRDF Shader Light Source Shader Lighting Shader TPU/ MPU Light Source Shader Geometry Intersector primary ray secondary rays TPU/ MPU shadow rays object-level kd-tree

RPU Programming Model Surface shading performed on programmable SPU – Surface shader is called directly from primary shader – Arguments passed on HW stack – May trace secondary rays at any time: reflection, refraction, … – Writing shaders is easy due to global access to the scene and physically-based computation Primary Ray Shader TPU/ MPU Top-Level Object Intersector TPU/ MPU Surface/ BRDF Shader Light Source Shader Lighting Shader TPU/ MPU Light Source Shader Geometry Intersector primary ray secondary rays TPU/ MPU shadow rays

RPU Programming Model Light properties and illumination can be abstracted using function calls Illumination shader iterates over all light sources For each light source a Light source shader is called Primary Ray Shader TPU/ MPU Top-Level Object Intersector TPU/ MPU Surface/ BRDF Shader Light Source Shader Lighting Shader TPU/ MPU Light Source Shader Geometry Intersector primary ray secondary rays TPU/ MPU shadow rays

Prototype Implementation

Prototype Performance FPGA prototype – Xilinx Virtex II 6000 – 128 MB DDR-RAM at 350 MB/s – PCI bus for up-/download (no VGA) Single RPU at only 66 MHz – Up to 4 million rays per second – Up to x384 – Same ray tracing performance as Intel 2.66 GHz

Scalability Larger Chunk Size – Less ray coherence – More data is accessed – Increased cache bandwidth – Larger caches

Scalability Larger Chunk Size Multiple RPUs on a Chip – Limited by VLSI technology Memory bandwidth – FPGA prototype versus current GPUs Floating point units 50x Memory bandwidth 100x Clock rate 7x

Scalability Larger Chunk Size Multiple RPUs on a Chip Multiple chips on a board – Fast interconnect for data exchange – Cache sizes accumulate – Managed through virtual memory [Schmittler’2003] – Limited through external bandwidth due to scene changes

Scalability Larger Chunk Size Multiple RPUs on a Chip Multiple chips on a board Multiple boards in a PC – Similar to today’s PC clusters in a much smaller form factor

Video

Future Work Support for fully dynamic scenes – Vertex shader + building kd-trees Efficient photon mapping – kd-tree construction + kNN filtering OpenRT-API [Dietrich’03] ASIC prototype

Questions?