Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Realtime Ray Tracing Course 41 Philipp Slusallek Peter Shirley Bill Mark Gordon Stoll Ingo Wald.

Similar presentations


Presentation on theme: "Introduction to Realtime Ray Tracing Course 41 Philipp Slusallek Peter Shirley Bill Mark Gordon Stoll Ingo Wald."— Presentation transcript:

1 Introduction to Realtime Ray Tracing Course 41 Philipp Slusallek Peter Shirley Bill Mark Gordon Stoll Ingo Wald

2 Hardware for Realtime Ray Tracing Custom Hardware for Realtime Ray Tracing – Characteristics and requirements – RPU Design and Implementation GPU + Recursion + Custom Traversal HW – Programming Model – FPGA Prototype – Performance and Scalability

3 Ray Tracing on CPUs Characteristics – Commodity, well understood HW – High FP performance, yet still too slow – Limited parallelism, bulky clusters – Poor silicon usage (e.g. cache) Outlook – Multi-core designs are coming – Will still take too long

4 Ray Tracing on GPUs Characteristics – Very high raw FP performance – High degree of parallelism – Fast development cycle Stream programming model – Still too limited for efficient ray tracing No support for recursion Limited memory access

5 Ray Tracing Characteristics: kd-Tree Traversal One-dimensional computation along ray – Compute location of d relative to t_min / t_max – Iterate or recurse with updated t_max / t_max t_min t_max d t_min t_max dsplit t_min t_max d split Near: t_min< t_max < dBoth: t_min < d < t_maxFar: d < t_min < t_max

6 Ray Tracing Characteristics: kd-Tree Traversal Inner traversal loop tmp = node.split – ray.origin d = tmp * 1/ray.direction near = d > t_min far = d < t_max if (near & far) push(node.far, d, t_max) if (near)iterate(node.near, t_min, d) elseiterate(node.far, d, t_max) Advantages of using kd-trees – Simple and fast traversal & building algorithm – Robust & very good handling of large scenes t_min t_max d split

7 Ray Tracing Characteristics: kd-Tree Traversal Traversal Processing – 50-80 k-D steps per ray @ 10 instructions/step  many instructions  many clock cycles – Serial dependency  low pipeline efficiency, stalls, latency – Limited but flexible control flow and memory access  Custom HW unit – One clock tick per traversal step (fully pipelined) – Up to 100:1 improvement

8 Ray Tracing Characteristics: Intersection Intersection computation – Triggered by traversal at every leaf node Called with: ray and address of geometry – Option 1: Custom hardware [SaarCOR’05] – Option 2: Software on programmable processor Can be implemented efficiently Enables arbitrary programmable primitives  Do not use costly dedicated hardware

9 Ray Tracing Characteristics: Shading Shading computation – Triggered by finished ray traversal Called with: ray, hit point, shader-id, address of parameters – Characteristics: General-purpose computation, many 3-/4-vectors Needs support for efficient texture and memory access Needs support for arbitrary recursive tracing rays – E.g. support dependent ray tracing  Main feature of ray tracing: Do not put limits on it

10 Ray Tracing Characteristics: Coherence Ray coherence – Neighboring primary rays Traverse highly similar kd-node in same order Often hit same geometric primitives Often execute the same shader, access same textures, … – Similar for shadow rays to one light source – Often (but not always) applies for secondary rays  HW should take advantage of this coherence

11 Previous Work SaarCOR I – Fixed function ray tracing chip [GH’05]

12 RPU Approach Take GPUs as basis and core component – Highly parallel, highly efficient Improve programming model – Add efficient recursion, conditionals – Add memory access options Add custom traversal unit – Slave to RPU – Performs indirect, data dependent functions calls

13 RPU Design  Shader Processing Units (SPU) -General purpose computation -For shading, geometry, lighting computations -Operates on 4-component vectors -Integer and float -Dual issue, split vector -GPU-like instruction set -Arbitrary read/write -Texture addressing mode -No texture filtering  SW

14 RPU Design  Shader Processing Units (SPU)  Custom Ray Traversal Unit (TPU) -Efficient traversal of k-D trees -Communicates with SPU over dedicated registers

15 RPU Design  Shader Processing Units (SPU)  Custom Ray Traversal Unit (TPU)  Multi-Threading -Increases usage of HW resources -Hides latency due to -Memory access -Instruction dependencies -Long traversal operations -Separate thread pool for SPU & TPU -Software scheduling (compiler) -No overhead for switching threads -Increases resources (mainly register file)

16 RPU Design  Shader Processing Units (SPU)  Custom Ray Traversal Unit (TPU)  Multi-Threading  Chunking -SIMD execution (SPUs & TPUs) -Takes advantage of coherence -Reduces hardware complexity -Can combine of memory requests -Reduces external bandwidth -Must allow for incoherence -Chunks may split at conditionals -Inactive sub-chunk put on stack -Masked execution -Worst case: serial computation

17 RPU Design  Shader Processing Units (SPU)  Custom Ray Traversal Unit (TPU)  Multi-Threading  Chunking  Mailbox Processing (MPU)  Per thread caching mechanism  Avoids multiple processing of same kd-tree entry (e.g. triangle)  10x performance for some scenes

18 RPU Architecture

19 SPU Vector Registers All registers have 4- component (float or integer) R0 to R15: General registers – Index into a HW managed register stack – Allows for single-cycle function call P0 to P15: shader parameters I0 to I3: data read from memory A = (A0,A1,A2,A3) – Memory addressing ORG, DIR,... – TPU communication registers

20 Instruction Set of SPU Ray traversal operation trace Conditional instructions (paired) if jmp label if call If return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store

21 Instruction Set of SPU Ray traversal operation trace Conditional instructions (paired) if jmp label if call If return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store

22 Instruction Set of SPU Ray traversal operation trace Conditional instructions (paired) if jmp label if call If return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store

23 Instruction Set of SPU Ray traversal operation trace Conditional instructions (paired) if jmp label if call If return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store

24 Instruction Set of SPU Ray traversal operation trace Conditional instructions (paired) if jmp label if call If return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store

25 Instruction Set of SPU Ray traversal operation trace Conditional instructions (paired) if jmp label if call If return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store

26 Ray Triangle Intersection Unit-Triangle Test ; load triangle transformation load4x A.y,0 ; transform ray dp3_rcp R7.z,I2,R3 dp3 R7.y,I1,R3 dp3 R7.x,I0,R3 dph3 R6.x,I0,R2 dph3 R6.y,I1,R2 dph3 R6.z,I2,R2 ; compute hit distance mul R8.z,-R6.z,S.z + if z <0 return ; barycentric coordinates mad R8.xy,R8.z,R7,R6 + if or xy ( =1) return ; hit if u + v < 1 add R8.w,R8.x,R8.y + if w >=1 return ; hit distance closer than last one? add R8.w,R8.z,-R4.z + if w >=0 return ; save hit information mov SID,I3.x + mov MAX,R8.z mov R4.xyz,R8 + return Input Arithmetic (dot products) Multi-issue (arith. & cond.)

27 Read Instruction Read 3 Source Registers Swizzeling mov R0,R1 * mov R2,R3 * mov R0,R2 Masking Writeback * Memory Access Writeback I0 – I3 *** +++ + Clamp Thread Control Branching Stack Control RCP, RSQ Writeback Masking Shader Processing Unit Pipelining

28 RPU Programming Model ↨: Direct function calls ↔: Indirect function calls via TPU... Primary Ray Shader TPU/ MPU Top-Level Object Intersector TPU/ MPU Surface/ BRDF Shader Light Source Shader Lighting Shader TPU/ MPU Light Source Shader Geometry Intersector primary ray secondary rays SPU Processing TPU / MPU Processing... TPU/ MPU shadow rays

29 RPU Programming Model Primary Ray Shader TPU/ MPU Top-Level Object Intersector TPU/ MPU Surface/ BRDF Shader Light Source Shader Lighting Shader TPU/ MPU Light Source Shader Geometry Intersector primary ray secondary rays TPU/ MPU shadow rays

30 RPU Programming Model Threads are started for each pixel Registers initialized from an input stream – 2D Hilbert curve generator sampling the screen – Memory stream for multi-pass Shader computes ray Primary Ray Shader TPU/ MPU Top-Level Object Intersector TPU/ MPU Surface/ BRDF Shader Light Source Shader Lighting Shader TPU/ MPU Light Source Shader Geometry Intersector primary ray secondary rays TPU/ MPU shadow rays

31 RPU Programming Model Threads are started Registers initialized from an input stream – 2D Hilbert curve generator sampling the screen – Memory stream for multi-pass Primary Ray Shader TPU/ MPU Top-Level Object Intersector TPU/ MPU Surface/ BRDF Shader Light Source Shader Lighting Shader TPU/ MPU Light Source Shader Geometry Intersector primary ray secondary rays TPU/ MPU shadow rays

32 RPU Programming Model Shooting Primary Rays – Ray traversal performed on the TPU – Started in top-level kd-tree – Intersector transforms ray into local coordinate system Primary Ray Shader TPU/ MPU Top-Level Object Intersector TPU/ MPU Surface/ BRDF Shader Light Source Shader Lighting Shader TPU/ MPU Light Source Shader Geometry Intersector primary ray secondary rays TPU/ MPU shadow rays top-level kd-tree

33 RPU Programming Model Primary Ray Shader TPU/ MPU Top-Level Object Intersector TPU/ MPU Surface/ BRDF Shader Light Source Shader Lighting Shader TPU/ MPU Light Source Shader Geometry Intersector primary ray secondary rays TPU/ MPU shadow rays top-level kd-tree

34 RPU Programming Model Primary Ray Shader TPU/ MPU Top-Level Object Intersector TPU/ MPU Surface/ BRDF Shader Light Source Shader Lighting Shader TPU/ MPU Light Source Shader Geometry Intersector primary ray secondary rays TPU/ MPU shadow rays top-level kd-tree

35 RPU Programming Model Shooting Primary Rays (II) – Transformed ray traversed through object kd-tree on TPU – Geometry intersection performed on programmable SPU – Programmable geometry: triangles, spheres, bicubic splines, quadrics, … Primary Ray Shader TPU/ MPU Top-Level Object Intersector TPU/ MPU Surface/ BRDF Shader Light Source Shader Lighting Shader TPU/ MPU Light Source Shader Geometry Intersector primary ray secondary rays TPU/ MPU shadow rays object-level kd-tree

36 RPU Programming Model Primary Ray Shader TPU/ MPU Top-Level Object Intersector TPU/ MPU Surface/ BRDF Shader Light Source Shader Lighting Shader TPU/ MPU Light Source Shader Geometry Intersector primary ray secondary rays TPU/ MPU shadow rays object-level kd-tree

37 RPU Programming Model Surface shading performed on programmable SPU – Surface shader is called directly from primary shader – Arguments passed on HW stack – May trace secondary rays at any time: reflection, refraction, … – Writing shaders is easy due to global access to the scene and physically-based computation Primary Ray Shader TPU/ MPU Top-Level Object Intersector TPU/ MPU Surface/ BRDF Shader Light Source Shader Lighting Shader TPU/ MPU Light Source Shader Geometry Intersector primary ray secondary rays TPU/ MPU shadow rays

38 RPU Programming Model Light properties and illumination can be abstracted using function calls Illumination shader iterates over all light sources For each light source a Light source shader is called Primary Ray Shader TPU/ MPU Top-Level Object Intersector TPU/ MPU Surface/ BRDF Shader Light Source Shader Lighting Shader TPU/ MPU Light Source Shader Geometry Intersector primary ray secondary rays TPU/ MPU shadow rays

39 Prototype Implementation

40 Prototype Performance FPGA prototype – Xilinx Virtex II 6000 – 128 MB DDR-RAM at 350 MB/s – PCI bus for up-/download (no VGA) Single RPU at only 66 MHz – Up to 4 million rays per second – Up to 20 fps @ 512x384 – Same ray tracing performance as Intel P4 @ 2.66 GHz

41 Scalability Larger Chunk Size – Less ray coherence – More data is accessed – Increased cache bandwidth – Larger caches

42 Scalability Larger Chunk Size Multiple RPUs on a Chip – Limited by VLSI technology Memory bandwidth – FPGA prototype versus current GPUs Floating point units 50x Memory bandwidth 100x Clock rate 7x

43 Scalability Larger Chunk Size Multiple RPUs on a Chip Multiple chips on a board – Fast interconnect for data exchange – Cache sizes accumulate – Managed through virtual memory [Schmittler’2003] – Limited through external bandwidth due to scene changes

44 Scalability Larger Chunk Size Multiple RPUs on a Chip Multiple chips on a board Multiple boards in a PC – Similar to today’s PC clusters in a much smaller form factor

45 Video

46 Future Work Support for fully dynamic scenes – Vertex shader + building kd-trees Efficient photon mapping – kd-tree construction + kNN filtering OpenRT-API [Dietrich’03] ASIC prototype

47 Questions? http://graphics.cs.uni-sb.de http://www.OpenRT.de http://www.SaarCOR.de

48


Download ppt "Introduction to Realtime Ray Tracing Course 41 Philipp Slusallek Peter Shirley Bill Mark Gordon Stoll Ingo Wald."

Similar presentations


Ads by Google