Download presentation

Presentation is loading. Please wait.

Published byLucy Hoyt Modified about 1 year ago

1
DRPU: A Programmable Hardware Architecture for Real-Time Ray Tracing of Coherent Dynamic Scenes Sven Woop Computer Graphics Lab Saarland University

2
Overview Motivation: Why Ray Tracing? Previous Work DRPU Architecture FPGA Prototype ASIC Performance Estimates Conclusion & Future Work

3
Why not Rasterization... Primitive Operation: Rasterize Isolated Triangles Perfect for dynamic scenes Very simple operation (good for HW) Parallel processing of triangles and fragments (good for HW) No global access to the scene All Interesting Visual Effects Need 2+ Triangles (Shadows, Reflection, Global Illumination, …) Approximations via multiple pass approaches have many issues Difficult to Use Algorithm Very Fast Hardware Implementations

4
... but Ray Tracing? Primive Operation: Trace a Ray O(log n) traversal operation Demand driven Global Access to Scene Automatic combination of effects (orthogonal shaders) Recursive evaluation Physical Light Simulation Embarrassingly parallel (good for HW) Accurate and realistic images Easy to use Algorithm Low Performance

5
... but Ray Tracing? Primive Operation: Trace a Ray O(log n) traversal operation Demand driven Global Access to Scene Automatic combination of effects (orthogonal shaders) Recursive evaluation Physical Light Simulation Embarrassingly parallel (good for HW) Accurate and realistic images Easy to use Algorithm Low Performance

6
... but Ray Tracing? Primive Operation: Trace a Ray O(log n) traversal operation Demand driven Global Access to Scene Automatic combination of effects (orthogonal shaders) Recursive evaluation Physical Light Simulation Embarrassingly parallel (good for HW) Accurate and realistic images Easy to use Algorithm Low Performance

7
... but Ray Tracing? Primive Operation: Trace a Ray O(log n) traversal operation Demand driven Global Access to Scene Automatic combination of effects (orthogonal shaders) Recursive evaluation Physical Light Simulation Embarrassingly parallel (good for HW) Accurate and realistic images Easy to use Algorithm Low Performance

8
... but Ray Tracing? Primive Operation: Trace a Ray O(log n) traversal operation Demand driven Global Access to Scene Automatic combination of effects (orthogonal shaders) Recursive evaluation Physical Light Simulation Embarrassingly parallel (good for HW) Accurate and realistic images Easy to use Algorithm Low Performance

9
... but Ray Tracing? Primive Operation: Trace a Ray O(log n) traversal operation Demand driven Global Access to Scene Automatic combination of effects (orthogonal shaders) Recursive evaluation Physical Light Simulation Embarrassingly parallel (good for HW) Accurate and realistic images Easy to use Algorithm Low Performance

10
... but Ray Tracing? Primive Operation: Trace a Ray O(log n) traversal operation Demand driven Global Access to Scene Automatic combination of effects (orthogonal shaders) Recursive evaluation Physical Light Simulation Embarrassingly parallel (good for HW) Accurate and realistic images Easy to use Algorithm Low Performance

11
Previous Work Ray Tracers for Static Scenes CPU based: [OpenRT], [MLRT SIGGRAPH05] GPU based: Purcell (Grids) [SIGGRAPH02], Foley et al. (KD Trees) [GH05] Stefan Popov (Stackless KD Tree traversal) [EG07] Custom Hardware: ART-VPS (AR350 Chip for offline rendering) Schmittler (SaarCOR) [GH04] Woop (RPU) [SIGGRAPH05] Ray Tracers for Dynamic Scenes CPU based: Wald (Grids) [SIGGRAPH06] Wald (AABVHs) [TOG / Tech. Rep. 2006] Wächter and Keller (BIH) [EG06] Johannes Günther (Motion Decomposition) [EG06] Custom Hardware: Woop (B-KD Trees) [GH06] Woop (DRPU-ASIC) [RT06]

12
Why isn’t everybody using Ray Tracing … Low Performance High computational complexity 1 million pixels (minimal) 30 frames per second (minimal) 10 rays per pixel (minimal) At least 300 million rays 24 billion traversal steps (80 trav. steps per ray) 240 billion instructions (10 instructions) 0.5 trillion (5E11) cycles (instruction dependencies) Limited Support for Dynamic Scenes Due to need of spatial index structures (costly rebuild O(n log n)) But most graphics applications are highly dynamic (e.g. computer games)

13
… and what can be done? Hardware Implementation (DRPU) High performance through dedicated hardware units A high end ASIC implementation would provide enough performance for computer games using RT (about 200 million rays/s) Algorithmic Changes B-KD Trees as spatial index structure Supports most kinds of dynamic scenes

14
DRPU Architecture vertices from memory Task Parallelism Optimized Hardware Units

15
DRPU Architecture Rendering Units Synchronous execution of packets of 4 rays Memory bandwidth reduction (combining) Sharing of HW (e.g. caches) Highly multi-threaded Higher hardware usage First level caches Memory bandwidth reduction Memory latency reduction vertices from memory

16
DRPU Hardware Architecture vertices from memory

17
DRPU Architecture Programmable Shading Processor Fully programmable In-order execution 4-component SIMD operations Similar Instruction set to GPUs, but: Efficient recursion Flexible memory access Programming Model Material shading Ray generation tasks Calls Ray Casting Units to cast rays vertices from memory

18
DRPU Architecture Programmable Shading Unit Ray Casting Units Find closest intersection of a ray with the scene High-performance traversal and intersection Implement the atomic “trace” instruction of Shading Processor SP can continue scheduling instruction not dependent on intersection result vertices from memory

19
DRPU Architecture Programmable Shading Unit Ray Casting Units Traversal Processor B-KD Tree approach vertices from memory

20
Definition of B-KD Trees B-KD Tree (Bounded KD-Tree) Binary Tree 1D bounding intervals (or slabs) for each child Leaf nodes point to a single primitive Bounding Volume Hierarchy (subdivides geometry)

21
B-KD Tree Semantics B-KD Tree (Bounded KD-Tree) Each node T can be assigned a box B(T) B(T)

22
B-KD Tree Semantics B-KD Tree (Bounded KD-Tree) H right (min_1) = { (x,y,z) | x >= min_1 } B(T)

23
B-KD Tree Semantics B-KD Tree (Bounded KD-Tree) H right (min_1) = { (x,y,z) | x >= min_1 } H left (max_1) = { (x,y,z) | x <= max_1 }

24
B-KD Tree Semantics B-KD Tree (Bounded KD-Tree) H left (min_1) H right (max_1) B(T)

25
B-KD Tree Semantics B-KD Tree (Bounded KD-Tree) B(T root ) = R 3 B(T 0 ) = B(T) H left (min_0) H right (max_0) B(T 1 ) = B(T) H left (min_1) H right (max_1) B(T) B(T 0 )

26
B-KD Tree Example

27

28

29

30
Boxes may Overlap More traversal steps as for KD Tree Support for dynamic scenes

31
B-KD Tree Example Boxes may Overlap More traversal steps as for KD Tree Support for dynamic scenes

32
Traversal of B-KD Trees Interval Algorithm B(T)

33
Traversal of B-KD Trees Interval Algorithm Early ray termination B(T)

34
Traversal of B-KD Trees Interval Algorithm Early ray termination Compute Distances

35
Traversal of B-KD Trees Interval Algorithm Early ray termination Compute Distances

36
Traversal of B-KD Trees Interval Algorithm Early ray termination Compute Distances Clipping of near/far interval against both bounding intervals Simple min/max operations

37
Traversal of B-KD Trees Interval Algorithm Early ray termination Compute Distances Clipping of near/far interval against both bounding intervals Take closer child, push farther child to stack Traversal order does not affect correctness

38
Traversal Processor Stack control computes next address 36 FPUs

39
Traversal Processor Stack control computes next address Next node is fetched from cache 36 FPUs

40
Traversal Processor Stack control computes next address Next node is fetched from cache 4 traversal slices compute 4x4 distances to bounding planes 36 FPUs

41
Traversal Processor Stack control computes next address Next node is fetched from cache 4 traversal slices compute 4x4 distances to bounding planes 4 Decision Units compute per ray traversal decision 36 FPUs

42
Traversal Processor Stack control computes next address Next node is fetched from cache 4 traversal slices compute 4x4 distances to bounding planes 4 Decision Units compute per ray traversal decision Packet Decision Unit computes packet traversal decision Packet goes left if exists a that ray goes left Packet goes right if exists a ray that goes right Packet goes from left to right if exists a ray that goes into both children from left to right

43
Traversal Processor Stack control computes next address Next node is fetched from cache 4 traversal slices compute 4x4 distances to bounding planes 4 Decision Units compute per ray traversal decision Packet Decision Unit computes packet traversal decision Packet goes left if exists a that ray goes left Packet goes right if exists a ray that goes right Packet goes from left to right if exists a ray that goes into both children from left to right Incoherent packets possible

44
DRPU Architecture Programmable Shading Unit Ray Casting Units Traversal Processor Geometry Processor Ray transformations Vertex-based ray/triangle intersection [Möller Trumbore] –Solve linear system of equations with 3 unknowns –Shared vertices save memory 6x 1 ray/triangle intersection each 2 cycle 38 floating point units vertices from memory

45
DRPU Architecture Programmable Shading Unit Ray Casting Units Scene Changes Skinning Processor Skeleton Subspace Deformation Re-uses Geometry Unit 4 additional floating point units Pure stream architecture vertices from memory

46
B-KD Trees for Dynamic Scenes B-KD Tree Approach Initially build B-KD tree O(n log n) Update after each frame O(n) Updating Works well for Continuous motion where structure of motion matches tree structure E.g. skinned meshes, characters, water surfaces,... Not Optimal for Random motions, turbulence However amortizing O(n log n) reconstruction over many frames is feasible

47
Examples Bounding Approaches Perform well for Continous motion Structure of motion must match tree structure E.g. skinned meshes, characters, water surfaces,...

48
Examples Bounding Approaches Perform well for Continous motion Structure of motion must match tree structure E.g. skinned meshes, characters, water surfaces,...

49
Examples Bounding Approaches Perform well for Continous motion Structure of motion must match tree structure E.g. skinned meshes, characters, water surfaces,...

50
Examples Bounding Approaches Perform well for Continous motion Structure of motion must match tree structure E.g. skinned meshes, characters, water surfaces,...

51
Examples Bounding Approaches Perform well for Continous motion Structure of motion must match tree structure E.g. skinned meshes, characters, water surfaces,...

52
Examples Bounding Approaches Perform well for Continous motion Structure of motion must match tree structure E.g. skinned meshes, characters, water surfaces,...

53
Examples Bounding Volume Approaches are less Efficient for Non-continous motion Structure of motion does not match tree structure High traversal cost due to large overlapping boxes

54
Examples Bounding Volume Approaches are less Efficient for Non-continous motion Structure of motion does not match tree structure High traversal cost due to large overlapping boxes

55
Examples Bounding Volume Approaches are less Efficient for Non-continous motion Structure of motion does not match tree structure High traversal cost due to large overlapping boxes

56
Examples Bounding Volume Approaches are less Efficient for Non-continous motion Structure of motion does not match tree structure High traversal cost due to large overlapping boxes

57
DRPU Architecture Programmable Shading Unit Ray Casting Units Scene Changes Skinning Processor Update Processor In-order execution 32 bit instructions Precomputed Instruction Stream –Load vertex, merge 3 vertices, merge 2 boxes –¼ more memory (#vertices + #nodes) instructions One B-KD node update each two clock cycle peak vertices from memory

58
FPGA Implementation Hardware HWML Hardware Description Xilinx Virtex4 LX MHz clock frequency 1.0 GB/s memory bandwidth 7.5 Gflops (113 floating point units) 2,3 Gflops programmable 5,2 Gflops fixed function Implementation Packets of 4 rays 32 packets of rays 3x 8 KB caches, direct mapped 24 bit floating point Virtex4 Board

59
Video

60
ASIC Implementation Implementation Differences Larger caches (3x 16 KB, 4-way associative) 32 bit floating point Synthesis Synopsys Synthesis UMC 130nm CMOS process Place & Route Cadence Encounter Manual placements to achieve good results Only DRPU Core No chip interface designed (PCI Express, DRAM,...) DRPU-ASIC

61
Hardware UMC 130nm CMOS process 49 mm MHz clock 2.1 GB/s bandwidth 30 Gflops 10 Gflops programmable 20 Gflops fixed function Very Efficient Fixed Function Units GP via SP: 5x smaller area, 3x higher performance 15 times more efficient (performance per area) 7mm

62
DRPU8-ASIC Hardware 90nm CMOS process extrapolated using constant field scaling 186 mm 2 die 400 MHz clock speed 25,6 GB/s bandwidth 361 Gflops 110 Gflops programmable 471 Gflops fixed function About million shaded rays per second 9,6 mm 19,3 mm

63
Results at 1024x768 with shadows

64
Conclusion and Future Work Efficient Hardware Ray Tracing is Possible Performance levels sufficient for computer games could be achieved Even support for Dynamic Scenes Ray Tracing ready to replace rasterization? But Still Open Questions Anti-aliasing (many rays per pixel) Arbitrary dynamics (reconstruction) What about advanced global illumination (e.g. photon mapping) ?

65
Questions?

66
Instruction Set of Shading Processor Ray traversal operation trace Conditional instructions (paired) if jmp label if call If return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store

67
Instruction Set of Shading Processor Ray traversal operation trace Conditional instructions (paired) if jmp label if call If return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store

68
Instruction Set of Shading Processor Ray traversal operation trace Conditional instructions (paired) if jmp label if call If return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store

69
Instruction Set of Shading Processor Ray traversal operation trace Conditional instructions (paired) if jmp label if call If return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store

70
Instruction Set of Shading Processor Ray traversal operation trace Conditional instructions (paired) if jmp label if call If return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store

71
Instruction Set of Shading Processor Ray traversal operation trace Conditional instructions (paired) if jmp label if call If return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store

72
Hardware Description Problem Most hardware description languages operate on a low abstraction level (e.g. VHDL, Verilog, …) High level languages are behavioral (Handel-C, Mitrion-C, …) no reliable mapping to hardware Need high-level structural language HWML Structural hardware description Implemented as an SML library Design add, mul, fadd, fmul, rcp, rsq, … Allows a compact description of HW algorithms, e.g.: 8000 LOC for the entire DRPU 160 LOC for a full implementation of the Tomasulo algorithm

73
HWML Features Functional Circuit descriptions are SML functions Functions can operate on circuits (e.g. arbitrary reductions) Recursive circuit descriptions Important for the implementation of arithmetic units (e.g. adders) Abstract Data Types Polymorphic functions (e.g. a single FIFO operates on different types of data) Allows for full parameterized designs (e.g. change floating point precision) Data Stream Abstraction Only one communication protocol in complete chip Automatic pipelining of circuits (higher order operator) Automatically generates highly efficient implementation Atomar support for multiported (typed) memories Allows to map memories efficiently to different platforms (e.g. memory compilers for CMOS processes) Generate FPGA and ASIC from one description

74
Brute Force Ray Tracing Demands Property Standard Quality Medium Quality High Quality Resolution1024x x1080 FPS3060 Rays per Pixel Total Rays/s250M1.2B24.0B...

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google