Download presentation

1
**Sven Woop Computer Graphics Lab Saarland University**

DRPU: A Programmable Hardware Architecture for Real-Time Ray Tracing of Coherent Dynamic Scenes Sven Woop Computer Graphics Lab Saarland University

2
**Overview Motivation: Why Ray Tracing? Previous Work DRPU Architecture**

FPGA Prototype ASIC Performance Estimates Conclusion & Future Work

3
Why not Rasterization ... Primitive Operation: Rasterize Isolated Triangles Perfect for dynamic scenes Very simple operation (good for HW) Parallel processing of triangles and fragments (good for HW) No global access to the scene All Interesting Visual Effects Need 2+ Triangles (Shadows, Reflection, Global Illumination, …) Approximations via multiple pass approaches have many issues Difficult to Use Algorithm Very Fast Hardware Implementations

4
**... but Ray Tracing? Primive Operation: Trace a Ray**

O(log n) traversal operation Demand driven Global Access to Scene Automatic combination of effects (orthogonal shaders) Recursive evaluation Physical Light Simulation Embarrassingly parallel (good for HW) Accurate and realistic images Easy to use Algorithm Low Performance

5
**... but Ray Tracing? Primive Operation: Trace a Ray**

O(log n) traversal operation Demand driven Global Access to Scene Automatic combination of effects (orthogonal shaders) Recursive evaluation Physical Light Simulation Embarrassingly parallel (good for HW) Accurate and realistic images Easy to use Algorithm Low Performance

6
**... but Ray Tracing? Primive Operation: Trace a Ray**

O(log n) traversal operation Demand driven Global Access to Scene Automatic combination of effects (orthogonal shaders) Recursive evaluation Physical Light Simulation Embarrassingly parallel (good for HW) Accurate and realistic images Easy to use Algorithm Low Performance

7
**... but Ray Tracing? Primive Operation: Trace a Ray**

O(log n) traversal operation Demand driven Global Access to Scene Automatic combination of effects (orthogonal shaders) Recursive evaluation Physical Light Simulation Embarrassingly parallel (good for HW) Accurate and realistic images Easy to use Algorithm Low Performance

8
**... but Ray Tracing? Primive Operation: Trace a Ray**

O(log n) traversal operation Demand driven Global Access to Scene Automatic combination of effects (orthogonal shaders) Recursive evaluation Physical Light Simulation Embarrassingly parallel (good for HW) Accurate and realistic images Easy to use Algorithm Low Performance

9
**... but Ray Tracing? Primive Operation: Trace a Ray**

O(log n) traversal operation Demand driven Global Access to Scene Automatic combination of effects (orthogonal shaders) Recursive evaluation Physical Light Simulation Embarrassingly parallel (good for HW) Accurate and realistic images Easy to use Algorithm Low Performance

10
**... but Ray Tracing? Primive Operation: Trace a Ray**

O(log n) traversal operation Demand driven Global Access to Scene Automatic combination of effects (orthogonal shaders) Recursive evaluation Physical Light Simulation Embarrassingly parallel (good for HW) Accurate and realistic images Easy to use Algorithm Low Performance

11
**Previous Work Ray Tracers for Static Scenes**

CPU based: [OpenRT], [MLRT SIGGRAPH05] GPU based: Purcell (Grids) [SIGGRAPH02], Foley et al. (KD Trees) [GH05] Stefan Popov (Stackless KD Tree traversal) [EG07] Custom Hardware: ART-VPS (AR350 Chip for offline rendering) Schmittler (SaarCOR) [GH04] Woop (RPU) [SIGGRAPH05] Ray Tracers for Dynamic Scenes CPU based: Wald (Grids) [SIGGRAPH06] Wald (AABVHs) [TOG / Tech. Rep. 2006] Wächter and Keller (BIH) [EG06] Johannes Günther (Motion Decomposition) [EG06] Custom Hardware: Woop (B-KD Trees) [GH06] Woop (DRPU-ASIC) [RT06]

12
**Why isn’t everybody using Ray Tracing …**

Low Performance High computational complexity 1 million pixels (minimal) 30 frames per second (minimal) 10 rays per pixel (minimal) At least 300 million rays 24 billion traversal steps (80 trav. steps per ray) 240 billion instructions (10 instructions) 0.5 trillion (5E11) cycles (instruction dependencies) Limited Support for Dynamic Scenes Due to need of spatial index structures (costly rebuild O(n log n)) But most graphics applications are highly dynamic (e.g. computer games)

13
**… and what can be done? Hardware Implementation (DRPU)**

High performance through dedicated hardware units A high end ASIC implementation would provide enough performance for computer games using RT (about 200 million rays/s) Algorithmic Changes B-KD Trees as spatial index structure Supports most kinds of dynamic scenes

14
**DRPU Architecture Task Parallelism Optimized Hardware Units**

vertices from memory

15
**DRPU Architecture Rendering Units**

Synchronous execution of packets of 4 rays Memory bandwidth reduction (combining) Sharing of HW (e.g. caches) Highly multi-threaded Higher hardware usage First level caches Memory bandwidth reduction Memory latency reduction vertices from memory

16
**DRPU Hardware Architecture**

vertices from memory

17
**DRPU Architecture Programmable Shading Processor Fully programmable**

In-order execution 4-component SIMD operations Similar Instruction set to GPUs, but: Efficient recursion Flexible memory access Programming Model Material shading Ray generation tasks Calls Ray Casting Units to cast rays vertices from memory

18
**DRPU Architecture Programmable Shading Unit Ray Casting Units**

Find closest intersection of a ray with the scene High-performance traversal and intersection Implement the atomic “trace” instruction of Shading Processor SP can continue scheduling instruction not dependent on intersection result vertices from memory

19
**DRPU Architecture Programmable Shading Unit Ray Casting Units**

Traversal Processor B-KD Tree approach vertices from memory

20
**Definition of B-KD Trees**

B-KD Tree (Bounded KD-Tree) Binary Tree 1D bounding intervals (or slabs) for each child Leaf nodes point to a single primitive Bounding Volume Hierarchy (subdivides geometry)

21
**B-KD Tree Semantics B-KD Tree (Bounded KD-Tree) B(T)**

Each node T can be assigned a box B(T) B(T)

22
**B-KD Tree Semantics B-KD Tree (Bounded KD-Tree) B(T)**

Hright(min_1) = { (x,y,z) | x >= min_1 } B(T)

23
**B-KD Tree Semantics B-KD Tree (Bounded KD-Tree)**

Hright(min_1) = { (x,y,z) | x >= min_1 } Hleft(max_1) = { (x,y,z) | x <= max_1 }

24
**B-KD Tree Semantics B-KD Tree (Bounded KD-Tree) B(T)**

Hleft(min_1) Hright(max_1) B(T)

25
**B-KD Tree Semantics B-KD Tree (Bounded KD-Tree) B(T) B(T0)**

B(Troot) = R3 B(T0) = B(T) Hleft(min_0) Hright(max_0) B(T1) = B(T) Hleft(min_1) Hright(max_1) B(T) B(T0)

26
B-KD Tree Example

27
B-KD Tree Example

28
B-KD Tree Example

29
B-KD Tree Example

30
**B-KD Tree Example Boxes may Overlap**

More traversal steps as for KD Tree Support for dynamic scenes

31
**B-KD Tree Example Boxes may Overlap**

More traversal steps as for KD Tree Support for dynamic scenes

32
**Traversal of B-KD Trees**

Interval Algorithm B(T)

33
**Traversal of B-KD Trees**

Interval Algorithm Early ray termination B(T)

34
**Traversal of B-KD Trees**

Interval Algorithm Early ray termination Compute Distances

35
**Traversal of B-KD Trees**

Interval Algorithm Early ray termination Compute Distances

36
**Traversal of B-KD Trees**

Interval Algorithm Early ray termination Compute Distances Clipping of near/far interval against both bounding intervals Simple min/max operations

37
**Traversal of B-KD Trees**

Interval Algorithm Early ray termination Compute Distances Clipping of near/far interval against both bounding intervals Take closer child, push farther child to stack Traversal order does not affect correctness

38
Traversal Processor Stack control computes next address 36 FPUs

39
**Traversal Processor 36 FPUs Stack control computes next address**

Next node is fetched from cache 36 FPUs

40
**Traversal Processor 36 FPUs Stack control computes next address**

Next node is fetched from cache 4 traversal slices compute 4x4 distances to bounding planes 36 FPUs

41
**Traversal Processor 36 FPUs Stack control computes next address**

Next node is fetched from cache 4 traversal slices compute 4x4 distances to bounding planes 4 Decision Units compute per ray traversal decision 36 FPUs

42
**Traversal Processor Stack control computes next address**

Next node is fetched from cache 4 traversal slices compute 4x4 distances to bounding planes 4 Decision Units compute per ray traversal decision Packet Decision Unit computes packet traversal decision Packet goes left if exists a that ray goes left Packet goes right if exists a ray that goes right Packet goes from left to right if exists a ray that goes into both children from left to right

43
**Traversal Processor Stack control computes next address**

Next node is fetched from cache 4 traversal slices compute 4x4 distances to bounding planes 4 Decision Units compute per ray traversal decision Packet Decision Unit computes packet traversal decision Packet goes left if exists a that ray goes left Packet goes right if exists a ray that goes right Packet goes from left to right if exists a ray that goes into both children from left to right Incoherent packets possible

44
**DRPU Architecture Programmable Shading Unit Ray Casting Units**

Traversal Processor Geometry Processor Ray transformations Vertex-based ray/triangle intersection [Möller Trumbore] Solve linear system of equations with 3 unknowns Shared vertices save memory 6x 1 ray/triangle intersection each 2 cycle 38 floating point units vertices from memory

45
**DRPU Architecture Programmable Shading Unit Ray Casting Units**

Scene Changes Skinning Processor Skeleton Subspace Deformation Re-uses Geometry Unit 4 additional floating point units Pure stream architecture vertices from memory

46
**B-KD Trees for Dynamic Scenes**

B-KD Tree Approach Initially build B-KD tree O(n log n) Update after each frame O(n) Updating Works well for Continuous motion where structure of motion matches tree structure E.g. skinned meshes, characters, water surfaces, ... Not Optimal for Random motions, turbulence However amortizing O(n log n) reconstruction over many frames is feasible

47
**Examples Bounding Approaches Perform well for Continous motion**

Structure of motion must match tree structure E.g. skinned meshes, characters, water surfaces, ...

48
**Examples Bounding Approaches Perform well for Continous motion**

Structure of motion must match tree structure E.g. skinned meshes, characters, water surfaces, ...

49
**Examples Bounding Approaches Perform well for Continous motion**

Structure of motion must match tree structure E.g. skinned meshes, characters, water surfaces, ...

50
**Examples Bounding Approaches Perform well for Continous motion**

Structure of motion must match tree structure E.g. skinned meshes, characters, water surfaces, ...

51
**Examples Bounding Approaches Perform well for Continous motion**

Structure of motion must match tree structure E.g. skinned meshes, characters, water surfaces, ...

52
**Examples Bounding Approaches Perform well for Continous motion**

Structure of motion must match tree structure E.g. skinned meshes, characters, water surfaces, ...

53
**Examples Bounding Volume Approaches are less Efficient for**

Non-continous motion Structure of motion does not match tree structure High traversal cost due to large overlapping boxes

54
**Examples Bounding Volume Approaches are less Efficient for**

Non-continous motion Structure of motion does not match tree structure High traversal cost due to large overlapping boxes

55
**Examples Bounding Volume Approaches are less Efficient for**

Non-continous motion Structure of motion does not match tree structure High traversal cost due to large overlapping boxes

56
**Examples Bounding Volume Approaches are less Efficient for**

Non-continous motion Structure of motion does not match tree structure High traversal cost due to large overlapping boxes

57
**DRPU Architecture Programmable Shading Unit Ray Casting Units**

Scene Changes Skinning Processor Update Processor In-order execution 32 bit instructions Precomputed Instruction Stream Load vertex, merge 3 vertices, merge 2 boxes ¼ more memory (#vertices + #nodes) instructions One B-KD node update each two clock cycle peak vertices from memory

58
**FPGA Implementation Hardware Implementation Virtex4 Board**

HWML Hardware Description Xilinx Virtex4 LX160 66 MHz clock frequency 1.0 GB/s memory bandwidth 7.5 Gflops (113 floating point units) 2,3 Gflops programmable 5,2 Gflops fixed function Implementation Packets of 4 rays 32 packets of rays 3x 8 KB caches, direct mapped 24 bit floating point Virtex4 Board

59
Video

60
**ASIC Implementation Implementation Differences Synthesis Place & Route**

Larger caches (3x 16 KB, 4-way associative) 32 bit floating point Synthesis Synopsys Synthesis UMC 130nm CMOS process Place & Route Cadence Encounter Manual placements to achieve good results Only DRPU Core No chip interface designed (PCI Express, DRAM, ...) DRPU-ASIC

61
**DRPU-ASIC Hardware Very Efficient Fixed Function Units**

UMC 130nm CMOS process 49 mm2 266 MHz clock 2.1 GB/s bandwidth 30 Gflops 10 Gflops programmable 20 Gflops fixed function Very Efficient Fixed Function Units GP via SP: 5x smaller area, 3x higher performance 15 times more efficient (performance per area) 7mm 7mm

62
**DRPU8-ASIC Hardware 90nm CMOS process**

extrapolated using constant field scaling 186 mm2 die 400 MHz clock speed 25,6 GB/s bandwidth 361 Gflops 110 Gflops programmable 471 Gflops fixed function About million shaded rays per second 19,3 mm 9,6 mm

63
**Results at 1024x768 with shadows**

64
**Conclusion and Future Work**

Efficient Hardware Ray Tracing is Possible Performance levels sufficient for computer games could be achieved Even support for Dynamic Scenes Ray Tracing ready to replace rasterization? But Still Open Questions Anti-aliasing (many rays per pixel) Arbitrary dynamics (reconstruction) What about advanced global illumination (e.g. photon mapping) ?

65
Questions?

66
**Instruction Set of Shading Processor**

Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store Ray traversal operation trace Conditional instructions (paired) if <condition> jmp label if <condition> call <fun> If <condition> return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return

67
**Instruction Set of Shading Processor**

Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store Ray traversal operation trace Conditional instructions (paired) if <condition> jmp label if <condition> call <fun> If <condition> return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return

68
**Instruction Set of Shading Processor**

Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store Ray traversal operation trace Conditional instructions (paired) if <condition> jmp label if <condition> call <fun> If <condition> return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return

69
**Instruction Set of Shading Processor**

Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store Ray traversal operation trace Conditional instructions (paired) if <condition> jmp label if <condition> call <fun> If <condition> return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return

70
**Instruction Set of Shading Processor**

Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store Ray traversal operation trace Conditional instructions (paired) if <condition> jmp label if <condition> call <fun> If <condition> return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return

71
**Instruction Set of Shading Processor**

Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store Ray traversal operation trace Conditional instructions (paired) if <condition> jmp label if <condition> call <fun> If <condition> return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return

72
**Hardware Description Problem HWML**

Most hardware description languages operate on a low abstraction level (e.g. VHDL, Verilog, …) High level languages are behavioral (Handel-C, Mitrion-C, …) no reliable mapping to hardware Need high-level structural language HWML Structural hardware description Implemented as an SML library Design add, mul, fadd, fmul, rcp, rsq, … Allows a compact description of HW algorithms, e.g.: 8000 LOC for the entire DRPU 160 LOC for a full implementation of the Tomasulo algorithm

73
**HWML Features Functional Recursive circuit descriptions**

Circuit descriptions are SML functions Functions can operate on circuits (e.g. arbitrary reductions) Recursive circuit descriptions Important for the implementation of arithmetic units (e.g. adders) Abstract Data Types Polymorphic functions (e.g. a single FIFO operates on different types of data) Allows for full parameterized designs (e.g. change floating point precision) Data Stream Abstraction Only one communication protocol in complete chip Automatic pipelining of circuits (higher order operator) Automatically generates highly efficient implementation Atomar support for multiported (typed) memories Allows to map memories efficiently to different platforms (e.g. memory compilers for CMOS processes) Generate FPGA and ASIC from one description

74
**Brute Force Ray Tracing Demands**

Property Standard Quality Medium Quality High Quality Resolution 1024x768 1920x1080 FPS 30 60 Rays per Pixel 10 200 Total Rays/s 250M 1.2B 24.0B ...

Similar presentations

© 2020 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google