Sven Woop Computer Graphics Lab Saarland University

Slides:



Advertisements
Similar presentations
The OpenRT Application Programming Interface - Towards a Common API for Interactive Ray Tracing – OpenSG 2003 Darmstadt, Germany Andreas Dietrich Ingo.
Advertisements

A Hardware Processing Unit For Point Sets S. Heinzle, G. Guennebaud, M. Botsch, M. Gross Graphics Hardware 2008.
Sven Woop Jörg Schmittler Philipp Slusallek Computer Graphics Lab Saarland University, Germany RPU: A Programmable Ray Processing Unit for Realtime Ray.
† Saarland University, Germany ‡ University of Utah, USA Estimating Performance of a Ray-Tracing ASIC Design Sven Woop † Erik Brunvand ‡ Philipp Slusallek.
Christian Lauterbach COMP 770, 2/16/2009. Overview  Acceleration structures  Spatial hierarchies  Object hierarchies  Interactive Ray Tracing techniques.
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
Physically Based Real-time Ray Tracing Ryan Overbeck.
Ray Tracing Ray Tracing 1 Basic algorithm Overview of pbrt Ray-surface intersection (triangles, …) Ray Tracing 2 Brute force: Acceleration data structures.
Two Methods for Fast Ray-Cast Ambient Occlusion Samuli Laine and Tero Karras NVIDIA Research.
Latency considerations of depth-first GPU ray tracing
Two-Level Grids for Ray Tracing on GPUs
RT06 conferenceVlastimil Havran On the Fast Construction of Spatial Hierarchies for Ray Tracing Vlastimil Havran 1,2 Robert Herzog 1 Hans-Peter Seidel.
Cost-based Workload Balancing for Ray Tracing on a Heterogeneous Platform Mario Rincón-Nigro PhD Showcase Feb 17 th, 2012.
Experiences with Streaming Construction of SAH KD Trees Stefan Popov, Johannes Günther, Hans-Peter Seidel, Philipp Slusallek.
Paper Presentation - An Efficient GPU-based Approach for Interactive Global Illumination- Rui Wang, Rui Wang, Kun Zhou, Minghao Pan, Hujun Bao Presenter.
Rasterization and Ray Tracing in Real-Time Applications (Games) Andrew Graff.
Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Computer Graphics (Fall 2005) COMS 4160, Lecture 21: Ray Tracing
1 View Coherence Acceleration for Ray Traced Animation University of Colorado at Colorado Springs Master’s Thesis Defense by Philip Glen Gage April 19,
3D Graphics Processor Architecture Victor Moya. PhD Project Research on architecture improvements for future Graphic Processor Units (GPUs). Research.
IN4151 Introduction 3D graphics 1 Introduction to 3D computer graphics part 2 Viewing pipeline Multi-processor implementation GPU architecture GPU algorithms.
Many-Core Programming with GRAMPS Jeremy Sugerman Kayvon Fatahalian Solomon Boulos Kurt Akeley Pat Hanrahan.
Interactive k-D Tree GPU Raytracing Daniel Reiter Horn, Jeremy Sugerman, Mike Houston and Pat Hanrahan.
Real-Time Ray Tracing 3D Modeling of the Future Marissa Hollingsworth Spring 2009.
RT08, August ‘08 Large Ray Packets for Real-time Whitted Ray Tracing Ryan Overbeck Columbia University Ravi Ramamoorthi Columbia University William R.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
RAY TRACING ON GPU By: Nitish Jain. Introduction Ray Tracing is one of the most researched fields in Computer Graphics A great technique to produce optical.
Computer Graphics 2 Lecture x: Acceleration Techniques for Ray-Tracing Benjamin Mora 1 University of Wales Swansea Dr. Benjamin Mora.
Interactive Ray Tracing: From bad joke to old news David Luebke University of Virginia.
Ray Tracing and Photon Mapping on GPUs Tim PurcellStanford / NVIDIA.
Realtime Caustics using Distributed Photon Mapping Johannes Günther Ingo Wald * Philipp Slusallek Computer Graphics Group Saarland University ( * now at.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Chris Kerkhoff Matthew Sullivan 10/16/2009.  Shaders are simple programs that describe the traits of either a vertex or a pixel.  Shaders replace a.
Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,
Gregory Fotiades.  Global illumination techniques are highly desirable for realistic interaction due to their high level of accuracy and photorealism.
On a Few Ray Tracing like Algorithms and Structures. -Ravi Prakash Kammaje -Swansea University.
Introduction to Realtime Ray Tracing Course 41 Philipp Slusallek Peter Shirley Bill Mark Gordon Stoll Ingo Wald.
Interactive Visualization of Exceptionally Complex Industrial CAD Datasets Andreas Dietrich Ingo Wald Philipp Slusallek Computer Graphics Group Saarland.
Saarland University, Germany B-KD Trees for Hardware Accelerated Ray Tracing of Dynamic Scenes Sven Woop Gerd Marmitt Philipp Slusallek.
Interactive Rendering With Coherent Ray Tracing Eurogaphics 2001 Wald, Slusallek, Benthin, Wagner Comp 238, UNC-CH, September 10, 2001 Joshua Stough.
Fast BVH Construction on GPUs (Eurographics 2009) Park, Soonchan KAIST (Korea Advanced Institute of Science and Technology)
Introduction to Realtime Ray Tracing Course 41 Philipp Slusallek Peter Shirley Bill Mark Gordon Stoll Ingo Wald.
Recursion and Data Structures in Computer Graphics Ray Tracing 1.
Department of Computer Science 1 Beyond CUDA/GPUs and Future Graphics Architectures Karu Sankaralingam University of Wisconsin-Madison Adapted from “Toward.
1 Ray Tracing with Existing Graphics Systems Jeremy Sugerman, FLASHG 31 January 2006.
Hierarchical Penumbra Casting Samuli Laine Timo Aila Helsinki University of Technology Hybrid Graphics, Ltd.
- Laboratoire d'InfoRmatique en Image et Systèmes d'information
Memory Management and Parallelization Paul Arthur Navrátil The University of Texas at Austin.
Ray Tracing Animated Scenes using Motion Decomposition Johannes Günther, Heiko Friedrich, Ingo Wald, Hans-Peter Seidel, and Philipp Slusallek.
Coherent Hierarchical Culling: Hardware Occlusion Queries Made Useful Jiri Bittner 1, Michael Wimmer 1, Harald Piringer 2, Werner Purgathofer 1 1 Vienna.
From Turing Machine to Global Illumination Chun-Fa Chang National Taiwan Normal University.
COMPUTER GRAPHICS CS 482 – FALL 2015 SEPTEMBER 29, 2015 RENDERING RASTERIZATION RAY CASTING PROGRAMMABLE SHADERS.
Ray Tracing by GPU Ming Ouhyoung. Outline Introduction Graphics Hardware Streaming Ray Tracing Discussion.
Path/Ray Tracing Examples. Path/Ray Tracing Rendering algorithms that trace photon rays Trace from eye – Where does this photon come from? Trace from.
1cs426-winter-2008 Notes. 2 Atop operation  Image 1 “atop” image 2  Assume independence of sub-pixel structure So for each final pixel, a fraction alpha.
1 The Method of Precomputing Triangle Clusters for Quick BVH Builder and Accelerated Ray Tracing Kirill Garanzha Department of Software for Computers Bauman.
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
Hybrid Ray Tracing and Path Tracing of Bezier Surfaces using a mixed hierarchy Rohit Nigam, P. J. Narayanan CVIT, IIIT Hyderabad, Hyderabad, India.
Graphics Processing Unit
Real-Time Ray Tracing Stefan Popov.
From Turing Machine to Global Illumination
Hybrid Ray Tracing of Massive Models
Ray Tracing Dinesh Manocha COMP 575/770.
Accelerated Single Ray Tracing for Wide Vector Units
Graphics Processing Unit
CSCE 441: Computer Graphics Ray Tracing
Ray Tracing on Programmable Graphics Hardware
Presentation transcript:

Sven Woop Computer Graphics Lab Saarland University DRPU: A Programmable Hardware Architecture for Real-Time Ray Tracing of Coherent Dynamic Scenes Sven Woop Computer Graphics Lab Saarland University

Overview Motivation: Why Ray Tracing? Previous Work DRPU Architecture FPGA Prototype ASIC Performance Estimates Conclusion & Future Work

Why not Rasterization ... Primitive Operation: Rasterize Isolated Triangles Perfect for dynamic scenes Very simple operation (good for HW) Parallel processing of triangles and fragments (good for HW) No global access to the scene All Interesting Visual Effects Need 2+ Triangles (Shadows, Reflection, Global Illumination, …) Approximations via multiple pass approaches have many issues Difficult to Use Algorithm Very Fast Hardware Implementations

... but Ray Tracing? Primive Operation: Trace a Ray O(log n) traversal operation Demand driven Global Access to Scene Automatic combination of effects (orthogonal shaders) Recursive evaluation Physical Light Simulation Embarrassingly parallel (good for HW) Accurate and realistic images Easy to use Algorithm Low Performance

... but Ray Tracing? Primive Operation: Trace a Ray O(log n) traversal operation Demand driven Global Access to Scene Automatic combination of effects (orthogonal shaders) Recursive evaluation Physical Light Simulation Embarrassingly parallel (good for HW) Accurate and realistic images Easy to use Algorithm Low Performance

... but Ray Tracing? Primive Operation: Trace a Ray O(log n) traversal operation Demand driven Global Access to Scene Automatic combination of effects (orthogonal shaders) Recursive evaluation Physical Light Simulation Embarrassingly parallel (good for HW) Accurate and realistic images Easy to use Algorithm Low Performance

... but Ray Tracing? Primive Operation: Trace a Ray O(log n) traversal operation Demand driven Global Access to Scene Automatic combination of effects (orthogonal shaders) Recursive evaluation Physical Light Simulation Embarrassingly parallel (good for HW) Accurate and realistic images Easy to use Algorithm Low Performance

... but Ray Tracing? Primive Operation: Trace a Ray O(log n) traversal operation Demand driven Global Access to Scene Automatic combination of effects (orthogonal shaders) Recursive evaluation Physical Light Simulation Embarrassingly parallel (good for HW) Accurate and realistic images Easy to use Algorithm Low Performance

... but Ray Tracing? Primive Operation: Trace a Ray O(log n) traversal operation Demand driven Global Access to Scene Automatic combination of effects (orthogonal shaders) Recursive evaluation Physical Light Simulation Embarrassingly parallel (good for HW) Accurate and realistic images Easy to use Algorithm Low Performance

... but Ray Tracing? Primive Operation: Trace a Ray O(log n) traversal operation Demand driven Global Access to Scene Automatic combination of effects (orthogonal shaders) Recursive evaluation Physical Light Simulation Embarrassingly parallel (good for HW) Accurate and realistic images Easy to use Algorithm Low Performance

Previous Work Ray Tracers for Static Scenes CPU based: [OpenRT], [MLRT SIGGRAPH05] GPU based: Purcell (Grids) [SIGGRAPH02], Foley et al. (KD Trees) [GH05] Stefan Popov (Stackless KD Tree traversal) [EG07] Custom Hardware: ART-VPS (AR350 Chip for offline rendering) Schmittler (SaarCOR) [GH04] Woop (RPU) [SIGGRAPH05] Ray Tracers for Dynamic Scenes CPU based: Wald (Grids) [SIGGRAPH06] Wald (AABVHs) [TOG / Tech. Rep. 2006] Wächter and Keller (BIH) [EG06] Johannes Günther (Motion Decomposition) [EG06] Custom Hardware: Woop (B-KD Trees) [GH06] Woop (DRPU-ASIC) [RT06]

Why isn’t everybody using Ray Tracing … Low Performance High computational complexity 1 million pixels (minimal) 30 frames per second (minimal) 10 rays per pixel (minimal) At least 300 million rays 24 billion traversal steps (80 trav. steps per ray) 240 billion instructions (10 instructions) 0.5 trillion (5E11) cycles (instruction dependencies) Limited Support for Dynamic Scenes Due to need of spatial index structures (costly rebuild O(n log n)) But most graphics applications are highly dynamic (e.g. computer games)

… and what can be done? Hardware Implementation (DRPU) High performance through dedicated hardware units  A high end ASIC implementation would provide enough performance for computer games using RT (about 200 million rays/s) Algorithmic Changes B-KD Trees as spatial index structure  Supports most kinds of dynamic scenes

DRPU Architecture Task Parallelism Optimized Hardware Units vertices from memory

DRPU Architecture Rendering Units Synchronous execution of packets of 4 rays  Memory bandwidth reduction (combining)  Sharing of HW (e.g. caches) Highly multi-threaded  Higher hardware usage First level caches  Memory bandwidth reduction  Memory latency reduction vertices from memory

DRPU Hardware Architecture vertices from memory

DRPU Architecture Programmable Shading Processor Fully programmable In-order execution 4-component SIMD operations Similar Instruction set to GPUs, but: Efficient recursion Flexible memory access Programming Model Material shading Ray generation tasks Calls Ray Casting Units to cast rays vertices from memory

DRPU Architecture Programmable Shading Unit Ray Casting Units Find closest intersection of a ray with the scene High-performance traversal and intersection Implement the atomic “trace” instruction of Shading Processor SP can continue scheduling instruction not dependent on intersection result vertices from memory

DRPU Architecture Programmable Shading Unit Ray Casting Units Traversal Processor B-KD Tree approach vertices from memory

Definition of B-KD Trees B-KD Tree (Bounded KD-Tree) Binary Tree 1D bounding intervals (or slabs) for each child Leaf nodes point to a single primitive  Bounding Volume Hierarchy (subdivides geometry)

B-KD Tree Semantics B-KD Tree (Bounded KD-Tree) B(T) Each node T can be assigned a box B(T) B(T)

B-KD Tree Semantics B-KD Tree (Bounded KD-Tree) B(T) Hright(min_1) = { (x,y,z) | x >= min_1 } B(T)

B-KD Tree Semantics B-KD Tree (Bounded KD-Tree) Hright(min_1) = { (x,y,z) | x >= min_1 } Hleft(max_1) = { (x,y,z) | x <= max_1 }

B-KD Tree Semantics B-KD Tree (Bounded KD-Tree) B(T) Hleft(min_1)  Hright(max_1) B(T)

B-KD Tree Semantics B-KD Tree (Bounded KD-Tree) B(T) B(T0) B(Troot) = R3 B(T0) = B(T)  Hleft(min_0)  Hright(max_0) B(T1) = B(T)  Hleft(min_1)  Hright(max_1) B(T) B(T0)

B-KD Tree Example

B-KD Tree Example

B-KD Tree Example

B-KD Tree Example

B-KD Tree Example Boxes may Overlap More traversal steps as for KD Tree Support for dynamic scenes

B-KD Tree Example Boxes may Overlap More traversal steps as for KD Tree Support for dynamic scenes

Traversal of B-KD Trees Interval Algorithm B(T)

Traversal of B-KD Trees Interval Algorithm Early ray termination B(T)

Traversal of B-KD Trees Interval Algorithm Early ray termination Compute Distances

Traversal of B-KD Trees Interval Algorithm Early ray termination Compute Distances

Traversal of B-KD Trees Interval Algorithm Early ray termination Compute Distances Clipping of near/far interval against both bounding intervals Simple min/max operations

Traversal of B-KD Trees Interval Algorithm Early ray termination Compute Distances Clipping of near/far interval against both bounding intervals Take closer child, push farther child to stack Traversal order does not affect correctness

Traversal Processor Stack control computes next address 36 FPUs

Traversal Processor 36 FPUs Stack control computes next address Next node is fetched from cache 36 FPUs

Traversal Processor 36 FPUs Stack control computes next address Next node is fetched from cache 4 traversal slices compute 4x4 distances to bounding planes 36 FPUs

Traversal Processor 36 FPUs Stack control computes next address Next node is fetched from cache 4 traversal slices compute 4x4 distances to bounding planes 4 Decision Units compute per ray traversal decision 36 FPUs

Traversal Processor Stack control computes next address Next node is fetched from cache 4 traversal slices compute 4x4 distances to bounding planes 4 Decision Units compute per ray traversal decision Packet Decision Unit computes packet traversal decision Packet goes left if exists a that ray goes left Packet goes right if exists a ray that goes right Packet goes from left to right if exists a ray that goes into both children from left to right

Traversal Processor Stack control computes next address Next node is fetched from cache 4 traversal slices compute 4x4 distances to bounding planes 4 Decision Units compute per ray traversal decision Packet Decision Unit computes packet traversal decision Packet goes left if exists a that ray goes left Packet goes right if exists a ray that goes right Packet goes from left to right if exists a ray that goes into both children from left to right  Incoherent packets possible

DRPU Architecture Programmable Shading Unit Ray Casting Units Traversal Processor Geometry Processor Ray transformations Vertex-based ray/triangle intersection [Möller Trumbore] Solve linear system of equations with 3 unknowns Shared vertices save memory 6x 1 ray/triangle intersection each 2 cycle 38 floating point units vertices from memory

DRPU Architecture Programmable Shading Unit Ray Casting Units Scene Changes Skinning Processor Skeleton Subspace Deformation Re-uses Geometry Unit  4 additional floating point units Pure stream architecture vertices from memory

B-KD Trees for Dynamic Scenes B-KD Tree Approach Initially build B-KD tree O(n log n) Update after each frame O(n) Updating Works well for Continuous motion where structure of motion matches tree structure E.g. skinned meshes, characters, water surfaces, ... Not Optimal for Random motions, turbulence However amortizing O(n log n) reconstruction over many frames is feasible

Examples Bounding Approaches Perform well for Continous motion Structure of motion must match tree structure E.g. skinned meshes, characters, water surfaces, ...

Examples Bounding Approaches Perform well for Continous motion Structure of motion must match tree structure E.g. skinned meshes, characters, water surfaces, ...

Examples Bounding Approaches Perform well for Continous motion Structure of motion must match tree structure E.g. skinned meshes, characters, water surfaces, ...

Examples Bounding Approaches Perform well for Continous motion Structure of motion must match tree structure E.g. skinned meshes, characters, water surfaces, ...

Examples Bounding Approaches Perform well for Continous motion Structure of motion must match tree structure E.g. skinned meshes, characters, water surfaces, ...

Examples Bounding Approaches Perform well for Continous motion Structure of motion must match tree structure E.g. skinned meshes, characters, water surfaces, ...

Examples Bounding Volume Approaches are less Efficient for Non-continous motion Structure of motion does not match tree structure High traversal cost due to large overlapping boxes

Examples Bounding Volume Approaches are less Efficient for Non-continous motion Structure of motion does not match tree structure High traversal cost due to large overlapping boxes

Examples Bounding Volume Approaches are less Efficient for Non-continous motion Structure of motion does not match tree structure High traversal cost due to large overlapping boxes

Examples Bounding Volume Approaches are less Efficient for Non-continous motion Structure of motion does not match tree structure High traversal cost due to large overlapping boxes

DRPU Architecture Programmable Shading Unit Ray Casting Units Scene Changes Skinning Processor Update Processor In-order execution 32 bit instructions Precomputed Instruction Stream Load vertex, merge 3 vertices, merge 2 boxes ¼ more memory (#vertices + #nodes) instructions One B-KD node update each two clock cycle peak vertices from memory

FPGA Implementation Hardware Implementation Virtex4 Board HWML Hardware Description Xilinx Virtex4 LX160 66 MHz clock frequency 1.0 GB/s memory bandwidth 7.5 Gflops (113 floating point units) 2,3 Gflops programmable 5,2 Gflops fixed function Implementation Packets of 4 rays 32 packets of rays 3x 8 KB caches, direct mapped 24 bit floating point Virtex4 Board

Video

ASIC Implementation Implementation Differences Synthesis Place & Route Larger caches (3x 16 KB, 4-way associative) 32 bit floating point Synthesis Synopsys Synthesis UMC 130nm CMOS process Place & Route Cadence Encounter Manual placements to achieve good results Only DRPU Core No chip interface designed (PCI Express, DRAM, ...) DRPU-ASIC

DRPU-ASIC Hardware Very Efficient Fixed Function Units UMC 130nm CMOS process 49 mm2 266 MHz clock 2.1 GB/s bandwidth 30 Gflops 10 Gflops programmable 20 Gflops fixed function Very Efficient Fixed Function Units GP via SP: 5x smaller area, 3x higher performance  15 times more efficient (performance per area) 7mm 7mm

DRPU8-ASIC Hardware 90nm CMOS process extrapolated using constant field scaling 186 mm2 die 400 MHz clock speed 25,6 GB/s bandwidth 361 Gflops 110 Gflops programmable 471 Gflops fixed function  About 150 - 400 million shaded rays per second 19,3 mm 9,6 mm

Results at 1024x768 with shadows

Conclusion and Future Work Efficient Hardware Ray Tracing is Possible Performance levels sufficient for computer games could be achieved Even support for Dynamic Scenes  Ray Tracing ready to replace rasterization? But Still Open Questions Anti-aliasing (many rays per pixel) Arbitrary dynamics (reconstruction) What about advanced global illumination (e.g. photon mapping) ?

Questions?

Instruction Set of Shading Processor Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store Ray traversal operation trace Conditional instructions (paired) if <condition> jmp label if <condition> call <fun> If <condition> return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return

Instruction Set of Shading Processor Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store Ray traversal operation trace Conditional instructions (paired) if <condition> jmp label if <condition> call <fun> If <condition> return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return

Instruction Set of Shading Processor Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store Ray traversal operation trace Conditional instructions (paired) if <condition> jmp label if <condition> call <fun> If <condition> return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return

Instruction Set of Shading Processor Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store Ray traversal operation trace Conditional instructions (paired) if <condition> jmp label if <condition> call <fun> If <condition> return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return

Instruction Set of Shading Processor Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store Ray traversal operation trace Conditional instructions (paired) if <condition> jmp label if <condition> call <fun> If <condition> return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return

Instruction Set of Shading Processor Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store Ray traversal operation trace Conditional instructions (paired) if <condition> jmp label if <condition> call <fun> If <condition> return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return

Hardware Description Problem  HWML Most hardware description languages operate on a low abstraction level (e.g. VHDL, Verilog, …) High level languages are behavioral (Handel-C, Mitrion-C, …)  no reliable mapping to hardware Need high-level structural language  HWML Structural hardware description Implemented as an SML library Design add, mul, fadd, fmul, rcp, rsq, … Allows a compact description of HW algorithms, e.g.: 8000 LOC for the entire DRPU 160 LOC for a full implementation of the Tomasulo algorithm

HWML Features Functional Recursive circuit descriptions Circuit descriptions are SML functions Functions can operate on circuits (e.g. arbitrary reductions) Recursive circuit descriptions Important for the implementation of arithmetic units (e.g. adders) Abstract Data Types Polymorphic functions (e.g. a single FIFO operates on different types of data)  Allows for full parameterized designs (e.g. change floating point precision) Data Stream Abstraction Only one communication protocol in complete chip Automatic pipelining of circuits (higher order operator) Automatically generates highly efficient implementation Atomar support for multiported (typed) memories Allows to map memories efficiently to different platforms (e.g. memory compilers for CMOS processes) Generate FPGA and ASIC from one description

Brute Force Ray Tracing Demands Property Standard Quality Medium Quality High Quality Resolution 1024x768 1920x1080 FPS 30 60 Rays per Pixel 10 200 Total Rays/s 250M 1.2B 24.0B ...