† Saarland University, Germany ‡ University of Utah, USA Estimating Performance of a Ray-Tracing ASIC Design Sven Woop † Erik Brunvand ‡ Philipp Slusallek.

Slides:



Advertisements
Similar presentations
The OpenRT Application Programming Interface - Towards a Common API for Interactive Ray Tracing – OpenSG 2003 Darmstadt, Germany Andreas Dietrich Ingo.
Advertisements

Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
A Hardware Processing Unit For Point Sets S. Heinzle, G. Guennebaud, M. Botsch, M. Gross Graphics Hardware 2008.
Sven Woop Jörg Schmittler Philipp Slusallek Computer Graphics Lab Saarland University, Germany RPU: A Programmable Ray Processing Unit for Realtime Ray.
Sven Woop Computer Graphics Lab Saarland University
Christian Lauterbach COMP 770, 2/16/2009. Overview  Acceleration structures  Spatial hierarchies  Object hierarchies  Interactive Ray Tracing techniques.
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Restart Trail for Stackless BVH Traversal Samuli Laine NVIDIA Research.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
Ray Tracing Ray Tracing 1 Basic algorithm Overview of pbrt Ray-surface intersection (triangles, …) Ray Tracing 2 Brute force: Acceleration data structures.
IIIT Hyderabad Hybrid Ray Tracing and Path Tracing of Bezier Surfaces using a mixed hierarchy Rohit Nigam, P. J. Narayanan CVIT, IIIT Hyderabad, Hyderabad,
Ray Tracing CMSC 635. Basic idea How many intersections?  Pixels  ~10 3 to ~10 7  Rays per Pixel  1 to ~10  Primitives  ~10 to ~10 7  Every ray.
A Coherent Grid Traversal Algorithm for Volume Rendering Ioannis Makris Supervisors: Philipp Slusallek*, Céline Loscos *Computer Graphics Lab, Universität.
Two Methods for Fast Ray-Cast Ambient Occlusion Samuli Laine and Tero Karras NVIDIA Research.
Latency considerations of depth-first GPU ray tracing
Two-Level Grids for Ray Tracing on GPUs
RT06 conferenceVlastimil Havran On the Fast Construction of Spatial Hierarchies for Ray Tracing Vlastimil Havran 1,2 Robert Herzog 1 Hans-Peter Seidel.
Cost-based Workload Balancing for Ray Tracing on a Heterogeneous Platform Mario Rincón-Nigro PhD Showcase Feb 17 th, 2012.
Experiences with Streaming Construction of SAH KD Trees Stefan Popov, Johannes Günther, Hans-Peter Seidel, Philipp Slusallek.
Afrigraph 2004 Interactive Ray-Tracing of Free-Form Surfaces Carsten Benthin Ingo Wald Philipp Slusallek Computer Graphics Lab Saarland University, Germany.
Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.
1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.
Computer Graphics (Fall 2005) COMS 4160, Lecture 21: Ray Tracing
Efficient IP-Address Lookup with a Shared Forwarding Table for Multiple Virtual Routers Author: Jing Fu, Jennifer Rexford Publisher: ACM CoNEXT 2008 Presenter:
1 View Coherence Acceleration for Ray Traced Animation University of Colorado at Colorado Springs Master’s Thesis Defense by Philip Glen Gage April 19,
Status – Week 243 Victor Moya. Summary Current status. Current status. Tests. Tests. XBox documentation. XBox documentation. Post Vertex Shader geometry.
3D Graphics Processor Architecture Victor Moya. PhD Project Research on architecture improvements for future Graphic Processor Units (GPUs). Research.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.
Interactive k-D Tree GPU Raytracing Daniel Reiter Horn, Jeremy Sugerman, Mike Houston and Pat Hanrahan.
Real-Time Ray Tracing 3D Modeling of the Future Marissa Hollingsworth Spring 2009.
RT08, August ‘08 Large Ray Packets for Real-time Whitted Ray Tracing Ryan Overbeck Columbia University Ravi Ramamoorthi Columbia University William R.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
RAY TRACING ON GPU By: Nitish Jain. Introduction Ray Tracing is one of the most researched fields in Computer Graphics A great technique to produce optical.
Computer Graphics 2 Lecture x: Acceleration Techniques for Ray-Tracing Benjamin Mora 1 University of Wales Swansea Dr. Benjamin Mora.
Interactive Ray Tracing: From bad joke to old news David Luebke University of Virginia.
Ray Tracing and Photon Mapping on GPUs Tim PurcellStanford / NVIDIA.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Cg Programming Mapping Computational Concepts to GPUs.
Matrices from HELL Paul Taylor Basic Required Matrices PROJECTION WORLD VIEW.
Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,
On a Few Ray Tracing like Algorithms and Structures. -Ravi Prakash Kammaje -Swansea University.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Introduction to Realtime Ray Tracing Course 41 Philipp Slusallek Peter Shirley Bill Mark Gordon Stoll Ingo Wald.
Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.
A Reconfigurable Architecture for Load-Balanced Rendering Graphics Hardware July 31, 2005, Los Angeles, CA Jiawen Chen Michael I. Gordon William Thies.
1 Real-time visualization of large detailed volumes on GPU Cyril Crassin, Fabrice Neyret, Sylvain Lefebvre INRIA Rhône-Alpes / Grenoble Universities Interactive.
Saarland University, Germany B-KD Trees for Hardware Accelerated Ray Tracing of Dynamic Scenes Sven Woop Gerd Marmitt Philipp Slusallek.
Interactive Rendering With Coherent Ray Tracing Eurogaphics 2001 Wald, Slusallek, Benthin, Wagner Comp 238, UNC-CH, September 10, 2001 Joshua Stough.
Fast BVH Construction on GPUs (Eurographics 2009) Park, Soonchan KAIST (Korea Advanced Institute of Science and Technology)
Introduction to Realtime Ray Tracing Course 41 Philipp Slusallek Peter Shirley Bill Mark Gordon Stoll Ingo Wald.
Hierarchical Penumbra Casting Samuli Laine Timo Aila Helsinki University of Technology Hybrid Graphics, Ltd.
Memory Management and Parallelization Paul Arthur Navrátil The University of Texas at Austin.
Ray Tracing Animated Scenes using Motion Decomposition Johannes Günther, Heiko Friedrich, Ingo Wald, Hans-Peter Seidel, and Philipp Slusallek.
A SEMINAR ON 1 CONTENT 2  The Stream Programming Model  The Stream Programming Model-II  Advantage of Stream Processor  Imagine’s.
COMPUTER GRAPHICS CS 482 – FALL 2015 SEPTEMBER 29, 2015 RENDERING RASTERIZATION RAY CASTING PROGRAMMABLE SHADERS.
Ray Tracing by GPU Ming Ouhyoung. Outline Introduction Graphics Hardware Streaming Ray Tracing Discussion.
Path/Ray Tracing Examples. Path/Ray Tracing Rendering algorithms that trace photon rays Trace from eye – Where does this photon come from? Trace from.
Advanced Rendering Technology The AR250 A New Architecture for Ray Traced Rendering.
1 The Method of Precomputing Triangle Clusters for Quick BVH Builder and Accelerated Ray Tracing Kirill Garanzha Department of Software for Computers Bauman.
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
CS427 Multicore Architecture and Parallel Computing
Graphics Processing Unit
Real-Time Ray Tracing Stefan Popov.
Christian Lauterbach GPGPU presentation 3/5/2007
Lecture 41: Introduction to Reconfigurable Computing
RADEON™ 9700 Architecture and 3D Performance
CIS 6930: Chip Multiprocessor: GPU Architecture and Programming
Presentation transcript:

† Saarland University, Germany ‡ University of Utah, USA Estimating Performance of a Ray-Tracing ASIC Design Sven Woop † Erik Brunvand ‡ Philipp Slusallek †

Ray Tracing in Car Industry

Ray Tracing Games

Previous Work Ray Tracers for Static Scenes CPU based: [OpenRT], [MLRT SIGGRAPH05] GPU based: Purcell (Grids) [SIGGRAPH02], Foley et al. (KD Trees) [GH05] Custom Hardware: Commercial Hardware (ART-VPS) Schmittler (KD Trees) [GH04] RPU (KD Trees) [SIGGRAPH05] Ray Tracers for Dynamic Scenes CPU based: Wald (Grids) [SIGGRAPH06] Wald (AABVHs) [TOG / Tech. Rep. 2006] Custom Hardware: Woop (B-KD Trees) [GH06]

Outline Previous Work DRPU Architecture B-KD Trees Traversal Processor Prototype Implementations DRPU-FPGA DRPU-ASICs Conclusion

Definition of B-KD Trees B-KD Tree (Bounded KD-Tree) Binary Tree 1D bounding intervalls for each child Leaf nodes point to a single primitive

B-KD Tree Subdivision Bounding Volume Hierarchy (partially unbounded) Each node can be associated with a full bounding box Bounds may overlap  Primitives in single leaf nodes  More traversal steps as for KD Tree  Support for dynamic scenes

B-KD Tree Subdivision Bounding Volume Hierarchy (partially unbounded) Each node can be associated with a full bounding box Bounds may overlap  Primitives in single leaf nodes  More traversal steps as for KD Tree  Support for dynamic scenes

B-KD Tree Subdivision Bounding Volume Hierarchy (partially unbounded) Each node can be associated with a full bounding box Bounds may overlap  Primitives in single leaf nodes  More traversal steps as for KD Tree  Support for dynamic scenes

B-KD Tree Subdivision Bounding Volume Hierarchy (partially unbounded) Each node can be associated with a full bounding box Bounds may overlap  Primitives in single leaf nodes  More traversal steps as for KD Tree  Support for dynamic scenes

B-KD Tree Subdivision Bounding Volume Hierarchy (partially unbounded) Each node can be associated with a full bounding box Bounds may overlap  Primitives in single leaf nodes  More traversal steps as for KD Tree  Support for dynamic scenes

Update of B-KD Trees Update Procedure Bounds updated on changed geometry B-KD tree structure remains constant  Linear updating complexity

DRPU Architecture vertices from memory

DRPU Architecture Rendering Units Highly multi-threaded Higher hardware usage Synchronous execution of packets of 4 rays Memory bandwidth reduction First level caches Memory bandwidth reduction vertices from memory

DRPU Architecture Programmable Shading Processor Design similar to fragment processors on GPUs Improved Programming Model Add highly efficient recursion Add flexible memory access Programming Model Ray generation tasks Material shading Calls Ray Casting Units to cast rays vertices from memory

DRPU Architecture Programmable Shading Unit Ray Casting Units High-performance traversal and intersection Support for continous dynamic scenes B-KD Trees approach vertices from memory

DRPU Architecture Programmable Shading Unit Ray Casting Units Traversal Processor Efficient traversal of B-KD trees vertices from memory

DRPU Architecture Programmable Shading Unit Ray Casting Units Traversal Processor Efficient traversal of B-KD trees Geometry Unit Ray transformations Vertex-based ray/triangle intersection [Möller Trumbore] Shared vertices save memory 6x vertices from memory

DRPU Architecture Programmable Shading Unit Ray Casting Units Scene Changes Skinning Processor Skeleton Subspace Deformation Re-uses Geometry Unit Pure stream architecture vertices from memory

DRPU Architecture Programmable Shading Unit Ray Casting Units Scene Changes Skinning Processor (see paper) Skeleton Subspace Deformation Re-uses Geometry Unit Pure stream architecture Update Processor Stream-like architecture Partial breadth-first execution One B-KD node update per clock cycle peak vertices from memory

DRPU Architecture vertices from memory

Traversal of B-KD Trees Early ray termination Clipping of near/far interval against both bounding intervalls Take closer child, push farther child to stack Traversal order does not affect correctness Complexity 4x computational cost of KD tree traversal step 2x stack memory

Traversal Processor Stack control computes next address

Traversal Processing Unit Stack control computes next address Next node is fetched from cache

Traversal Processing Unit Stack control computes next address Next node is fetched from cache 4 traversal slices compute 4x4 distances to bounding planes

Traversal Processing Unit Stack control computes next address Next node is fetched from cache 4 traversal slices compute 4x4 distances to bounding planes 4 Decision Units compute per ray traversal decision

Traversal Processing Unit Stack control computes next address Next node is fetched from cache 4 traversal slices compute 4x4 distances to bounding planes 4 Decision Units compute per ray traversal decision Packet Decision Unit computes packet traversal decision Packet goes left if exists a that ray goes left Packet goes right if exists a ray that goes right Packet goes from left to right if exists a ray that goes into both children from left to right

Traversal Processing Unit Stack control computes next address Next node is fetched from cache 4 traversal slices compute 4x4 distances to bounding planes 4 Decision Units compute per ray traversal decision Packet Decision Unit computes packet traversal decision Packet goes left if exists a that ray goes left Packet goes right if exists a ray that goes right Packet goes from left to right if exists a ray that goes into both children from left to right  Incoherent packets possible

FPGA Implementation Hardware Xilinx Virtex4 LX MHz 1.0 GB/s (limited to 0.5 GB/s) 7.5 Gflops 2,3 Gflops programmable 5,2 Gflops fixed function Implementation Packets of 4 rays 32 packets of rays 3x 8 KB caches, direct mapped 24 bit floating point Virtex4 Board

ASIC Design Synthesis Synopsys Synthesis UMC 130nm CMOS process Place & Route Cadence Encounter Some manual placements to achieve good results Only DRPU Core No chip interface designed (PCI Express, DRAM,...) No power estimation DRPU-ASIC

Hardware UMC 130nm process Die size: 49 mm MHz clock 2.1 GB/s bandwidth 30 Gflops Implementation Differences Larger caches (3x 16 KB, 4-way associative) 32 bit floating point 7mm

GPU Complexity ATI R520 (October, 2005) 90nm process 288 mm 2 die 600 MHz clock speed 170 GFlops programmable? 44,8 GB/s memory bandwidth Implementation Packets of 4 fragments 16 fragment pipelines 8 vertex piplines 32 bit floating point 7mm

On-Chip Parallelization Thread Scheduler schedules packets High bandwidth memory interface to Rendering Units

DRPU4 ASIC Hardware UMC 130nm process 196 mm 2 die (4 x 49 mm 2 ) 266 MHz clock 8,5 GB/s 120 GFlops Implementation Differences 4x DRPU ASIC No high level control 14mm

DRPU8-ASIC Hardware 90nm process (extrapolated using constant field scaling) 186 mm 2 die 400 MHz clock speed 25,6 GB/s bandwidth 361 Gflops 110 Gflops programmable 471 Gflops fixed function Implementation Differences 8x DRPU-ASIC 9,6 mm 19,3 mm

Results 1024x768, shadows

Results 1024x768, shadows

Results for DRPU8  Performance sufficient for game play  Room for improving image quality Gael 91.2 fpsDynGael 96.0 fps

Conclusions and Future Work Ray Tracing Hardware Design Support for programmable recursive shading Coherent scene changes Working Prototype Implementation Post layout ASIC Results Still no power results No direct performance comparison against GPU

Questions? Project Homepage: Computer Graphics Lab at Saarland University: