Real-Time Reyes: Programmable Pipelines and Research Challenges Anjul Patney University of California, Davis.

Slides:



Advertisements
Similar presentations
COMPUTER GRAPHICS SOFTWARE.
Advertisements

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Understanding the graphics pipeline Lecture 2 Original Slides by: Suresh Venkatasubramanian Updates by Joseph Kider.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
High-Performance Software
High-Quality Parallel Depth-of- Field Using Line Samples Stanley Tzeng, Anjul Patney, Andrew Davidson, Mohamed S. Ebeida, Scott A. Mitchell, John D. Owens.
Results / Compared to Relief Mapping It does not scale linearly with screen coverage as does the other techniques. However, for larger displacements, it.
GCAFE 28 Feb Real-time REYES Jeremy Sugerman.
Parallel View-Dependent Tessellation of Catmull-Clark Subdivision Surfaces Anjul Patney 1 Mohamed S. Ebeida 2 John D. Owens 1 University of California,
1Notes  Assignment 1 is out, due October 12  Inverse Kinematics  Evaluating Catmull-Rom splines for motion curves  Wednesday: may be late (will get.
Fragment-Parallel Composite and Filter Anjul Patney, Stanley Tzeng, and John D. Owens University of California, Davis.
Real-Time Reyes-Style Adaptive Surface Subdivision
Rasterization and Ray Tracing in Real-Time Applications (Games) Andrew Graff.
Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.
3D Graphics Processor Architecture Victor Moya. PhD Project Research on architecture improvements for future Graphic Processor Units (GPUs). Research.
1Notes. 2Atop  The simplest (useful) and most common form of compositing: put one image “atop” another  Image 1 (RGB) on top of image 2 (RGB)  For.
IN4151 Introduction 3D graphics 1 Introduction to 3D computer graphics part 2 Viewing pipeline Multi-processor implementation GPU architecture GPU algorithms.
Evolution of the Programmable Graphics Pipeline Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Anjul Patney University of California, Davis Real-Time Reyes Programmable Pipelines and Research Challenges.
Many-Core Programming with GRAMPS & “Real Time REYES” Jeremy Sugerman, Kayvon Fatahalian Stanford University June 12, 2008.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Interactive Ray Tracing: From bad joke to old news David Luebke University of Virginia.
Ray Tracing and Photon Mapping on GPUs Tim PurcellStanford / NVIDIA.
May 8, 2007Farid Harhad and Alaa Shams CS7080 Over View of the GPU Architecture CS7080 Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad &
Computer graphics & visualization REYES Render Everything Your Eyes Ever Saw.
Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.
Computer Graphics Graphics Hardware
Invitation to Computer Science 5th Edition
Advanced Computer Graphics March 06, Grading Programming assignments Paper study and reports (flipped classroom) Final project No written exams.
Interactive Rendering of Meso-structure Surface Details using Semi-transparent 3D Textures Vision, Modeling, Visualization Erlangen, Germany November 16-18,
Chris Kerkhoff Matthew Sullivan 10/16/2009.  Shaders are simple programs that describe the traits of either a vertex or a pixel.  Shaders replace a.
Interactive Time-Dependent Tone Mapping Using Programmable Graphics Hardware Nolan GoodnightGreg HumphreysCliff WoolleyRui Wang University of Virginia.
Cg Programming Mapping Computational Concepts to GPUs.
Surface displacement, tessellation, and subdivision Ikrima Elhassan.
Matrices from HELL Paul Taylor Basic Required Matrices PROJECTION WORLD VIEW.
Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.
Tone Mapping on GPUs Cliff Woolley University of Virginia Slides courtesy Nolan Goodnight.
Fast BVH Construction on GPUs (Eurographics 2009) Park, Soonchan KAIST (Korea Advanced Institute of Science and Technology)
Department of Computer Science 1 Beyond CUDA/GPUs and Future Graphics Architectures Karu Sankaralingam University of Wisconsin-Madison Adapted from “Toward.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
- Laboratoire d'InfoRmatique en Image et Systèmes d'information
A SEMINAR ON 1 CONTENT 2  The Stream Programming Model  The Stream Programming Model-II  Advantage of Stream Processor  Imagine’s.
Computing & Information Sciences Kansas State University Lecture 12 of 42CIS 636/736: (Introduction to) Computer Graphics CIS 636/736 Computer Graphics.
Real-Time Relief Mapping on Arbitrary Polygonal Surfaces Fabio Policarpo Manuel M. Oliveira Joao L. D. Comba.
컴퓨터 그래픽스 Real-time Rendering 1. Introduction.
From Turing Machine to Global Illumination Chun-Fa Chang National Taiwan Normal University.
COMPUTER GRAPHICS CS 482 – FALL 2015 SEPTEMBER 29, 2015 RENDERING RASTERIZATION RAY CASTING PROGRAMMABLE SHADERS.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
1cs426-winter-2008 Notes. 2 Atop operation  Image 1 “atop” image 2  Assume independence of sub-pixel structure So for each final pixel, a fraction alpha.
Image Fusion In Real-time, on a PC. Goals Interactive display of volume data in 3D –Allow more than one data set –Allow fusion of different modalities.
Siggraph 2009 RenderAnts: Interactive REYES Rendering on GPUs Kun Zhou Qiming Hou Zhong Ren Minmin Gong Xin Sun Baining Guo JAEHYUN CHO.
Graphics Pipeline Bringing it all together. Implementation The goal of computer graphics is to take the data out of computer memory and put it up on the.
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
Photorealistic Rendering vs. Interactive 3D Graphics
Hybrid Ray Tracing and Path Tracing of Bezier Surfaces using a mixed hierarchy Rohit Nigam, P. J. Narayanan CVIT, IIIT Hyderabad, Hyderabad, India.
CS427 Multicore Architecture and Parallel Computing
Graphics Processing Unit
Real-Time Ray Tracing Stefan Popov.
From Turing Machine to Global Illumination
Graphics Processing Unit
Ray Tracing on Programmable Graphics Hardware
RADEON™ 9700 Architecture and 3D Performance
CIS 441/541: Introduction to Computer Graphics Lecture 15: shaders
Interactive Sampling and Rendering for Complex and Procedural Geometry
CIS 6930: Chip Multiprocessor: GPU Architecture and Programming
Presentation transcript:

Real-Time Reyes: Programmable Pipelines and Research Challenges Anjul Patney University of California, Davis

Real-Time Reyes-Style Adaptive Surface Subdivision Anjul Patney and John D. Owens SIGGRAPH Asia 2008 (to appear)

Image courtesy: Pixar

Cinematic rendering looks good Image courtesy: Pixar

High geometric complexity Image courtesy: Pixar

High shading complexity Image courtesy: Pixar

Motion blur, Depth-of-field Image courtesy: Pixar

Complex lighting Image courtesy: Pixar

All this is slow Image courtesy: Pixar

Our Goal Make it faster – As much as possible in real time Use the GPU – Massive available parallelism – Fixed-function texture/raster units – High Programmability

Enter Reyes Developed in 1980s, offline Pixel-accurate smooth surfaces Eye-space shading Stochastic sampling Order-independent transparency

Bound/Split Dice (tessellate) Displace Shade Compose Sample Vertex Shade Geometry Shade Rasterize Fragment Shade Blend Depth-buffer ReyesDirect3D 10

Bound/Split Dice (tessellate) Displace Shade Compose Sample High-quality geometry Convenient surface shading Cinematic quality Well-behaved (?) Reyes

Bound/Split Dice (tessellate) Displace Shade Compose Sample Reyes Input: Smooth surfaces Output: 0.5 x 0.5 pixel quads (micropolygons)

Outline Reyes Subdivision – algorithm Subdivision on GPU – implementation Results Limitations

Outline Reyes Subdivision – algorithm Subdivision on GPU – implementation Results Limitations

Split-Dice Recursively split a surface till dicing makes sense Uniformly sample it to form a grid of micropolygons

Split Bound/Cull Dice Diceable? No: Split Yes: Dice

Grid Micropolygon Post-split surface

Split-Dice is hard Split – Recursive, serial – Rapid primitive generation/destruction Dice – Huge memory demand

Can we do this in parallel?

Parallel Split A lot of independent operations Regular computation

Can we do this in parallel? A lot of independent operations – Our simplest model 5K primitives 1.2M micropolygons Regular computation – Bound/Cull/Split everything together – Dice everything together

Analogy: A Dynamic work queue A A B B C C D D E E F F G G H H I I A A A A B B C C E E E E F F H H Split No Action Cull C C

How can we do these efficiently? Creating new primitives – How to dynamically allocate space? Culling unneeded primitives – How to avoid fragmentation?

A child primitive is offset by the queue length Our Choice – keep it simple… A A B B C C D D E E F F G G H H I I A A B B C C E E F F H H A A C C E E

…and get rid of the holes later A A B B C C E E F F H H A A C C E E A A B B C C E E F F H H A A C C E E Scan-based compact is fast! (Sengupta ‘07) Scan-based compact is fast! (Sengupta ‘07)

Storage Issues for Dice Too many micropolygons – Cannot reject early Screen-space buckets (tiles) – In parallel ? – Ideal bucket size ?

Outline Reyes Subdivision – algorithm Subdivision on GPU – implementation Results Limitations

Platform NVIDIA GeForce 8800 GTX – 16 SIMD Multiprocessors – 16KB shared memory – 768 MB memory (no cache) NVIDIA CUDA 1.1 – Grids/Blocks/Threads – OpenGL interface

Implementation Details Input choice: Bicubic Bézier Surfaces – Only affects implementation View Dependent Subdivision every frame – Single CPU-GPU transfer Final micropolygons written to a VBO – Flat-shaded and displayed

Kernels Implemented Dice – Regular, symmetric, parallel – 256 threads per patch – Primitive information in shared memory Bound/Split – Hard to ensure efficiency

Bound/Split: Efficiency Goals Memory Coherence – Off-chip memory accesses must be efficient Computational Efficiency – Hardware SIMD must be maximally utilized

Memory Coherence during Split Compact work-queue after each iteration – Primitives always contiguous in memory Structure-Of-Arrays representation – Primitive attributes adjacent in memory 99.5% of all accesses fully coalesced

SIMD utilization during split Intra-Primitive parallelism – Independent control points – Negligible divergence Vectorized Split – 16 Threads per patch 90.16% of all branches SIMD coherent 32 threads (1 warp)

Outline Reyes Subdivision – algorithm Subdivision on GPU – implementation Results Limitations

Results - Killeroo grids 5 levels of subdivision Bound/Split: 6.99 ms Dice: 7.21ms frames/second (19.92 with 16x AA) Killeroo model courtesy: Headus Inc.

Results - Teapot 4823 grids 11 levels of subdivision Bound/Split: 3.46 ms Dice: 2.42 ms frames/second (30.02 with 16x AA)

Results – Random scenes Subdivision time proportional to number of micropolygons Dice Split Total

Render time Breakdown Rendering lots of small triangles Should ideally be zero: CUDA limitation

Results - Screen-space buckets Acceptable performance, Small memory footprint Acceptable performance, Small memory footprint

Outline Reyes Subdivision – algorithm Subdivision on GPU – implementation Results Limitations

Can’t split and dice in parallel Uniform dicing Cracks

Limitations – Uniform dicing

Limitations - Cracks

Conclusions Breadth-first recursive subdivision – Suits GPUs – Works fast Dicing Programmable tessellation is fast – 500M micropolygons/sec It is time to experiment with alternate graphics pipelines

Reyes in real-time rendering Visually superior to polygon pipeline Regular, highly parallel workload Extremely well-studied Good candidate for a real-time system?

Future Work Cracks, Displacement mapping Rest of Reyes – Offline quality shading – Interactive lighting – Parallel Stochastic Sampling (Wei 2008) – A-buffer (Myers 2007)

Thanks to Feedback and suggestions – Per Christensen, Charles Loop, Dave Luebke, Matt Pharr and Daniel Wexler Financial support – DOE Early Career PI Award – National Science Foundation – SciDAC Institute for Ultrascale Visualization Equipment support from NVIDIA

Teapot Video Real-time footage

Killeroo Video Real-time footage

BACKUP SLIDES Real-Time Reyes-Style Adaptive Surface Subdivision

CUDA Thread Structure Image courtesy: NVIDIA CUDA Programming Guide, 1.1

CUDA Memory Architecture Image courtesy: NVIDIA CUDA Programming Guide, 1.1

Displacement Mapping Fairly simple if cracks can be avoided – Displace adjacent grids together

Shading & Lighting Interactive preview – Lpics / Lightspeed Shaders – Intelligent textures – File access Shadows

Composite/Filter Stuff Stochastic sampling – 20x speedup on a GPU (Wei 2008) A-buffer

Special Effects Motion Blur and DOF Global Illumination Ambient Occlusion

Bound/Split Dice (tessellate) Displace Shade Compose Sample Vertex Shade Geometry Shade Rasterize Fragment Shade Blend Depth-buffer ReyesDirect3D 10 Geometry