Presentation is loading. Please wait.

Presentation is loading. Please wait.

Real-Time Reyes: Programmable Pipelines and Research Challenges Anjul Patney University of California, Davis.

Similar presentations


Presentation on theme: "Real-Time Reyes: Programmable Pipelines and Research Challenges Anjul Patney University of California, Davis."— Presentation transcript:

1 Real-Time Reyes: Programmable Pipelines and Research Challenges Anjul Patney University of California, Davis

2 Real-Time Reyes-Style Adaptive Surface Subdivision Anjul Patney and John D. Owens SIGGRAPH Asia 2008 (to appear) http://graphics.idav.ucdavis.edu/publications/print_pub?pub_id=952

3 Image courtesy: Pixar

4 Cinematic rendering looks good Image courtesy: Pixar

5 High geometric complexity Image courtesy: Pixar

6 High shading complexity Image courtesy: Pixar

7 Motion blur, Depth-of-field Image courtesy: Pixar

8 Complex lighting Image courtesy: Pixar

9 All this is slow Image courtesy: Pixar

10 Our Goal Make it faster – As much as possible in real time Use the GPU – Massive available parallelism – Fixed-function texture/raster units – High Programmability

11 Enter Reyes Developed in 1980s, offline Pixel-accurate smooth surfaces Eye-space shading Stochastic sampling Order-independent transparency

12 Bound/Split Dice (tessellate) Displace Shade Compose Sample Vertex Shade Geometry Shade Rasterize Fragment Shade Blend Depth-buffer ReyesDirect3D 10

13 Bound/Split Dice (tessellate) Displace Shade Compose Sample High-quality geometry Convenient surface shading Cinematic quality Well-behaved (?) Reyes

14 Bound/Split Dice (tessellate) Displace Shade Compose Sample Reyes Input: Smooth surfaces Output: 0.5 x 0.5 pixel quads (micropolygons)

15 Outline Reyes Subdivision – algorithm Subdivision on GPU – implementation Results Limitations

16 Outline Reyes Subdivision – algorithm Subdivision on GPU – implementation Results Limitations

17 Split-Dice Recursively split a surface till dicing makes sense Uniformly sample it to form a grid of micropolygons

18 Split Bound/Cull Dice Diceable? No: Split Yes: Dice

19 Grid Micropolygon Post-split surface

20 Split-Dice is hard Split – Recursive, serial – Rapid primitive generation/destruction Dice – Huge memory demand

21 Can we do this in parallel?

22 Parallel Split A lot of independent operations Regular computation

23 Can we do this in parallel? A lot of independent operations – Our simplest model 5K primitives 1.2M micropolygons Regular computation – Bound/Cull/Split everything together – Dice everything together

24 Analogy: A Dynamic work queue A A B B C C D D E E F F G G H H I I A A A A B B C C E E E E F F H H Split No Action Cull C C

25 How can we do these efficiently? Creating new primitives – How to dynamically allocate space? Culling unneeded primitives – How to avoid fragmentation?

26 A child primitive is offset by the queue length Our Choice – keep it simple… A A B B C C D D E E F F G G H H I I A A B B C C E E F F H H A A C C E E

27 …and get rid of the holes later A A B B C C E E F F H H A A C C E E A A B B C C E E F F H H A A C C E E Scan-based compact is fast! (Sengupta ‘07) Scan-based compact is fast! (Sengupta ‘07)

28 Storage Issues for Dice Too many micropolygons – Cannot reject early Screen-space buckets (tiles) – In parallel ? – Ideal bucket size ?

29 Outline Reyes Subdivision – algorithm Subdivision on GPU – implementation Results Limitations

30 Platform NVIDIA GeForce 8800 GTX – 16 SIMD Multiprocessors – 16KB shared memory – 768 MB memory (no cache) NVIDIA CUDA 1.1 – Grids/Blocks/Threads – OpenGL interface

31 Implementation Details Input choice: Bicubic Bézier Surfaces – Only affects implementation View Dependent Subdivision every frame – Single CPU-GPU transfer Final micropolygons written to a VBO – Flat-shaded and displayed

32 Kernels Implemented Dice – Regular, symmetric, parallel – 256 threads per patch – Primitive information in shared memory Bound/Split – Hard to ensure efficiency

33 Bound/Split: Efficiency Goals Memory Coherence – Off-chip memory accesses must be efficient Computational Efficiency – Hardware SIMD must be maximally utilized

34 Memory Coherence during Split Compact work-queue after each iteration – Primitives always contiguous in memory Structure-Of-Arrays representation – Primitive attributes adjacent in memory 99.5% of all accesses fully coalesced

35 SIMD utilization during split Intra-Primitive parallelism – Independent control points – Negligible divergence Vectorized Split – 16 Threads per patch 90.16% of all branches SIMD coherent 32 threads (1 warp)

36 Outline Reyes Subdivision – algorithm Subdivision on GPU – implementation Results Limitations

37 Results - Killeroo 14426 grids 5 levels of subdivision Bound/Split: 6.99 ms Dice: 7.21ms 29.69 frames/second (19.92 with 16x AA) Killeroo model courtesy: Headus Inc.

38 Results - Teapot 4823 grids 11 levels of subdivision Bound/Split: 3.46 ms Dice: 2.42 ms 60.07 frames/second (30.02 with 16x AA)

39 Results – Random scenes Subdivision time proportional to number of micropolygons Dice Split Total

40 Render time Breakdown Rendering lots of small triangles Should ideally be zero: CUDA limitation

41 Results - Screen-space buckets Acceptable performance, Small memory footprint Acceptable performance, Small memory footprint

42 Outline Reyes Subdivision – algorithm Subdivision on GPU – implementation Results Limitations

43 Can’t split and dice in parallel Uniform dicing Cracks

44 Limitations – Uniform dicing

45 Limitations - Cracks

46

47 Conclusions Breadth-first recursive subdivision – Suits GPUs – Works fast Dicing Programmable tessellation is fast – 500M micropolygons/sec It is time to experiment with alternate graphics pipelines

48 Reyes in real-time rendering Visually superior to polygon pipeline Regular, highly parallel workload Extremely well-studied Good candidate for a real-time system?

49 Future Work Cracks, Displacement mapping Rest of Reyes – Offline quality shading – Interactive lighting – Parallel Stochastic Sampling (Wei 2008) – A-buffer (Myers 2007)

50 Thanks to Feedback and suggestions – Per Christensen, Charles Loop, Dave Luebke, Matt Pharr and Daniel Wexler Financial support – DOE Early Career PI Award – National Science Foundation – SciDAC Institute for Ultrascale Visualization Equipment support from NVIDIA

51 Teapot Video Real-time footage

52 Killeroo Video Real-time footage

53 BACKUP SLIDES Real-Time Reyes-Style Adaptive Surface Subdivision

54 CUDA Thread Structure Image courtesy: NVIDIA CUDA Programming Guide, 1.1

55 CUDA Memory Architecture Image courtesy: NVIDIA CUDA Programming Guide, 1.1

56 Displacement Mapping Fairly simple if cracks can be avoided – Displace adjacent grids together

57 Shading & Lighting Interactive preview – Lpics / Lightspeed Shaders – Intelligent textures – File access Shadows

58 Composite/Filter Stuff Stochastic sampling – 20x speedup on a GPU (Wei 2008) A-buffer

59 Special Effects Motion Blur and DOF Global Illumination Ambient Occlusion

60 Bound/Split Dice (tessellate) Displace Shade Compose Sample Vertex Shade Geometry Shade Rasterize Fragment Shade Blend Depth-buffer ReyesDirect3D 10 Geometry


Download ppt "Real-Time Reyes: Programmable Pipelines and Research Challenges Anjul Patney University of California, Davis."

Similar presentations


Ads by Google