Download presentation
1
GPU Data Formatting and Addressing
Aaron Lefohn University of California, Davis
2
Overview GPU Memory Model GPU-Based Data Structures
Performance Considerations
3
GPU memory model GPU Data Storage Vertex data Texture data
Frame buffer PS3.0 GPUs Texture Data Vertex Processor Rasterizer Fragment er result to ``other GPU memory'' (i.e., texture) - Write directly to ``other GPU memory'' instead of framebuffer. - Does the OS or OpenGL own GPU memory? - What other memory can we write to? - Textures - Vertex array buffers? - Fbuffer? - Mechanisms by which GPU can write to its own memory - Copy from framebuffer/pbuffer to texture - Cross platform - 2D output, save in 1D, 2D, 3D texture memory - Slow... - WGL_ARB_render_texture - RTT using pbuffers (only on Windows) - Fast RTT, but context switch is slow (time this!) - Current state of the art and lots of hacks to speed up - See next section for details of hackery - GL_EXT_Render_Target - Lightweight extension to enable x-platform, efficient RTT. - Spec. not yet approved and no implemenation - GL_EXT_pixel_buffer_object - Copy from frame buffer to vertex buffer - Asynchronous CPU readbacks - Supported by current NVIDIA drivers - TODO: Can I talk about this? - Uber buffers - General memory model for GPUs - Textures, frame buffers, vertex buffers are all just ``memory'' - Render to any GPU memory: N-D Texture, Vertex arrays, stencil bufer, frame buffer, etc. - Cross platform (OpenGL owns the memory, not the OS) - Mix-and-match depth buffers/color buffers/etc. - Alpha ATI drivers and spec. not approved - Stream/GPU-Based Data Structures 1) Multi-dimensional streams - Read/Write GPU memory optimized for 2D (images!) - But isn't memory all really 1D? - Yes, but GPU memory heirarchy is optimized for 2D accesses. Texture caches must capture multidimensional locality for texture filtering and 2D rasterization. - Reference texture cache stanford paper. - Result is that GPGPU programmer should use illusion of 2D physical memory. - Large 1D streams - Lay out in 2D - 3D streams - Update slice-by-slice (potentially limits parallelism) - Flatten parts or all into large 2D texture(s) - Streams of higher dimension (> 3D) - Layout in 2D memory in the same way that N-D arrays use 1D CPU memory. - 2D memory is limited in size. 4) How does the GPU get memory addresses? - Per-Vertex - Vertex attributes - Computed in vertex program - Read from vertex texture - Per-Fragment - Per-vertex addresses interpolated by rasterizer - Computed in fragment stage - Read from texture memory 2) Pointers - Dependent texture lookups 3) Sparse Data - Two options - Store entire dataset on GPU and create substreams out of it (depth culling or geometry-based substreams). - Sherbondy et al., IEEE Visualization 2003 - Purcell et al. - Only store sparse data on GPU (in packed format) - Sparse matrices: Kruger, Bolz - Sparse volume: Lefohn - Performance - Pbuffers - Currently the state-of-the-art for RTT - Most implementations optimized for RGBA??? (TODO: Is this true?) - Avoid context switches (TIME this!!) - Pack scalar data into RGBA channels - Use multiple surfaces (front/back/aux0/...) - Pack 2D domains into larger buffers (dangerous!) - Texture Cache Considerations - Caches designed to capture 2D locality wrt. to rasterization and texture filtering. - Dependent Texture Reads - NVIDIA: Based on cache locality - ATI: ??? - Compute addresses at lowest possible computational frequency - Neighbor offsets in vertex program - Avoid fragment-level address computation whenever possible Frame Buffer(s) Vertex Data
4
GPU memory model Read-Only Read/Write Traditional use of GPU memory
CPU writes, GPU reads Read/Write Save frame buffer(s) for later use as texture or vertex array Save up to 16, 32-bit floating values per pixel Multiple Render Targets (MRTs)
5
How to Save Render Result
Copy framebuffer result to “other GPU memory” Copy-to-texture Copy-to-vertex-array Write directly to “other GPU memory'' Render-to-texture Render-to-vertex-array
6
OpenGL GPU Memory Writes
Texture Copy frame buffer to texture Render-to-texture WGL_ARB_render_texture GL_EXT_render_target Superbuffers Vertex Array Copy frame buffer to vertex array GL_EXT_pixel_buffer_object Render-to-vertex-array
7
Render-To-Texture: 1 Copy-To-Texture Good
Cross-Platform texture writes Flexible output 2D output Copy to 1D, 2D, or 3D texture Bad Slow Consumes internal GPU memory bandwidth
8
Render-To-Texture: 2 WGL_ARB_render_texture
Render-to-texture (RTT) using pbuffers Good Fast RTT Current state of the art for RTT Bad Only works on Windows Slow OpenGL context switches Many hacks to avoid this bottleneck
9
Render-To-Texture: 3 GL_EXT_render_target
Proposed extension for cross-platform RTT Good Cross-platform, efficient RTT solution Lightweight, simple extension Bad Specification not approved (April 24, 2004) No implementations exist (April 24, 2004)
10
Render-To-Texture: 4 Superbuffers Proposed new memory model for GPUs
Good Unified GPU memory model Render to any GPU memory Cross platform (OpenGL owns memory, not OS) Mix-and-match depth/stencil/color buffers Bad Large, complex extension Specification not approved (April 24, 2004) Only driver support is alpha version (ATI)
11
Render-To-Texture Summary
OpenGL RTT Currently Only Under Windows Pbuffers Complex and awkward RTT mechanism Current state of the art Cross-Platform RTT Coming Soon…
12
Render-To-Vertex-Array: 1
GL_EXT_pixel_buffer_object Copy framebuffer to vertex buffer object Good Only GPU/AGP memory bandwidth Works with current drivers (NVIDIA) Bad No direct render-to-vertex-array (slower than true RTVA) No ATI implementation
13
Render-To-Vertex-Array: 2
Superbuffers Write to “memory object” as render target Read from “memory object” as vertex array Good Direct render-to-vertex-array (fast) Bad Can render results always be interpreted as vertex data? Large, complex, unapproved extension, …
14
Render-To-Vertex-Array Summary
Current OpenGL Support NVIDIA: GL_EXT_pixel_buffer_object ATI: Superbuffers Semantics Still Under Development…
15
Fbuffer: Capturing Fragments
Idea “Rasterization-Order FIFO Buffer” Render results are fragment values instead of pixel values Mark and Proudfoot, Graphics Hardware 2001 Uses Designed for multi-pass rendering with transparent geometry New possibilities for GPGPU? Varying number of results per pixel RTT and RTVA with an fbuffer?
16
Fbuffer: Capturing Fragments
Implementations ATI Radeon 9800 and newer ATI GPUs Not yet exposed to user (ask for it!) Problems Size of fbuffer is not known before rendering GPUs cannot perform dynamic memory allocation How to handle buffer overflow?
17
Overview GPU Memory Model GPU-Based Data Structures
Performance Considerations
18
GPU-Based Data Structures
Building Blocks GPU memory addresses Address Generation Address Use Pointers Multi-dimensional arrays Sparse representations
19
GPU Memory Addresses Where Are Addresses Generated?
CPU Vertex stream or textures Vertex processor Input stream, ALU ops or textures Rasterizer Interpolation Fragment processor Input stream, ALU ops or textures Vertex Processor Rasterizer Fragment CPU
20
GPU Memory Addresses Where Are Addresses Used?
Vertex textures (PS3.0 GPUs) Fragment textures Texture Data CPU Rasterizer Fragment Processor Vertex Processor
21
GPU Memory Addresses Pointers Store addresses in texture
Dependent texture read Example: See Tim Purcell’s ray tracing talk float2 addr = tex2D( addrTex, texCoord ); float2 data = tex2D( dataTex, addr ); Address Texture Data Texture 1 2 3 3 Data 3 Data 1 1 Data 2 1 Data 3
22
GPU-Based Data Structures
Building Blocks GPU memory addresses Address Generation Address Use Pointers Multi-dimensional arrays Sparse representations
23
Multi-Dimensional Arrays
Build Data Structures in 2D Memory Read/Write GPU memory optimized for 2D Images But Isn’t Physical Memory 1D? GPU memory hierarchy optimized to capture 2D locality Rasterization Texture filtering Igehy, Eldridge, Proudfoot, “"Prefetching in a Texture Cache Architecture,” Graphics Hardware, 1998 Conclusion: Use illusion of 2D physical memory
24
GPU Arrays Large 1D Arrays
Current GPUs limit 1D array sizes to 2048 or 4096 Pack into 2D memory 1D-to-2D address translation
25
GPU Arrays 3D Arrays Problem GPUs do not have 3D frame buffers
No RTT to slice of 3D texture (except Superbuffers) Solutions Stack of 2D slices Multiple slices per 2D buffer
26
GPU Arrays Problems With 3D Arrays for GPGPU Solutions
Cannot read stack of 2D slices as 3D texture Must know which slices are needed in advance Visualization of 3D data difficult Solutions Need render-to-slice-of-3D-texture (Superbuffers) Volume rendering of slice-based 3D data Course 28, “Real-Time Volume Graphics”, Siggraph 2004
27
GPU Arrays Higher Dimensional Arrays Conclusions Pack into 2D buffers
N-D to 2D address translation Same problems as 3D arrays if data does not fit in a single 2D texture Conclusions Fundamental GPU memory primitive is a fixed-size 2D array GPGPU needs more general memory model
28
GPU-Based Data Structures
Building Blocks GPU memory addresses Address Generation Address Use Pointers Multi-dimensional arrays Sparse representations
29
Sparse Data Structures
Why Sparse Data Structures? Reduce computational workload Reduce memory pressure Examples Sparse matrices Krueger et al., Siggraph 2003 Bolz et al., Siggraph 2003 Implicit surface computations (sparse volumes) Sherbondy et al., IEEE Visualization 2003 Lefohn et al., IEEE Visualization 2003 Premoze et al. Eurographics 2003
30
Sparse Computation Option 1: Store Complete Data Set on GPU
Cull unused data Conditional execution tricks (discussed earlier) Option 2: Store Only Sparse Data on GPU Saves memory Potentially much faster than culling Much more complicated (especially if time-varying)
31
Sparse Data Structures
Basic Idea Pack “active” data elements into GPU memory For more information Linear algebra section in this course : Static structures Level-set case study in this course : Dynamic structures
32
Sparse Data Structures
Addressing Sparse Data Neighborhoods no longer implicitly defined on grid Use pointer-based data structures to locate neighbors Pre-compute neighbor addresses if possible Use CPU or vertex processor Removes pointer dereference from fragment program Separate common addressing case from boundary conditions Common case must be cache coherent See Harris and Lefohn case studies for “substream” technique
33
Overview GPU Memory Model GPU-Based Data Structures
Performance Considerations
34
Memory Performance Issues
Pbuffer Survival Guide Dependent Texture Costs Computational Frequency
35
Pbuffer Survival Guide
Pbuffers Give us Render-To-Texture Designed to create an environment map or two Never intended to be used for GPGPU (100s of pbuffers) Problem Each pbuffer has its own OpenGL render context Each pbuffer may have depth and/or stencil buffer Changing OpenGL contexts is slow Solution Many optimizations to avoid this bottleneck…
36
Pbuffer Survival Guide
Pack Scalar Data Into RGBA > 4x memory savings 4x reduction in context switches Be careful of read-modify-write hazard Scalar Data in 4 RGBA Pbuffers 1 RGBA Pbuffer
37
Pbuffer Survival Guide
Use Multi-Surface Pbuffers Each RGBA surface is its own render-texture Front, Back, AuxN (N = 0,1,2,…) Greatly reduces context switches Technically illegal, but “blessed” by ATI. Works on NVIDIA. 5 Pbuffers 1 RGBA Surface Each 1 Pbuffer 5 RGBA Surfaces
38
Pbuffer Survival Guide
Using Multi-Surface Pbuffers Allocate double buffer pbuffer (and/or with AUX buffers) Set render target to back buffer glDrawBuffer(GL_BACK) Bind front buffer as texture wglBindTexImageARB(hpbuffer, WGL_FRONT_ARB) Render Switch buffers wglReleaseTexImageARB(hpbuffer, WGL_FRONT_ARB) glDrawBuffer(GL_FRONT) wglBindTexImageARB(hpbuffer, WGL_BACK_ARB)
39
Pbuffer Survival Guide
Pack 2D domains into large buffer “Flat 3D textures” Be careful of read-modify-write hazard 3D Volume Flattened Volume
40
Dependent Texture Costs
Cache Coherency Dependent reads fast if they hit cache Even chained dependencies can be same speed as non-dependent reads Very slow if out of cache Example: 3 levels of dependent cache misses can be >10x slower More detail in “GPU Computation Strategies and Tricks”
41
Computational Frequency
Compute Memory Addresses at Low Frequency Compute memory addresses in vertex program Let rasterizer interpolation create per-fragment addresses Compute neighbor addresses this way Avoid fragment-level address computation whenever possible Consumes fragment instructions Computation often redundant with neighboring fragments May defeat texture pre-fetch
42
Conclusions GPU Memory Model Evolving GPGPU Data Structures
Writable GPU memory forms loop-back in an otherwise feed-forward streaming pipeline Memory model will continue to evolve as GPUs become more general stream processors GPGPU Data Structures Basic memory primitive is limited-size, 2D texture Use address translation to fit all array dimensions into 2D Maintain 2D cache locality Render-To-Texture Use pbuffers with care and eagerly adopt their successor
43
Selected References J. Boltz, I. Farmer, E. Grinspun, P. Schoder, “Spare Matrix Solvers on the GPU: Conjugate Gradients and Multigrid,” SIGGRAPH 2003 N. Goodnight, C. Woolley, G. Lewin, D. Luebke, G. Humphreys, “A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware,” Graphics Hardware 2003 M. Harris, W. Baxter, T. Scheuermann, A. Lastra, “Simulation of Cloud Dynamics on Graphics Hardware,“ Graphics Hardware 2003 H. Igehy, M. Eldridge, K. Proudfoot, “Prefetching in a Texture Cache Architecture,” Graphics Hardware 1998 J. Krueger, R. Westermann, “Linear Algebra Operators for GPU Implementation of Numerical Algorithms,” SIGGRAPH 2003 A. Lefohn, J. Kniss, C. Hansen, R. Whitaker, “A Streaming Narrow-Band Algorithm: Interactive Deformation and Visualization of Level Sets,” IEEE Transactions on Visualization and Computer Graphics 2004
44
Selected References A. Lefohn, J. Kniss, C. Hansen, R. Whitaker, “Interactive Deformation and Visualization of Level Set Surfaces Using Graphics Hardware,” IEEE Visualization W. Mark, K. Proudfoot, “The F-Buffer: A Rasterization-Order FIFO Buffer for Multi- Pass Rendering,” Graphics Hardware 2001 T. Purcell, C. Donner, M. Cammarano, H. W. Jensen, P. Hanrahan, “Photon Mapping on Programmable Graphics Hardware,” Graphics Hardware 2003 A. Sherbondy, M. Houston, S. Napel, “Fast Volume Segmentation With Simultaneous Visualization Using Programmable Graphics Hardware,” IEEE Visualization 2003
45
OpenGL References GL_EXT_pixel_buffer_object GL_EXT_render_target, OpenGL Extension Registry Superbuffers WGL_ARB_render_texture
46
Questions? Acknowledgements
Cass Everitt, Craig Kolb, Chris Seitz, and Jeff Juliano at NVIDIA Mark Segal, Rob Mace, and Evan Hart at ATI GPGPU Siggraph 2004 course presenters Joe Kniss and Ross Whitaker Brian Budge John Owens National Science Foundation Graduate Fellowship Pixar Animation Studios
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.