Advanced Virtual Texture Topics

Slides:



Advertisements
Similar presentations
Destruction Masking in Frostbite 2 using Volume Distance Fields
Advertisements

Introduction to Direct3D 10 Course Porting Game Engines to Direct3D 10: Crysis / CryEngine2 Carsten Wenzel.
Accelerating Real-Time Shading with Reverse Reprojection Caching Diego Nehab 1 Pedro V. Sander 2 Jason Lawrence 3 Natalya Tatarchuk 4 John R. Isidoro 4.
OIT and Indirect Illumination using DX11 Linked Lists
CSE 380 – Computer Game Programming Tile Based Graphics Legend of Zelda, by Nintendo, released 1987.
Technology Behind AMD’s “Leo Demo” Jay McKee MTS Engineer, AMD
Exploration of advanced lighting and shading techniques
DirectX11 Performance Reloaded
Maximizing Multi-GPU Performance
4/6/ :35 AM V ©2004 Microsoft Corporation. All rights reserved.
Real-time Shading with Filtered Importance Sampling
Filtering Approaches for Real-Time Anti-Aliasing
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Normal Map Compression with ATI 3Dc™ Jonathan Zarge ATI Research Inc.
RealityEngine Graphics Kurt Akeley Silicon Graphics Computer Systems.
Operating Systems Lecture 10 Issues in Paging and Virtual Memory Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing.
Filtering Approaches for Real-Time Anti-Aliasing
Optimized Effects for Mobile Devices Ed Plowman, Director of Performance Analysis, ARM Stacy Smith, Senior Software Engineer, ARM.
WSCG 2007 Hardware Independent Clipmapping A. Seoane, J. Taibo, L. Hernández, R. López, A. Jaspe VideaLAB – University of A Coruña (Spain)
Virtual Memory. Hierarchy Cache Memory : Provide invisible speedup to main memory.
Real-Time Rendering TEXTURING Lecture 02 Marina Gavrilova.
ABC HFG JKLW OPQR NTU VS YZ.
Computer Graphics Hardware Acceleration for Embedded Level Systems Brian Murray
Order-Independent Texture Synthesis Li-Yi Wei Marc Levoy Gcafe 1/30/2003.
3D Graphics Processor Architecture Victor Moya. PhD Project Research on architecture improvements for future Graphic Processor Units (GPUs). Research.
Sorting and Searching Timothy J. PurcellStanford / NVIDIA Updated Gary J. Katz based on GPUTeraSort (MSR TR )U. of Pennsylvania.
IN4151 Introduction 3D graphics 1 Introduction to 3D computer graphics part 2 Viewing pipeline Multi-processor implementation GPU architecture GPU algorithms.
Hardware-Based Nonlinear Filtering and Segmentation using High-Level Shading Languages I. Viola, A. Kanitsar, M. E. Gröller Institute of Computer Graphics.
Adaptive Streaming and Rendering of Large Terrains: a Generic Solution WSCG 2009 Raphaël Lerbour Jean-Eudes Marvie Pascal Gautron THOMSON R&D, Rennes,
Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.
High Performance in Broad Reach Games Chas. Boyd
Animation. Pre-calculated Animation Do more now, less later.
MACHINE VISION GROUP Graphics hardware accelerated panorama builder for mobile phones Miguel Bordallo López*, Jari Hannuksela*, Olli Silvén* and Markku.
GPU Programming Robert Hero Quick Overview (The Old Way) Graphics cards process Triangles Graphics cards process Triangles Quads.
Next-Generation Consoles Brenden Schubert & Nathaniel Williams With a Special talk by John F. Rhoads on Video Game Ethics.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Interactive Time-Dependent Tone Mapping Using Programmable Graphics Hardware Nolan GoodnightGreg HumphreysCliff WoolleyRui Wang University of Virginia.
4/23/2017 4:23 AM © 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Pattern Based Procedural Textures Sylvain Lefebvre Fabrice Neyret iMAGIS - GRAVIR / IMAG - INRIA
Tone Mapping on GPUs Cliff Woolley University of Virginia Slides courtesy Nolan Goodnight.
1 Real-time visualization of large detailed volumes on GPU Cyril Crassin, Fabrice Neyret, Sylvain Lefebvre INRIA Rhône-Alpes / Grenoble Universities Interactive.
OpenGL Buffer Transfers Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Computer Graphics 2 Lecture 7: Texture Mapping Benjamin Mora 1 University of Wales Swansea Pr. Min Chen Dr. Benjamin Mora.
Xbox MB system memory IBM 3-way symmetric core processor ATI GPU with embedded EDRAM 12x DVD Optional Hard disk.
Efficient Streaming of 3D Scenes with Complex Geometry and Complex Lighting Romain Pacanowski and M. Raynaud X. Granier P. Reuter C. Schlick P. Poulin.
M. Jędrzejewski, K.Marasek, Warsaw ICCVG, Multimedia Chair Computation of room acoustics using programable video hardware Marcin Jędrzejewski.
Using wavelets on the XBOX360 For current and future games San Francisco, GDC 2008 Mike Boulton Senior Software Engineer Rare/MGS
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Demand Paging.
Computer Graphics 3 Lecture 6: Other Hardware-Based Extensions Benjamin Mora 1 University of Wales Swansea Dr. Benjamin Mora.
Maths & Technologies for Games Graphics Optimisation - Batching CO3303 Week 5.
Ray Tracing using Programmable Graphics Hardware
CS-321 Dr. Mark L. Hornick 1 Graphics Displays Video graphics adapter Monitor.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
Postmortem: Deferred Shading in Tabula Rasa Rusty Koonce NCsoft September 15, 2008.
Build your own 2D Game Engine and Create Great Web Games using HTML5, JavaScript, and WebGL. Sung, Pavleas, Arnez, and Pace, Chapter 5 Examples 1.
Memory Management & Virtual Memory. Hierarchy Cache Memory : Provide invisible speedup to main memory.
From VIC (VRVS) to ViEVO (EVO) 3 years of experiences with developing of video application VIC for VRVS allowed us to develop a new video application.
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
- Introduction - Graphics Pipeline
Combining Edges and Points for Interactive High-Quality Rendering
Normal Map Compression with ATI 3Dc™
Introduction to Computer Graphics with WebGL
NVIDIA Fermi Architecture
EE 445S Real-Time Digital Signal Processing Lab Spring 2014
Static Image Filtering on Commodity Graphics Processors
Desktop Window Manager
UE4 Vulkan Updates & Tips
RADEON™ 9700 Architecture and 3D Performance
02 | What DirectX Can Do and Creating the Main Game Loop
Presentation transcript:

Advanced Virtual Texture Topics Virtual Texture is like Virtual Memory but instead of stalling a low resolution mip can be used. The page is a 2d section of a mip with constant memory size. =Mega Texture? possible Martin Mittring Lead Graphics Programmer

Presentation Overview Motivation Virtual Texture* Implementation Related Topics Combo Texture* Summary / Future * Demo

Motivation: Texture Streaming Peripherals Hard drive Network DVD/CD/… Computer Main memory GPU GPU texture cache Texture Video memory Amount Speed

Per mip-map texture streaming* Streaming is needed: Large worlds, Loading time, Memory limits HW / API support No asynchronous single mip-map update Create/Release breaks MultiGPU Unpredictable performance less stable performance No asynchronous single mip-map update: upload is delayed until the texture is used, all mip-maps are uploaded Create/Release breaks MultiGPU: Unpredictable performance: different mip-map size, fragmentation in video memory might trigger garbage collector We used in Crysis to stay in the 32bit limits, on 64bit with enough memory or with low-res textures it’s not active * Used in CrysisTM to overcome 32 bit limits with high resolution textures

What is a Virtual Texture*? Emulates a mip-mapped texture can be high resolution real-time on consumer graphics hardware partly resides in texture memory Implications on engine design, content creation, performance and image quality For the virtual texture we only keep relevant parts of the texture in fast memory and asynchronously request missing parts from a slower memory (while using the content of a lower mip-map as fall-back). This implies that we need to keep parts of lower mip-maps in memory and these parts need to be loaded first. To allow efficient texture lookups coherent memory access is achieved by slicing up the mip-mapped texture into reasonably sized pieces. Using a fixed size for all pieces (now called texture tiles) the cache can be managed more easily and all tile operations, e.g. reading or copying, have constant time and memory characteristics. Mip-maps smaller than the tile size are not handled by our implementation. This is often acceptable for applications like terrain rendering where the texture is never that far away and little aliasing is barely visible at all. If required this can be solved as well, simply by packing multiple mip-maps into one tile. For distant objects it’s even possible to fall back to normal mip-mapped texture mapping, but that requires a separate system for managing that. * “Virtual Texture” derived from the OS/CPU feature “Virtual Memory”

Virtual Texture example usage Prepare mip-maps, store in equal sized tiles on HD Compute required tiles and request from HD Update indirection texture and tile cache Render 3D scene Both the tile texture cache and the indirection texture are dynamic and adapt to one or multiple views. A typical virtual texture is created from some source image format in a pre-process without imposing the graphic hardware limits on texture size. The data is stored on any slower device, e.g. the hard drive. Unused areas of the virtual texture can be dropped, thus saving memory as a nice bonus. The texture in the video memory (now called tile cache) consists of tiles required to render the 3d view. To efficiently reconstruct the virtual texture layout the indirection information is required as well.

Virtual Texture Demo Take a pause Show tile cache in action and UV space with borders

Virtual Texture in the Pixel Shader Idea: single unfiltered lookup into indirection texture (scale&offset), single filtered lookup into tile cache texture -> GDC 2008 Sean Barrett “Sparse Virtual Textures” No mips, memory coherent access Precision problem: 24/32bit float / integer D3D9: Half texel offset to get stable results Alternatives: Indirection per draw call, …

Virtual Texture Pixel Shader (HLSL) float4 g_vIndir; // w,h,1/w,1/h indirection texture extend float4 g_Cache; // w,h,1/w,1/h tilecachetexture extend float4 g_CacheMulTilesize; // w,h,1/w,1/h tilecachetexture extend * tilesize sampler IndirMap = sampler_state { Texture = <IndirTexture>; MipFilter = POINT; MinFilter = POINT; MagFilter = POINT; // MIPMAPLODBIAS = 7; // using mip-mapped indirection texture, here 128x128 }; float2 AdjustTexCoordforAT( float2 vTexIn ) float fHalf = 0.5f; // half texel for DX9, 0 for DX10 float2 TileIntFrac = vTexIn*g_vIndir.xy; float2 TileFrac = frac(TileIntFrac)*g_vIndir.zw; float2 TileInt = vTexIn - TileFrac; float4 vTiledTextureData = tex2D(IndirMap,TileInt+fHalf*g_vIndir.zw); float2 vScale = vTiledTextureData.bb; float2 vOffset = vTiledTextureData.rg; float2 vWithinTile = frac( TileIntFrac * vScale ); return vTileCacheUV = vOffset + vWithinTile*g_CacheMulTilesize.zw + fHalf*g_Cache.zw; }

Bilinear filtering Efficient only with redundant border 1 is minimum, DXT 4 pixel (e.g. 128+4), centered tiles can improve quality Texture updates become unaligned and texture size non power-of-two Power-of-two property can be retained by reducing the usable tile size but quality suffers (e.g. 124+4) non power-of-two is bad If you have non power of two tiles it’s probably better to add some padding to the tile cache to create the texture with a power of two extend. Otherwise you might be faced with undefined memory and performance characteristics from the graphic card driver. Power-of-two is best for the hardware implementation but with borders the quality can suffer a lot. That loss is due to aliasing in the mip-maps caused by the down sampling of the source texture to slightly less than its half size. A good down-sampling algorithm can limit the aliasing in the lower mip-maps but even the top mip-map is affected by this design decision and that can be visible especially when using regular patterns in the texture.

Indirection Texture Efficient representation of the dynamic (view dependent) Quad-tree Texture format: 32Bit ARGB (Precision issues on some HW) 64Bit FP16 for precision and less PS math Free channel can be used to fade tiles: bilinear filtered, to limit max per-pixel LOD Using a 32bit texture format with 8 bit channels is possible but you have to adjust the values in the pixel shader with additional instructions. Beware that scaling might not return the values you would expect. By storing a value from 0 to 255 in the texture you get values from 0.0 to 1.0 in the pixel shader what is guaranteed, but scaling these values by 255.0 should return you integer values, but it might not. Floating point math can be an issue, but even worse is that on some hardware the precision seems to be lower, rather comparable to 16 bit floats. In [Barrett08] the problem was solved by rounding[1], but the author admits that this might be not the most efficient approach. [1] To get the integer value from of a 8 bit channel this shader code was used: exp2(floor(channel*255+0.5))

Indirection Texture Update Quad-tree modifications only at leaf level 2D block updates of the texture (CPU/GPU) Mip mapped texture offers per-pixel LOD Many indirection textures at full resolution in memory are wasteful Multiple indirection texture updates per frame Many indirection textures at full resolution in memory are wasteful: run-time Create/Release, using max memory or pools with different sizes Multiple indirection texture updates per frame: distribute over multiple frames

Indirection Texture Update (C/C++) // float to fp16(s1e5m10) conversion (does not handle all cases) WORD float2fp16( float x ) { uint32 dwFloat = *((uint32 *)&x); uint32 dwMantissa = dwFloat & 0x7fffff; int iExp = (int)((dwFloat>>23) & 0xff) - (int)0x7f; uint32 dwSign = dwFloat>>31; return (WORD)( (dwSign<<15) | (((uint32)(iExp+0xf))<<10) | (dwMantissa>>13) ); } WORD texel[4]; // texel output RECT recTile; // in texels in the tilecache texture int iLod; // 0=full domain, 1=2x2, 2=4x4, ... int iSquareExtend; // indirection texture size in texels float fInvTileCache; // tile.Width / texCacheTexture.Width texel[0] = float2fp16(recTile.left*fInvTileCache); texel[1] = float2fp16(recTile.top*fInvTileCache); texel[2] = float2fp16((1<<iLod)/ iSquareExtend); texel[3] = 0; // unused

Tile Texture Cache Update 1/2 Requirements: Fast in latency and throughput stall free (no wait) but correct state Minimal bandwidth and memory overhead Memory layout and type for fast texture lookups O(n) or better for n tiles Predictable performance Fast in latency and throughput Bandwidth efficient (copy only required part) Little memory overhead Updates should happen stall free (no wait) but correctly synchronized to get the right texture state When updating the content by CPU there should be no copy from GPU memory to CPU memory (discard should be used) For fast texturing from the tile cache texture it should be in the appropriate memory layout (swizzled[1]) and memory type (video memory). Note that on some hardware compressed textures are stored in linear form (not swizzled). Multiple tile updates should have linear or better performance [1] A swizzled memory layout is a cache friendly layout for position coherent texture lookups.

Tile Texture Cache Update 2/2 Direct CPU Update D3D9: D3DPOOL_MANGED, LockRect() Small intermediate tile texture D3D9: LockRect(), StretchRect(), not for compressed Large intermediate tile cache texture D3D9: D3DPOOL_SYSTEM, LockRect(), UpdateTexture() GPU update (render to texture) fastest, flexibility limited There are three basic methods to update a part of a mip-map from CPU and depending on graphics API and depending on driver version and graphic card it’s not defined what performance behaviour you would get. This is actually is the major problem of the implementation and on some configurations it might even make the method unusable. We haven’t made enough tests to clarify this. The best test would be a finished game with heavy asset usage and dynamic content because you need to have a realistic workload to see problems with the bandwidth and memory limits. Experience shows that such driver issues often get addressed after a major game shipped using the technology. Method 1: Direct CPU update: The destination texture needs to be in D3DPOOL_MANGED and via the LockRect() function a section of the texture is updated. This method wastes main memory, and most likely the transfer is deferred till a draw call is using the texture. This method is simple but likely to be less optimal when compared with the next two methods. Method 2: With (small) intermediate tile texture: Here we need one lockable intermediate texture hat can hold one tile (including border). With the LockRect() function this texture is updated as a whole. With the StretchRect() function the texture is then copied into the destination texture to replace one tile only. This does not work with compressed textures (DXT) as they cannot be used as render target format. For the StretchRect() function the source and the destination requires to be in D3DPOOL_DEFAULT and that requires the source to be D3DUSAGE_DYNAMIC to be lockable. To find out if the driver supports dynamic textures, you are ought to check the caps bits for D3DCAPS2_DYNAMICTEXTURES but according to the list in the DirectX SDK even the lowest SM20 cards support this feature. Method 3: With (large) intermediate tile cache texture: This method requires a lockable intermediate texture in D3DPOOL_SYSTEM with the full texture cache extend. With LockRect() the intermediate texture is updated only where required and a following UpdateTexture() function call is transferring the data to the destination texture. UpdateTexture() requires the destination to be in D3DPOOL_DEFAULT.

Limits Maximum virtual texture size Maximum tile cache texture size = ExtendIndirection texture * Extendtile without border (addtionally limited by math precision) Maximum tile cache texture size Storing different attributes in the tile caches Splitting the tile cache over multiple textures Different object types or multiple cache levels

Possible Tile Sources Harddrive, CD.. use IO Completion ports, not memory mapped files Network Procedural Content Generation Mathematics or Example based Blending Materials Decals / Roads / Vector graphics / Text / Shadows Compression? DXT on-load? O(1)? Harddrive: IO Completion ports, disable OS cache CD / DVD / BD / HD DVD: latency accumulates with many small tile requests, alignment Network: Wouldn’t it be nice to join a multiplayer match without the need to download a map? You can join an ongoing game and even see all the detailed changes like tire tracks and decals that happened to the map. The normal game server or some special extra server can bake decals and distribute them to the clients on demand. Compression might be more important here, but it seems that downstream bandwidth will be much less of a problem in the future. O(1): costant performance with HD streaming of uncompressed content, much worse with on-the-fly baking of decals

Examples from CrysisTM * Terrain material blending (terrain detail objects have been removed Decals (roads, tire tracks, dirt) used on top of terrain material blending * not using the virtual texture pixel shader or the combo texture method!

Mip-map generation Even Kernel size (e.g. 2x2, 4x4) mip-map is offset to the next higher one fits to normal GPU behavior Odd Kernel size (e.g. 3x3, 5x5) mip-map texels are aligned good for rectangular chart images Choice affects implementation in other areas Non power-of-two: harder, slower, less quality

Out of Core mip-map Generation In memory often not feasible (32Bit) Tile based HD layout, no redundancy, uncompressed data, sRGB? Functions to request any rectangular block Intermediate mip-maps stored in tiles Texture compression in export step Lazy / on-the-fly mip-map generation

Computing the local LOD Conservative or Precise? View-point or View-cone? UV mapping dependent? min(ddx,ddy) or max(ddx,ddy) or euclidian? Many methods e.g. WorldSpace Quad-tree of AABB (tessellation dependent) View space (occlusion problem, single camera) Texture space (GPU, overdraw, resolution?, conservative?) D3D9: OcclusionQuery() or CPU readback?

Mesh Parameterization Unique unwrapping many applications Workflow issues Stable memory requirements Shading in texture space Rectangular charts Quad-tree aware unwrapping (less wasted area) Sparse textures (e.g. Masks) Rectangular charts: Subdivision surfaces like catmull-clark Tesselation?

Attributes stored in the Tiles Similar to the Attributes for Deferred Shading Diffuse/Specular Colour, Normal, SpecularPower Material ID => no blending Material Masks in Sparse Textures => blending, compression Combo Texture => lossy but static compression

Combo Texture Demo Take a pause

Combo Texture A 3d position in the RGB Cube can be used to blend between up to 8 materials. The blend coefficients can be stored efficiently in a virtual texture

Combo Texture RGB Cube Example 8 materials are defined at the cube corners RGB Combo Texture 8 materials are blended by the Combo Texture

Combo Texture Properties Clean blend only on the cube edges (material placement) Bilinear filtering (Custom filtering possible but expensive), DXT Dynamic or shader driven combo texture colour (e.g. frost) Blending 2/4/8/16 or n materials by using RGB/RGBA texture Best: RGB with 2..8 Materials, alpha left for other use Single pass only possible in simple cases (e.g. Detail Texture)

Combo Texture Multi pass Blending 4 materials: One opaque pass 3 passes alpha lerp Alpha needs to be adjusted to compensate the following passes Additive blending would be simpler but precision requires FP16 Less overdraw: Index buffer per material Alternative to per triangle material assignment Material LOD (e.g. golden buttons on jacket)

Combo Texture Pixel Shader (HLSL) float3 g_ComboMask; // RGB material combo colour // (3 channels for 8 materials) // 000,100,010,110,001,101,011,111 float4 g_ComboSum0,g_ComboSum1; // RGBA sum of the masks blended so far // including the current // (8 channels for 8 materials) // 10000000, 11000000, 11100000, 11110000 // 11111000, 11111100, 11111110, 11111111 float ComputeComboAlpha( BETWEENVERTEXANDPIXEL_Unified InOut ) { float3 cCombo = tex2D(Sampler_Combo,InOut.vBaseTexPos).rgb; float3 fSrcAlpha3 = g_ComboMask*cCombo + (1-g_ComboMask)*(1-cCombo); float fSrcAlpha3 = fSrcAlpha3.r*fSrcAlpha3.g*fSrcAlpha3.b; float4 vComboRG = float4(1-cCombo.r,cCombo.r,1-cCombo.r,cCombo.r) * float4(1-cCombo.g,1-cCombo.g,cCombo.g,cCombo.g); float fSum = dot(vComboRG,g_ComboSum0)*(1-cCombo.b) + dot(vComboRG,g_ComboSum1)*(cCombo.b); // + small numbers to avoid DivByZero return (fSrcAlpha+0.00000001f)/(0.00000001f+fSum); }

Sparse Virtual Textures GDC 2008 Sean Barrett Summary Many uncovered topics and details Virtual Texture is an old idea Recommended read: Sparse Virtual Textures Course PDF for details and references Many uncovered topics (e.g. anisotropy, prediction) We [game developers] need asynchronous partial texture update and mip-level adjustment with predictable performance. It seems we get virtualized memory for graphic card memory which is nice from an OS perspective but for games we don’t want stalling virtual memory. The application or the engine can deal with missing data on a more high level and can provide fallbacks until the request is resolved. Whatever methods are used and the described virtual texture method is just one, we need to be able to deliver frame rate quality and see this as a valuable feature (Quality of Service). Sparse Virtual Textures GDC 2008 Sean Barrett

Future … Better HW / API support Many Applications Render from/to Virtual Texture Quality of Service (MultiGPU?) Local LOD feedback Compression Many Applications Adaptive Shadow Maps TileTrees, PolyCube-Maps, … Many uncovered topics (e.g. anisotropy, prediction) We [game developers] need asynchronous partial texture update and mip-level adjustment with predictable performance. It seems we get virtualized memory for graphic card memory which is nice from an OS perspective but for games we don’t want stalling virtual memory. The application or the engine can deal with missing data on a more high level and can provide fallbacks until the request is resolved. Whatever methods are used and the described virtual texture method is just one, we need to be able to deliver frame rate quality and see this as a valuable feature (Quality of Service). …

Acknowledgments Efgeni Malachewitsch for the creature 3D model Anton Kaplanyan, Nick Kasyan, Michael Kopietz and all others that helped me with this text Special thanks to Natasha Tatarchuk, Kev Gee, Miguel Sainz, Yury Uralsky, Henry Morton, Carsten Dachsbacher and the many others from the industry Questions?