Presentation on theme: "Caching Strategies for Textures Paul Arthur Navratil."— Presentation transcript:
Caching Strategies for Textures Paul Arthur Navratil
Overview Conceptual summary Design and Analysis of a Cache Architecture for Texture Mapping (Hakura and Gupta 1997) Prefetching in a Texture Cache Architecture (Igehy, Eldrige, and Proudfoot 1998) Discussion!
Mip mapping Achieves acceptable performance texture mapping Interpolation between fixed levels of detail is a constant computation cost per fragment Reduces aliasing [Williams p.4] Efficient memory use Memory access pattern is well understood
Hakura and Gupta: Problem Motivation: need high bandwidth, low latency memory access for texture mapping Previous work uses brute-force –Dedicated DRAM for each fragment generator [Akeley p.3] –SGI RealityEngine can have 320MB texture memory, but only 16MB of unique texture memory!
Hakura and Gupta: Idea Observation: If textures exhibit spatial and temporal localities, design a system to exploit them Use SRAM cache for each fragment generator Have a single, shared DRAM texture memory Advantages –Unique texture memory is larger –Uses cheaper chip (SRAM over DRAM) –SRAM gives higher bandwidth and lower latency
Hakura and Gupta: Locality Mip mapping has inherent spatial locality –Four contiguous texels on each of two levels for trilinear interpolation, with texel area close to pixel area Texture mapping has two temporal localities –Overlapping texel usage along contiguous fragment generation –Repeating texture across image [color images.ps]
Hakura and Gupta: Caching Observation: Increase in DRAM density has decreased DRAM bandwidth! Cache decreases bandwidth requirement by decreasing accesses to texture memory Block transfers from memory to cache maximize DRAM bandwidth utilization Texture memory can be shared (not dedicated) No cache coherence issues Cache characterized by: –Cache size –Cache line size –Associativity Which combination is best?
Hakura and Gupta: Texture Representation in Memory Base case: Linear (Non-Blocked) –Williams original representation misses spatial locality –Use contiguous RGBA values per texel [Hakura p.5] Observations: –Gradual level-of-detail change uses more of a fetched cache line –Higher line size drops cold miss rate –Principle of Texture Thrift: amount of texture info required to render is proportional to the resolution of the image, and is independent of the number of surfaces and the size of the texture [Peachey 90] –In examples, workset limited to one texture Worst case bound by either texture size or screen size –This representation is sensitive to the texture orientation on screen.
Hakura and Gupta: Texture Representation in Memory Blocked case: convert 2-D arrays into 4-D arrays. –Address calculation is a two-step process –Block size remains constant across mipmap levels Observations: –Reduces dependency on texture orientation, and utilizes spatial locality –Lowest miss rates occur when block size matches cache line size [Hakura p.7] –Increasing line size alone creates worse miss rates –Can use 2-way associative cache to avoid conflict with blocks of different mipmap levels (see Igehy)
Hakura and Gupta: Rasterization Rasterization order affects texture access pattern, and thus cache behavior also Use tiling (chunking) to utilize spatial locality –If tiles are too large, the working set will be larger than the cache size, and capacity misses will result [Hakura p.9] –Smaller triangles in image reduce this effect
Hakura and Gupta: Performance Rendering performance and memory bandwidth are good measures of a texture mapping system Fragment generator observations –Machine must access more than one texel per cycle –Must hide memory latency to achieve maximum throughput (address precomputation) SRAM cache observations –Multiple banks with interleaced lines for multiple texel access –Interleave texels within each block –Without multi-texel access, trilinear interpolation can compare texels only once every two cycles!
Hakura and Gupta: Conclusions Caching yields a three-fold to fifteen-fold reduction in memory bandwidth requirements Cache should be at least 16 KB and 2-way associative Long cache lines better utilize bandwidth (with a slight increase in bandwidth requirements) Block size should match cache line size Rasterization pattern should be tiled
Igehy et al: Problem Motivation: Memory bandwidth and latency are (becoming) bottleneck for texture systems Previous work shows caching benefits [Hakura97; Cox98], but fails to hide memory latency Little literature on prefetching texels: –used in some systems, but the algorithms are not described (proprietary) e.g. [Torborg and Kajiya, 1996]
Igehy et al: Idea Combine prefetching and caching in an architecture with a clear description Advantages: –Simple –Robust to variations in bandwidth requirements and latencies –Achieves within 3% of performance of a zero-latency system
Igehy et al: Traditional Prefetching (no cache) When a fragment is ready for texturing, queue it and request the texels Fragment stays in queue for time equal to memory latency If the queue is sized correctly, latency will be masked Problems: –If covering large request rate and latency, early prefetch can cause cache miss –Tags must be checked at double-rate to maximize throughput (prefetch check and read check) –Prefetch buffer size must increase as request rate and latency increase
Igehy et al: Texture Prefetching Differences from traditional prefetch: –Tag checks occur once per texel, before cache access –Add reorder buffer to handle early return of texel data –New cache blocks only put in cache when associated fragment reaches head of the queue Cache organization: –Four banks each, with adjacent levels of mipmap in alternating banks –Data interleaved so the four accesses for bilinear interpolation can occur in parallel –Can process 8 requests in parallel, which is enough for trilinear interpolation
Igehy et al: Texture Properties Texture caching effectiveness is scene dependent Observation: unique-texel-to-fragment ratio is lower bound on number of texels that must be fetched per frame (unless utilizing inter-frame locality) Want a low unique-texel-to-fragment ratio! Ratio affected by: –Magnification (lowers ratio) –Repetition (lowers ratio if cache holds entire texture) –Minification (ratio depends on texel-area-to-pixel-area ratio)
Igehy et al: Memory Organization Use 6-D texture representation in Hakura [Igehy p.5] Rasterize in tiled pattern (not scan-line) Cache associativity does not appreciably affect miss rate –Design minimizes conflict misses General formula for determining associativity: –m independent n-way associative caches can handle a rate of m bilinear accesses (four texels) per cycle to m*n textures (or texture levels in mipmap)
Igehy et al: Bandwidth Average texel requests per frame are not enough to determine actual requirements –High-request bursts occur [Igehy p.6] e.g. color map vs. light map When system misses ideal (zero-latency) performance, bandwidth is to blame [Igehy p.8] –e.g. AGP vs. NUMA
Igehy et al: Conclusions System that approximates zero-latency is possible –Achieved 97% utilization of available resources Fragment queue should slightly exceed latency of memory system to account for miss bursts Reserve reorder-buffer slot when memory request is made to avoid deadlock