Caching Strategies for Textures Paul Arthur Navratil.

Slides:



Advertisements
Similar presentations
Technische Universität München Fakultät für Informatik Computer Graphics SS 2014 Sampling Rüdiger Westermann Lehrstuhl für Computer Graphik und Visualisierung.
Advertisements

Lecture 19: Cache Basics Today’s topics: Out-of-order execution
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Lecture 12 Reduce Miss Penalty and Hit Time
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
RealityEngine Graphics Kurt Akeley Silicon Graphics Computer Systems.
Direct Volume Rendering. What is volume rendering? Accumulate information along 1 dimension line through volume.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
Order-Independent Texture Synthesis Li-Yi Wei Marc Levoy Gcafe 1/30/2003.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
X86 and 3D graphics. Quick Intro to 3D Graphics Glossary: –Vertex – point in 3D space –Triangle – 3 connected vertices –Object – list of triangles that.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)
Computer ArchitectureFall 2008 © November 3 rd, 2008 Nael Abu-Ghazaleh CS-447– Computer.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.
Computer Graphics Inf4/MSc Computer Graphics Lecture 9 Antialiasing, Texture Mapping.
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
 Higher associativity means more complex hardware  But a highly-associative cache will also exhibit a lower miss rate —Each set has more blocks, so there’s.
CMPE 421 Parallel Computer Architecture
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
Tiled Polygon Traversal Using Half-Plane Edge Functions Joel McCormack & Bob McNamara.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
CS418 Computer Graphics John C. Hart
Memory Management and Parallelization Paul Arthur Navrátil The University of Texas at Austin.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache And Pefetch Buffers Norman P. Jouppi Presenter:Shrinivas Narayani.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.
Sunpyo Hong, Hyesoon Kim
SOFTENG 363 Computer Architecture Cache John Morris ECE/CS, The University of Auckland Iolanthe I at 13 knots on Cockburn Sound, WA.
Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
COMPSYS 304 Computer Architecture Cache John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe at 13 knots.
© Copyright 3Dlabs, Page 1 - PROPRIETARY & CONFIDENTIAL Virtual Textures Texture Management in Silicon Chris Hall Director, Product Marketing 3Dlabs.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Author : Tzi-Cker Chiueh, Prashant Pradhan Publisher : High-Performance Computer Architecture, Presenter : Jo-Ning Yu Date : 2010/11/03.
Memory Hierarchy and Cache. A Mystery… Memory Main memory = RAM : Random Access Memory – Read/write – Multiple flavors – DDR SDRAM most common 64 bit.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
Lecture: Large Caches, Virtual Memory
Improving Memory Access 1/3 The Cache and Virtual Memory
Texture Mapping cgvr.korea.ac.kr.
Multilevel Memories (Improving performance using alittle “cash”)
The University of Adelaide, School of Computer Science
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Cache memory Direct Cache Memory Associate Cache Memory
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
M. Usha Professor/CSE Sona College of Technology
Lecture 20: OOO, Memory Hierarchy
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Cache - Optimization.
10/18: Lecture Topics Using spatial locality
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Caching Strategies for Textures Paul Arthur Navratil

Overview Conceptual summary Design and Analysis of a Cache Architecture for Texture Mapping (Hakura and Gupta 1997) Prefetching in a Texture Cache Architecture (Igehy, Eldrige, and Proudfoot 1998) Discussion!

Mip mapping Achieves acceptable performance texture mapping Interpolation between fixed levels of detail is a constant computation cost per fragment Reduces aliasing [Williams p.4] Efficient memory use Memory access pattern is well understood

Hakura and Gupta: Problem Motivation: need high bandwidth, low latency memory access for texture mapping Previous work uses brute-force –Dedicated DRAM for each fragment generator [Akeley p.3] –SGI RealityEngine can have 320MB texture memory, but only 16MB of unique texture memory!

Hakura and Gupta: Idea Observation: If textures exhibit spatial and temporal localities, design a system to exploit them Use SRAM cache for each fragment generator Have a single, shared DRAM texture memory Advantages –Unique texture memory is larger –Uses cheaper chip (SRAM over DRAM) –SRAM gives higher bandwidth and lower latency

Hakura and Gupta: Locality Mip mapping has inherent spatial locality –Four contiguous texels on each of two levels for trilinear interpolation, with texel area close to pixel area Texture mapping has two temporal localities –Overlapping texel usage along contiguous fragment generation –Repeating texture across image [color images.ps]

Hakura and Gupta: Caching Observation: Increase in DRAM density has decreased DRAM bandwidth! Cache decreases bandwidth requirement by decreasing accesses to texture memory Block transfers from memory to cache maximize DRAM bandwidth utilization Texture memory can be shared (not dedicated) No cache coherence issues Cache characterized by: –Cache size –Cache line size –Associativity Which combination is best?

Hakura and Gupta: Texture Representation in Memory Base case: Linear (Non-Blocked) –Williams original representation misses spatial locality –Use contiguous RGBA values per texel [Hakura p.5] Observations: –Gradual level-of-detail change uses more of a fetched cache line –Higher line size drops cold miss rate –Principle of Texture Thrift: amount of texture info required to render is proportional to the resolution of the image, and is independent of the number of surfaces and the size of the texture [Peachey 90] –In examples, workset limited to one texture Worst case bound by either texture size or screen size –This representation is sensitive to the texture orientation on screen.

Hakura and Gupta: Texture Representation in Memory Blocked case: convert 2-D arrays into 4-D arrays. –Address calculation is a two-step process –Block size remains constant across mipmap levels Observations: –Reduces dependency on texture orientation, and utilizes spatial locality –Lowest miss rates occur when block size matches cache line size [Hakura p.7] –Increasing line size alone creates worse miss rates –Can use 2-way associative cache to avoid conflict with blocks of different mipmap levels (see Igehy)

Hakura and Gupta: Rasterization Rasterization order affects texture access pattern, and thus cache behavior also Use tiling (chunking) to utilize spatial locality –If tiles are too large, the working set will be larger than the cache size, and capacity misses will result [Hakura p.9] –Smaller triangles in image reduce this effect

Hakura and Gupta: Performance Rendering performance and memory bandwidth are good measures of a texture mapping system Fragment generator observations –Machine must access more than one texel per cycle –Must hide memory latency to achieve maximum throughput (address precomputation) SRAM cache observations –Multiple banks with interleaced lines for multiple texel access –Interleave texels within each block –Without multi-texel access, trilinear interpolation can compare texels only once every two cycles!

Hakura and Gupta: Conclusions Caching yields a three-fold to fifteen-fold reduction in memory bandwidth requirements Cache should be at least 16 KB and 2-way associative Long cache lines better utilize bandwidth (with a slight increase in bandwidth requirements) Block size should match cache line size Rasterization pattern should be tiled

Igehy et al: Problem Motivation: Memory bandwidth and latency are (becoming) bottleneck for texture systems Previous work shows caching benefits [Hakura97; Cox98], but fails to hide memory latency Little literature on prefetching texels: –used in some systems, but the algorithms are not described (proprietary) e.g. [Torborg and Kajiya, 1996]

Igehy et al: Idea Combine prefetching and caching in an architecture with a clear description Advantages: –Simple –Robust to variations in bandwidth requirements and latencies –Achieves within 3% of performance of a zero-latency system

Igehy et al: Traditional Prefetching (no cache) When a fragment is ready for texturing, queue it and request the texels Fragment stays in queue for time equal to memory latency If the queue is sized correctly, latency will be masked Problems: –If covering large request rate and latency, early prefetch can cause cache miss –Tags must be checked at double-rate to maximize throughput (prefetch check and read check) –Prefetch buffer size must increase as request rate and latency increase

Igehy et al: Texture Prefetching Differences from traditional prefetch: –Tag checks occur once per texel, before cache access –Add reorder buffer to handle early return of texel data –New cache blocks only put in cache when associated fragment reaches head of the queue Cache organization: –Four banks each, with adjacent levels of mipmap in alternating banks –Data interleaved so the four accesses for bilinear interpolation can occur in parallel –Can process 8 requests in parallel, which is enough for trilinear interpolation

Igehy et al: Texture Properties Texture caching effectiveness is scene dependent Observation: unique-texel-to-fragment ratio is lower bound on number of texels that must be fetched per frame (unless utilizing inter-frame locality) Want a low unique-texel-to-fragment ratio! Ratio affected by: –Magnification (lowers ratio) –Repetition (lowers ratio if cache holds entire texture) –Minification (ratio depends on texel-area-to-pixel-area ratio)

Igehy et al: Memory Organization Use 6-D texture representation in Hakura [Igehy p.5] Rasterize in tiled pattern (not scan-line) Cache associativity does not appreciably affect miss rate –Design minimizes conflict misses General formula for determining associativity: –m independent n-way associative caches can handle a rate of m bilinear accesses (four texels) per cycle to m*n textures (or texture levels in mipmap)

Igehy et al: Bandwidth Average texel requests per frame are not enough to determine actual requirements –High-request bursts occur [Igehy p.6] e.g. color map vs. light map When system misses ideal (zero-latency) performance, bandwidth is to blame [Igehy p.8] –e.g. AGP vs. NUMA

Igehy et al: Conclusions System that approximates zero-latency is possible –Achieved 97% utilization of available resources Fragment queue should slightly exceed latency of memory system to account for miss bursts Reserve reorder-buffer slot when memory request is made to avoid deadlock

Discussion!