Cache Perf. CSE 471 Autumn 021 Cache Performance CPI contributed by cache = CPI c = miss rate * number of cycles to handle the miss Another important metric.

Slides:



Advertisements
Similar presentations
Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006.
Advertisements

Lecture 8: Memory Hierarchy Cache Performance Kai Bu
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.
1 Lecture 20: Cache Hierarchies, Virtual Memory Today’s topics:  Cache hierarchies  Virtual memory Reminder:  Assignment 8 will be posted soon (due.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
CS Lecture 10 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers N.P. Jouppi Proceedings.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
1 Lecture 12: Cache Innovations Today: cache access basics and innovations (Sections )
331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
Reducing Cache Misses (Sec. 5.3) Three categories of cache misses: 1.Compulsory –The very first access to a block cannot be in the cache 2.Capacity –Due.
Translation Buffers (TLB’s)
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
Mem. Hier. CSE 471 Aut 011 Evolution in Memory Management Techniques In early days, single program run on the whole machine –Used all the memory available.
Cache intro CSE 471 Autumn 011 Principle of Locality: Memory Hierarchies Text and data are not accessed randomly Temporal locality –Recently accessed items.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Caches – basic idea Small, fast memory Stores frequently-accessed blocks of memory. When it fills up, discard some blocks and replace them with others.
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CSE 378 Cache Performance1 Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache /
CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache And Pefetch Buffers Norman P. Jouppi Presenter:Shrinivas Narayani.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
Caches CSE Principle of Locality: Memory Hierarchies Text and data are not accessed randomly Temporal locality –Recently accessed items will be accessed.
Virtual Memory Review Goal: give illusion of a large memory Allow many processes to share single memory Strategy Break physical memory up into blocks (pages)
1 Lecture: Virtual Memory Topics: virtual memory, TLB/cache access (Sections 2.2)
Cache Perf. CSE 471 Autumn 021 Tolerating/hiding Memory Latency One particular technique: prefetching Goal: bring data in cache just in time for its use.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
COSC 3330/6308 Second Review Session Fall Instruction Timings For each of the following MIPS instructions, check the cycles that each instruction.
Microprocessor Microarchitecture Memory Hierarchy Optimization Lynn Choi Dept. Of Computer and Electronics Engineering.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.
Associativity in Caches Lecture 25
CSC 4250 Computer Architectures
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Evolution in Memory Management Techniques
Lecture 08: Memory Hierarchy Cache Performance
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Performance metrics for caches
Performance metrics for caches
Performance metrics for caches
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Translation Buffers (TLB’s)
Translation Buffers (TLB’s)
Cache Performance CPI contributed by cache = CPIc
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Performance metrics for caches
CSE 471 Autumn 1998 Virtual memory
CSE 471 Cache Enhancements
Cache - Optimization.
CSE 586 Computer Architecture Lecture 5
Principle of Locality: Memory Hierarchies
Translation Buffers (TLBs)
Performance metrics for caches
Review What are the advantages/disadvantages of pages versus segments?
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Cache Perf. CSE 471 Autumn 021 Cache Performance CPI contributed by cache = CPI c = miss rate * number of cycles to handle the miss Another important metric Average memory access time = cache hit time * hit rate + Miss penalty * (1 - hit rate)

Cache Perf. CSE 471 Autumn 022 Improving Cache Performance To improve cache performance: –Decrease miss rate without increasing time to handle the miss (more precisely: without increasing average memory access time) –Decrease time to handle the miss w/o increasing miss rate A slew of techniques: hardware and/or software –Increase capacity, associativity etc. –Hardware assists (victim caches, write buffers etc.) –Tolerating memory latency: Prefetching (hardware and software), lock-up free caches –O.S. interaction: mapping of virtual pages to decrease cache conflicts –Compiler interactions: code and data placement; tiling

Cache Perf. CSE 471 Autumn 023 Obvious Solutions to Decrease Miss Rate Increase cache capacity –Yes, but the larger the cache, the slower the access time –Limitations for first-level (L1) on-chip caches –Solution: Cache hierarchies (even on-chip) –Increasing L2 capacity can be detrimental on multiprocessor systems because of increase in coherence misses Increase cache associativity –Yes, but “law of diminishing returns” (after 4-way for small caches; not sure of the limit for large caches) –More comparisons needed, i.e., more logic and therefore longer time to check for hit/miss? –Make cache look more associative than it really is (see later)

Cache Perf. CSE 471 Autumn 024 What about Cache Block Size? For a given application, cache capacity and associativity, there is an optimal cache block size Long cache blocks –Good for spatial locality (code, vectors) –Reduce compulsory misses (implicit prefetching) –But takes more time to bring from next level of memory hierarchy (can be compensated by “critical word first” and subblocks) –Increase possibility of fragmentation (only fraction of the block is used – or reused) –Increase possibility of false-sharing in multiprocessor systems

Cache Perf. CSE 471 Autumn 025 What about Cache Block Size? (c’ed) In general, the larger the cache, the longer the best block size (e.g., 32 or 64 bytes for on-chip, 64, 128 or even 256 bytes for large off-chip caches) Longer block sizes in I-caches –Sequentiality of code –Matching with the IF unit

Cache Perf. CSE 471 Autumn 026 Example of (naïve) Analysis Choice between 32B (miss rate m32) and 64B block sizes (m64) Time of a miss: –send request + access time + transfer time –send request independent of block size (e.g., 4 cycles) –access time can be considered independent of block size (memory interleaving) (e.g., 28 cycles) –transfer time depends on bus width. For a bus width of say 64 bits transfer time is twice as much for 64B (say 16 cycles) than for 32B (8 cycles). –In this example, 32B is better if (4+28+8)m32 < ( ) m64

Cache Perf. CSE 471 Autumn 027 Example of (naïve) Analysis (c’ed) Case KB cache: m32 = 2.87, m64 = 2.64 –2.87 * 40 < 2.64 * 48 Case 2. 64KB cache: m32 = 1.35, m64 = 1.06 –1.35 * 40 > 1.06 * 48 32B better for 16KB and 64B better for 64KB –(Of course the example was designed to support the “in general the larger the cache, the longer the best block size ” statement of two slides ago).

Cache Perf. CSE 471 Autumn 028 Impact of Associativity “Old” conventional wisdom –Direct-mapped caches are faster; cache access is bottleneck for on- chip L1; make L1 caches direct mapped –For on-board (L2) caches, direct-mapped are 10% faster. “New” conventional wisdom –Can make 2-way set-associative caches fast enough for L1. Allows larger caches to be addressed only with page offset bits (see later) –Looks like time-wise it does not make much difference for L2/L3 caches, hence provide more associativity (but if caches are extremely large there might not be much benefit)

Cache Perf. CSE 471 Autumn 029 Reducing Cache Misses with more “Associativity” -- Victim caches First example (in this course) of an “hardware assist” Victim cache: Small fully-associative buffer “behind” the L1 cache and “before” the L2 cache Of course can also exist “behind” L2 and “before” main memory Main goal: remove some of the conflict misses in L1 direct-mapped caches (or any cache with low associativity)

Cache Perf. CSE 471 Autumn 0210 Index + Tag Cache Victim Cache 1. Hit 2.Miss in L1; Hit in VC; Send data to register and swap 3. From next level of memory hierarchy 3’. evicted

Cache Perf. CSE 471 Autumn 0211 Operation of a Victim Cache 1. Hit in L1; Nothing else needed 2. Miss in L1 for block at location b, hit in victim cache at location v: swap contents of b and v (takes an extra cycle) 3. Miss in L1, miss in victim cache : load missing item from next level and put in L1; put entry replaced in L1 in victim cache; if victim cache is full, evict one of its entries. Victim buffer of 4 to 8 entries for a 32KB direct-mapped cache works well.

Cache Perf. CSE 471 Autumn 0212 Bringing more Associativity -- Column-associative Caches Split (conceptually) direct-mapped cache into two halves Probe first half according to index. On hit proceed normally On miss, probe 2 nd half ; If hit, send to register and swap with entry in first half (takes an extra cycle) On miss (on both halves) go to next level, load in 2 nd half and swap

Cache Perf. CSE 471 Autumn 0213 Skewed-associative Caches Have different mappings for the two (or more) banks of the set-associative cache First mapping conventional; second one “dispersing” the addresses (XOR a few bits) Hit ratio of 2-way skewed as good as 4-way conventional.

Cache Perf. CSE 471 Autumn 0214 Reducing Conflicts --Page Coloring Interaction of the O.S. with the hardware In caches where the cache size > page size * associativity, bits of the physical address (besides the page offset) are needed for the index. On a page fault, the O.S. selects a mapping such that it tries to minimize conflicts in the cache.

Cache Perf. CSE 471 Autumn 0215 Options for Page Coloring Option 1: It assumes that the process faulting is using the whole cache –Attempts to map the page such that the cache will access data as if it were by virtual addresses Option 2: do the same thing but hash with bits of the PID (process identification number) –Reduce inter-process conflicts (e.g., prevent pages corresponding to stacks of various processes to map to the same area in the cache) Implemented by keeping “bins” of free pages