CS6290 Caches. Locality and Caches Data Locality –Temporal: if data item needed now, it is likely to be needed again in near future –Spatial: if data.

Slides:



Advertisements
Similar presentations
Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006.
Advertisements

Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Computation I pg 1 Embedded Computer Architecture Memory Hierarchy: Cache Recap Course 5KK73 Henk Corporaal November 2014
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
CS2100 Computer Organisation Cache II (AY2014/2015) Semester 2.
1 Lecture 20: Cache Hierarchies, Virtual Memory Today’s topics:  Cache hierarchies  Virtual memory Reminder:  Assignment 8 will be posted soon (due.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
Overview of Cache and Virtual MemorySlide 1 The Need for a Cache (edited from notes with Behrooz Parhami’s Computer Architecture textbook) Cache memories.
Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Review CPSC 321 Andreas Klappenecker Announcements Tuesday, November 30, midterm exam.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
Chapter 7 Large and Fast: Exploiting Memory Hierarchy Bo Cheng.
Memory Hierarchy Design Chapter 5 Karin Strauss. Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 3, 2003 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
ENGS 116 Lecture 121 Caches Vincent H. Berk Wednesday October 29 th, 2008 Reading for Friday: Sections C.1 – C.3 Article for Friday: Jouppi Reading for.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
Virtual Memory Topics Virtual Memory Access Page Table, TLB Programming for locality Memory Mountain Revisited.
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.
CS 3410, Spring 2014 Computer Science Cornell University See P&H Chapter: , 5.8, 5.15.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
Computer Organization CS224 Fall 2012 Lessons 45 & 46.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches.
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
Memory Hierarchy How to improve memory access. Outline Locality Structure of memory hierarchy Cache Virtual memory.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
For each of these, where could the data be and how would we find it? TLB hit – cache or physical memory TLB miss – cache, memory, or disk Virtual memory.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
Microprocessor Microarchitecture Memory Hierarchy Optimization Lynn Choi Dept. Of Computer and Electronics Engineering.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
CS203 – Advanced Computer Architecture Cache. Memory Hierarchy Design Memory hierarchy design becomes more crucial with recent multi-core processors:
Chapter 5 Memory II CSE 820. Michigan State University Computer Science and Engineering Equations CPU execution time = (CPU cycles + Memory-stall cycles)
CMSC 611: Advanced Computer Architecture
CSE 351 Section 9 3/1/12.
Associativity in Caches Lecture 25
CSC 4250 Computer Architectures
Multilevel Memories (Improving performance using alittle “cash”)
The University of Adelaide, School of Computer Science
Cache Memory Presentation I
Morgan Kaufmann Publishers
Lecture 23: Cache, Memory, Virtual Memory
Chapter 5 Memory CSE 820.
Adapted from slides by Sally McKee Cornell University
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
CS 3410, Spring 2014 Computer Science Cornell University
Cache - Optimization.
10/18: Lecture Topics Using spatial locality
Presentation transcript:

CS6290 Caches

Locality and Caches Data Locality –Temporal: if data item needed now, it is likely to be needed again in near future –Spatial: if data item needed now, nearby data likely to be needed in near future Exploiting Locality: Caches –Keep recently used data in fast memory close to the processor –Also bring nearby data there

Storage Hierarchy and Locality Register File Instruction CacheData Cache L2 Cache L3 Cache Main Memory Disk Row buffer SRAM Cache Bypass Network Capacity + Speed - Speed + Capacity - ITLBDTLB

Memory Latency is Long ns not totally uncommon Quick back-of-the-envelope calculation: –2GHz CPU –  0.5ns / cycle –100ns memory  200 cycle memory latency! Solution: Caches

Cache Basics Fast (but small) memory close to processor When data referenced –If in cache, use cache instead of memory –If not in cache, bring into cache (actually, bring entire block of data, too) –Maybe have to kick something else out to do it! Important decisions –Placement: where in the cache can a block go? –Identification: how do we find a block in cache? –Replacement: what to kick out to make room in cache? –Write policy: What do we do about stores? Key: Optimize the average memory access latency

Cache Basics Cache consists of block-sized lines –Line size typically power of two –Typically 16 to 128 bytes in size Example –Suppose block size is 128 bytes Lowest seven bits determine offset within block –Read data at address A=0x7fffa3f4 –Address begins to block with base address 0x7fffa380

Cache Placement Placement –Which memory blocks are allowed into which cache lines Placement Policies –Direct mapped (block can go to only one line) –Fully Associative (block can go to any line) –Set-associative (block can go to one of N lines) E.g., if N=4, the cache is 4-way set associative Other two policies are extremes of this (E.g., if N=1 we get a direct-mapped cache)

Cache Identification When address referenced, need to –Find whether its data is in the cache –If it is, find where in the cache –This is called a cache lookup Each cache line must have –A valid bit (1 if line has data, 0 if line empty) We also say the cache line is valid or invalid –A tag to identify which block is in the line (if line is valid)

Cache Replacement Need a free line to insert new block –Which block should we kick out? Several strategies –Random (randomly selected line) –FIFO (line that has been in cache the longest) –LRU (least recently used line) –LRU Approximations –NMRU –LFU

Implementing LRU Have LRU counter for each line in a set When line accessed –Get old value X of its counter –Set its counter to max value –For every other line in the set If counter larger than X, decrement it When replacement needed –Select line whose counter is 0

Approximating LRU LRU is pretty complicated (esp. for many ways) –Access and possibly update all counters in a set on every access (not just replacement) Need something simpler and faster –But still close to LRU NMRU – Not Most Recently Used –The entire set has one MRU pointer –Points to last-accessed line in the set –Replacement: Randomly select a non-MRU line

Write Policy Do we allocate cache lines on a write? –Write-allocate A write miss brings block into cache –No-write-allocate A write miss leaves cache as it was Do we update memory on writes? –Write-through Memory immediately updated on each write –Write-back Memory updated when line replaced

Write-Back Caches Need a Dirty bit for each line –A dirty line has more recent data than memory Line starts as clean (not dirty) Line becomes dirty on first write to it –Memory not updated yet, cache has the only up- to-date copy of data for a dirty line Replacing a dirty line –Must write data back to memory (write-back)

Example - Cache Lookup

Example - Cache Replacement

Cache Performance Miss rate –Fraction of memory accesses that miss in cache –Hit rate = 1 – miss rate Average memory access time AMAT = hit time + miss rate * miss penalty Memory stall cycles CPUtime = CycleTime x (Cycles Exec + Cycles MemoryStall ) Cycles MemoryStall = CacheMisses x (MissLatency Total – MissLatency Overlapped )

Improving Cache Performance AMAT = hit time + miss rate * miss penalty –Reduce miss penalty –Reduce miss rate –Reduce hit time Cycles MemoryStall = CacheMisses x (MissLatency Total – MissLatency Overlapped ) –Increase overlapped miss latency

Reducing Cache Miss Penalty (1) Multilevel caches –Very Fast, small Level 1 (L1) cache –Fast, not so small Level 2 (L2) cache –May also have slower, large L3 cache, etc. Why does this help? –Miss in L1 cache can hit in L2 cache, etc. AMAT = HitTime L1 +MissRate L1 MissPenalty L1 MissPenalty L1 = HitTime L2 +MissRate L2 MissPenalty L2 MissPenalty L2 = HitTime L3 +MissRate L3 MissPenalty L3

Multilevel Caches Global vs. Local Miss Rate –Global L2 Miss Rate # of L2 Misses / # of All Memory Refs –Local Miss Rate # of L2 Misses / # of L1 Misses (only L1 misses actually get to the L2 cache) –MPKI often used (normalize against number of instructions) Allows comparisons against different types of events Exclusion Property –If block is in L1 cache, it is never in L2 cache –Saves some L2 space Inclusion Property –If block A is in L1 cache, it must also be in L2 cache

Reducing Cache Miss Penalty (2) Early Restart & Critical Word First –Block transfer takes time (bus too narrow) –Give data to loads before entire block arrive Early restart –When needed word arrives, let processor use it –Then continue block transfer to fill cache line Critical Word First –Transfer loaded word first, then the rest of block (with wrap-around to get the entire block) –Use with early restart to let processor go ASAP

Reducing Cache Miss Penalty (3) Increase Load Miss Priority –Loads can have dependent instructions –If a load misses and a store needs to go to memory, let the load miss go first –Need a write buffer to remember stores Merging Write Buffer –If multiple write misses to the same block, combine them in the write buffer –Use block-write instead of a many small writes

Kinds of Cache Misses The “3 Cs” –Compulsory: have to have these Miss the first time each block is accessed –Capacity: due to limited cache capacity Would not have them if cache size was infinite –Conflict: due to limited associativity Would not have them if cache was fully associative

Reducing Cache Miss Penalty (4) Victim Caches –Recently kicked-out blocks kept in small cache –If we miss on those blocks, can get them fast –Why does it work: conflict misses Misses that we have in our N-way set-assoc cache, but would not have if the cache was fully associative –Example: direct-mapped L1 cache and a 16-line fully associative victim cache Victim cache prevents thrashing when several “popular” blocks want to go to the same entry

Reducing Cache Miss Rate (1) Larger blocks –Helps if there is more spatial locality

Reducing Cache Miss Rate (2) Larger caches –Fewer capacity misses, but longer hit latency! Higher Associativity –Fewer conflict misses, but longer hit latency Way Prediction –Speeds up set-associative caches –Predict which of N ways has our data, fast access as direct-mapped cache –If mispredicted, access again as set-assoc cache

Reducing Cache Miss Rate (2) Pseudo Associative Caches –Similar to way prediction –Start with direct mapped cache –If miss on “primary” entry, try another entry Compiler optimizations –Loop interchange –Blocking

Loop Interchange For loops over multi-dimensional arrays –Example: matrices (2-dim arrays) Change order of iteration to match layout –Gets better spatial locality –Layout in C: last index changes first for(j=0;j<10000;j++) for(i=0;i<40000;i++) c[i][j]=a[i][j]+b[i][j]; for(i=0;i<40000;i++) for(j=0;j<10000;j++) c[i][j]=a[i][j]+b[i][j]; a[i][j] and a[i+1][j] are elements apart a[i][j] and a[i][j+1] are next to each other

Reducing Hit Time (1) Small & Simple Caches are faster

Reducing Hit Time (2) Avoid address translation on cache hits Software uses virtual addresses, memory accessed using physical addresses –Details of this later (virtual memory) HW must translate virtual to physical –Normally the first thing we do –Caches accessed using physical address –Wait for translation before cache lookup Idea: index cache using virtual address

Reducing Hit Time (3) Pipelined Caches –Improves bandwidth, but not latency –Essential for L1 caches at high frequency Even small caches have 2-3 cycle latency at N GHz –Also used in many L2 caches Trace Caches –For instruction caches

Hiding Miss Latency Idea: overlap miss latency with useful work –Also called “latency hiding” Non-blocking caches –A blocking cache services one access at a time While miss serviced, other accesses blocked (wait) –Non-blocking caches remove this limitation While miss serviced, can process other requests Prefetching –Predict what will be needed and get it ahead of time

Non-Blocking Caches Hit Under Miss –Allow cache hits while one miss in progress –But another miss has to wait Miss Under Miss, Hit Under Multiple Misses –Allow hits and misses when other misses in progress –Memory system must allow multiple pending requests

Prefetching Predict future misses and get data into cache –If access does happen, we have a hit now (or a partial miss, if data is on the way) –If access does not happen, cache pollution (replaced other data with junk we don’t need) To avoid pollution, prefetch buffers –Pollution a big problem for small caches –Have a small separate buffer for prefetches When we do access it, put data in cache If we don’t access it, cache not polluted

Software Prefetching Two flavors: register prefetch and cache prefetch Each flavor can be faulting or non-faulting –If address bad, does it create exceptions? Faulting register prefetch is binding –It is a normal load, address must be OK, uses register Not faulting cache prefetch is non-binding –If address bad, becomes a NOP –Does not affect register state –Has more overhead (load still there), ISA change (prefetch instruction), complicates cache (prefetches and loads different)