Caches Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University See P&H 5.1, 5.2 (except writes)

Slides:



Advertisements
Similar presentations
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Advertisements

Caches Hakim Weatherspoon CS 3410, Spring 2011 Computer Science Cornell University See P&H 5.1, 5.2 (except writes)
CS2100 Computer Organisation Cache II (AY2014/2015) Semester 2.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache II Steve Ko Computer Sciences and Engineering University at Buffalo.
Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Caches P & H Chapter 5.1, 5.2 (except writes)
Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Caches P & H Chapter 5.1, 5.2 (except writes)
Caches Hakim Weatherspoon CS 3410, Spring 2011 Computer Science Cornell University See P&H 5.2 (writes), 5.3, 5.5.
Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Caches 2 P & H Chapter 5.2 (writes), 5.3, 5.5.
The Lord of the Cache Project 3. Caches Three common cache designs: Direct-Mapped store in exactly one cache line Fully Associative store in any cache.
Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
331 Week13.1Spring :332:331 Computer Architecture and Assembly Language Spring 2006 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
CS61C L22 Caches III (1) A Carle, Summer 2006 © UCB inst.eecs.berkeley.edu/~cs61c/su06 CS61C : Machine Structures Lecture #22: Caches Andy.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
Chapter 7 Large and Fast: Exploiting Memory Hierarchy Bo Cheng.
331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
CIS °The Five Classic Components of a Computer °Today’s Topics: Memory Hierarchy Cache Basics Cache Exercise (Many of this topic’s slides were.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
Cs 61C L17 Cache.1 Patterson Spring 99 ©UCB CS61C Cache Memory Lecture 17 March 31, 1999 Dave Patterson (http.cs.berkeley.edu/~patterson) www-inst.eecs.berkeley.edu/~cs61c/schedule.html.
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
DAP Spr.‘98 ©UCB 1 Lecture 11: Memory Hierarchy—Ways to Reduce Misses.
Caches (Writing) Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University P & H Chapter 5.2-3, 5.5.
Caches Han Wang CS 3410, Spring 2012 Computer Science Cornell University See P&H 5.1, 5.2 (except writes)
Caches Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University See P&H 5.1, 5.2 (except writes)
CMPE 421 Parallel Computer Architecture
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
Lecture 19 Today’s topics Types of memory Memory hierarchy.
Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.
CS 3410, Spring 2014 Computer Science Cornell University See P&H Chapter: , 5.8, 5.15.
Caches Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University See P&H 5.1, 5.2 (except writes)
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
223 Memory Hierarchy: Caches, Virtual Memory Big memories are slow Fast memories are small Need to get fast, big memories Processor Computer Control Datapath.
1010 Caching ENGR 3410 – Computer Architecture Mark L. Chang Fall 2006.
The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and.
CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.
Computer Organization & Programming
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
CPE232 Cache Introduction1 CPE 232 Computer Organization Spring 2006 Cache Introduction Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.
Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
What is it and why do we need it? Chris Ward CS147 10/16/2008.
Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
Deniz Altinbuken CS 3410, Spring 2015 Computer Science Cornell University See P&H Chapter: (except writes) Caches and Memory.
COMP 3221: Microprocessors and Embedded Systems Lectures 27: Cache Memory - III Lecturer: Hui Wu Session 2, 2005 Modified.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
The Goal: illusion of large, fast, cheap memory
CPU cache Acknowledgment
Caches 2 Hakim Weatherspoon CS 3410, Spring 2013 Computer Science
Caches (Writing) Hakim Weatherspoon CS 3410, Spring 2013
Caches 2 Hakim Weatherspoon CS 3410, Spring 2013 Computer Science
Caches (Writing) Hakim Weatherspoon CS 3410, Spring 2012
Caches Hakim Weatherspoon CS 3410, Spring 2013 Computer Science
Caches Hakim Weatherspoon CS 3410, Spring 2012 Computer Science
Adapted from slides by Sally McKee Cornell University
Han Wang CS 3410, Spring 2012 Computer Science Cornell University
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
CS 3410, Spring 2014 Computer Science Cornell University
Cache - Optimization.
10/18: Lecture Topics Using spatial locality
Caches & Memory.
Presentation transcript:

Caches Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University See P&H 5.1, 5.2 (except writes)

2 Administrivia HW4 due today, March 27 th Project2 due next Monday, April 2 nd Prelim2 Thursday, March 29 th at 7:30pm in Philips 101 Review session today 5:30-7:30pm in Phillips 407

3 Write- Back Memory Instruction Fetch Execute Instruction Decode extend register file control Big Picture: Memory alu memory d in d out addr PC memory new pc inst IF/ID ID/EXEX/MEMMEM/WB imm B A ctrl B DD M compute jump/branch targets +4 forward unit detect hazard Memory: big & slow vs Caches: small & fast

4 Goals for Today: caches Caches vs memory vs tertiary storage Tradeoffs: big & slow vs small & fast –Best of both worlds working set: 90/10 rule How to predict future: temporal & spacial locality Examples of caches: Direct Mapped Fully Associative N-way set associative

5 Performance CPU clock rates ~0.2ns – 2ns (5GHz-500MHz) Technology Capacity$/GBLatency Tape1 TB$.17100s of seconds Disk2 TB$.03Millions of cycles (ms) SSD (Flash)128 GB$2Thousands of cycles (us) DRAM 8 GB$10 (10s of ns) SRAM off-chip8 MB$ cycles (few ns) SRAM on-chip256 KB???1-3 cycles (ns) Others: eDRAM aka 1T SRAM, FeRAM, CD, DVD, … Q: Can we create illusion of cheap + large + fast? cycles

6 Memory Pyramid Disk (Many GB – few TB) Memory (128MB – few GB) L2 Cache (½-32MB) RegFile 100s bytes Memory Pyramid < 1 cycle access 1-3 cycle access 5-15 cycle access cycle access L3 becoming more common (eDRAM ?) These are rough numbers: mileage may vary for latest/greatest Caches usually made of SRAM (or eDRAM) L1 Cache (several KB) cycle access

7 Memory Hierarchy Memory closer to processor small & fast stores active data Memory farther from processor big & slow stores inactive data

8 Memory Hierarchy Insight for Caches If Mem[x] is was accessed recently... … then Mem[x] is likely to be accessed soon Exploit temporal locality: –Put recently accessed Mem[x] higher in memory hierarchy since it will likely be accessed again soon … then Mem[x ± ε] is likely to be accessed soon Exploit spatial locality: –Put entire block containing Mem[x] and surrounding addresses higher in memory hierarchy since nearby address will likely be accessed

9 Memory Hierarchy Memory closer to processor is fast but small usually stores subset of memory farther away –“strictly inclusive” Transfer whole blocks (cache lines): 4kb: disk ↔ ram 256b: ram ↔ L2 64b: L2 ↔ L1

10 Memory Hierarchy Memory trace 0x7c9a2b18 0x7c9a2b19 0x7c9a2b1a 0x7c9a2b1b 0x7c9a2b1c 0x7c9a2b1d 0x7c9a2b1e 0x7c9a2b1f 0x7c9a2b20 0x7c9a2b21 0x7c9a2b22 0x7c9a2b23 0x7c9a2b28 0x7c9a2b2c 0x c 0x x7c9a2b04 0x x7c9a2b00 0x x c... int n = 4; int k[] = { 3, 14, 0, 10 }; int fib(int i) { if (i <= 2) return i; else return fib(i-1)+fib(i-2); } int main(int ac, char **av) { for (int i = 0; i < n; i++) { printi(fib(k[i])); prints("\n"); }

11 Cache Lookups (Read) Processor tries to access Mem[x] Check: is block containing Mem[x] in the cache? Yes: cache hit –return requested data from cache line No: cache miss –read block from memory (or lower level cache) –(evict an existing cache line to make room) –place new block in cache –return requested data  and stall the pipeline while all of this happens

12 Three common designs A given data block can be placed… … in exactly one cache line  Direct Mapped … in any cache line  Fully Associative … in a small set of cache lines  Set Associative

13 Direct Mapped Cache Each block number mapped to a single cache line index Simplest hardware line 0 line 1 0x x x x00000c 0x x x x00001c 0x x x x00002c 0x x x x00003c 0x x x000048

14 Direct Mapped Cache Each block number mapped to a single cache line index Simplest hardware line 0 line 1 line 2 line 3 0x x x x00000c 0x x x x00001c 0x x x x00002c 0x x x x00003c 0x x x000048

15 Direct Mapped Cache

16 Direct Mapped Cache (Reading) VTagBlock TagIndexOffset = hit? data word select 32bits

17 Example:A Simple Direct Mapped Cache LB $1  M[ 1 ] LB $2  M[ 5 ] LB $3  M[ 1 ] LB $3  M[ 4 ] LB $2  M[ 0 ] CacheProcessor tag data $0 $1 $2 $3 Memory cache lines 2 word block V Using byte addresses in this example! Addr Bus = 5 bits

18 Example:A Simple Direct Mapped Cache LB $1  M[ 1 ] LB $2  M[ 5 ] LB $3  M[ 1 ] LB $3  M[ 4 ] LB $2  M[ 0 ] CacheProcessor tag data $0 $1 $2 $3 Memory cache lines 2 word block 2 bit tag field 2 bit index field 1 bit block offset V Using byte addresses in this example! Addr Bus = 5 bits

19 1 st Access CacheProcessor tag data $0 $1 $2 $3 Memory LB $1  M[ 1 ] LB $2  M[ 5 ] LB $3  M[ 1 ] LB $3  M[ 4 ] LB $2  M[ 0 ] V Misses: Hits:

20 LB $1  M[ 1 ] LB $2  M[ 5 ] LB $3  M[ 1 ] LB $3  M[ 4 ] LB $2  M[ 0 ] LB $2  M[ 10 ] LB $2  M[ 15 ] LB $2  M[ 8 ] 8 th Access Processor $0 $1 $2 $3 Memory Cache tag data V Misses: Hits:

21 Misses Three types of misses Cold (aka Compulsory) –The line is being referenced for the first time Capacity –The line was evicted because the cache was not large enough Conflict –The line was evicted because of another access whose index conflicted

22 Misses Q: How to avoid… Cold Misses Unavoidable? The data was never in the cache… Prefetching! Capacity Misses Buy more SRAM Conflict Misses Use a more flexible cache design

23 LB $1  M[ 1 ] LB $2  M[ 5 ] LB $3  M[ 1 ] LB $3  M[ 4 ] LB $2  M[ 0 ] LB $2  M[ 12 ] LB $2  M[ 8 ] Direct Mapped Example: 6 th Access Processor $0 $1 $2 $3 Memory Misses: Hits: Cache tag data V Using byte addresses in this example! Addr Bus = 5 bits Pathological example

24 LB $1  M[ 1 ] LB $2  M[ 5 ] LB $3  M[ 1 ] LB $3  M[ 4 ] LB $2  M[ 0 ] LB $2  M[ 12 ] LB $2  M[ 8 ] LB $2  M[ 4 ] LB $2  M[ 0 ] LB $2  M[ 12 ] LB $2  M[ 8 ] 10 th and 11 th Access ProcessorMemory Cache tag data V Misses: Hits:

25 Cache Organization How to avoid Conflict Misses Three common designs Fully associative: Block can be anywhere in the cache Direct mapped: Block can only be in one line in the cache Set-associative: Block can be in a few (2 to 8) places in the cache

26 Example:Simple Fully Associative Cache LB $1  M[ 1 ] LB $2  M[ 5 ] LB $3  M[ 1 ] LB $3  M[ 4 ] LB $2  M[ 0 ] CacheProcessor tag data $0 $1 $2 $3 Memory cache lines 2 word block 4 bit tag field 1 bit block offset V V V V V Using byte addresses in this example! Addr Bus = 5 bits

27 1 st Access LB $1  M[ 1 ] LB $2  M[ 5 ] LB $3  M[ 1 ] LB $3  M[ 4 ] LB $2  M[ 0 ] CacheProcessor tag data $0 $1 $2 $3 Memory V Misses: Hits:

28 LB $1  M[ 1 ] LB $2  M[ 5 ] LB $3  M[ 1 ] LB $3  M[ 4 ] LB $2  M[ 0 ] LB $2  M[ 12 ] LB $2  M[ 8 ] LB $2  M[ 4 ] LB $2  M[ 0 ] LB $2  M[ 12 ] LB $2  M[ 8 ] 10 th and 11 th Access ProcessorMemory Misses: Hits: Cache tag data 0 V

29 Fully Associative Cache (Reading) VTagBlock word select hit? data line select ==== 32bits 64bytes Tag Offset

30 Eviction Which cache line should be evicted from the cache to make room for a new line? Direct-mapped –no choice, must evict line selected by index Associative caches –random: select one of the lines at random –round-robin: similar to random –FIFO: replace oldest line –LRU: replace line that has not been used in the longest time

31 Cache Tradeoffs Direct Mapped + Smaller + Less + Faster + Less + Very – Lots – Low – Common Fully Associative Larger – More – Slower – More – Not Very – Zero + High + ? Tag Size SRAM Overhead Controller Logic Speed Price Scalability # of conflict misses Hit rate Pathological Cases?

32 Compromise Set-associative cache Like a direct-mapped cache Index into a location Fast Like a fully-associative cache Can store multiple entries –decreases thrashing in cache Search in each element

33 LB $1  M[ 1 ] LB $2  M[ 5 ] LB $3  M[ 1 ] LB $3  M[ 4 ] LB $2  M[ 0 ] LB $2  M[ 12 ] LB $2  M[ 5 ] LB $2  M[ 12 ] LB $2  M[ 5 ] LB $2  M[ 12 ] LB $2  M[ 5 ] Comparison: Direct Mapped ProcessorMemory Misses: Hits: Cache tag data cache lines 2 word block 2 bit tag field 2 bit index field 1 bit block offset field Using byte addresses in this example! Addr Bus = 5 bits

34 LB $1  M[ 1 ] LB $2  M[ 5 ] LB $3  M[ 1 ] LB $3  M[ 4 ] LB $2  M[ 0 ] LB $2  M[ 12 ] LB $2  M[ 5 ] LB $2  M[ 12 ] LB $2  M[ 5 ] LB $2  M[ 12 ] LB $2  M[ 5 ] Comparison: Fully Associative ProcessorMemory Misses: Hits: Cache tag data 0 4 cache lines 2 word block 4 bit tag field 1 bit block offset field Using byte addresses in this example! Addr Bus = 5 bits

35 Comparison: 2 Way Set Assoc Processor Memory Misses: Hits: Cache tag data sets 2 word block 3 bit tag field 1 bit set index field 1 bit block offset field LB $1  M[ 1 ] LB $2  M[ 5 ] LB $3  M[ 1 ] LB $3  M[ 4 ] LB $2  M[ 0 ] LB $2  M[ 12 ] LB $2  M[ 5 ] LB $2  M[ 12 ] LB $2  M[ 5 ] LB $2  M[ 12 ] LB $2  M[ 5 ] Using byte addresses in this example! Addr Bus = 5 bits

36 3-Way Set Associative Cache (Reading) word select hit?data line select === 32bits 64bytes TagIndex Offset

37 Remaining Issues To Do: Evicting cache lines Picking cache parameters Writing using the cache

38 Summary Caching assumptions small working set: 90/10 rule can predict future: spatial & temporal locality Benefits big & fast memory built from (big & slow) + (small & fast) Tradeoffs: associativity, line size, hit cost, miss penalty, hit rate Fully Associative  higher hit cost, higher hit rate Larger block size  lower hit cost, higher miss penalty Next up: other designs; writing to caches