1 Lecture 10 Cache Peng Liu Physical Size Affects Latency 2 Small Memory CPU Big Memory CPU  Signals have further to travel  Fan.

1 Lecture 10 Cache Peng Liu liupeng@zju.edu.cn

Physical Size Affects Latency 2 Small Memory CPU Big Memory CPU  Signals have further to travel  Fan out to more locations

Memory Hierarchy 3 Small, Fast Memory (RF, SRAM) capacity: Register << SRAM << DRAM latency: Register << SRAM << DRAM bandwidth: on-chip >> off-chip On a data access: if data  fast memory  low latency access (SRAM) if data  fast memory  high latency access (DRAM) CPU Big, Slow Memory (DRAM) AB holds frequently used data

4 Memory Technology Speed &Cost Static RAM (SRAM) –0.3ns-2.5ns Dynamic (DRAM) –50ns-70ns Magnetic disk –5ms-20ms Ideal memory –Access time of SRAM –Capacity and cost/GB of disk

5 Random Access Memory (RAM) Dynamic Random Access Memory (DRAM) –High density, low power, cheap, but slow –Dynamic since data must be “refreshed” regularly (“leaky buckets”) –Contents are lost when power is lost Static Random Access Memory (SRAM) –Lower density (about 1/10 density of DRAM), higher cost –Static since data is held without refresh if power is on –Fast access time, often 2 to 10 times faster than DRAM Flash memory –Holds contents without power –Data written in blocks, generally slow –Very cheap

6 Programs Have Locality Principle of locality –Programs access a relatively small portion of the address space at any given time –Can tell what memory locations a program will reference in the future by looking at what it referenced recently in the past Two Types of Locality –Temporal Locality – if an item has been referenced recently, it will tend to be referenced again soon –Spatial Locality – if an item has been referenced recently, nearby items will tend to be referenced soon Nearby refers to memory addresses

7 Locality Examples Spatial Locality –Likely to reference data near recent references –Example: for (i=0; i<N; i++) a[i] = … Temporal Locality –Likely to reference same data that was referenced recently –Example: for (i=0; i<N; i++) a[i] = f(a[i-1]);

8 Taking Advantages of Locality Memory hierarchy: –Store everything on disk Or Flash in some systems –Copy recently accessed (and nearby) data to smaller DRAM memory DRAM is called main memory –Copy more recently accessed (and nearby) data to smaller SRAM memory Cache memory attached or close to CPU

9 Typical Memory Hierarchy: Everything is a Cache for Something Else Access timeCapacityManaged by 1 cycle~500BSoftware/compiler 1-3 cycles~64KBhardware 5-10 cycles1-10MBhardware ~100 cycles~10GBSoftware/OS 10 6 -10 7 cycles~100GBSoftware/OS

10 Caches A cache is an interim storage component –Functions as a buffer for larger, slower storage components Exploits principles of locality –Provide as much inexpensive storage space as possible –Offer access speed equivalent to the fastest memory For data in the cache Key is to have the right data cached Computer systems often use multiple caches Cache ideas are not limited to hardware designers –Example: Web caches widely used on the Internet

11 Memory Hierarchy Levels Block (aka line): unit of copying –May be multiple words If accessed data is present in upper level –Hit: access satisfied by upper level Hit ratio: hits/accesses If accessed data is absent –Miss: block copied from lower level Time taken: miss penalty Miss ratio: misses/accesses= 1 - hit ratio –Then data supplied to CPU from upper level CPU pipeline is stalled in the meantime

12 Average Memory Access Times Need to define an average access time –Since some will be fast and some slow Access time = hit time + miss rate x miss penalty –The hope is that the hit time will be low and the miss rate low since the miss penalty is so much larger than the hit time Average Memory Access Time (AMAT) –Formula can be applied to any level of the hierarchy Access time for that level –Can be generalized for the entire hierarchy Average access time that the processor sees for a reference

13 How Processor Handles a Miss Assume that cache access occurs in 1 cycle –Hit is great, and basic pipeline is fine CPI penalty = miss rate x miss penalty A miss stalls the pipeline (for a instruction or data miss) –Stall the pipeline (you don’t have the data it needs) –Send the address that missed to the memory –Instruct main memory to perform a read and wait –When access completes, return the data to the processor –Restart the instruction

14 How to Build A Cache? Big question: locating data in the cache –I need to map a large address space into a small memory How do I do that? –Can build full associative lookup in hardware, but complex –Need to find a simple but effective solution –Two common techniques: Direct Mapped Set Associative Further questions –Block size (crucial for spatial locality) –Replacement policy (crucial for temporal locality) –Write policy (writes are always more complex)

15 Terminology Block – Minimum unit of information transfer between levels of the hierarchy –Block addressing varies by technology at each level –Blocks are moved one level at a time Hit – Data appears in a block in lower numbered level Hit rate – Percent of accesses found Hit time – Time to access at lower numbered level –Hit time = Cache access time + Time to determine hit/miss Miss – Data was not in lower numbered level and had to be fetched from a higher numbered level Miss rate – Percent of misses (1 – Hit rate) Miss penalty – Overhead in getting data from a higher numbered level –Miss penalty = higher level access time + Time to deliver to lower level + Cache replacement/forward to processor time –Miss penalty is usually much larger than the hit time

16 Direct Mapped Cache Location in cache determined by (main) memory address Direct mapped: only one choice –(Block address) modulo (#Blocks in cache)

17 Tags and Valid Bits How do we know which particular block is stored in a cache location? –Store block address as well as the data –Actually, only need the high-order bits –Called the tag What if there is no data in a location? –Valid bit: 1 = present, 0 = not present –Initially 0

18 Cache Example 8-blocks, 1 word/block, direct mapped Initial state IndexVTagData 000N 001N 010N 011N 100N 101N 110N 111N

19 Cache Example IndexVTagData 000N 001N 010N 011N 100N 101N 110Y10Mem[10110] 111N Word addrBinary addrHit/MissCache block 2210 110Miss110 Compulsory/Cold Miss

20 Cache Example IndexVTagData 000N 001N 010Y11Mem[11010] 011N 100N 101N 110Y10Mem[10110] 111N Word addrBinary addrHit/MissCache block 2611 010Miss010 Compulsory/Cold Miss

21 Cache Example IndexVTagData 000N 001N 010Y11Mem[11010] 011N 100N 101N 110Y10Mem[10110] 111N Word addrBinary addrHit/MissCache block 2210 110Hit110 2611 010Hit010 Hit

22 Cache Example IndexVTagData 000Y10Mem[10000] 001N 010Y11Mem[11010] 011Y00Mem[00011] 100N 101N 110Y10Mem[10110] 111N Word addrBinary addrHit/MissCache block 1610 000Miss000 300 011Miss011 1610 000Hit000

23 Cache Example IndexVTagData 000Y10Mem[10000] 001N 010Y10Mem[10010] 011Y00Mem[00011] 100N 101N 110Y10Mem[10110] 111N Word addrBinary addrHit/MissCache block 1811 010Miss010 Replacement

24 Cache Organization & Access Assumptions –32-bit address –4 Kbyte cache –1024 blocks, 1word/block Steps –Use index to read V, tag from cache –Compare read tag with tag from address –If match, return data & hit signal –Otherwise, return miss

25 Larger Block Size Motivation: exploit spatial locality & amortize overheads This example: 64 blocks, 16 bytes/block –To what block number does address 1200 map? –Block address = =75 –Block number = 75 modulo 64 = 11 What is the impact of larger on tag/index size? What is the impact of larger blocks on the cache overhead? –Overhead = tags & valid bits

26 Cache Block Example Assume a 2 n byte direct mapped cache with 2 m byte blocks –Byte select – The lower m bits –Cache index – The lower (n - m) bits of the memory address –Cache tag – The upper (32 - n) bits of the memory address

27 Block Sizes Larger block sizes take advantage of spatial locality –Also incurs larger miss penalty since it takes longer to transfer the block into the cache –Large block can also increase the average time or the miss rate Tradeoff in selecting block size Average Access Time = Hit Time *(1-MR) + Miss Penalty * MR

28 Direct Mapped Problems: Conflict Misses Two blocks that are used concurrently and map to same index –Only one can fit in the cache, regardless of cache size –No flexibility in placing 2 nd block elsewhere Thrashing –If accesses alternate, one block will replace the other before reuse –No benefit from caching Conflicts & thrashing can happen quite often

29 Fully Associate Cache Opposite extreme in that I has no cache index to hash –Use any available entry to store memory elements –No conflict misses, only capacity misses –Must compare cache tags of all entries to find the desired one

30 N-way Set Associative Compromise between direct-mapped and fully associative –Each memory block can go to one of N entries in cache Each “set” can store N blocks; a cache contains some number of sets –For fast access, all blocks in a set are search in parallel How to think of a N-way associative cache with X sets –1 st view: N direct mapped caches each with X entries Caches search in parallel Need to coordinate on data output and signaling hit/miss –2 nd view: X fully associative caches each with N entries One cache searched in each case

31 Associative Cache Example

32 Associativity Example Compare 4-block caches –Direct mapped, 2-way set associative, fully associative –Block access sequence: 0, 8,0,6,8 (0 modulo 4) = 0 (6 modulo 4) = 2 (8 modulo 4) = 0 Direct mapped Block address Cache index Hit/missCache content after access 0123 00missMem[0] 80missMem[8] 00missMem[0] 62missMem[0]Mem[6] 80missMem[8]Mem[6]

33 Associativity Example 2-way set associative Full associative Block address Cache index Hit/missCache content after access Set 0 Set 1 00missMem[0] 80missMem[0] Mem[8] 00hitMem[0] Mem[8] 60missMem[0] Mem[6] 80missMem[8] Mem[6] Block address Hit/missCache content after access Set 0 0missMem[0] 8missMem[0] Mem[8] 0hitMem[0] Mem[8] 6missMem[0] Mem[8] Mem[6] 8hitMem[0] Mem[8] Mem[6]

34 Set Associative Cache Organization

35 Tag & Index with Set-Associative Caches Assume a 2 n -byte cache with 2 m -byte blocks that is 2 a set-associative –Which bits of the address are the tag or the index? –m least significant bits are byte select within the block Basic idea –The cache contains 2 n /2 m =2 n-m blocks –Each cache way contains 2 n-m /2 a =2 n-m-a blocks –Cache index: (n-m-a) bits after the byte select Same index used with all cache ways … Observation –For fixed size, length of tags increases with the associativity –Associative caches incur more overhead for tags

36 Bonus Trick: How to build a 3KB cache? It would be difficult to make a DM 3KB cache –3KB is not a power of two –Assuming 16-byte blocks, we have 24 blocks to select from –(address %24) is very expensive to calculate Unlike (address %32) which requires looking at the 4LS bits Solution: start with 4KB 4-way set associative cache –Every way can hold 1-KB (8 blocks) –Same 3-bit index used to access all 4 cache ways 3LS bits of address (after eliminating the block offset) –Now drop the 4 th way of the cache As if that 4 th way always reports a miss and never receives data

37 Associative Cache: Pros Increased associativity decreases miss rate –Eliminates conflicts –But with diminishing returns Simulation of a system with 64KB D-cache, 16-word blocks –1-way:10.3% –2-way:8.6% –4-way:8.3% –8-way:8.1% Caveat: cache shared by multiple processors may have need higher associativity

38 Associative Caches: Cons Area overhead –More storage needed for tags( compared to same sized DM) –N comparators Latency –Critical path = way access + comparator + logic to combine answers Logic to OR hit signal and multiplex the data outputs –Cannot forward the data to processor immediately Must first wait for selection and multiplexing Direct mapped assumes a hit and recovers later if a miss Complexity: dealing with replacement

39 Acknowledgements These slides contain material from courses: –UCB CS152 –Stanford EE108B

40 Homework You may work in groups of 2 students, but turn in only one HW per group! 5.3 –5.3.1, 5.3.3 reference stream b 5.4 –5.4.3 5.6 –5.6.3 5.8 –5.8.3

41 Break 作业和测验讲解

42 Placement Policy 0 1 2 3 4 5 6 7 0 1 2 3 Set Number Cache Fully (2-way) Set Direct AssociativeAssociative Mapped anywhereanywhere in only into set 0 block 4 (12 mod 4) (12 mod 8) 0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 0 1 2 3 4 5 6 7 8 9 2 2 2 2 2 2 2 2 2 2 0 1 2 3 4 5 6 7 8 9 3 0 1 Memory Block Number block 12 can be placed

43 Direct-Mapped Cache TagData Block V = Block Offset TagIndex t k b t HIT Data Word or Byte 2 k lines

44 2-Way Set-Associative Cache TagData Block V = Block Offset TagIndex t k b HIT TagData Block V Data Word or Byte = t

45 Fully Associative Cache TagData Block V = Block Offset Tag t b HIT Data Word or Byte = = t

46 Replacement Methods Which line do you replace on a miss? Direct Mapped –Easy, you have only one choice –Replace the line at the index you need N-way Set Associative –Need to choose which way to replace –Random (choose one at random) –Least Recently Used (LRU) (the one used least recently) Often difficult to calculate, so people use approximations. Often they are really not recently used

47 Replacement Policy In an associative cache, which block from a set should be evicted when the set becomes full? Random Least Recently Used (LRU) LRU cache state must be updated on every access true implementation only feasible for small sets (2-way) pseudo-LRU binary tree often used for 4-8 way First In, First Out (FIFO) a.k.a. Round-Robin used in highly associative caches Not Least Recently Used (NLRU) FIFO with exception for most recently used block or blocks This is a second-order effect. Why? Replacement only happens on misses

48 Word3Word0Word1Word2 Block Size and Spatial Locality Larger block size has distinct hardware advantages less tag overhead exploit fast burst transfers from DRAM exploit fast burst transfers over wide busses What are the disadvantages of increasing block size? block address offset b 2 b = block size a.k.a line size (in bytes) Split CPU address b bits 32-b bits Tag Block is unit of transfer between the cache and memory 4 word block, b=2 Fewer blocks => more conflicts. Can waste bandwidth.

49 CPU-Cache Interaction (5-stage pipeline) PC addr inst Primary Instruction Cache 0x4 Add IR D nop hit ? PCen Decode, Register Fetch wdata R addr wdata rdata Primary Data Cache we A B YY ALU MD1 MD2 Cache Refill Data from Lower Levels of Memory Hierarchy hit ? Stall entire CPU on data cache miss To Memory Control M E

50 Improving Cache Performance Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the hit time reduce the miss rate reduce the miss penalty What is the simplest design strategy? Biggest cache that doesn’t increase hit time past 1-2 cycles (approx 8-32KB in modern technology) [ design issues more complex with out-of-order superscalar processors ]

51 Serial-versus-Parallel Cache and Memory Access  is HIT RATIO: Fraction of references in cache 1 -  is MISS RATIO: Remaining references CACHE Processor Main Memory Addr Data Average access time for serial search: t cache + (1 - ) t mem CACHE Processor Main Memory Addr Data Average access time for parallel search:  t cache + (1 - ) t mem Savings are usually small, t mem >> t cache, hit ratio  high High bandwidth required for memory path Complexity of handling parallel paths can slow t cache

52 Causes for Cache Misses Compulsory: first-reference to a block a.k.a. cold start misses - misses that would occur even with infinite cache Capacity: cache is too small to hold all data needed by the program - misses that would occur even under perfect replacement policy Conflict: misses that occur because of collisions due to block-placement strategy - misses that would not occur with full associativity

53 Miss Rates and 3Cs

54 Effect of Cache Parameters on Performance Larger cache size + reduces capacity and conflict misses - hit time will increase Higher associativity + reduces conflict misses - may increase hit time Larger block size + reduces compulsory and capacity (reload) misses - increases conflict misses and miss penalty

55 Write Policy Choices Cache hit: –write through: write both cache & memory generally higher traffic but simplifies cache coherence –write back: write cache only (memory is written only when the entry is evicted) a dirty bit per block can further reduce the traffic Cache miss: –no write allocate: only write to main memory –write allocate (aka fetch on write): fetch into cache Common combinations: –write through and no write allocate –write back with write allocate

56 Write Performance Tag Data V = Block Offset TagIndex t k b t HIT Data Word or Byte 2 k lines WE

57 Reducing Write Hit Time Problem: Writes take two cycles in memory stage, one cycle for tag check plus one cycle for data write if hit Solutions: Design data RAM that can perform read and write in one cycle, restore old value after tag miss Fully-associative (CAM Tag) caches: Word line only enabled if hit Pipelined writes: Hold write data for store in single buffer ahead of cache, write cache data during next store ’ s tag check

58 Pipelining Cache Writes TagsData TagIndexStore Data Address and Store Data From CPU Delayed Write DataDelayed Write Addr. =? Load Data to CPU Load/Store L S 10 Hit? Data from a store hit written into data portion of cache during tag access of subsequent store

59 Write Buffer to Reduce Read Miss Penalty Processor is not stalled on writes, and read misses can go ahead of write to main memory Problem: Write buffer may hold updated value of location needed by a read miss Simple scheme: on a read miss, wait for the write buffer to go empty Faster scheme: Check write buffer addresses against read miss addresses, if no match, allow read miss to go ahead of writes, else, return value in write buffer Data Cache Unified L2 Cache RF CPU Write buffer Evicted dirty lines for writeback cache OR All writes in writethru cache

60 Block-level Optimizations Tags are too large, i.e., too much overhead –Simple solution: Larger blocks, but miss penalty could be large. Sub-block placement (aka sector cache) –A valid bit added to units smaller than full block, called sub-blocks –Only read a sub-block on a miss –If a tag matches, is the word in the cache? 100 300 204 1 1 1 1 0 0 0 1

61 Set-Associative RAM-Tag Cache Not energy-efficient –A tag and data word is read from every way Two-phase approach –First read tags, then just read data from selected way –More energy-efficient –Doubles latency in L1 –OK, for L2 and above, why? =? Tag Status Data Tag Index Offset

1 Lecture 10 Cache Peng Liu Physical Size Affects Latency 2 Small Memory CPU Big Memory CPU  Signals have further to travel  Fan.

Similar presentations

Presentation on theme: "1 Lecture 10 Cache Peng Liu Physical Size Affects Latency 2 Small Memory CPU Big Memory CPU  Signals have further to travel  Fan."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Lecture 10 Cache Peng Liu Physical Size Affects Latency 2 Small Memory CPU Big Memory CPU  Signals have further to travel  Fan.

Similar presentations

Presentation on theme: "1 Lecture 10 Cache Peng Liu Physical Size Affects Latency 2 Small Memory CPU Big Memory CPU  Signals have further to travel  Fan."— Presentation transcript:

Similar presentations

About project

Feedback