Peng Liu liupeng@zju.edu.cn Lecture 10 Cache (2) Peng Liu liupeng@zju.edu.cn.

Peng Liu liupeng@zju.edu.cn
Lecture 10 Cache (2) Peng Liu

Review: Memory Technology Speed &Cost
Static RAM (SRAM) 0.3ns-2.5ns Dynamic (DRAM) 50ns-70ns Magnetic disk 5ms-20ms Ideal memory Access time of SRAM Capacity and cost/GB of disk

Review: Locality Examples
Spatial Locality Likely to reference data near recent references Example: for (i=0; i<N; i++) a[i] = … Temporal Locality Likely to reference same data that was referenced recently a[i] = f(a[i-1]);

Review: Memory Hierarchy Levels
Block (aka line): unit of copying May be multiple words If accessed data is present in upper level Hit: access satisfied by upper level Hit ratio: hits/accesses If accessed data is absent Miss: block copied from lower level Time taken: miss penalty Miss ratio: misses/accesses= 1 - hit ratio Then data supplied to CPU from upper level CPU pipeline is stalled in the meantime

How to Build A Cache? Big question: locating data in the cache
I need to map a large address space into a small memory How do I do that? Can build full associative lookup in hardware, but complex Need to find a simple but effective solution Two common techniques: Direct Mapped Set Associative Further questions Block size (crucial for spatial locality) Replacement policy (crucial for temporal locality) Write policy (writes are always more complex)

Direct Mapped Cache Location in cache determined by (main) memory address Direct mapped: only one choice (Block address) modulo (#Blocks in cache)

Tags and Valid Bits How do we know which particular block is stored in a cache location? Store block address as well as the data Actually, only need the high-order bits Called the tag What if there is no data in a location? Valid bit: 1 = present, 0 = not present Initially 0

Cache Example 8-blocks, 1 word/block, direct mapped Initial state
Index V Tag Data 000 N 001 010 011 100 101 110 111

Cache Example 22 10 110 Miss 110 Index V Tag Data 000 N 001 010 011
Word addr Binary addr Hit/Miss Cache block 22 10 110 Miss 110 Compulsory/Cold Miss Index V Tag Data 000 N 001 010 011 100 101 110 Y 10 Mem[10110] 111

Cache Example 26 11 010 Miss 010 Index V Tag Data 000 N 001 010 Y 11
Word addr Binary addr Hit/Miss Cache block 26 11 010 Miss 010 Compulsory/Cold Miss Index V Tag Data 000 N 001 010 Y 11 Mem[11010] 011 100 101 110 10 Mem[10110] 111

Cache Example 22 10 110 Hit 110 26 11 010 010 Index V Tag Data 000 N
Word addr Binary addr Hit/Miss Cache block 22 10 110 Hit 110 26 11 010 010 Hit Index V Tag Data 000 N 001 010 Y 11 Mem[11010] 011 100 101 110 10 Mem[10110] 111

Cache Example 16 10 000 Miss 000 3 00 011 011 Hit Index V Tag Data 000
Word addr Binary addr Hit/Miss Cache block 16 10 000 Miss 000 3 00 011 011 Hit Index V Tag Data 000 Y 10 Mem[10000] 001 N 010 11 Mem[11010] 011 00 Mem[00011] 100 101 110 Mem[10110] 111

Cache Example 18 11 010 Miss 010 Index V Tag Data 000 Y 10 Mem[10000]
Word addr Binary addr Hit/Miss Cache block 18 11 010 Miss 010 Replacement Index V Tag Data 000 Y 10 Mem[10000] 001 N 010 Mem[10010] 011 00 Mem[00011] 100 101 110 Mem[10110] 111

Cache Organization & Access
Assumptions 32-bit address 4 Kbyte cache 1024 blocks, 1word/block Steps Use index to read V, tag from cache Compare read tag with tag from address If match, return data & hit signal Otherwise, return miss

Cache Block Example Assume a 2n byte direct mapped cache with 2m byte blocks Byte select – The lower m bits Cache index – The lower (n-m) bits of the memory address Cache tag – The upper (32-n) bits of the memory address

Block Sizes Larger block sizes take advantage of spatial locality
Also incurs larger miss penalty since it takes longer to transfer the block into the cache Large block can also increase the average time or the miss rate Tradeoff in selecting block size Average Access Time = Hit Time *(1-MR) + Miss Penalty * MR

Direct Mapped Problems: Conflict Misses
Two blocks that are used concurrently and map to same index Only one can fit in the cache, regardless of cache size No flexibility in placing 2nd block elsewhere Thrashing If accesses alternate, one block will replace the other before reuse No benefit from caching Conflicts & thrashing can happen quite often

Fully Associate Cache Opposite extreme in that I has no cache index to hash Use any available entry to store memory elements No conflict misses, only capacity misses Must compare cache tags of all entries to find the desired one

N-way Set Associative Compromise between direct-mapped and fully associative Each memory block can go to one of N entries in cache Each “set” can store N blocks; a cache contains some number of sets For fast access, all blocks in a set are search in parallel How to think of a N-way associative cache with X sets 1st view: N direct mapped caches each with X entries Caches search in parallel Need to coordinate on data output and signaling hit/miss 2nd view: X fully associative caches each with N entries One cache searched in each case

Associative Cache Example

Associativity Example
Compare 4-block caches Direct mapped, 2-way set associative, fully associative Block access sequence: 0, 8,0,6,8 (0 modulo 4) = 0 (6 modulo 4) = 2 (8 modulo 4) = 0 Direct mapped Block address Cache index Hit/miss Cache content after access 1 2 3 miss Mem[0] 8 Mem[8] 6 Mem[6] Assume there are small caches, each consisting of four one-word blocks.

Associativity Example
2-way set associative Full associative Block address Cache index Hit/miss Cache content after access Set 0 Set 1 miss Mem[0] 8 Mem[8] hit 6 Mem[6] Block address Hit/miss Cache content after access Set 0 miss Mem[0] 8 Mem[8] hit 6 Mem[6]

Set Associative Cache Organization

Tag & Index with Set-Associative Caches
Assume a 2n-byte cache with 2m-byte blocks that is 2a set-associative Which bits of the address are the tag or the index? m least significant bits are byte select within the block Basic idea The cache contains 2n/2m=2n-m blocks Each cache way contains 2n-m/2a=2n-m-a blocks Cache index: (n-m-a) bits after the byte select Same index used with all cache ways … Observation For fixed size, length of tags increases with the associativity Associative caches incur more overhead for tags

Bonus Trick: How to build a 3KB cache?
It would be difficult to make a DM 3KB cache 3KB is not a power of two Assuming 16-byte blocks, we have 24 blocks to select from (address %24) is very expensive to calculate Unlike (address %32) which requires looking at the 4LS bits Solution: start with 4KB 4-way set associative cache Every way can hold 1-KB (8 blocks) Same 3-bit index used to access all 4 cache ways 3LS bits of address (after eliminating the block offset) Now drop the 4th way of the cache As if that 4th way always reports a miss and never receives data

Associative Cache: Pros
Increased associativity decreases miss rate Eliminates conflicts But with diminishing returns Simulation of a system with 64KB D-cache, 16-word blocks 1-way:10.3% 2-way:8.6% 4-way:8.3% 8-way:8.1% Caveat: cache shared by multiple processors may have need higher associativity

Associative Caches: Cons
Area overhead More storage needed for tags( compared to same sized DM) N comparators Latency Critical path= way access + comparator + logic to combine answers Logic to OR hit signal and multiplex the data outputs Cannot forward the data to processor immediately Must first wait for selection and multiplexing Direct mapped assumes a hit and recovers later if a miss Complexity: dealing with replacement

Placement Policy Memory Cache Fully (2-way) Set Direct
3 3 0 1 Block Number Memory Set Number Cache Simplest scheme is to extract bits from ‘block number’ to determine ‘set’. More sophisticated schemes will hash the block number ---- why could that be good/bad? Fully (2-way) Set Direct Associative Associative Mapped anywhere anywhere in only into set block 4 (12 mod 4) (12 mod 8) block 12 can be placed

Direct-Mapped Cache Tag Index t k b V Tag Data Block 2k lines t = HIT
Offset t k b V Tag Data Block 2k lines t = HIT Data Word or Byte

2-Way Set-Associative Cache
Tag Index Block Offset b t k V Tag Data Block V Tag Data Block t Compare latency to direct mapped case? Data Word or Byte = = HIT

Fully Associative Cache
Tag Data Block t = Tag t = HIT Block Offset Data Word or Byte = b

Replacement Methods Which line do you replace on a miss? Direct Mapped
Easy, you have only one choice Replace the line at the index you need N-way Set Associative Need to choose which way to replace Random (choose one at random) Least Recently Used (LRU) (the one used least recently) Often difficult to calculate, so people use approximations. Often they are really not recently used

Replacement only happens on misses
Replacement Policy In an associative cache, which block from a set should be evicted when the set becomes full? Random Least Recently Used (LRU) LRU cache state must be updated on every access true implementation only feasible for small sets (2-way) pseudo-LRU binary tree often used for 4-8 way First In, First Out (FIFO) a.k.a. Round-Robin used in highly associative caches Not Least Recently Used (NLRU) FIFO with exception for most recently used block or blocks This is a second-order effect. Why? NLRU used in Alpha TLBs. Replacement only happens on misses

Block Size and Spatial Locality
Block is unit of transfer between the cache and memory 4 word block, b=2 Tag Word0 Word1 Word2 Word3 Split CPU address block address offsetb 32-b bits b bits 2b = block size a.k.a line size (in bytes) Larger block size has distinct hardware advantages less tag overhead exploit fast burst transfers from DRAM exploit fast burst transfers over wide busses What are the disadvantages of increasing block size? Larger block size will reduce compulsory misses (first miss to a block). Larger blocks may increase conflict misses since the number of blocks is smaller. Fewer blocks => more conflicts. Can waste bandwidth.

CPU-Cache Interaction (5-stage pipeline)
0x4 E Add M Decode, Register Fetch A ALU we Y addr nop IR B Primary Data Cache rdata R addr PC inst D hit? hit? wdata wdata PCen Primary Instruction Cache MD1 MD2 Stall entire CPU on data cache miss To Memory Control Cache Refill Data from Lower Levels of Memory Hierarchy

Improving Cache Performance
Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the hit time reduce the miss rate reduce the miss penalty What is the simplest design strategy? Design the largest primary cache without slowing down the clock Or adding pipeline stages. Biggest cache that doesn’t increase hit time past 1-2 cycles (approx 8-32KB in modern technology) [ design issues more complex with out-of-order superscalar processors ]

Serial-versus-Parallel Cache and Memory Access
a is HIT RATIO: Fraction of references in cache 1 - a is MISS RATIO: Remaining references CACHE Processor Main Memory Addr Data Average access time for serial search: tcache + (1 - a) tmem CACHE Processor Main Memory Addr Data Average access time for parallel search: a tcache + (1 - a) tmem Savings are usually small, tmem >> tcache, hit ratio a high High bandwidth required for memory path Complexity of handling parallel paths can slow tcache

Causes for Cache Misses
Compulsory: first-reference to a block a.k.a. cold start misses - misses that would occur even with infinite cache Capacity: cache is too small to hold all data needed by the program - misses that would occur even under perfect replacement policy Conflict: misses that occur because of collisions due to block-placement strategy - misses that would not occur with full associativity

Miss Rates and 3Cs

Effect of Cache Parameters on Performance
Larger cache size reduces capacity and conflict misses hit time will increase Higher associativity reduces conflict misses may increase hit time Larger block size reduces compulsory and capacity (reload) misses increases conflict misses and miss penalty Requested block first…. The following could be in the slide… spatial locality reduces compulsory misses and capacity reload misses fewer blocks may increase conflict miss rate larger blocks may increase miss penalty

Write Policy Choices Cache hit:
write through: write both cache & memory generally higher traffic but simplifies cache coherence write back: write cache only (memory is written only when the entry is evicted) a dirty bit per block can further reduce the traffic Cache miss: no write allocate: only write to main memory write allocate (aka fetch on write): fetch into cache Common combinations: write through and no write allocate write back with write allocate

Write Performance Tag Index b t k V Tag Data 2k lines t = WE HIT
Block Offset b t k V Tag Data 2k lines t Completely serial. = WE HIT Data Word or Byte

Reducing Write Hit Time
Problem: Writes take two cycles in memory stage, one cycle for tag check plus one cycle for data write if hit Solutions: Design data RAM that can perform read and write in one cycle, restore old value after tag miss Fully-associative (CAM Tag) caches: Word line only enabled if hit Pipelined writes: Hold write data for store in single buffer ahead of cache, write cache data during next store’s tag check Need to bypass from write buffer if read address matches write buffer tag!

Pipelining Cache Writes
Address and Store Data From CPU Tag Index Store Data Delayed Write Addr. Delayed Write Data Load/Store =? Tags Data S L =? 1 Load Data to CPU Hit? Data from a store hit written into data portion of cache during tag access of subsequent store

Write Buffer to Reduce Read Miss Penalty
Unified L2 Cache Data Cache CPU Write buffer RF Evicted dirty lines for writeback cache OR All writes in writethru cache Processor is not stalled on writes, and read misses can go ahead of write to main memory Problem: Write buffer may hold updated value of location needed by a read miss Simple scheme: on a read miss, wait for the write buffer to go empty Faster scheme: Check write buffer addresses against read miss addresses, if no match, allow read miss to go ahead of writes, else, return value in write buffer Deisgners of the MIPS M/1000 estimated that waiting for a four-word buffer to empty increased the read miss penalty by a factor of 1.5.

Block-level Optimizations
Tags are too large, i.e., too much overhead Simple solution: Larger blocks, but miss penalty could be large. Sub-block placement (aka sector cache) A valid bit added to units smaller than full block, called sub-blocks Only read a sub-block on a miss If a tag matches, is the word in the cache? Main reason for subblock placement is to reduce tag overhead. Sector cache. 100 300 204

Set-Associative RAM-Tag Cache
=? Tag Status Data Tag Index Offset Not energy-efficient A tag and data word is read from every way Two-phase approach First read tags, then just read data from selected way More energy-efficient Doubles latency in L1 OK, for L2 and above, why?

Multilevel Caches DRAM CPU L2$ L1$
A memory cannot be large and fast Increasing sizes of cache at each level CPU L1$ L2$ DRAM Local miss rate = misses in cache / accesses to cache Global miss rate = misses in cache / CPU memory accesses Misses per instruction = misses in cache / number of instructions MPI makes it easier to compute overall performance.

Multilevel Caches Primary (L1) caches attached to CPU
Small, but fast Focusing on hit time rather than hit rate Level-2 cache services misses from primary cache Larger, slower, but still faster than main memory Unified instruction and data Focusing on hit rate rather than hit time Main memory services L2 cache misses Some high-end systems include L3 cache

A Typical Memory Hierarchy
Split instruction & data primary caches (on-chip SRAM) Multiple interleaved memory banks (off-chip DRAM) L1 Instruction Cache Unified L2 Cache Memory CPU Memory Memory L1 Data Cache RF Memory Implementation close to the CPU looks like a Harvard machine. Multiported register file (part of CPU) Large unified secondary cache (on-chip SRAM)

Multilevel On-Chip Caches

Presence of L2 influences L1 design
Use smaller L1 if there is also L2 Trade increased L1 miss rate for reduced L1 hit time and reduced L1 miss penalty Reduces average access energy Use simpler write-through L1 with on-chip L2 Write-back L2 cache absorbs write traffic, doesn’t go off-chip At most one L1 miss request per L1 access (no dirty victim write back) simplifies pipeline control Simplifies coherence issues Simplifies error recovery in L1 (can use just parity bits in L1 and reload from L2 when parity error detected on L1 read)

Itanium-2 On-Chip Caches (Intel/HP, 2002)
Level 1: 16KB, 4-way s.a., 64B line, quad-port (2 load+2 store), single cycle latency Level 2: 256KB, 4-way s.a, 128B line, quad-port (4 load or 4 store), five cycle latency Level 3: 3MB, 12-way s.a., 128B line, single 32B port, twelve cycle latency If two is good, then three must be better

What About Writes? Where do we put the data we want to write?
In the cache? In main memory? In both? Caches have different policies for this question Most systems store the data in the cache (why?) Some also store the data in memory as well (why?) Interesting observation Processor does not need to “wait” until the store completes

Cache Write Policies: Major Options
Write-through (write data go to cache and memory) Main memory is updated on each cache write Replacing a cache entry is simple (just overwrite new block) Memory write causes significant delay if pipeline must stall Write-back (write data only goes to the cache) Only the cache entry is updated on each cache write so main memory and cache data are inconsistent Add “dirty” bit to the cache entry to indicate whether the data in the cache entry must be committed to memory Replacing a cache entry requires writing the data back to memory before replacing the entry if it is “dirty”

Write Policy Trade-offs
Write-through Misses are simpler and cheaper (no write-back to memory) Easier to implement Requires buffering to be practical Uses a lot of bandwidth to the next level of memory Write-back Writes are fast on a hit Multiple writes within a block require only one “writeback” later Efficient block transfer on write back to memory at evicaiton

Write Buffers for Write-Through Caches
Processor Cache Write Buffer Lower Level Memory Holds data awaiting write-through to lower level memory Q. Why a write buffer ? A. So CPU doesn’t stall Q. Why a buffer, why not just one register ? A. Bursts of writes are common. Q. Are Read After Write (RAW) hazards an issue for write buffer? A. Yes! Drain buffer before next read, or check write buffers.

Avoiding the Stalls for Write-Through
Use write buffer between cache and memory Processor writes data into the cache and the write buffer Memory controller slowly “drains” buffer to memory Write buffer: a first-in-first-out buffer (FIFO) Typically holds a small number of writes Can absorb small bursts as long as the long term rate of writing to the buffer does not exceed the maximum rate of writing to DRAM

Cache Write Policy: Allocation Options
What happens on a cache write that misses? It’s actually two subquestions Do you allocate space in the cache for the address? Write-allocate VS no-write allocate Actions: select a cache entry, evict old contents, update tags, Do you fetch the rest of the block contents from memory? Of interest if you do write allocate Remember a store updates up to 1 word from a wider block Fetch-on-miss VS no-fetch-on-miss For no-fecth-on-miss must remember which words are valid Use fine-grain valid bits in each cache line

Typical Choices Write-back caches Write-through caches
Write-allocate, fetch-on-miss Write-through caches Write-allocate, no-fetch-on-miss No-write-allocate, write-around Modern HW support multiple polices Select by OS on at some coarse granularity Which program patters match each policy?

Write Miss Actions Write through Write back Write allocate
No write allocate Steps Fetch on miss No fetch on miss Write around Write invalidate Hit 1 Pick replacement 2 Invalidate tag 3 Fetch block 4 Write cache Write partial cache 5 Write memory

Be Careful, Even with Write Hits
Reading form a cache Read tags and data in parallel If it hits, return the data, else go to lower level Writing a cache can take more time First read tag to determine hit/miss Then overwrite data on a hit Otherwise, you may overwrite dirty data or write the wrong cache way Can you ever access tag and write data in parallel?

Splitting Caches Most processors have separate caches for instructions & data Often noted $I and $D Advantages Extra access port Can customize to specific access patterns Low hit time Disadvantages Capacity utilization Miss rate

Write Policy: Write-Through vs Write-Back
Data written to cache block also written to lower-level memory Write data only to the cache Update lower level when a block falls out of the cache Debug Easy Hard Do read misses produce writes? No Yes Do repeated writes make it to lower level? Additional option -- let writes to an un-cached address allocate a new cache line (”write-allocate”).

Cache Design: Datapath + Control
Most design errors come from incorrect specification of state machine behavior! Common bugs: Stalls, Block replacement, Write buffer To Lower Level Memory To CPU Control State Machine Control Control To Lower Level Memory Addr Addr To CPU Blocks Tags Din Din Dout Dout

Cache Controller Example cache characteristics
Direct-mapped, write-back, write allocate Block size: 4 words (16 bytes) Cache size: 16KB (1024 blocks) 32-bit byte addresses Valid bit and dirty bit per block Blocking cache CPU waits until assess is complete Address

Signals between the Processor and the Cache

Finite State Machines Use and FSM to sequence control steps
Set of states, transition on each clock edge State values are binary encoded Current state stored in a register Next state = fn (current state, current inputs) Control output signals = fo (current state)

Cache Controller FSM Idle state
Waiting for a valid read or write request from the processor Compare Tag state Testing if hit or miss If hit, set Cache Ready after read or write -> Idle state If miss, updates the cache tag If dirty ->Write-Back state, else -> Allocate state Write-Back state Writing the 128-bit block to memory Waiting for ready signal from memory ->Allocate state Allocate state Fetching new blocks is from memory

Acknowledgements These slides contain material from courses: UCB CS152
Stanford EE108B

Peng Liu liupeng@zju.edu.cn Lecture 10 Cache (2) Peng Liu liupeng@zju.edu.cn.

Similar presentations

Presentation on theme: "Peng Liu liupeng@zju.edu.cn Lecture 10 Cache (2) Peng Liu liupeng@zju.edu.cn."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Peng Liu liupeng@zju.edu.cn Lecture 10 Cache (2) Peng Liu liupeng@zju.edu.cn.

Similar presentations

Presentation on theme: "Peng Liu liupeng@zju.edu.cn Lecture 10 Cache (2) Peng Liu liupeng@zju.edu.cn."— Presentation transcript:

Similar presentations

About project

Feedback