Peng Liu liupeng@zju.edu.cn Lecture 10 Cache Peng Liu liupeng@zju.edu.cn.

Peng Liu liupeng@zju.edu.cn
Lecture 10 Cache Peng Liu

Physical Size Affects Latency
Big Memory CPU Small Memory CPU Signals have further to travel Fan out to more locations

holds frequently used data
Memory Hierarchy Big, Slow Memory (DRAM) Small, Fast Memory (RF, SRAM) A B CPU holds frequently used data capacity: Register << SRAM << DRAM latency: Register << SRAM << DRAM bandwidth: on-chip >> off-chip On a data access: if data Î fast memory  low latency access (SRAM) if data Ï fast memory  high latency access (DRAM) Due to cost Due to size of DRAM Due to cost and wire delays (wires on-chip cost much less, and are faster) CS252 S05

Memory Technology Speed &Cost
Static RAM (SRAM) 0.3ns-2.5ns Dynamic (DRAM) 50ns-70ns Magnetic disk 5ms-20ms Ideal memory Access time of SRAM Capacity and cost/GB of disk

Random Access Memory (RAM)
Dynamic Random Access Memory (DRAM) High density, low power, cheap, but slow Dynamic since data must be “refreshed” regularly (“leaky buckets”) Contents are lost when power is lost Static Random Access Memory (SRAM) Lower density (about 1/10 density of DRAM), higher cost Static since data is held without refresh if power is on Fast access time, often 2 to 10 times faster than DRAM Flash memory Holds contents without power Data written in blocks, generally slow Very cheap

Programs Have Locality
Principle of locality Programs access a relatively small portion of the address space at any given time Can tell what memory locations a program will reference in the future by looking at what it referenced recently in the past Two Types of Locality Temporal Locality – if an item has been referenced recently, it will tend to be referenced again soon Spatial Locality – if an item has been referenced recently, nearby items will tend to be referenced soon Nearby refers to memory addresses

Locality Examples Spatial Locality Temporal Locality
Likely to reference data near recent references Example: for (i=0; i<N; i++) a[i] = … Temporal Locality Likely to reference same data that was referenced recently a[i] = f(a[i-1]);

Taking Advantages of Locality
Memory hierarchy: Store everything on disk Or Flash in some systems Copy recently accessed (and nearby) data to smaller DRAM memory DRAM is called main memory Copy more recently accessed (and nearby) data to smaller SRAM memory Cache memory attached or close to CPU

Typical Memory Hierarchy: Everything is a Cache for Something Else
Access time Capacity Managed by 1 cycle ~500B Software/compiler 1-3 cycles ~64KB hardware 5-10 cycles 1-10MB ~100 cycles ~10GB Software/OS cycles ~100GB

Caches A cache is an interim storage component
Functions as a buffer for larger, slower storage components Exploits principles of locality Provide as much inexpensive storage space as possible Offer access speed equivalent to the fastest memory For data in the cache Key is to have the right data cached Computer systems often use multiple caches Cache ideas are not limited to hardware designers Example: Web caches widely used on the Internet

Memory Hierarchy Levels
Block (aka line): unit of copying May be multiple words If accessed data is present in upper level Hit: access satisfied by upper level Hit ratio: hits/accesses If accessed data is absent Miss: block copied from lower level Time taken: miss penalty Miss ratio: misses/accesses= 1 - hit ratio Then data supplied to CPU from upper level CPU pipeline is stalled in the meantime

Average Memory Access Times
Need to define an average access time Since some will be fast and some slow Access time = hit time + miss rate x miss penalty The hope is that the hit time will be low and the miss rate low since the miss penalty is so much larger than the hit time Average Memory Access Time (AMAT) Formula can be applied to any level of the hierarchy Access time for that level Can be generalized for the entire hierarchy Average access time that the processor sees for a reference

How Processor Handles a Miss
Assume that cache access occurs in 1 cycle Hit is great, and basic pipeline is fine CPI penalty = miss rate x miss penalty A miss stalls the pipeline (for a instruction or data miss) Stall the pipeline (you don’t have the data it needs) Send the address that missed to the memory Instruct main memory to perform a read and wait When access completes, return the data to the processor Restart the instruction

How to Build A Cache? Big question: locating data in the cache
I need to map a large address space into a small memory How do I do that? Can build full associative lookup in hardware, but complex Need to find a simple but effective solution Two common techniques: Direct Mapped Set Associative Further questions Block size (crucial for spatial locality) Replacement policy (crucial for temporal locality) Write policy (writes are always more complex)

Terminology Block – Minimum unit of information transfer between levels of the hierarchy Block addressing varies by technology at each level Blocks are moved one level at a time Hit – Data appears in a block in lower numbered level Hit rate – Percent of accesses found Hit time – Time to access at lower numbered level Hit time = Cache access time + Time to determine hit/miss Miss – Data was not in lower numbered level and had to be fetched from a higher numbered level Miss rate – Percent of misses (1 – Hit rate) Miss penalty – Overhead in getting data from a higher numbered level Miss penalty = higher level access time + Time to deliver to lower level + Cache replacement/forward to processor time Miss penalty is usually much larger than the hit time

Direct Mapped Cache Location in cache determined by (main) memory address Direct mapped: only one choice (Block address) modulo (#Blocks in cache)

Tags and Valid Bits How do we know which particular block is stored in a cache location? Store block address as well as the data Actually, only need the high-order bits Called the tag What if there is no data in a location? Valid bit: 1 = present, 0 = not present Initially 0

Cache Example 8-blocks, 1 word/block, direct mapped Initial state
Index V Tag Data 000 N 001 010 011 100 101 110 111

Cache Example 22 10 110 Miss 110 Index V Tag Data 000 N 001 010 011
Word addr Binary addr Hit/Miss Cache block 22 10 110 Miss 110 Compulsory/Cold Miss Index V Tag Data 000 N 001 010 011 100 101 110 Y 10 Mem[10110] 111

Cache Example 26 11 010 Miss 010 Index V Tag Data 000 N 001 010 Y 11
Word addr Binary addr Hit/Miss Cache block 26 11 010 Miss 010 Compulsory/Cold Miss Index V Tag Data 000 N 001 010 Y 11 Mem[11010] 011 100 101 110 10 Mem[10110] 111

Cache Example 22 10 110 Hit 110 26 11 010 010 Index V Tag Data 000 N
Word addr Binary addr Hit/Miss Cache block 22 10 110 Hit 110 26 11 010 010 Hit Index V Tag Data 000 N 001 010 Y 11 Mem[11010] 011 100 101 110 10 Mem[10110] 111

Cache Example 16 10 000 Miss 000 3 00 011 011 Hit Index V Tag Data 000
Word addr Binary addr Hit/Miss Cache block 16 10 000 Miss 000 3 00 011 011 Hit Index V Tag Data 000 Y 10 Mem[10000] 001 N 010 11 Mem[11010] 011 00 Mem[00011] 100 101 110 Mem[10110] 111

Cache Example 18 11 010 Miss 010 Index V Tag Data 000 Y 10 Mem[10000]
Word addr Binary addr Hit/Miss Cache block 18 11 010 Miss 010 Replacement Index V Tag Data 000 Y 10 Mem[10000] 001 N 010 Mem[10010] 011 00 Mem[00011] 100 101 110 Mem[10110] 111

Cache Organization & Access
Assumptions 32-bit address 4 Kbyte cache 1024 blocks, 1word/block Steps Use index to read V, tag from cache Compare read tag with tag from address If match, return data & hit signal Otherwise, return miss

Larger Block Size Motivation: exploit spatial locality & amortize overheads This example: 64 blocks, 16 bytes/block To what block number does address 1200 map? Block address = =75 Block number = 75 modulo 64 = 11 What is the impact of larger on tag/index size? What is the impact of larger blocks on the cache overhead? Overhead = tags & valid bits

Cache Block Example Assume a 2n byte direct mapped cache with 2m byte blocks Byte select – The lower m bits Cache index – The lower (n - m) bits of the memory address Cache tag – The upper (32 - n) bits of the memory address

Block Sizes Larger block sizes take advantage of spatial locality
Also incurs larger miss penalty since it takes longer to transfer the block into the cache Large block can also increase the average time or the miss rate Tradeoff in selecting block size Average Access Time = Hit Time *(1-MR) + Miss Penalty * MR

Direct Mapped Problems: Conflict Misses
Two blocks that are used concurrently and map to same index Only one can fit in the cache, regardless of cache size No flexibility in placing 2nd block elsewhere Thrashing If accesses alternate, one block will replace the other before reuse No benefit from caching Conflicts & thrashing can happen quite often

Fully Associate Cache Opposite extreme in that I has no cache index to hash Use any available entry to store memory elements No conflict misses, only capacity misses Must compare cache tags of all entries to find the desired one

N-way Set Associative Compromise between direct-mapped and fully associative Each memory block can go to one of N entries in cache Each “set” can store N blocks; a cache contains some number of sets For fast access, all blocks in a set are search in parallel How to think of a N-way associative cache with X sets 1st view: N direct mapped caches each with X entries Caches search in parallel Need to coordinate on data output and signaling hit/miss 2nd view: X fully associative caches each with N entries One cache searched in each case

Associative Cache Example

Associativity Example
Compare 4-block caches Direct mapped, 2-way set associative, fully associative Block access sequence: 0, 8,0,6,8 (0 modulo 4) = 0 (6 modulo 4) = 2 (8 modulo 4) = 0 Direct mapped Block address Cache index Hit/miss Cache content after access 1 2 3 miss Mem[0] 8 Mem[8] 6 Mem[6] Assume there are small caches, each consisting of four one-word blocks.

Associativity Example
2-way set associative Full associative Block address Cache index Hit/miss Cache content after access Set 0 Set 1 miss Mem[0] 8 Mem[8] hit 6 Mem[6] Block address Hit/miss Cache content after access Set 0 miss Mem[0] 8 Mem[8] hit 6 Mem[6]

Set Associative Cache Organization

Tag & Index with Set-Associative Caches
Assume a 2n-byte cache with 2m-byte blocks that is 2a set-associative Which bits of the address are the tag or the index? m least significant bits are byte select within the block Basic idea The cache contains 2n/2m=2n-m blocks Each cache way contains 2n-m/2a=2n-m-a blocks Cache index: (n-m-a) bits after the byte select Same index used with all cache ways … Observation For fixed size, length of tags increases with the associativity Associative caches incur more overhead for tags

Bonus Trick: How to build a 3KB cache?
It would be difficult to make a DM 3KB cache 3KB is not a power of two Assuming 16-byte blocks, we have 24 blocks to select from (address %24) is very expensive to calculate Unlike (address %32) which requires looking at the 4LS bits Solution: start with 4KB 4-way set associative cache Every way can hold 1-KB (8 blocks) Same 3-bit index used to access all 4 cache ways 3LS bits of address (after eliminating the block offset) Now drop the 4th way of the cache As if that 4th way always reports a miss and never receives data

Associative Cache: Pros
Increased associativity decreases miss rate Eliminates conflicts But with diminishing returns Simulation of a system with 64KB D-cache, 16-word blocks 1-way:10.3% 2-way:8.6% 4-way:8.3% 8-way:8.1% Caveat: cache shared by multiple processors may have need higher associativity

Associative Caches: Cons
Area overhead More storage needed for tags( compared to same sized DM) N comparators Latency Critical path = way access + comparator + logic to combine answers Logic to OR hit signal and multiplex the data outputs Cannot forward the data to processor immediately Must first wait for selection and multiplexing Direct mapped assumes a hit and recovers later if a miss Complexity: dealing with replacement

Acknowledgements These slides contain material from courses: UCB CS152
Stanford EE108B

Placement Policy Memory Cache Fully (2-way) Set Direct
3 3 0 1 Block Number Memory Set Number Cache Simplest scheme is to extract bits from ‘block number’ to determine ‘set’. More sophisticated schemes will hash the block number ---- why could that be good/bad? Fully (2-way) Set Direct Associative Associative Mapped anywhere anywhere in only into set block 4 (12 mod 4) (12 mod 8) block 12 can be placed

Direct-Mapped Cache Tag Index t k b V Tag Data Block 2k lines t = HIT
Offset t k b V Tag Data Block 2k lines t = HIT Data Word or Byte

2-Way Set-Associative Cache
Tag Index Block Offset b t k V Tag Data Block V Tag Data Block t Compare latency to direct mapped case? Data Word or Byte = = HIT

Fully Associative Cache
Tag Data Block t = Tag t = HIT Block Offset Data Word or Byte = b

Replacement Methods Which line do you replace on a miss? Direct Mapped
Easy, you have only one choice Replace the line at the index you need N-way Set Associative Need to choose which way to replace Random (choose one at random) Least Recently Used (LRU) (the one used least recently) Often difficult to calculate, so people use approximations. Often they are really not recently used

Replacement only happens on misses
Replacement Policy In an associative cache, which block from a set should be evicted when the set becomes full? Random Least Recently Used (LRU) LRU cache state must be updated on every access true implementation only feasible for small sets (2-way) pseudo-LRU binary tree often used for 4-8 way First In, First Out (FIFO) a.k.a. Round-Robin used in highly associative caches Not Least Recently Used (NLRU) FIFO with exception for most recently used block or blocks This is a second-order effect. Why? NLRU used in Alpha TLBs. Replacement only happens on misses

Block Size and Spatial Locality
Block is unit of transfer between the cache and memory 4 word block, b=2 Tag Word0 Word1 Word2 Word3 Split CPU address block address offsetb 32-b bits b bits 2b = block size a.k.a line size (in bytes) Larger block size has distinct hardware advantages less tag overhead exploit fast burst transfers from DRAM exploit fast burst transfers over wide busses What are the disadvantages of increasing block size? Larger block size will reduce compulsory misses (first miss to a block). Larger blocks may increase conflict misses since the number of blocks is smaller. Fewer blocks => more conflicts. Can waste bandwidth.

CPU-Cache Interaction (5-stage pipeline)
0x4 E Add M Decode, Register Fetch A ALU we Y addr nop IR B Primary Data Cache rdata R addr PC inst D hit? hit? wdata wdata PCen Primary Instruction Cache MD1 MD2 Stall entire CPU on data cache miss To Memory Control Cache Refill Data from Lower Levels of Memory Hierarchy

Improving Cache Performance
Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the hit time reduce the miss rate reduce the miss penalty What is the simplest design strategy? Design the largest primary cache without slowing down the clock Or adding pipeline stages. Biggest cache that doesn’t increase hit time past 1-2 cycles (approx 8-32KB in modern technology) [ design issues more complex with out-of-order superscalar processors ]

Serial-versus-Parallel Cache and Memory Access
a is HIT RATIO: Fraction of references in cache 1 - a is MISS RATIO: Remaining references CACHE Processor Main Memory Addr Data Average access time for serial search: tcache + (1 - a) tmem CACHE Processor Main Memory Addr Data Average access time for parallel search: a tcache + (1 - a) tmem Savings are usually small, tmem >> tcache, hit ratio a high High bandwidth required for memory path Complexity of handling parallel paths can slow tcache

Causes for Cache Misses
Compulsory: first-reference to a block a.k.a. cold start misses - misses that would occur even with infinite cache Capacity: cache is too small to hold all data needed by the program - misses that would occur even under perfect replacement policy Conflict: misses that occur because of collisions due to block-placement strategy - misses that would not occur with full associativity

Miss Rates and 3Cs

Effect of Cache Parameters on Performance
Larger cache size reduces capacity and conflict misses hit time will increase Higher associativity reduces conflict misses may increase hit time Larger block size reduces compulsory and capacity (reload) misses increases conflict misses and miss penalty Requested block first…. The following could be in the slide… spatial locality reduces compulsory misses and capacity reload misses fewer blocks may increase conflict miss rate larger blocks may increase miss penalty

Write Policy Choices Cache hit:
write through: write both cache & memory generally higher traffic but simplifies cache coherence write back: write cache only (memory is written only when the entry is evicted) a dirty bit per block can further reduce the traffic Cache miss: no write allocate: only write to main memory write allocate (aka fetch on write): fetch into cache Common combinations: write through and no write allocate write back with write allocate

Write Performance Tag Index b t k V Tag Data 2k lines t = WE HIT
Block Offset b t k V Tag Data 2k lines t Completely serial. = WE HIT Data Word or Byte

Reducing Write Hit Time
Problem: Writes take two cycles in memory stage, one cycle for tag check plus one cycle for data write if hit Solutions: Design data RAM that can perform read and write in one cycle, restore old value after tag miss Fully-associative (CAM Tag) caches: Word line only enabled if hit Pipelined writes: Hold write data for store in single buffer ahead of cache, write cache data during next store’s tag check Need to bypass from write buffer if read address matches write buffer tag!

Pipelining Cache Writes
Address and Store Data From CPU Tag Index Store Data Delayed Write Addr. Delayed Write Data Load/Store =? Tags Data S L =? 1 Load Data to CPU Hit? Data from a store hit written into data portion of cache during tag access of subsequent store

Write Buffer to Reduce Read Miss Penalty
Unified L2 Cache Data Cache CPU Write buffer RF Evicted dirty lines for writeback cache OR All writes in writethru cache Processor is not stalled on writes, and read misses can go ahead of write to main memory Problem: Write buffer may hold updated value of location needed by a read miss Simple scheme: on a read miss, wait for the write buffer to go empty Faster scheme: Check write buffer addresses against read miss addresses, if no match, allow read miss to go ahead of writes, else, return value in write buffer Deisgners of the MIPS M/1000 estimated that waiting for a four-word buffer to empty increased the read miss penalty by a factor of 1.5.

Block-level Optimizations
Tags are too large, i.e., too much overhead Simple solution: Larger blocks, but miss penalty could be large. Sub-block placement (aka sector cache) A valid bit added to units smaller than full block, called sub-blocks Only read a sub-block on a miss If a tag matches, is the word in the cache? Main reason for subblock placement is to reduce tag overhead. Sector cache. 100 300 204

Set-Associative RAM-Tag Cache
=? Tag Status Data Tag Index Offset Not energy-efficient A tag and data word is read from every way Two-phase approach First read tags, then just read data from selected way More energy-efficient Doubles latency in L1 OK, for L2 and above, why?

Peng Liu liupeng@zju.edu.cn Lecture 10 Cache Peng Liu liupeng@zju.edu.cn.

Similar presentations

Presentation on theme: "Peng Liu liupeng@zju.edu.cn Lecture 10 Cache Peng Liu liupeng@zju.edu.cn."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Peng Liu liupeng@zju.edu.cn Lecture 10 Cache Peng Liu liupeng@zju.edu.cn.

Similar presentations

Presentation on theme: "Peng Liu liupeng@zju.edu.cn Lecture 10 Cache Peng Liu liupeng@zju.edu.cn."— Presentation transcript:

Similar presentations

About project

Feedback