Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
14 January, 2019 Lecture 4.2 Memory Hierarchy: Cache Basics Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Learning Objectives Given a 32-bit address, figure out the corresponding block address Given a block address, find the index of the cache line it maps to Given a 32-bit address and a multi-word cache line, identify the particular location inside the cache line the word maps to Given a sequence of memory requests, decide hit/miss for each request Give the cache configuration, calculate the total number of bits to implement the cache Describe the behaviors of write-through cache and write-back cache for write hit or write miss 2

Coverage Textbook Chapter 5.3 3

Morgan Kaufmann Publishers
14 January, 2019 Cache Memory Cache memory The level of the memory hierarchy closest to the CPU Given accesses X1, …, Xn–1, Xn §5.3 The Basics of Caches How do we know if the data is present? Where do we look? Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 4 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

14 January, 2019 Direct Mapped Cache Location determined by address Direct mapped: only one choice (Block address) modulo (#Blocks in cache) #Blocks is a power of 2 Use low-order address bits Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 5 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

14 January, 2019 Tags and Valid Bits How do we know which particular data block is stored in a cache location? Store block address as well as the data Actually, only need the high-order bits Called the tag What if there is no data in a location? Valid bit: 1 = valid, 0 = not valid Initially 0 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 6 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

14 January, 2019 Cache Example 8 blocks, 1 word / block, direct mapped Initial state Index V Tag Data 0002(010) N 0012(110) 0102(210) 0112(310) 1002(410) 1012(510) 1102(610) 1112(710) Access sequence (address in word): , 26, 22, 26, 16, 3, 16, 18, 16 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 7 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

14 January, 2019 Cache Example Word addr Binary addr Hit/miss Cache block 22 10 110 Miss 110 Index V Tag Data 000 N 001 010 011 100 101 110 Y 10 Mem[10110] 111 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 8 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

14 January, 2019 Cache Example Word addr Binary addr Hit/miss Cache block 26 11 010 Miss 010 Index V Tag Data 000 N 001 010 Y 11 Mem[11010] 011 100 101 110 10 Mem[10110] 111 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 9 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

14 January, 2019 Cache Example Word addr Binary addr Hit/miss Cache block 22 10 110 Hit 110 26 11 010 010 Index V Tag Data 000 N 001 010 Y 11 Mem[11010] 011 100 101 110 10 Mem[10110] 111 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 10 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

14 January, 2019 Cache Example Word addr Binary addr Hit/miss Cache block 16 10 000 Miss 000 3 00 011 011 Index V Tag Data 000 Y 10 Mem[10000] 001 N 010 11 Mem[11010] 011 00 Mem[00011] 100 101 110 Mem[10110] 111 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 11 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

14 January, 2019 Cache Example Word addr Binary addr Hit/miss Cache block 16 10 000 Hit 000 18 10 010 Miss 010 Index V Tag Data 000 Y 10 Mem[10000] 001 N 010 Mem[10010] 011 00 Mem[00011] 100 101 110 Mem[10110] 111 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 12 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

14 January, 2019 Address Subdivision Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 13 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Multiword-Block Direct Mapped Cache
Four words / block, cache size = 1K words …… Byte offset Hit Data 32 Block offset 20 Tag 8 Index Data Index Tag Valid 1 2 . 253 254 255 20 to take advantage for spatial locality want a cache block that is larger than word word in size. What kind of locality are we taking advantage of?

Taking Advantage of Spatial Locality
Let cache block hold more than one word Start with an empty cache - all blocks initially marked as not valid miss 1 hit 2 miss 00 Mem(1) Mem(0) 00 Mem(1) Mem(0) 00 Mem(1) Mem(0) 00 Mem(3) Mem(2) 3 hit 4 miss 3 hit 01 5 4 00 Mem(1) Mem(0) 00 Mem(1) Mem(0) 01 Mem(5) Mem(4) 00 Mem(3) Mem(2) 00 Mem(3) Mem(2) 00 Mem(3) Mem(2) For lecture Show the 4-bit address mapping – 2-bits of tag, 1-bit of set address (index), 1-bit of word-in-block select 4 hit 15 miss 01 Mem(5) Mem(4) 01 Mem(5) Mem(4) 11 15 14 00 Mem(3) Mem(2) 00 Mem(3) Mem(2) 8 requests, 4 misses

14 January, 2019 Larger Block Size 64 blocks, 16 bytes / block To what block number does address 1200 map? 120010=0x0000,04B0 Tag Index Offset 3 4 9 10 31 4 bits 6 bits 22 bits 1200: 100,1011,0000 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 16 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Block Size Considerations
Morgan Kaufmann Publishers 14 January, 2019 Block Size Considerations Larger blocks should reduce miss rate Due to spatial locality In a fixed-sized cache Larger blocks  fewer of them More competition  increased miss rate Larger miss penalty Can override benefit of reduced miss rate Early restart and critical-word-first can help Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 17 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Miss Rate v.s. Block Size Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 18

2n x (block size + tag field size + flag bits size)
Cache Field Sizes The number of bits in a cache includes the storage for data, the tags, and the flag bits For a direct mapped cache with 2n blocks, n bits are used for the index For a block size of 2m words (2m+2 bytes), m bits are used to address the word within the block and 2 bits are used to address the byte within the word What is the size of the tag field? The total number of bits in a direct-mapped cache is then 2n x (block size + tag field size + flag bits size) How many total bits are required for a direct mapped cache with 16KB of data and 4-word blocks assuming a 32-bit address? The tag field is 32 – (n+m+2) 16KB is 4K (2^12 words). With a block size of 4 words, there are 1024 (2^10) blocks. Each block has 4x32 or 128 bits of data plus a tag which is 32 – – 2 bits, plus a valid bit. So the total cache size is 2^10 x (4x ) = 2^10 x 147 = 147Kbits (or about 1.15 times as many as needed just for storage of the data).

Handling Cache Hits Read hits (I$ and D$) Write hits (D$ only)
this is what we want! Write hits (D$ only) require the cache and memory to be consistent always write the data into both the cache block and the next level in the memory hierarchy (write-through) allow cache and memory to be inconsistent write the data only into the cache block (write-back the cache block to the next level in the memory hierarchy when that cache block is “evicted”) In a write-back cache, because we cannot overwrite the block (since we may not have a backup copy anywhere), stores either require two cycles (one to check for a hit, followed by one to actually do the write) or require a write buffer to hold that data (essentially pipelining the write). By comparison, a write-through cache can always be done in one cycle assuming there is room in the write buffer. Read the tag and write the data in parallel. If the tag doesn’t match, the generate a write miss to fetch the rest of that block from the next level in the hierarchy (and update the tag field).

14 January, 2019 Write-Through Write through: update the data in both the cache and the main memory But makes writes take longer Writes run at the speed of the next level in the memory hierarchy e.g., if base CPI = 1, 10% of instructions are stores, write to memory takes 100 cycles Effective CPI = ×100 = 11 Solution: write buffer Holds data waiting to be written to memory CPU continues immediately Only stalls on write if write buffer becomes full Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 21 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

14 January, 2019 Write-Back Write back: On data-write hit, just update the block in cache Keep track of whether each block is dirty Need a dirty bit When a dirty block is replaced Write it back to memory Can use a write buffer Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 22 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Handling Cache Misses (single-word cache block)
Morgan Kaufmann Publishers 14 January, 2019 Handling Cache Misses (single-word cache block) Read misses (I$ and D$) Stall the CPU pipeline Fetch block from the next level of hierarchy Copy it to the cache Send the requested word to the processor Resume the CPU pipeline 23 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Handling Cache Misses (single-word cache block)
Morgan Kaufmann Publishers 14 January, 2019 Handling Cache Misses (single-word cache block) Write allocate (aka: fetch on write) Data at the missed-write location is loaded to cache For write-through: two alternatives Allocate on miss: fetch the block, i.e., write allocate No write allocate: don’t fetch the block Since programs often write a whole data structure before reading it (e.g., initialization) For write-back Usually fetch the block, i.e., write allocate Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 24 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Multiword Block Considerations
Read misses (I$ and D$) Processed the same as for single word blocks – a miss returns the entire block from memory Miss penalty grows as block size grows Early restart – processor resumes execution as soon as the requested word of the block is returned Requested word first – requested word is transferred from the memory to the cache (and processor) first Nonblocking cache – allows the processor to continue to access the cache while the cache is handling an earlier miss Early restart works best for instruction caches (since it works best for sequential accesses) – if the memory system can deliver a word every clock cycle, it can return words just in time. But if the processor needs another word from a different block before the previous transfer is complete, then the processor will have to stall until the memory is no longer busy. Unless you have a nonblocking cache that come in two flavors Hit under miss – allow additional cache hits during a miss with the goal of hiding some of the miss latency Miss under miss – allow multiple outstanding cache misses (need a high bandwidth memory system to support it)

Multiword Block Considerations
Write misses (D$) If using write allocate, must first fetch the block from memory and then write the word to the block or could end up with a “garbled” block in the cache, e.g., for 4-word blocks, a new tag, one word of data from the new block, and three words of data from the old block If not using write allocate, forward the write request to the main memory Early restart works best for instruction caches (since it works best for sequential accesses) – if the memory system can deliver a word every clock cycle, it can return words just in time. But if the processor needs another word from a different block before the previous transfer is complete, then the processor will have to stall until the memory is no longer busy. Unless you have a nonblocking cache that come in two flavors Hit under miss – allow additional cache hits during a miss with the goal of hiding some of the miss latency Miss under miss – allow multiple outstanding cache misses (need a high bandwidth memory system to support it)

Main Memory Supporting Caches
Morgan Kaufmann Publishers 14 January, 2019 Main Memory Supporting Caches Use DRAMs for main memory Example: cache block read 1 bus cycle for address transfer 15 bus cycles per DRAM access 1 bus cycle per data transfer 4-word cache block 3 different main memory configurations 1-word wide memory 4-word wide memory 4-bank interleaved memory Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 27 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

1-Word Wide Bus, 1-Word Wide Memory
What if the block size is four words and the bus and the memory are 1-word wide? cycle to send address cycles to read DRAM cycles to return 4 data words total clock cycles miss penalty For class handout

What if the block size is four words and the bus and the memory are 1-word wide? cycle to send address cycles to read DRAM cycles to return 4 data words total clock cycles miss penalty 1 4 x 15 = 60 4 65 15 cycles For lecture Bandwidth = (4 x 4)/65 = B/cycle

What if the block size is four words and the bus and the memory are 4-word wide? cycle to send address cycles to read DRAM cycles to return 4 data words total clock cycles miss penalty For class handout

What if the block size is four words and the bus and the memory are 4-word wide? cycle to send address cycles to read DRAM cycles to return 4 data words total clock cycles miss penalty 1 15 17 15 cycles For lecture Bandwidth = (4 x 4)/17 = B/cycle

Interleaved Memory, 1-Word Wide Bus
For a block size of four words cycle to send address cycles to read DRAM banks cycles to return 4 data words total clock cycles miss penalty For class handout

Interleaved Memory, 1-Word Wide Bus
For a block size of four words cycle to send address cycles to read DRAM banks cycles to return 4 data words total clock cycles miss penalty 1 15 4*1 = 4 20 15 cycles For lecture The width of the bus and the cache need not be changed, but sending an address to several banks permits them all to read simultaneously. Interleaving retains the advantage of incurring the full memory latency only once. Low order interleaving (low order bits select bank, high order bits used to access DRAM banks in parallel). Low order memory interleaving, i.e., Bank 0 holds addresses xx…xx00 Bank 1 holds addresses xx…xx01 Bank 2 holds addresses xx…xx10 Bank 3 holds addresses xx…xx11 May want to add another example where you also have a wide bus (e.g., a 2 word bus between the interleaved memory and the processor). Bandwidth = (4 x 4)/20 = 0.8 B/cycle

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Similar presentations

Presentation on theme: "Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Similar presentations

Presentation on theme: "Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics"— Presentation transcript:

Similar presentations

About project

Feedback