Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches.

Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches

00 01 10 11 DataTagValid Reference Stream:Hit/Miss 0b01001000 0b00010100 0b00111000 0b00010000 Direct-mapped Cache Blocksize=4words, wordsize= 4bytes Tag Index Byte Offset Block Offset 01 11 00 1 1 0 1

01 10 11 DataTagValid Reference Stream:Hit/Miss 0b01001000 0b00010100 0b00111000 0b00010000 Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=4words, wordsize= 4bytes 01 11 00 1 1 0 1

01 00 11 01 00 10 00 11 DataTagValid 1 1 0 1 Reference Stream:Hit/Miss 0b01001000 0b00010100 0b00111000 0b00010000 Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=4words, wordsize= 4bytes M[64-79]

01 00 11 01 00 10 00 11 DataTagValid 1 1 0 1 Reference Stream:Hit/Miss 0b01001000 0b00010100 0b00111000 0b00010000 Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=4words, wordsize= 4bytes M[64-79] M[208-223]

01 00 11 01 00 10 00 11 DataTagValid 1 1 0 1 Reference Stream:Hit/Miss 0b01001000 0b00010100 0b00111000 0b00010000 Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=4words, wordsize= 4bytes M[64-79] M[208-223] M[32-47]

01 00 11 01 00 10 00 11 DataTagValid 1 1 0 1 Reference Stream:Hit/Miss 0b01001000 0b00010100 0b00111000 0b00010000 Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=4words, wordsize= 4bytes M[64-79] M[208-223] M[32-47] Not Valid

01 00 11 01 00 10 00 11 DataTagValid 1 1 0 1 Reference Stream:Hit/Miss 0b01001000 0b00010100 0b00111000 0b00010000 Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=4words, wordsize= 4bytes M[64-79] M[208-223] M[32-47]

01 00 11 01 00 10 00 11 DataTagValid 1 1 0 1 Reference Stream:Hit/Miss 0b01001000H 0b00010100 0b00111000 0b00010000 Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=4words, wordsize= 4bytes M[64-79] M[208-223] M[32-47]

01 00 01 00 10 00 11 DataTagValid 1 1 0 1 Reference Stream:Hit/Miss 0b01001000H 0b00010100M 0b00111000 0b00010000 Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=4words, wordsize= 4bytes M[64-79] M[16-31] M[32-47]

01 00 01 00 10 00 11 DataTagValid 1 1 1 1 Reference Stream:Hit/Miss 0b01001000H 0b00010100M 0b00111000M 0b00010000 Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=4words, wordsize= 4bytes M[64-79] M[16-31] M[32-47] M[48-63]

01 00 01 00 10 00 11 DataTagValid 1 1 1 1 Reference Stream:Hit/Miss 0b01001000H 0b00010100M 0b00111000M 0b00010000H Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=4words, wordsize= 4bytes M[64-79] M[16-31] M[32-47] M[48-63]

Cache Writes There are multiple copies of the data lying around –L1 cache, L2 cache, DRAM Do we write to all of them? Do we wait for the write to complete before the processor can proceed?

Do we write to all of them? Write-through Write-back –creates data - different values for same item in cache and DRAM. –This data is referred to as

Do we write to all of them? Write-through - write to all levels of hierarchy Write-back –creates data - different values for same item in cache and DRAM. –This data is referred to as

Do we write to all of them? Write-through - write to all levels of hierarchy Write-back - write to lower level only when cache line gets evicted from cache –creates inconsistent data - different values for same item in cache and DRAM – stale data. –Inconsistent data in highest level in cache is referred to as dirty –If they all match, they are clean –The old data is stale.

Write-Through CPU L1 L2 Cache DRAM Sw $3, 0($5)

Write-Back CPU L1 L2 Cache DRAM Sw $3, 0($5)

Write-through vs Write-back Which performs the write faster? Which has faster evictions from a cache? Which causes more bus traffic?

Write-through vs Write-back Which performs the write faster? –Write-back - it only writes the L1 cache Which has faster evictions from a cache? Which causes more bus traffic?

Write-through vs Write-back Which performs the write faster? –Write-back - it only writes the L1 cache Which has faster evictions from a cache? –Write-through - no write involved, just overwrite tag Which causes more bus traffic?

Write-through vs Write-back Which performs the write faster? –Write-back - it only writes the L1 cache Which has faster evictions from a cache? –Write-through - no write involved, just overwrite tag Which causes more bus traffic? –Write-through. DRAM is written every store. Write-back only writes on eviction.

Does processor wait for write? Write buffer –Any loads must check write buffer in parallel with cache access. –Buffer values are more recent than cache values.

Does processor wait for write? Write buffer - intermediate queue for pending writes –Any loads must check write buffer in parallel with cache access. –Buffer values are more recent than cache values.

Outline Cache writes DRAM configurations Performance Associative caches

Challenge DRAM is designed for density, not speed DRAM is ______ than the bus We are allowed to change the width, the number of DRAMs, and the bus protocol, but the access latency stays slow. Widening anything increases the cost by quite a bit.

Challenge DRAM is designed for density, not speed DRAM is slower than the bus We are allowed to change the width, the number of DRAMs, and the bus protocol, but the access latency stays slow. Widening anything increases the cost by quite a bit.

Narrow Configuration CPU Cache DRAM Bus Given: –1 clock cycle request –15 cycles / word DRAM latency –1 cycle / word bus latency If a cache block is 8 words, what is the miss penalty of an L2 cache miss?

Narrow Configuration CPU Cache DRAM Bus Given: –1 clock cycle request –15 cycles / word DRAM latency –1 cycle / word bus latency If a cache block is 8 words, what is the miss penalty of an L2 cache miss? 1cycle + 15 cycles/word * 8 words + 1 cycle/word * 8 words = 129 cycles

Wide Configuration CPU Cache DRAM Bus Given: –1 clock cycle request –15 cycles / 2 words DRAM latency –1 cycle / 2 words bus latency If a cache block is 8 words, what is the miss penalty of an L2 cache miss?

Wide Configuration CPU Cache DRAM Bus Given: –1 clock cycle request –15 cycles / 2 words DRAM latency –1 cycle / 2 words bus latency If a cache block is 8 words, what is the miss penalty of an L2 cache miss? 1cycle + 15 cycles/2 words * 8 words + 1 cycle/2words*8words = 65 cycles

Interleaved Configuration CPU Cache DRAM Bus Given: –1 clock cycle request –15 cycles / word DRAM latency –1 cycle / word bus latency If a cache block is 8 words, what is the miss penalty of an L2 cache miss? DRAM

Interleaved Configuration CPU Cache DRAM Bus Given: –1 clock cycle request –15 cycles / word DRAM latency –1 cycle / word bus latency If a cache block is 8 words, what is the miss penalty of an L2 cache miss? 1 cycle + 15 cycles / 2 words * 8 words + 1 cycle / word * 8 words = 69 cycles DRAM

Recent DRAM trends Fewer, Bigger DRAMs New bus protocols (RAMBUS) small DRAM caches (page mode) SDRAM (synchronous DRAM) –one request & length nets several continuous responses.

Performance Execute Time = (Cpu cycles + Memory- stall cycles) * clock cycle time Memory-stall cycles = –accesses * misses * cycles = –program access miss –memory access * Miss rate * Miss penalty –program –instructions * misses * cycles = – program inst miss –instructions * misses * miss penalty –program inst

Example 1 instruction cache miss rate: 2% data cache miss rate: 3% miss penalty: 50 cycles ld/st instructions are 25% of instructions CPI with perfect cache is 2.3 How much faster is the computer with a perfect cache?

Example 1 misses = Iacc * Imr + Dacc * Dmr instr instr instr

Example 1 misses = Iacc * Imr + Dacc * Dmr instr instr instr = 1 *.02 +.25 *.03 =.02 +.0075 =.0275

Example 1 misses = Iacc * Imr + Dacc * Dmr instr instr instr = 1 *.02 +.25 *.03 =.02 +.0075 =.0275 Memory cycles = I *.0275 * 50 = I* 1.375

Example 1 misses = Iacc * Imr + Dacc * Dmr instr instr instr = 1 *.02 +.25 *.03 =.02 +.0075 =.0275 Memory cycles = I *.0275 * 50 = I* 1.375 ExecT = (Cpu CPI * I + MemCycles)*Clk

Example 1 misses = Iacc * Imr + Dacc * Dmr instr instr instr = 1 *.02 +.25 *.03 =.02 +.0075 =.0275 Memory cycles = I *.0275 * 50 = I* 1.375 ExecT = (Cpu CPI * I + MemCycles)*Clk = (2.3 * I + 1.375 * I) * clk = 3.675IC

Example 1 misses = Iacc * Imr + Dacc * Dmr instr instr instr = 1 *.02 +.25 *.03 =.02 +.0075 =.0275 Memory cycles = I *.0275 * 50 = I* 1.375 ExecT = (Cpu CPI * I + MemCycles)*Clk = (2.3 * I + 1.375 * I) * clk = 3.675IC speedup = 3.675 IC / 2.3IC = 1.6

Example 2 Double the clock rate from Example1. What is the ideal speedup when taking into account the memory system? How long is the miss penalty now?

Example 2 Double the clock rate from Example1. What is the ideal speedup when taking into account the memory system? How long is the miss penalty now? 100 cycles Memory cycles =

Example 2 Double the clock rate from Example1. What is the ideal speedup when taking into account the memory system? How long is the miss penalty now? 100 cycles Memory cycles = I *.0275 * 100 = I * 2.75

Example 2 Double the clock rate from Example1. What is the ideal speedup when taking into account the memory system? How long is the miss penalty now? 100 cycles Memory cycles = I *.0275 * 100 = I * 2.75 Exec = (2.3*I + 2.75*I)*clk = 5.05I(C/2)

Example 2 Double the clock rate from Example1. What is the ideal speedup when taking into account the memory system? How long is the miss penalty now? 100 cycles Memory cycles = I *.0275 * 100 = I * 2.75 Exec = (2.3*I + 2.75*I)*clk = 5.05I(C/2) speedup = old = 3.675IC = 3.675 = 1.5 new = 5.05IC/2 2.525

101 00 010 01 000 10 000 11 DataTagValid 1 1 0 1 Reference Stream:Hit/Miss 0b00111000 0b00011100 0b00111000 0b00011000 Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=2words, wordsize= 4bytes M[160-167] M[72-79] M[16-23] Not Valid

00 01 10 001 11 DataTagValid 1 1 1 1 Reference Stream:Hit/Miss 0b00111000M 0b00011100 0b00111000 0b00011000 Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=2words, wordsize= 4bytes 101 010 000 M[160-167] M[72-79] M[16-23] M[56-63]

00 01 10 000 11 DataTagValid 1 1 1 1 Reference Stream:Hit/Miss 0b00111000M 0b00011100M 0b00111000 0b00011000 Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=2words, wordsize= 4bytes 101 010 000 M[160-167] M[72-79] M[16-23] M[24-31]

00 01 10 001 11 DataTagValid 1 1 1 1 Reference Stream:Hit/Miss 0b00111000M 0b00011100M 0b00111000M 0b00011000 Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=2words, wordsize= 4bytes 101 010 000 M[160-167] M[72-79] M[16-23] M[56-63]

00 01 10 001 11 DataTagValid 1 1 1 1 Reference Stream:Hit/Miss 0b00111000M 0b00011100M 0b00111000M 0b00011000M Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=2words, wordsize= 4bytes 101 010 000 M[160-167] M[72-79] M[16-23] M[56-63]

Problem Conflicting addresses cause high miss rates

Solution Relax the direct-mapping Allow each address to be mapped into 2 or 4 locations (a set)

Cache Configurations 00 01 10 11 DataTagValid 0 1 DataTagValidDataTagValid Direct-Mapped 2-way Associative - each set has two blocks DataTagValidDataTagValid Fully Associative - all addresses map to the same set

Cache Configurations 00 01 10 11 DataTagValid 0 1 DataTagValidDataTagValid Direct-Mapped 2-way Associative - each set has two blocks DataTagValidDataTagValid Fully Associative - all addresses map to the same set Block

Cache Configurations 00 01 10 11 DataTagValid 0 1 DataTagValidDataTagValid Direct-Mapped 2-way Associative - each set has two blocks DataTagValidDataTagValid Fully Associative - all addresses map to the same set Block Set

1001 0 0010 1 0000 0001 DataTagValid 1 1 1 1 Reference Stream:Hit/Miss 0b00111000 0b00011100 0b00111000 0b00011000 Tag Index Byte Offset Block Offset 2-way Set Associative Cache Blocksize=2words, wordsize= 4bytes DataTagValidIndex Set Block

1001 0 0010 1 0000 0001 DataTagValid 1 1 1 1 Reference Stream:Hit/Miss 0b00111000 0b00011100 0b00111000 0b00011000 Tag Index Byte Offset Block Offset 2-way Set Associative Cache Blocksize=2words, wordsize= 4bytes DataTagValidIndex

1001 0 0011 1 0000 0001 DataTagValid 1 1 1 1 Reference Stream:Hit/Miss 0b00111000M 0b00011100 0b00111000 0b00011000 Tag Index Byte Offset Block Offset 2-way Set Associative Cache Blocksize=2words, wordsize= 4bytes DataTagValidIndex

1001 0 0011 1 0000 0001 DataTagValid 1 1 1 1 Reference Stream:Hit/Miss 0b00111000M 0b00011100H 0b00111000 0b00011000 Tag Index Byte Offset Block Offset 2-way Set Associative Cache Blocksize=2words, wordsize= 4bytes DataTagValidIndex

1001 0 0011 1 0000 0001 DataTagValid 1 1 1 1 Reference Stream:Hit/Miss 0b00111000M 0b00011100H 0b00111000H 0b00011000 Tag Index Byte Offset Block Offset 2-way Set Associative Cache Blocksize=2words, wordsize= 4bytes DataTagValidIndex

1001 0 0011 1 0000 0001 DataTagValid 1 1 1 1 Reference Stream:Hit/Miss 0b00111000M 0b00011100H 0b00111000H 0b00011000H Tag Index Byte Offset Block Offset 2-way Set Associative Cache Blocksize=2words, wordsize= 4bytes DataTagValidIndex

Implementation 0 1 DataTagValid Byte Address 0x100100100 Tag Index Byte Offset = Hit? MUX Block offset Data TagValid MUX=

Performance Implications Increasing associativity increases/decreases hit rate Increasing associativity increases/decreases access time Increasing associativity increases/decreases miss penalty

Performance Implications Increasing associativity increases hit rate Increasing associativity increases/decreases access time Increasing associativity increases/decreases miss penalty

Performance Implications Increasing associativity increases hit rate Increasing associativity increases access time Increasing associativity increases/decreases miss penalty

Performance Implications Increasing associativity increases hit rate Increasing associativity increases access time Increasing associativity has no effect on miss penalty

0 1 Direct-Mapped Cache DataTagValid 0 0 0 0 Miss Rate: Tag Index Byte Offset Block Offset Example 2-way associative Reference Stream:Hit/Miss 0b1001000M 0b0011100 0b1001000 0b0111000

0 1 Direct-Mapped Cache DataTagValid 0 0 0 0 Tag Index Byte Offset Block Offset Example 2-way associative Reference Stream: 0b1001000 0b0011100 0b1001000 0b0111000

0 1 100 Direct-Mapped Cache DataTagValid 0 1 0 0 Tag Index Byte Offset Block Offset Example 2-way associative Reference Stream: 0b1001000 0b0011100 0b1001000 0b0111000

0 1 100 001 Direct-Mapped Cache DataTagValid 0 1 1 0 Tag Index Byte Offset Block Offset Example 2-way associative Reference Stream: 0b1001000 0b0011100 0b1001000 0b0111000

Which block to replace? 0b1001000 0b0011100

Which block to replace? 0b1001000 - It entered the cache first –FIFO - First In First Out 0b0011100

Which block to replace? 0b1001000 - It entered the cache first –FIFO - First In First Out 0b0011100 - Longer since it has been used –LRU - Least Recently Used Random

Replacement Algorithms LRU & FIFO simple conceptually, but implementation difficult for high assoc. LRU & FIFO must be approximated with high associativity Random sometimes better than approximated LRU/FIFO Tradeoff between accuracy, implementation cost

L1 L2 Cache DRAM Memory Me L1 cache’s perspective L1’s miss penalty contains the access of L2, and possibly the access of DRAM!!!

Multi-level Caches Base CPI 1.0, 500MHz clock main memory-100 cycles, L2 - 10 cycles L1 miss rate per instruction - 5% w/L2 - 2% of instructions go to DRAM What is the speedup with the L2 cache? There is a typo in the book for this example!

Multi-level Caches CPI = 1 + memory stalls / instructions

Multi-level Caches CPI = 1 + memory stalls / instructions CPI old = 1 + 5% miss/instr * 100 cycles/miss = 1 + 5 = 6 cycles / instr

Multi-level Caches CPI = 1 + memory stalls / instructions CPI old = 1 + 5% miss/instr * 100 cycles/miss = 1 + 5 = 6 cycles / instr CPI new = 1 + L2%*L2penalty + Mem%*MemPenalty

Multi-level Caches CPI = 1 + memory stalls / instructions CPI old = 1 + 5% miss/instr * 100 cycles/miss = 1 + 5 = 6 cycles / instr CPI new = 1 + L2%*L2penalty + Mem%*MemPenalty = 1 + 5% * 10 + 2% * 100 = 3.5

Multi-level Caches CPI = 1 + memory stalls / instructions CPI old = 1 + 5% miss/instr * 100 cycles/miss = 1 + 5 = 6 cycles / instr CPI new = 1 + L2%*L2penalty + Mem%*MemPenalty = 1 + 5% * 10 + 2% * 100 = 3.5 = 1 + (5-2)%*10 + 2%*(10+100) = 3.5

Multi-level Caches CPI = 1 + memory stalls / instructions CPI old = 1 + 5% miss/instr * 100 cycles/miss = 1 + 5 = 6 cycles / instr CPI new = 1 + L2%*L2penalty + Mem%*MemPenalty = 1 + 5% * 10 + 2% * 100 = 3.5 = 1 + (5-2)%*10 + 2%*(10+100) = 3.5 Speedup = 6/3.5 = 1.7

DO GROUPWORK NOW

Summary Direct-mapped –simple –_____ access time –_______ hit rate Variable block size –still simple –_______ access time

Summary Direct-mapped –simple –fast access time –marginal hit rate Variable block size –still simple –_____ access time –_____ hit rate by exploiting __________

Summary Direct-mapped –simple –fast access time –marginal hit rate Variable block size –still simple –fast access time –higher hit rate by exploiting spatial locality

Summary Associative caches –________ the access time –________ the hit rate –associativity above ___ has little to no gain Multi-level caches –__________ worst-case miss penalty –__________ average miss penalty

Summary Associative caches –increase the access time –increase the hit rate –associativity above 8 has little to no gain Multi-level caches –__________ worst-case miss penalty –__________ average miss penalty

Summary Associative caches –increase the access time –increase the hit rate –associativity above 8 has little to no gain Multi-level caches –increases worst-case miss penalty (because you waste time accessing another cache) –Reduces average miss penalty (because so many are caught and handled quickly)

Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches.

Similar presentations

Presentation on theme: "Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches.

Similar presentations

Presentation on theme: "Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches."— Presentation transcript:

Similar presentations

About project

Feedback