Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches.

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing
Advertisements

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Performance of Cache Memory
Cache Here we focus on cache improvements to support at least 1 instruction fetch and at least 1 data access per cycle – With a superscalar, we might need.
Cache Memory Locality of reference: It is observed that when a program refers to memory, the access to memory for data as well as code are confined to.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
Overview of Cache and Virtual MemorySlide 1 The Need for a Cache (edited from notes with Behrooz Parhami’s Computer Architecture textbook) Cache memories.
Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Cache Memory Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
Memory Chapter 7 Cache Memories.
ENGS 116 Lecture 121 Caches Vincent H. Berk Wednesday October 29 th, 2008 Reading for Friday: Sections C.1 – C.3 Article for Friday: Jouppi Reading for.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
1 Lecture 13: Cache Innovations Today: cache access basics and innovations, DRAM (Sections )
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
CPE432 Chapter 5A.1Dr. W. Abu-Sufah, UJ Chapter 5A: Exploiting the Memory Hierarchy, Part 3 Adapted from Slides by Prof. Mary Jane Irwin, Penn State University.
 Higher associativity means more complex hardware  But a highly-associative cache will also exhibit a lower miss rate —Each set has more blocks, so there’s.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
Cache Control and Cache Coherence Protocols How to Manage State of Cache How to Keep Processors Reading the Correct Information.
Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.
Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Memory Hierarchy and Caches. Who Cares about Memory Hierarchy? Processor Only Thus Far in Course CPU-DRAM Gap 1980: no cache in µproc; level cache,
Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
Chapter 5 Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
Memory Hierarchy and Cache Design (3). Reducing Cache Miss Penalty 1. Giving priority to read misses over writes 2. Sub-block placement for reduced miss.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
CMSC 611: Advanced Computer Architecture
Main Memory Cache Architectures
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
Soner Onder Michigan Technological University
Multilevel Memories (Improving performance using alittle “cash”)
Cache Memory Presentation I
ECE 445 – Computer Organization
Lecture 21: Memory Hierarchy
Chapter 5 Memory CSE 820.
Lecture 08: Memory Hierarchy Cache Performance
CPE 631 Lecture 05: Cache Design
ECE232: Hardware Organization and Design
Lecture 20: OOO, Memory Hierarchy
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Main Memory Cache Architectures
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
Lecture 22: Cache Hierarchies, Memory
Lecture 21: Memory Hierarchy
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache - Optimization.
Cache Memory Rabi Mahapatra
Lecture 7 Memory Hierarchy and Cache Design
Presentation transcript:

Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches

DataTagValid Reference Stream:Hit/Miss 0b b b b Direct-mapped Cache Blocksize=4words, wordsize= 4bytes Tag Index Byte Offset Block Offset

DataTagValid Reference Stream:Hit/Miss 0b b b b Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=4words, wordsize= 4bytes

DataTagValid Reference Stream:Hit/Miss 0b b b b Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=4words, wordsize= 4bytes

DataTagValid Reference Stream:Hit/Miss 0b b b b Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=4words, wordsize= 4bytes M[64-79]

DataTagValid Reference Stream:Hit/Miss 0b b b b Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=4words, wordsize= 4bytes M[64-79] M[ ]

DataTagValid Reference Stream:Hit/Miss 0b b b b Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=4words, wordsize= 4bytes M[64-79] M[ ] M[32-47]

DataTagValid Reference Stream:Hit/Miss 0b b b b Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=4words, wordsize= 4bytes M[64-79] M[ ] M[32-47] Not Valid

DataTagValid Reference Stream:Hit/Miss 0b b b b Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=4words, wordsize= 4bytes M[64-79] M[ ] M[32-47]

DataTagValid Reference Stream:Hit/Miss 0b H 0b b b Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=4words, wordsize= 4bytes M[64-79] M[ ] M[32-47]

DataTagValid Reference Stream:Hit/Miss 0b H 0b b b Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=4words, wordsize= 4bytes M[64-79] M[ ] M[32-47]

DataTagValid Reference Stream:Hit/Miss 0b H 0b M 0b b Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=4words, wordsize= 4bytes M[64-79] M[16-31] M[32-47]

DataTagValid Reference Stream:Hit/Miss 0b H 0b M 0b b Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=4words, wordsize= 4bytes M[64-79] M[16-31] M[32-47]

DataTagValid Reference Stream:Hit/Miss 0b H 0b M 0b M 0b Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=4words, wordsize= 4bytes M[64-79] M[16-31] M[32-47] M[48-63]

DataTagValid Reference Stream:Hit/Miss 0b H 0b M 0b M 0b Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=4words, wordsize= 4bytes M[64-79] M[16-31] M[32-47] M[48-63]

DataTagValid Reference Stream:Hit/Miss 0b H 0b M 0b M 0b H Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=4words, wordsize= 4bytes M[64-79] M[16-31] M[32-47] M[48-63]

Cache Writes There are multiple copies of the data lying around –L1 cache, L2 cache, DRAM Do we write to all of them? Do we wait for the write to complete before the processor can proceed?

Do we write to all of them? Write-through Write-back –creates data - different values for same item in cache and DRAM. –This data is referred to as

Do we write to all of them? Write-through - write to all levels of hierarchy Write-back –creates data - different values for same item in cache and DRAM. –This data is referred to as

Do we write to all of them? Write-through - write to all levels of hierarchy Write-back - write to lower level only when cache line gets evicted from cache –creates inconsistent data - different values for same item in cache and DRAM – stale data. –Inconsistent data in highest level in cache is referred to as dirty –If they all match, they are clean –The old data is stale.

Write-Through CPU L1 L2 Cache DRAM Sw $3, 0($5)

Write-Back CPU L1 L2 Cache DRAM Sw $3, 0($5)

Write-through vs Write-back Which performs the write faster? Which has faster evictions from a cache? Which causes more bus traffic?

Write-through vs Write-back Which performs the write faster? –Write-back - it only writes the L1 cache Which has faster evictions from a cache? Which causes more bus traffic?

Write-through vs Write-back Which performs the write faster? –Write-back - it only writes the L1 cache Which has faster evictions from a cache? –Write-through - no write involved, just overwrite tag Which causes more bus traffic?

Write-through vs Write-back Which performs the write faster? –Write-back - it only writes the L1 cache Which has faster evictions from a cache? –Write-through - no write involved, just overwrite tag Which causes more bus traffic? –Write-through. DRAM is written every store. Write-back only writes on eviction.

Does processor wait for write? Write buffer –Any loads must check write buffer in parallel with cache access. –Buffer values are more recent than cache values.

Does processor wait for write? Write buffer - intermediate queue for pending writes –Any loads must check write buffer in parallel with cache access. –Buffer values are more recent than cache values.

Outline Cache writes DRAM configurations Performance Associative caches

Challenge DRAM is designed for density, not speed DRAM is ______ than the bus We are allowed to change the width, the number of DRAMs, and the bus protocol, but the access latency stays slow. Widening anything increases the cost by quite a bit.

Challenge DRAM is designed for density, not speed DRAM is slower than the bus We are allowed to change the width, the number of DRAMs, and the bus protocol, but the access latency stays slow. Widening anything increases the cost by quite a bit.

Narrow Configuration CPU Cache DRAM Bus Given: –1 clock cycle request –15 cycles / word DRAM latency –1 cycle / word bus latency If a cache block is 8 words, what is the miss penalty of an L2 cache miss?

Narrow Configuration CPU Cache DRAM Bus Given: –1 clock cycle request –15 cycles / word DRAM latency –1 cycle / word bus latency If a cache block is 8 words, what is the miss penalty of an L2 cache miss? 1cycle + 15 cycles/word * 8 words + 1 cycle/word * 8 words = 129 cycles

Wide Configuration CPU Cache DRAM Bus Given: –1 clock cycle request –15 cycles / 2 words DRAM latency –1 cycle / 2 words bus latency If a cache block is 8 words, what is the miss penalty of an L2 cache miss?

Wide Configuration CPU Cache DRAM Bus Given: –1 clock cycle request –15 cycles / 2 words DRAM latency –1 cycle / 2 words bus latency If a cache block is 8 words, what is the miss penalty of an L2 cache miss? 1cycle + 15 cycles/2 words * 8 words + 1 cycle/2words*8words = 65 cycles

Interleaved Configuration CPU Cache DRAM Bus Given: –1 clock cycle request –15 cycles / word DRAM latency –1 cycle / word bus latency If a cache block is 8 words, what is the miss penalty of an L2 cache miss? DRAM

Interleaved Configuration CPU Cache DRAM Bus Given: –1 clock cycle request –15 cycles / word DRAM latency –1 cycle / word bus latency If a cache block is 8 words, what is the miss penalty of an L2 cache miss? 1 cycle + 15 cycles / 2 words * 8 words + 1 cycle / word * 8 words = 69 cycles DRAM

Recent DRAM trends Fewer, Bigger DRAMs New bus protocols (RAMBUS) small DRAM caches (page mode) SDRAM (synchronous DRAM) –one request & length nets several continuous responses.

Outline Cache writes DRAM configurations Performance Associative caches

Performance Execute Time = (Cpu cycles + Memory- stall cycles) * clock cycle time Memory-stall cycles = –accesses * misses * cycles = –program access miss –memory access * Miss rate * Miss penalty –program –instructions * misses * cycles = – program inst miss –instructions * misses * miss penalty –program inst

Example 1 instruction cache miss rate: 2% data cache miss rate: 3% miss penalty: 50 cycles ld/st instructions are 25% of instructions CPI with perfect cache is 2.3 How much faster is the computer with a perfect cache?

Example 1 misses = Iacc * Imr + Dacc * Dmr instr instr instr

Example 1 misses = Iacc * Imr + Dacc * Dmr instr instr instr = 1 * *.03 = =.0275

Example 1 misses = Iacc * Imr + Dacc * Dmr instr instr instr = 1 * *.03 = =.0275 Memory cycles = I *.0275 * 50 = I* 1.375

Example 1 misses = Iacc * Imr + Dacc * Dmr instr instr instr = 1 * *.03 = =.0275 Memory cycles = I *.0275 * 50 = I* ExecT = (Cpu CPI * I + MemCycles)*Clk

Example 1 misses = Iacc * Imr + Dacc * Dmr instr instr instr = 1 * *.03 = =.0275 Memory cycles = I *.0275 * 50 = I* ExecT = (Cpu CPI * I + MemCycles)*Clk = (2.3 * I * I) * clk = 3.675IC

Example 1 misses = Iacc * Imr + Dacc * Dmr instr instr instr = 1 * *.03 = =.0275 Memory cycles = I *.0275 * 50 = I* ExecT = (Cpu CPI * I + MemCycles)*Clk = (2.3 * I * I) * clk = 3.675IC speedup = IC / 2.3IC = 1.6

Example 2 Double the clock rate from Example1. What is the ideal speedup when taking into account the memory system? How long is the miss penalty now?

Example 2 Double the clock rate from Example1. What is the ideal speedup when taking into account the memory system? How long is the miss penalty now? 100 cycles Memory cycles =

Example 2 Double the clock rate from Example1. What is the ideal speedup when taking into account the memory system? How long is the miss penalty now? 100 cycles Memory cycles = I *.0275 * 100 = I * 2.75

Example 2 Double the clock rate from Example1. What is the ideal speedup when taking into account the memory system? How long is the miss penalty now? 100 cycles Memory cycles = I *.0275 * 100 = I * 2.75 Exec = (2.3*I *I)*clk = 5.05I(C/2)

Example 2 Double the clock rate from Example1. What is the ideal speedup when taking into account the memory system? How long is the miss penalty now? 100 cycles Memory cycles = I *.0275 * 100 = I * 2.75 Exec = (2.3*I *I)*clk = 5.05I(C/2) speedup = old = 3.675IC = = 1.5 new = 5.05IC/

Outline Cache writes DRAM configurations Performance Associative caches

DataTagValid Reference Stream:Hit/Miss 0b b b b Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=2words, wordsize= 4bytes M[ ] M[72-79] M[16-23] Not Valid

DataTagValid Reference Stream:Hit/Miss 0b M 0b b b Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=2words, wordsize= 4bytes M[ ] M[72-79] M[16-23] M[56-63]

DataTagValid Reference Stream:Hit/Miss 0b M 0b b b Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=2words, wordsize= 4bytes M[ ] M[72-79] M[16-23] M[56-63]

DataTagValid Reference Stream:Hit/Miss 0b M 0b M 0b b Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=2words, wordsize= 4bytes M[ ] M[72-79] M[16-23] M[24-31]

DataTagValid Reference Stream:Hit/Miss 0b M 0b M 0b b Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=2words, wordsize= 4bytes M[ ] M[72-79] M[16-23] M[24-31]

DataTagValid Reference Stream:Hit/Miss 0b M 0b M 0b M 0b Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=2words, wordsize= 4bytes M[ ] M[72-79] M[16-23] M[56-63]

DataTagValid Reference Stream:Hit/Miss 0b M 0b M 0b M 0b Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=2words, wordsize= 4bytes M[ ] M[72-79] M[16-23] M[56-63]

DataTagValid Reference Stream:Hit/Miss 0b M 0b M 0b M 0b M Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=2words, wordsize= 4bytes M[ ] M[72-79] M[16-23] M[56-63]

Problem Conflicting addresses cause high miss rates

Solution Relax the direct-mapping Allow each address to be mapped into 2 or 4 locations (a set)

Cache Configurations DataTagValid 0 1 DataTagValidDataTagValid Direct-Mapped 2-way Associative - each set has two blocks DataTagValidDataTagValid Fully Associative - all addresses map to the same set

Cache Configurations DataTagValid 0 1 DataTagValidDataTagValid Direct-Mapped 2-way Associative - each set has two blocks DataTagValidDataTagValid Fully Associative - all addresses map to the same set Block

Cache Configurations DataTagValid 0 1 DataTagValidDataTagValid Direct-Mapped 2-way Associative - each set has two blocks DataTagValidDataTagValid Fully Associative - all addresses map to the same set Block Set

DataTagValid Reference Stream:Hit/Miss 0b b b b Tag Index Byte Offset Block Offset 2-way Set Associative Cache Blocksize=2words, wordsize= 4bytes DataTagValidIndex Set Block

DataTagValid Reference Stream:Hit/Miss 0b b b b Tag Index Byte Offset Block Offset 2-way Set Associative Cache Blocksize=2words, wordsize= 4bytes DataTagValidIndex

DataTagValid Reference Stream:Hit/Miss 0b M 0b b b Tag Index Byte Offset Block Offset 2-way Set Associative Cache Blocksize=2words, wordsize= 4bytes DataTagValidIndex

DataTagValid Reference Stream:Hit/Miss 0b M 0b b b Tag Index Byte Offset Block Offset 2-way Set Associative Cache Blocksize=2words, wordsize= 4bytes DataTagValidIndex

DataTagValid Reference Stream:Hit/Miss 0b M 0b H 0b b Tag Index Byte Offset Block Offset 2-way Set Associative Cache Blocksize=2words, wordsize= 4bytes DataTagValidIndex

DataTagValid Reference Stream:Hit/Miss 0b M 0b H 0b b Tag Index Byte Offset Block Offset 2-way Set Associative Cache Blocksize=2words, wordsize= 4bytes DataTagValidIndex

DataTagValid Reference Stream:Hit/Miss 0b M 0b H 0b H 0b Tag Index Byte Offset Block Offset 2-way Set Associative Cache Blocksize=2words, wordsize= 4bytes DataTagValidIndex

DataTagValid Reference Stream:Hit/Miss 0b M 0b H 0b H 0b Tag Index Byte Offset Block Offset 2-way Set Associative Cache Blocksize=2words, wordsize= 4bytes DataTagValidIndex

DataTagValid Reference Stream:Hit/Miss 0b M 0b H 0b H 0b H Tag Index Byte Offset Block Offset 2-way Set Associative Cache Blocksize=2words, wordsize= 4bytes DataTagValidIndex

Implementation 0 1 DataTagValid Byte Address 0x Tag Index Byte Offset = Hit? MUX Block offset Data TagValid MUX=

Performance Implications Increasing associativity increases/decreases hit rate Increasing associativity increases/decreases access time Increasing associativity increases/decreases miss penalty

Performance Implications Increasing associativity increases hit rate Increasing associativity increases/decreases access time Increasing associativity increases/decreases miss penalty

Performance Implications Increasing associativity increases hit rate Increasing associativity increases access time Increasing associativity increases/decreases miss penalty

Performance Implications Increasing associativity increases hit rate Increasing associativity increases access time Increasing associativity has no effect on miss penalty

0 1 Direct-Mapped Cache DataTagValid Miss Rate: Tag Index Byte Offset Block Offset Example 2-way associative Reference Stream:Hit/Miss 0b M 0b b b

0 1 Direct-Mapped Cache DataTagValid Tag Index Byte Offset Block Offset Example 2-way associative Reference Stream: 0b b b b

Direct-Mapped Cache DataTagValid Tag Index Byte Offset Block Offset Example 2-way associative Reference Stream: 0b b b b

Direct-Mapped Cache DataTagValid Tag Index Byte Offset Block Offset Example 2-way associative Reference Stream: 0b b b b

Direct-Mapped Cache DataTagValid Tag Index Byte Offset Block Offset Example 2-way associative Reference Stream: 0b b b b

Direct-Mapped Cache DataTagValid Tag Index Byte Offset Block Offset Example 2-way associative Reference Stream: 0b b b b

Direct-Mapped Cache DataTagValid Tag Index Byte Offset Block Offset Example 2-way associative Reference Stream: 0b b b b

Which block to replace? 0b b

Which block to replace? 0b It entered the cache first –FIFO - First In First Out 0b

Which block to replace? 0b It entered the cache first –FIFO - First In First Out 0b Longer since it has been used –LRU - Least Recently Used Random

Replacement Algorithms LRU & FIFO simple conceptually, but implementation difficult for high assoc. LRU & FIFO must be approximated with high associativity Random sometimes better than approximated LRU/FIFO Tradeoff between accuracy, implementation cost

L1 L2 Cache DRAM Memory Me L1 cache’s perspective L1’s miss penalty contains the access of L2, and possibly the access of DRAM!!!

Multi-level Caches Base CPI 1.0, 500MHz clock main memory-100 cycles, L cycles L1 miss rate per instruction - 5% w/L2 - 2% of instructions go to DRAM What is the speedup with the L2 cache? There is a typo in the book for this example!

Multi-level Caches CPI = 1 + memory stalls / instructions

Multi-level Caches CPI = 1 + memory stalls / instructions CPI old = 1 + 5% miss/instr * 100 cycles/miss = = 6 cycles / instr

Multi-level Caches CPI = 1 + memory stalls / instructions CPI old = 1 + 5% miss/instr * 100 cycles/miss = = 6 cycles / instr CPI new = 1 + L2%*L2penalty + Mem%*MemPenalty

Multi-level Caches CPI = 1 + memory stalls / instructions CPI old = 1 + 5% miss/instr * 100 cycles/miss = = 6 cycles / instr CPI new = 1 + L2%*L2penalty + Mem%*MemPenalty = 1 + 5% * % * 100 = 3.5

Multi-level Caches CPI = 1 + memory stalls / instructions CPI old = 1 + 5% miss/instr * 100 cycles/miss = = 6 cycles / instr CPI new = 1 + L2%*L2penalty + Mem%*MemPenalty = 1 + 5% * % * 100 = 3.5 = 1 + (5-2)%*10 + 2%*(10+100) = 3.5

Multi-level Caches CPI = 1 + memory stalls / instructions CPI old = 1 + 5% miss/instr * 100 cycles/miss = = 6 cycles / instr CPI new = 1 + L2%*L2penalty + Mem%*MemPenalty = 1 + 5% * % * 100 = 3.5 = 1 + (5-2)%*10 + 2%*(10+100) = 3.5 Speedup = 6/3.5 = 1.7

DO GROUPWORK NOW

Summary Direct-mapped –simple –_____ access time –_______ hit rate Variable block size –still simple –_______ access time

Summary Direct-mapped –simple –fast access time –marginal hit rate Variable block size –still simple –_____ access time –_____ hit rate by exploiting __________

Summary Direct-mapped –simple –fast access time –marginal hit rate Variable block size –still simple –fast access time –higher hit rate by exploiting spatial locality

Summary Associative caches –________ the access time –________ the hit rate –associativity above ___ has little to no gain Multi-level caches –__________ worst-case miss penalty –__________ average miss penalty

Summary Associative caches –increase the access time –increase the hit rate –associativity above 8 has little to no gain Multi-level caches –__________ worst-case miss penalty –__________ average miss penalty

Summary Associative caches –increase the access time –increase the hit rate –associativity above 8 has little to no gain Multi-level caches –increases worst-case miss penalty (because you waste time accessing another cache) –Reduces average miss penalty (because so many are caught and handled quickly)