DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3% 0.6% 0.4% gcc spice Write Misses included in 4 word block, but not in 1 word.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3% 0.6% 0.4% gcc spice Write Misses included in 4 word block, but not in 1 word. Remember Miss Penalty goes UP !
Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty Miss Penalty Block Size Miss Rate Block Size Access Time Transfer Time Constant Size Cache Fewer Blocks
Reducing the Miss Penalty Reduce the time to read the multiple words from Main Memory to the cache block.
Reducing the Miss Penalty Reduce the time to read the multiple words from Main Memory to the cache block. Don’t wait for the complete block to be transferred “Early Restart”
Reducing the Miss Penalty Reduce the time to read the multiple words from Main Memory to the cache block. Don’t wait for the complete block to be transferred “Early Restart” Access and transfer each word sequentially. As soon as the requested word is in cache, restart the processor to access cache and finish the block transfer while the cache is available.
Reducing the Miss Penalty Reduce the time to read the multiple words from Main Memory to the cache block. Don’t wait for the complete block to be transferred “Early Restart” Access and transfer each word sequentially. As soon as the requested word is in cache, restart the processor to access cache and finish the block transfer while the cache is available. Variation: “Requested Word First”
Reducing the Miss Penalty Reduce the time to read the multiple words from Main Memory to the cache block. Don’t wait for the complete block to be transferred “Early Restart” Access and transfer each word sequentially. As soon as the requested word is in cache, restart the processor to access cache and finish the block transfer while the cache is available. Variation: “Requested Word First” Disadvantage: Complex Control Likely access cache block before transfer is complete
Reducing the Miss Penalty Reduce the time to read the multiple words from Main Memory to the cache block. Assume Memory Access times: 1 clock cycle to send address 10 Clock cycles to access DRAM 1 clock cycle to send a word of data
Reducing the Miss Penalty Reduce the time to read the multiple words from Main Memory to the cache block. Assume Memory Access times: 1 clock cycle to send address 10 Clock cycles to access DRAM 1 clock cycle to send a word of data For sequential transfer of 4 data words: Miss Penalty = *( 10 +1) = 45 clock cycles
What if we could read a block of words simultaneously from the Main Memory? Cache Entry Valid Tag Word3 Word2 Word1 Word Main Memory
What if we could read a block of words simultaneously from the Main Memory? Cache Entry Valid Tag Word3 Word2 Word1 Word Main Memory Miss Penalty = = 12 clock cycles Miss Penalty for Sequential = 45 clock cycles
What about 4 banks of Memory? “Interleaved Memory” Cache Bank 3 Bank 2 Bank 1 Bank 0 Address Banks are accessed in parallel Words are transferred serially
What about 4 banks of Memory? “Interleaved Memory” Cache Bank 3 Bank 2 Bank 1 Bank 0 Address Banks are accessed in parallel Words are transferred serially Miss Penalty = * 1 = 16 clock cycles Miss Penalty for Parallel = 12 clock cycles Miss Penalty for Sequential = 45 clock cycles
Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty Average Access Time Block Size Increase Cache size Increase Block size Main Memory Organization
CPU Performance with Cache Memory For a program: CPU time = CPU execution time + CPU Hold time Assuming no penalty for Hit
CPU Performance with Cache Memory For a program: CPU time = CPU execution time + CPU Hold time CPU Hold time = Memory Stall Clock Cycles * Clock Cycle time Assuming no penalty for Hit
CPU Performance with Cache Memory For a program: CPU time = CPU execution time + CPU Hold time CPU Hold time = Memory Stall Clock Cycles * Clock Cycle time Memory Stall Clock Cycles = Read Stall Cycles + Write Stall Cycles Assuming no penalty for Hit
CPU Performance with Cache Memory For a program: CPU time = CPU execution time + CPU Hold time CPU Hold time = Memory Stall Clock Cycles * Clock Cycle time Memory Stall Clock Cycles = Read Stall Cycles + Write Stall Cycles Read Stall Cycles = Reads * Read Miss Rate * Read Miss Penalty Program Assuming no penalty for Hit
CPU Performance with Cache Memory Write Stall Cycles = Writes * Write Miss Rate * Write Miss Penalty Program + Write Buffer Stalls
CPU Performance with Cache Memory Write Stall Cycles = Writes * Write Miss Rate * Write Miss Penalty Program + Write Buffer Stalls Write Buffer Stalls should be << Write Miss Stalls
CPU Performance with Cache Memory Write Stall Cycles = Writes * Write Miss Rate * Write Miss Penalty Program + Write Buffer Stalls Write Buffer Stalls should be << Write Miss Stalls So, approximately, Write Stall Cycles = Writes * Write Miss Rate * Write Miss Penalty Program
CPU Performance with Cache Memory Memory Stall Clock Cycles = Read Stall Cycles + Write Stall Cycles = Reads * Read Miss Rate * Read Miss Penalty Program + Writes * Write Miss Rate * Write Miss Penalty Program
CPU Performance with Cache Memory Memory Stall Clock Cycles = Read Stall Cycles + Write Stall Cycles = Reads * Read Miss Rate * Read Miss Penalty Program + Writes * Write Miss Rate * Write Miss Penalty Program The Miss Penalties are approximately the same ( Fetch the Block) So, combining the Reads and Writes together into a weighted Miss Rate Memory Stall Cycles = Memory Accesses * Miss Rate * Miss Penalty Program
CPU Performance with Cache Memory For a program: CPU time = CPU execution time + CPU Hold time CPU Hold time = Memory Stall Clock Cycles * Clock Cycle time CPU time = CPU execution time + Memory Accesses * Miss Rate * Miss Penalty* Clock Cycle time Program Assuming no penalty for Hit
CPU Performance with Cache Memory For a program: CPU time = CPU execution time + CPU Hold time CPU Hold time = Memory Stall Clock Cycles * Clock Cycle time CPU time = CPU execution time + Memory Accesses * Miss Rate * Miss Penalty* Clock Cycle time Program Dividing both sides by Instructions / Program and Clock Cycle time Effective CPI = Execution CPI + Memory Accesses * Miss Rate * Miss Penalty Instruction Assuming no penalty for Hit
CPU Performance with Cache Memory Effective CPI = Execution CPI + Memory Accesses * Miss Rate * Miss Penalty Instruction Consider the DECStation 3100 with 4 word blocks running spice CPI = 1.2 without misses Instruction Miss Rate = 0.3% Data Miss Rate = 0.6%, For spice, frequency of loads and stores = 9% 1.) Sequential Memory : Miss penalty = 65 clock cycles 2.) 4 Bank Interleaved: Miss penalty = 20 clock cycles
CPU Performance with Cache Memory Effective CPI = Execution CPI + Memory Accesses * Miss Rate * Miss Penalty Instruction Eff CPI = ( 1 * *.006) Miss Penalty = * Miss Penalty Consider the DECStation 3100 with 4 word blocks running spice CPI = 1.2 without misses Instruction Miss Rate = 0.3% Data Miss Rate = 0.6%, For spice, frequency of loads and stores = 9% 1.) Sequential Memory : Miss penalty = 65 clock cycles 2.) 4 Bank Interleaved: Miss penalty = 20 clock cycles
CPU Performance with Cache Memory Effective CPI = Execution CPI + Memory Accesses * Miss Rate * Miss Penalty Instruction Eff CPI = ( 1 * *.006) Miss Penalty = * Miss Penalty 1.) Eff CPI = * 65 = = 1.43 Consider the DECStation 3100 with 4 word blocks running spice CPI = 1.2 without misses Instruction Miss Rate = 0.3% Data Miss Rate = 0.6%, For spice, frequency of loads and stores = 9% 1.) Sequential Memory : Miss penalty = 65 clock cycles 2.) 4 Bank Interleaved: Miss penalty = 20 clock cycles
CPU Performance with Cache Memory Effective CPI = Execution CPI + Memory Accesses * Miss Rate * Miss Penalty Instruction Eff CPI = ( 1 * *.006) Miss Penalty = * Miss Penalty 1.) Eff CPI = * 65 = = ) Eff CPI = * 20 = = Consider the DECStation 3100 with 4 word blocks running spice CPI = 1.2 without misses Instruction Miss Rate = 0.3% Data Miss Rate = 0.6%, For spice, frequency of loads and stores = 9% 1.) Sequential Memory : Miss penalty = 65 clock cycles 2.) 4 Bank Interleaved: Miss penalty = 20 clock cycles
CPU Performance with Cache Memory Consider the DECStation 3100 with 4 word blocks running spice CPI = 1.2 without misses Instruction Miss Rate = 0.3% Data Miss Rate = 0.6%, For spice, frequency of loads and stores = 9% 4 Bank Interleaved: Miss penalty = 20 clock cycles Eff CPI = clock cycles What if we get a new processor and cache that runs at twice the clock frequency, but keep the same main memory speed?
CPU Performance with Cache Memory Consider the DECStation 3100 with 4 word blocks running spice CPI = 1.2 without misses Instruction Miss Rate = 0.3% Data Miss Rate = 0.6%, For spice, frequency of loads and stores = 9% 4 Bank Interleaved: Miss penalty = 20 clock cycles Eff CPI = clock cycles What if we get a new processor and cache that runs at twice the clock frequency, but keep the same main memory speed? Miss penalty = 40 clock cycles Eff CPI = * 40 = = 1.342
CPU Performance with Cache Memory Consider the DECStation 3100 with 4 word blocks running spice CPI = 1.2 without misses Instruction Miss Rate = 0.3% Data Miss Rate = 0.6%, For spice, frequency of loads and stores = 9% 4 Bank Interleaved: Miss penalty = 20 clock cycles Eff CPI = clock cycles What if we get a new processor and cache that runs at twice the clock frequency, but keep the same main memory speed? Miss penalty = 40 clock cycles Eff CPI = * 40 = = Performance Fast clock = * 2 *clock cycle time = 1.89 Slow clock * clock cycle time
Address Byte Offset Block Offset IndexTag v Tag Word3 Word2 Word1 Word0 4K Entries = 16 Hit Mux Data
Consider a Direct Mapped Cache with 4 word blocks with size of 8 blocks or 32 words. Reference Sequence Word Address Block Address Cache Address Hit or Miss
X4X+3 4X+2 4X+1 4X Block Address Word Address Word Addr 4
X4X+3 4X+2 4X+1 4X Block Address Word Address Word Addr 4 Cache Address
X4X+3 4X+2 4X+1 4X Block Address Word Address Word Addr 4 Cache Address
X4X+3 4X+2 4X+1 4X Block Address Word Address Word Addr 4 Cache Address X Modulo 8
Consider a Direct Mapped Cache with 4 word blocks with size of 8 blocks or 32 words. Reference Sequence Word Address Block Address Cache Address Hit or Miss Cache Address =( Word Addr ) modulo 8 4
Consider a Direct Mapped Cache with 4 word blocks with size of 8 blocks or 32 words. Reference Sequence Word Address Block Address Cache Address Hit or Miss 611Miss Cache Address =( Word Addr ) modulo 8 4
Consider a Direct Mapped Cache with 4 word blocks with size of 8 blocks or 32 words. Reference Sequence Word Address Block Address Cache Address Hit or Miss 611Miss 711Hit Cache Address =( Word Addr ) modulo 8 4
Consider a Direct Mapped Cache with 4 word blocks with size of 8 blocks or 32 words. Reference Sequence Word Address Block Address Cache Address Hit or Miss 611Miss 711Hit 822Miss Cache Address =( Word Addr ) modulo 8 4
Consider a Direct Mapped Cache with 4 word blocks with size of 8 blocks or 32 words. Reference Sequence Word Address Block Address Cache Address Hit or Miss 611Miss 711Hit 822Miss 922Hit Cache Address =( Word Addr ) modulo 8 4
Consider a Direct Mapped Cache with 4 word blocks with size of 8 blocks or 32 words. Reference Sequence Word Address Block Address Cache Address Hit or Miss 611Miss 711Hit 822Miss 922Hit 80204Miss Cache Address =( Word Addr ) modulo 8 4
Consider a Direct Mapped Cache with 4 word blocks with size of 8 blocks or 32 words. Reference Sequence Word Address Block Address Cache Address Hit or Miss 611Miss 711Hit 822Miss 922Hit 80204Miss 611Hit 711Hit 822Hit 922Hit 81204Hit Cache Address =( Word Addr ) modulo 8 4
Consider a Direct Mapped Cache with 4 word blocks with size of 8 blocks or 32 words. Reference Sequence Word Address Block Address Cache Address Hit or Miss 611Miss 711Hit 822Miss 922Hit Cache Address =( Word Addr ) modulo 8 4
Consider a Direct Mapped Cache with 4 word blocks with size of 8 blocks or 32 words. Reference Sequence Word Address Block Address Cache Address Hit or Miss 611Miss 711Hit 822Miss 922Hit 68171Miss Cache Address =( Word Addr ) modulo 8 4
Consider a Direct Mapped Cache with 4 word blocks with size of 8 blocks or 32 words. Reference Sequence Word Address Block Address Cache Address Hit or Miss 611Miss 711Hit 822Miss 922Hit 68171Miss 611Miss 711Hit 822Hit 922Hit 69 Cache Address =( Word Addr ) modulo 8 4
Consider a Direct Mapped Cache with 4 word blocks with size of 8 blocks or 32 words. Reference Sequence Word Address Block Address Cache Address Hit or Miss 611Miss 711Hit 822Miss 922Hit 68171Miss 611Miss 711Hit 822Hit 922Hit 69171Miss Cache Address =( Word Addr ) modulo 8 4
How about putting a block in any unused block of the eight blocks? Tag Word3 Word2 Word1 Word0
How about putting a block in any unused block of the eight blocks? Tag Word3 Word2 Word1 Word0 How can you find it?
How about putting a block in any unused block of the eight blocks? Tag Word3 Word2 Word1 Word0 How can you find it? Expand the Tag to the block address and compare
How about putting a block in any unused block of the eight blocks? Tag Word3 Word2 Word1 Word0 Fully Associative Memory – Addressed by it’s contents Block Address – 28 bits Address
Fully Associative Memory – Addressed by it’s contents Block Address – 28 bits Address For practical Hit time, must have parallel comparisons of the Tag and the Block Address Only feasible for small number of blocks Byte Offset Block Offset
Fully Associative Memory – Addressed by it’s contents Block Address – 28 bits Address Tag Data Tag Data Blk Addr = == = + Hit Mux Data Valid bit not shown Block Offset selects Word Byte Offset Block Offset
Fully Associative Memory – Addressed by it’s contents Block Address – 28 bits Address Tag Data Tag Data Blk Addr = == = + Hit Mux Data Valid bit not shown Hardware Not Feasible for large Cache Byte Offset Block Offset
Make sets of Blocks Associative Two-way set associative Tag0 Data0 Tag1 Data Index Valid bit not shown Addr by Index Compare Two Tags in parallel for Hit 2 k -1
Make sets of Blocks Associative Two-way set associative Tag0 Data0 Tag1 Data Index Valid bit not shown Tag Index Block Offset Byte Offset Addr by Index Compare Two Tags in parallel for Hit Address 2 k -1
Block replacement strategies For each Index there are 2, 4,... n options for replacement. Strategies 1.LRU – Least Recently Used Replace the block that has been unused for the longest time Implementation
Block replacement strategies For each Index there are 2, 4,... n options for replacement Strategies 1.LRU – Least Recently Used Replace the block that has been unused for the longest time 2.Random Select the block to be replaced randomly Implementation
Consider a Two Way Associative Cache with 4 word blocks with size of 8 blocks or 32 words. Reference Sequence Word Address Block Address Cache Address(Set) Hit or Miss Entry 0 Entry Cache Address =( Word Addr ) modulo 4 4
Consider a Two Way Associative Cache with 4 word blocks with size of 8 blocks or 32 words. Reference Sequence Word Address Block Address Cache Address(Set) Hit or Miss Entry 0 Entry 1 611Miss 711Hit 822Miss 922Hit Cache Address =( Word Addr ) modulo 4 4
Consider a Two Way Associative Cache with 4 word blocks with size of 8 blocks or 32 words. Reference Sequence Word Address Block Address Cache Address(Set) Hit or Miss Entry 0 Entry 1 611Miss 711Hit 822Miss 922Hit 68171Miss Cache Address =( Word Addr ) modulo 4 4
Consider a Two Way Associative Cache with 4 word blocks with size of 8 blocks or 32 words. Reference Sequence Word Address Block Address Cache Address(Set) Hit or Miss Entry 0 Entry 1 611Miss 711Hit 822Miss 922Hit 68171Miss 611Hit 711Hit 822Hit 922Hit 69171Hit Cache Address =( Word Addr ) modulo 4 4