EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer Systems Lecture 18 Memory Hierarchy Design Part II
EENG449b/Savvides Lec /13/04 Q1: Where can a Block Be Placed in a Cache?
EENG449b/Savvides Lec /13/04 Set Associatively Direct mapped = one-way set associative Fully associative = set associative with 1 set Most popular cache configurations in today’s processors –Direct mapped, 2-way set associative, 4-way set associative
EENG449b/Savvides Lec /13/04 Examples 32 KB cache for a byte addressable processor, 32-bit address space. Which bits of the address are used for the tag, index and byte-within-block for the following configuration: a) 8-byte block size, direct mapped 8-byte block size => 3 bits for byte-within block 32 KB / 8 B = 4 K Block in the cache => need 12 bits to index 32 bits – (12 + 3) bits = 17 bits remaining => need 17 bits for every tag Byte-within-blockindextag ……1
EENG449b/Savvides Lec /13/04 Examples 4-byte block size => 2 bits for byte-within block 32 KB / (4 B x 8) = 1 K Sets in the cache => need 10 bits to index 32 bits – (10 + 2) bits = 20 bits remaining => need 20 bits for every tag Byte-within-blockindextag …… 32 KB cache for a byte addressable processor, 32-bit address space. Which bits of the address are used for the tag, index and byte-within-block for the following configuration: a) 4-byte block size, 8-way set associative
EENG449b/Savvides Lec /13/04 Q2: How is a block found if it is in the cache? Selects the desired data from the block Selects the set Compared against for a hit If cache size remains the same increasing associativity increases The number of blocks per set => decrease index size and increase tag
EENG449b/Savvides Lec /13/04 Examples Processor contains a 16 word, direct mapped cache, with a 4 word block size. Which of the following addresses will hit in the cache? 0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 12 4-words block size => 2 bits for word-within block 16 words / (4 words) = 4 blocks in the cache => need 2 bits to index 6 bits – (2+2) bits = 2 bits remaining => need 2 bits for every tag
EENG449b/Savvides Lec /13/04 Examples Processor contains a 16 word, direct mapped cache, with a 4 word block size. Which of the following addresses will hit in the cache? 0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 12 4-words block size => 2 bits for word-within block 16 words / (4 words) = 4 blocks in the cache => need 2 bits to index 6 bits – (2+2) bits = 2 bits remaining => need 2 bits for every tag Block replacement
EENG449b/Savvides Lec /13/04 Examples Processor contains a 16 word, direct mapped cache, with a 4 word block size. Which of the following addresses will hit in the cache? 0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 12 4-words block size => 2 bits for word-within block 16 words / (4 words) = 4 blocks in the cache => need 2 bits to index 6 bits – (2+2) bits = 2 bits remaining => need 2 bits for every tag
EENG449b/Savvides Lec /13/04 Examples Processor contains a 16 word, direct mapped cache, with a 4 word block size. Which of the following addresses will hit in the cache? 0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 12 4-words block size => 2 bits for word-within block 16 words / (4 words) = 4 blocks in the cache => need 2 bits to index 6 bits – (2+2) bits = 2 bits remaining => need 2 bits for every tag Block replacement
EENG449b/Savvides Lec /13/04 Examples Processor contains a 16 word, direct mapped cache, with a 4 word block size. Which of the following addresses will hit in the cache? 0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 12 4-words block size => 2 bits for word-within block 16 words / (4 words) = 4 blocks in the cache => need 2 bits to index 6 bits – (2+2) bits = 2 bits remaining => need 2 bits for every tag Block replacement
EENG449b/Savvides Lec /13/04 Examples Processor contains a 16 word, direct mapped cache, with a 4 word block size. Which of the following addresses will hit in the cache? 0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 12 4-words block size => 2 bits for word-within block 16 words / (4 words) = 4 blocks in the cache => need 2 bits to index 6 bits – (2+2) bits = 2 bits remaining => need 2 bits for every tag Block replacement
EENG449b/Savvides Lec /13/04 Examples Processor contains a 16 word, direct mapped cache, with a 4 word block size. Which of the following addresses will hit in the cache? 0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 12 4-words block size => 2 bits for word-within block 16 words / (4 words) = 4 blocks in the cache => need 2 bits to index 6 bits – (2+2) bits = 2 bits remaining => need 2 bits for every tag Block replacement
EENG449b/Savvides Lec /13/04 Examples Processor contains a 16 word, direct mapped cache, with a 4 word block size. Which of the following addresses will hit in the cache? 0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 12 4-words block size => 2 bits for word-within block 16 words / (4 words) = 4 blocks in the cache => need 2 bits to index 6 bits – (2+2) bits = 2 bits remaining => need 2 bits for every tag
EENG449b/Savvides Lec /13/04 Examples Processor contains a 16 word, 4-way associate cache, with a 1 word block size. Which of the following addresses will hit in the cache? (LRU Replacement) 0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 12 1-word block size => 0 bits for word-within block 16 / (4 + 0) = 4 Sets in the cache => need 2 bits to index 6 bits – (2) bits = 4 bits remaining => need 4 bits for every tag 0 Set 0 Set 1 Set 2 Set
EENG449b/Savvides Lec /13/04 Examples Processor contains a 16 word, 4-way associate cache, with a 1 word block size. Which of the following addresses will hit in the cache? (LRU Replacement) 0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 12 1-word block size => 0 bits for word-within block 16 / (4 + 0) = 4 Sets in the cache => need 2 bits to index 6 bits – (2) bits = 4 bits remaining => need 4 bits for every tag 24 Set 0 Set 1 Set 2 Set Block replacement
EENG449b/Savvides Lec /13/04 Examples Processor contains a 16 word, 4-way associate cache, with a 1 word block size. Which of the following addresses will hit in the cache? (LRU Replacement) 0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 12 1-word block size => 0 bits for word-within block 16 / (4 + 0) = 4 Sets in the cache => need 2 bits to index 6 bits – (2) bits = 4 bits remaining => need 4 bits for every tag 24 Set 0 Set 1 Set 2 Set
EENG449b/Savvides Lec /13/04 Examples Processor contains a 16 word, 4-way associate cache, with a 1 word block size. Which of the following addresses will hit in the cache? (LRU Replacement) 0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 12 1-word block size => 0 bits for word-within block 16 / (4 + 0) = 4 Sets in the cache => need 2 bits to index 6 bits – (2) bits = 4 bits remaining => need 4 bits for every tag 24 Set 0 Set 1 Set 2 Set Block replacement
EENG449b/Savvides Lec /13/04 Address Breakdown Physical address is 44 bits wide, 36-bit block address and 6-bit offset Calculating cache index size Blocks are 64 bytes so offset needs 6 bits Tag size = 38 – 9 = 29 bits
EENG449b/Savvides Lec /13/04 How to Improve Cache Performance? Four main categories of optimizations 1.Reducing miss penalty - multilevel caches, critical word first, read miss before write miss, merging write buffers and victim caches 2.Reducing miss rate - larger block size, larger cache size, higher associativity, way prediction and pseudoassociativity and computer optimizations 2. Reduce the miss penalty or miss rate via parallelism - non-blocking caches, hardware prefetching and compiler prefetching 3. Reduce the time to hit in the cache - small and simple caches, avoiding address translation, pipelined cache access Last week Today
EENG449b/Savvides Lec /13/04 Reducing miss rate Way Predication: –Perform tag comparison with a single block in every set »Less comparisons -> simple hardware -> faster clock Pseudoassociative Caches: –Access proceeds as in a direct-mapped cache for a hit –If a miss, compare to a second entry for a match, where the second entry can be found fast
EENG449b/Savvides Lec /13/04 Reducing miss rate Compiler Optimization: –Loop Interchange & Blocking: »Exchange the nesting of the loops to make the code access the data in the order it is stored For (j=0 ->100) For (i=0->5000) Becomes x[i][j] = 2 * x[i][j] For (i=0 ->5000) For (j=0->100) x[i][j] = 2 * x[i][j] Maximize the use of the data before replacing it
EENG449b/Savvides Lec /13/04 Reducing miss Penalty Methods include: 1)Multi-level caches L2 Equations: AMAT=Hit Time L1 + Miss Rate L1 X Miss Penalty L1 Miss Penalty L1 =Hit Time L2 + Miss Rate L2 X Miss Penalty L2 AMAT=Hit Time L1 + Miss Rate L1 X (Hit Time L2 + Miss Rate L2 X Miss Penalty L2 ) Definitions: –Local Miss Rate- misses in this cache divided by the total number of accesses to this cache (Miss Rate L2 ) –Global Miss Rate- misses in this cache divided by the total number of memory accesses generated by the CPU (Miss Rate L2 X Miss Rate L1 )
EENG449b/Savvides Lec /13/04 Reducing miss Penalty Methods include: 2) Critical word first and early restart –Don’t wait for the full block to be loaded before restarting the CPU »Early restart – As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution »Critical word first – request the missed word first from memory and send it to the CPU as soon as it arrives; Let the CPU continue execution while filling the rest of the words in the block. –Very useful with large blocks –Spatial locality problem: we often want the next sequential word soon, so not always a benefit (early restart)
EENG449b/Savvides Lec /13/04 Reducing miss Penalty Methods include: 3) Prioritize read misses over writes »Write buffers offer RAW conflicts with main memory reads on cache misses »If simply wait for write buffer to empty might increase the read miss penalty by 50% »Check write buffer contents before read: if not conflict, let the memory access continue »Write back? Read miss may require write of dirty blocks Normal: write dirty block to memory and then do the read Instead, copy the dirty block to the write buffer, do the read and then do the write CPU stalls less since it can restart as soon as the read completes
EENG449b/Savvides Lec /13/04 Reducing miss Penalty 4) Merging the write buffer –CPU stalls if the write-back buffer is full –The buffer may contain an entry matching the address written to –If so, the writes are merged
EENG449b/Savvides Lec /13/04 Reducing miss Penalty 5) Victim caches: How to get the hit time of direct-mapped yet still avoid conflict misses? Add buffer to place data discarded from the cache Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4KB direct mapped data cache