EENG449b/Savvides Lec 18.1 4/13/04 April 13, 2004 Prof. Andreas Savvides Spring 2004 EENG 449bG/CPSC 439bG Computer.

EENG449b/Savvides Lec 18.1 4/13/04 April 13, 2004 Prof. Andreas Savvides Spring 2004 http://www.eng.yale.edu/courses/eeng449bG EENG 449bG/CPSC 439bG Computer Systems Lecture 18 Memory Hierarchy Design Part II

EENG449b/Savvides Lec 18.2 4/13/04 Q1: Where can a Block Be Placed in a Cache?

EENG449b/Savvides Lec 18.3 4/13/04 Set Associatively Direct mapped = one-way set associative Fully associative = set associative with 1 set Most popular cache configurations in today’s processors –Direct mapped, 2-way set associative, 4-way set associative

EENG449b/Savvides Lec 18.4 4/13/04 Examples 32 KB cache for a byte addressable processor, 32-bit address space. Which bits of the address are used for the tag, index and byte-within-block for the following configuration: a) 8-byte block size, direct mapped 8-byte block size => 3 bits for byte-within block 32 KB / 8 B = 4 K Block in the cache => need 12 bits to index 32 bits – (12 + 3) bits = 17 bits remaining => need 17 bits for every tag Byte-within-blockindextag 031531214……1

EENG449b/Savvides Lec 18.5 4/13/04 Examples 4-byte block size => 2 bits for byte-within block 32 KB / (4 B x 8) = 1 K Sets in the cache => need 10 bits to index 32 bits – (10 + 2) bits = 20 bits remaining => need 20 bits for every tag Byte-within-blockindextag 021231111…… 32 KB cache for a byte addressable processor, 32-bit address space. Which bits of the address are used for the tag, index and byte-within-block for the following configuration: a) 4-byte block size, 8-way set associative

EENG449b/Savvides Lec 18.6 4/13/04 Q2: How is a block found if it is in the cache? Selects the desired data from the block Selects the set Compared against for a hit If cache size remains the same increasing associativity increases The number of blocks per set => decrease index size and increase tag

EENG449b/Savvides Lec 18.7 4/13/04 Examples Processor contains a 16 word, direct mapped cache, with a 4 word block size. Which of the following addresses will hit in the cache? 0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 12 4-words block size => 2 bits for word-within block 16 words / (4 words) = 4 blocks in the cache => need 2 bits to index 6 bits – (2+2) bits = 2 bits remaining => need 2 bits for every tag 0 15 3 4 7 8 11 12 012 3 456 7 8910 11 121314 15

EENG449b/Savvides Lec 18.8 4/13/04 Examples Processor contains a 16 word, direct mapped cache, with a 4 word block size. Which of the following addresses will hit in the cache? 0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 12 4-words block size => 2 bits for word-within block 16 words / (4 words) = 4 blocks in the cache => need 2 bits to index 6 bits – (2+2) bits = 2 bits remaining => need 2 bits for every tag 0 15 3 4 7 8 11 12 323334 35 456 7 8910 11 121314 15 Block replacement

EENG449b/Savvides Lec 18.15 4/13/04 Examples Processor contains a 16 word, 4-way associate cache, with a 1 word block size. Which of the following addresses will hit in the cache? (LRU Replacement) 0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 12 1-word block size => 0 bits for word-within block 16 / (4 + 0) = 4 Sets in the cache => need 2 bits to index 6 bits – (2) bits = 4 bits remaining => need 4 bits for every tag 0 Set 0 Set 1 Set 2 Set 3 1 4 5 6 1014 3216 23

EENG449b/Savvides Lec 18.16 4/13/04 Examples Processor contains a 16 word, 4-way associate cache, with a 1 word block size. Which of the following addresses will hit in the cache? (LRU Replacement) 0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 12 1-word block size => 0 bits for word-within block 16 / (4 + 0) = 4 Sets in the cache => need 2 bits to index 6 bits – (2) bits = 4 bits remaining => need 4 bits for every tag 24 Set 0 Set 1 Set 2 Set 3 1 4 5 6 1014 3216 23 Block replacement

EENG449b/Savvides Lec 18.17 4/13/04 Examples Processor contains a 16 word, 4-way associate cache, with a 1 word block size. Which of the following addresses will hit in the cache? (LRU Replacement) 0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 12 1-word block size => 0 bits for word-within block 16 / (4 + 0) = 4 Sets in the cache => need 2 bits to index 6 bits – (2) bits = 4 bits remaining => need 4 bits for every tag 24 Set 0 Set 1 Set 2 Set 3 1 4 5 6 1014 3216 23 35

EENG449b/Savvides Lec 18.18 4/13/04 Examples Processor contains a 16 word, 4-way associate cache, with a 1 word block size. Which of the following addresses will hit in the cache? (LRU Replacement) 0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 12 1-word block size => 0 bits for word-within block 16 / (4 + 0) = 4 Sets in the cache => need 2 bits to index 6 bits – (2) bits = 4 bits remaining => need 4 bits for every tag 24 Set 0 Set 1 Set 2 Set 3 1 4 5 6 1014 1216 23 35 Block replacement

EENG449b/Savvides Lec 18.19 4/13/04 Address Breakdown Physical address is 44 bits wide, 36-bit block address and 6-bit offset Calculating cache index size Blocks are 64 bytes so offset needs 6 bits Tag size = 38 – 9 = 29 bits

EENG449b/Savvides Lec 18.20 4/13/04 How to Improve Cache Performance? Four main categories of optimizations 1.Reducing miss penalty - multilevel caches, critical word first, read miss before write miss, merging write buffers and victim caches 2.Reducing miss rate - larger block size, larger cache size, higher associativity, way prediction and pseudoassociativity and computer optimizations 2. Reduce the miss penalty or miss rate via parallelism - non-blocking caches, hardware prefetching and compiler prefetching 3. Reduce the time to hit in the cache - small and simple caches, avoiding address translation, pipelined cache access Last week Today

EENG449b/Savvides Lec 18.21 4/13/04 Reducing miss rate Way Predication: –Perform tag comparison with a single block in every set »Less comparisons -> simple hardware -> faster clock Pseudoassociative Caches: –Access proceeds as in a direct-mapped cache for a hit –If a miss, compare to a second entry for a match, where the second entry can be found fast

EENG449b/Savvides Lec 18.22 4/13/04 Reducing miss rate Compiler Optimization: –Loop Interchange & Blocking: »Exchange the nesting of the loops to make the code access the data in the order it is stored For (j=0 ->100) For (i=0->5000) Becomes x[i][j] = 2 * x[i][j] For (i=0 ->5000) For (j=0->100) x[i][j] = 2 * x[i][j] Maximize the use of the data before replacing it

EENG449b/Savvides Lec 18.23 4/13/04 Reducing miss Penalty Methods include: 1)Multi-level caches L2 Equations: AMAT=Hit Time L1 + Miss Rate L1 X Miss Penalty L1 Miss Penalty L1 =Hit Time L2 + Miss Rate L2 X Miss Penalty L2 AMAT=Hit Time L1 + Miss Rate L1 X (Hit Time L2 + Miss Rate L2 X Miss Penalty L2 ) Definitions: –Local Miss Rate- misses in this cache divided by the total number of accesses to this cache (Miss Rate L2 ) –Global Miss Rate- misses in this cache divided by the total number of memory accesses generated by the CPU (Miss Rate L2 X Miss Rate L1 )

EENG449b/Savvides Lec 18.24 4/13/04 Reducing miss Penalty Methods include: 2) Critical word first and early restart –Don’t wait for the full block to be loaded before restarting the CPU »Early restart – As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution »Critical word first – request the missed word first from memory and send it to the CPU as soon as it arrives; Let the CPU continue execution while filling the rest of the words in the block. –Very useful with large blocks –Spatial locality problem: we often want the next sequential word soon, so not always a benefit (early restart)

EENG449b/Savvides Lec 18.25 4/13/04 Reducing miss Penalty Methods include: 3) Prioritize read misses over writes »Write buffers offer RAW conflicts with main memory reads on cache misses »If simply wait for write buffer to empty might increase the read miss penalty by 50% »Check write buffer contents before read: if not conflict, let the memory access continue »Write back? Read miss may require write of dirty blocks Normal: write dirty block to memory and then do the read Instead, copy the dirty block to the write buffer, do the read and then do the write CPU stalls less since it can restart as soon as the read completes

EENG449b/Savvides Lec 18.26 4/13/04 Reducing miss Penalty 4) Merging the write buffer –CPU stalls if the write-back buffer is full –The buffer may contain an entry matching the address written to –If so, the writes are merged

EENG449b/Savvides Lec 18.27 4/13/04 Reducing miss Penalty 5) Victim caches: How to get the hit time of direct-mapped yet still avoid conflict misses? Add buffer to place data discarded from the cache Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4KB direct mapped data cache

EENG449b/Savvides Lec 18.1 4/13/04 April 13, 2004 Prof. Andreas Savvides Spring 2004 EENG 449bG/CPSC 439bG Computer.

Similar presentations

Presentation on theme: "EENG449b/Savvides Lec 18.1 4/13/04 April 13, 2004 Prof. Andreas Savvides Spring 2004 EENG 449bG/CPSC 439bG Computer."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

EENG449b/Savvides Lec 18.1 4/13/04 April 13, 2004 Prof. Andreas Savvides Spring 2004 EENG 449bG/CPSC 439bG Computer.

Similar presentations

Presentation on theme: "EENG449b/Savvides Lec 18.1 4/13/04 April 13, 2004 Prof. Andreas Savvides Spring 2004 EENG 449bG/CPSC 439bG Computer."— Presentation transcript:

Similar presentations

About project

Feedback