Presentation is loading. Please wait.

Presentation is loading. Please wait.

Morgan Kaufmann Publishers

Similar presentations


Presentation on theme: "Morgan Kaufmann Publishers"— Presentation transcript:

1 Morgan Kaufmann Publishers
13 January, 2019 Lecture 4.3 Memory Hierarchy: Improving Cache Performance Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

2 Learning Objectives Calculate the effective CPI and the average memory access time Given a 32-bit address, specify the index and tag for direct mapped cache, n-way set associative cache, and fully associative cache, respectively Calculate the effective CPI for multiple-level caches 2

3 Coverage Textbook: Chapter 5.4 3

4 Measuring Cache Performance
Morgan Kaufmann Publishers 13 January, 2019 Measuring Cache Performance Components of CPU time Program execution cycles Includes cache hit time Memory stall cycles Mainly from cache misses §5.4 Measuring and Improving Cache Performance CPU time CPU time = IC × CPI × CC = IC × (CPIideal + Average memory stall cycles per instruction) × CC Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 4 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

5 Morgan Kaufmann Publishers
13 January, 2019 Memory stall cycles With simplifying assumptions: §5.4 Measuring and Improving Cache Performance Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 5 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

6 Cache Performance Example
Morgan Kaufmann Publishers 13 January, 2019 Cache Performance Example Given I-cache miss rate = 2% D-cache miss rate = 4% Miss penalty = 100 cycles Base CPI (ideal cache) = 2 Load & stores are 36% of instructions Memory stall cycles per instruction I-cache: 0.02 × 100 = 2 Need to fetch every instruction from I-cache D-cache: 0.36 × 0.04 × 100 = 1.44 36% of instructions need to access D-cache Actual CPI = = 5.44 The real execution is 5.44/2 =2.72 times slower than the ideal case Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 6 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

7 Average Memory Access Time
Morgan Kaufmann Publishers 13 January, 2019 Average Memory Access Time Average memory access time (AMAT) AMAT = Hit time + Miss rate × Miss penalty Hit time is also important for performance Example CPU with 1ns clock, hit time = 1 cycle, miss penalty = 20 cycles, cache miss rate = 5% AMAT = × 20 = 2 cycles 2 cycles Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 7 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

8 Morgan Kaufmann Publishers
13 January, 2019 Performance Summary When CPU performance increased Miss penalty becomes more significant Greater proportion of time spent on memory stalls Decreasing base CPI Increasing clock rate Memory stalls account for more CPU cycles Can’t neglect cache behavior when evaluating system performance Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 8 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

9 Mechanisms to Reduce Cache Miss Rates
1. Allow more flexible block placement 2. Use multiple levels of caches 9

10 Reducing Cache Miss Rates #1
1. Allow more flexible block placement In a direct mapped cache a memory block maps to exactly one cache block At the other extreme, could allow a memory block to be mapped to any cache block – fully associative cache A compromise is to divide the cache into sets, each of which consists of n “ways” (n-way set associative) A memory block maps to a unique set (specified by the index field) and can be placed in any way of that set (so there are n choices) All of the tags of all of the elements of the set must be searched for a match.

11 Morgan Kaufmann Publishers
13 January, 2019 Associative Caches Fully associative (only one set) Allow a given block to go into any cache entry Requires all entries to be searched at once Comparator per entry (expensive) n-way set associative Each set contains n entries Block address determines which set (Block address) modulo (#Sets in cache) Search all entries in a given set at once n comparators (less expensive) Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 11 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

12 Associative Cache Example
Morgan Kaufmann Publishers 13 January, 2019 Associative Cache Example 12 in the diagram is the block address. For the same address, once the number of bits in index changes, the number of bits in tags changes accordingly. Therefore, the value of the tag in 2-way set associative cache and the fully associative cache should be different from the value of the tag in the direct mapped cache. Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 12 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

13 Spectrum of Associativity
Morgan Kaufmann Publishers 13 January, 2019 Spectrum of Associativity For a cache with 8 entries Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 13 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

14 Associativity Example
Morgan Kaufmann Publishers 13 January, 2019 Associativity Example Compare caches containing 4 cache blocks Each block is 1-word wide Direct mapped, 2-way set associative, fully associative Block address sequence: 0, 8, 0, 6, 8 Direct mapped Block address Cache index Hit/miss Cache content after access 1 2 3 miss Mem[0] 8 Mem[8] 6 Mem[6] Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 14 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

15 Associativity Example
Morgan Kaufmann Publishers 13 January, 2019 Associativity Example 2-way set associative Block address Cache index Hit/miss Cache content after access Set 0 Set 1 miss Mem[0] 8 Mem[8] hit 6 Mem[6] Fully associative Block address Hit/miss Cache content after access miss Mem[0] 8 Mem[8] hit 6 Mem[6] Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 15 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

16 Another Reference String Mapping
Consider the main memory word reference string Start with an empty cache - all blocks initially marked as not valid miss 4 miss miss 4 miss 01 4 00 01 4 00 Mem(0) 00 Mem(0) 01 Mem(4) 00 Mem(0) miss 4 miss miss 4 miss 01 4 00 01 4 00 01 Mem(4) 00 Mem(0) 01 Mem(4) 00 Mem(0) 4-bit word address: 0: 0000 4: 0100 8 requests, 8 misses Ping pong effect due to conflict misses - two memory locations that map into the same cache block

17 Another Reference String Mapping
Consider the main memory word reference string Start with an empty cache - all blocks initially marked as not valid 4 4 For class handout

18 Another Reference String Mapping
Consider the main memory word reference string Start with an empty cache - all blocks initially marked as not valid miss 4 miss hit 4 hit Mem(0) Mem(0) Mem(0) Mem(0) Mem(4) Mem(4) Mem(4) 8 requests, 2 misses Solves the ping pong effect in a direct mapped cache due to conflict misses since now two memory locations that map into the same cache set can co-exist! For lecture Another sample string to try

19 How Much Associativity
Morgan Kaufmann Publishers 13 January, 2019 How Much Associativity Increased associativity decreases miss rate But with diminishing returns Simulation of a system with 64KB D-cache, 16-word blocks, SPEC2000 1-way: 10.3% 2-way: 8.6% 4-way: 8.3% 8-way: 8.1% Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 19 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

20 Benefits of Set Associative Caches
Largest gains are in going from direct mapped cache to 2-way set associative cache (20+% reduction in miss rate) 20

21 Set Associative Cache Organization
Morgan Kaufmann Publishers 13 January, 2019 Set Associative Cache Organization Question: how wide is each cache block? Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 21 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

22 Range of Set Associative Caches
For a fixed size cache, each increase by a factor of two in associativity doubles the number of blocks per set (i.e., the number of ways) and halves the number of sets – decreases the size of the index by 1 bit and increases the size of the tag by 1 bit Tag Index Block offset Byte offset For class handout

23 Range of Set Associative Caches
For a fixed size cache, each increase by a factor of two in associativity doubles the number of blocks per set (i.e., the number of ways) and halves the number of sets – decreases the size of the index by 1 bit and increases the size of the tag by 1 bit Used for tag compare Selects the set Selects the word in the block Tag Index Block offset Byte offset Increasing associativity Decreasing associativity Fully associative (only one set) Tag has all the bits except block and byte offset For lecture In 2008, the greater size and power consumption of CAMs generally leads to 2-way and 4-way set associativity being built from standard SRAMs with comparators with 8-way and above being built using CAMs Direct mapped (only one way) Smaller tags, only a single comparator

24 Example: Tag Size Cache 32-bit byte address
4K blocks 4-word block size 32-bit byte address Number of tag bits for the cache (?) Directed mapped 2-way set associative 4-way set associative Fully associative 24

25 Morgan Kaufmann Publishers
13 January, 2019 Replacement Policy Direct mapped: no choice Set associative Prefer non-valid entry, if there is one Otherwise, choose among entries in the set Least-recently used (LRU) Choose the one unused for the longest time Simple for 2-way, manageable for 4-way, too hard beyond that Random Gives approximately the same performance as LRU for high associativity Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 25 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

26 Reducing Cache Miss Rates #2
Morgan Kaufmann Publishers 13 January, 2019 Reducing Cache Miss Rates #2 2. Use multiple levels of caches Primary (L1) cache attached to CPU Small, but fast Separate L1 I$ and L1 D$ Level-2 cache services misses from L1 cache Larger, slower, but still faster than main memory Unified cache for both instructions and data Main memory services L-2 cache misses Some high-end systems include L-3 cache Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 26 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

27 Multilevel Cache Example
Morgan Kaufmann Publishers 13 January, 2019 Multilevel Cache Example Given CPU base CPI = 1, clock rate = 4GHz Miss rate/instruction = 2% Main memory access time = 100ns With just primary cache Miss penalty = 100ns/0.25ns = 400 cycles Effective CPI = × 400 = 9 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 27 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

28 Morgan Kaufmann Publishers
13 January, 2019 Example (cont.) Now add L-2 cache Access time = 5ns Global miss rate to main memory = 0.5% L-1 miss with L-2 hit Penalty = 5ns/0.25ns = 20 cycles L-1 miss with L-2 miss Extra penalty = 400 cycles CPI = × × 400 = 3.4 Performance ratio = 9/3.4 = 2.6 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 28 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

29 Multilevel Cache Considerations
Morgan Kaufmann Publishers 13 January, 2019 Multilevel Cache Considerations Primary (L-1) cache Focus on minimal hit time Smaller total size with smaller block size L-2 cache Focus on low miss rate to avoid main memory access Hit time has less overall impact Larger total size with larger block size Higher levels of associativity Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 29 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

30 Global v.s. Local Miss Rate
Global miss rate (GR) The fraction of references that miss in all levels of a multilevel cache Dictate how often the main memory is accessed Global miss rate so far (GRS) The fraction of references that miss up to a certain level in a multilevel cache Local miss rate (LR) The fraction of references to one level of a cache that miss L2$ local miss rate >> the global miss rate 30

31 Example LR1=5%, LR2=20%, LR3=50% GR=LR1×LR2×LR3=0.05×0.2×0.5=0.005
GRS1=LR1=0.05 GRS2=LR1×LR2=0.05×0.2=0.01 GRS3=LR1×LR2×LR3=0.05×0.2×0.5=0.005 CPI=1+GSR1×Pty1+GSR2×Pty2+GSR3×Pty3 =1+0.05× × ×100=2.2 31

32 Sources of Cache Misses
Compulsory (cold start or process migration, first reference): First access to a block, “cold” fact of life, not a whole lot you can do about it. If you are going to run “millions” of instruction, compulsory misses are insignificant Solution: increase block size (increases miss penalty; very large blocks could increase miss rate) Capacity: Cache cannot contain all blocks accessed by the program Solution: increase cache size (may increase access time) Conflict (collision): Multiple memory locations mapped to the same cache location Solution 1: increase cache size Solution 2: increase associativity (may increase access time) (Capacity miss) That is the cache misses due to the fact that the cache is simply not large enough to contain all the blocks that are accessed by the program. The solution to reduce the Capacity miss rate is simple: increase the cache size. Here is a summary of other types of cache miss we talked about. First is the Compulsory misses. These are the misses that we cannot avoid. They are caused when we first start the program. Then we talked about the conflict misses. They are the misses that caused by multiple memory locations being mapped to the same cache location. There are two solutions to reduce conflict misses. The first one is, once again, increase the cache size. The second one is to increase the associativity. For example, say using a 2-way set associative cache instead of directed mapped cache. But keep in mind that cache miss rate is only one part of the equation. You also have to worry about cache access time and miss penalty. Do NOT optimize miss rate alone. Finally, there is another source of cache miss we will not cover today. Those are referred to as invalidation misses caused by another process, such as IO , update the main memory so you have to flush the cache to avoid inconsistency between memory and cache.

33 Cache Design Trade-offs
Morgan Kaufmann Publishers 13 January, 2019 Cache Design Trade-offs Design change Effect on miss rate Negative performance effect Increase cache size Decrease capacity misses May increase access time Increase associativity Decrease conflict misses Increase block size Decrease compulsory misses Increases miss penalty. For very large block size, may increase miss rate due to pollution. Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 33 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy


Download ppt "Morgan Kaufmann Publishers"

Similar presentations


Ads by Google