Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.

Similar presentations


Presentation on theme: "CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy."— Presentation transcript:

1 CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.

2 Cache Optimizations 1. Reducing miss penalty: multilevel caches, critical word first, read miss before write miss, merging write buffers, and victim caches 2. Reducing miss rate: larger block size, larger cache size, higher associativity, way prediction, and compiler optimization 3. Reducing miss penalty or miss rate via parallelism: hardware and compiler prefetching 4. Reducing time to hit in cache: small and simple caches, and pipelined cache access

3 Three Categories of Misses (Three C’s) Three C’s: Compulsory, Capacity, and Conflict Compulsory ─ The very first access to a block cannot be in the cache; also called cold-start misses or first-reference misses Capacity ─ If the cache cannot contain all the blocks needed during execution, capacity misses (in addition to compulsory misses) will occur because of blocks being discarded and later retrieved Conflict ─ If the block placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory and capacity misses) will occur because a block may be discarded and later retrieved if too many blocks map to its set; also called collision misses or interference misses

4 Figure 5.15. Total miss rate (top) and distribution of miss rate (bottom) for each size data cache according to the three C’s.

5 Interpretation of Figure 5.15 The figure shows the relative frequencies of cache misses, broken down by the “three C’s” Compulsory misses are those that occur in an infinite cache Capacity misses are those that occur in a fully associative cache Conflict misses are those that occur going from fully associative to 8-way associative, 4-way associative, and so on To show the benefit of associativity, conflict misses are divided by each decrease in associativity:  8-way ─ Conflict misses from fully assoc. to 8-way assoc.  4-way ─ Conflict misses from 8-way assoc. to 4-way assoc.  2-way ─ Conflict misses from 4-way assoc. to 2-way assoc.  1-way ─ Conflict misses from 2-way assoc. to 1-way assoc.

6 Reducing Miss Rate 1. Larger Block Size 2. Larger Caches 3. Higher Associativity 4. Way Prediction 5. Compiler Optimizations

7 1. Larger Block Size Larger block size reduces compulsory misses, due to spatial locality Larger blocks increase miss penalty Larger blocks increase conflict misses and even capacity misses if the cache is small Do not increase the block size to value beyond which either miss rate or average memory access time increases

8 Figure 5.16. Miss rate versus block size

9 Figure 5.18. Average memory access time versus block size for four caches sized 4KB, 16KB, 64KB, and 256KB. Block sizes of 32B and 64B dominate; the smallest average time per cache size is shown in italic What is the memory access overhead included in the miss penalty? Block sizeMiss penalty4KB16KB64KB256KB 16B828.0274.2312.6731.894 32B847.0823.4112.1341.588 64B887.1603.3231.9331.449 128B968.4693.6591.9791.470 256B11211.6514.6852.2881.549

10 2. Larger Caches An obvious way to reduce capacity misses in Fig. 5.15 is to increase capacity of the cache The drawback is a longer hit time and a higher dollar cost This technique is especially popular in off- chip caches: The size of second- or third- level caches in 2001 equals the size of main memory in desktop computers in 1990

11 3. Higher Associativity Figure 5.15 shows how miss rates improve with higher associativity. There are two general rules of thumb: 1. 8-way set associative is for practical purposes as effective in reducing misses as fully associative 2. A direct-mapped cache of size N has about the same miss rate as a 2-way set associative cache of size N/2 Improving one aspect of the average memory access time comes at the expense of another: 1. Increasing block size reduces miss rate while increasing miss penalty 2. Greater associativity comes at the cost of an increased hit time

12 Fig. 5.19. Average memory access time versus associativity Italic entries show where higher associativity increases average memory access time Smaller caches need higher associativity Cache size1-way2-way4-way8-way 4KB3.443.253.223.28 8KB2.692.582.552.62 16KB2.232.402.462.53 32KB2.062.302.372.45 64KB1.922.142.182.25 128KB1.521.841.922.00 256KB1.321.661.741.82 512KB1.201.551.591.66

13 4. Way Prediction This approach reduces conflict misses and maintains the hit speed of direct-mapped cache Extra bits are kept in the cache to predict the way of the next cache access Alpha 21264 uses way prediction in its 2-way set associative instruction cache: added to each block is a prediction bit, used to select which block to try on the next cache access If predictor is correct, the instruction cache latency is 1 clock cycle; if not, the cache tries the other block, changes the way predictor, and has a latency of 3 clock cycles SPEC95 suggests a way prediction accuracy of 85%

14 5. Compiler Optimizations Code can be rearranged without affecting correctness:  Reordering the procedures of a program might reduce instruction miss rates by reducing conflict misses. Use profiling information to determine likely conflicts between groups of instructions  Aim for better efficiency from long cache blocks: Align basic blocks so that the entry point is at the beginning of a cache block decreases the chance for a cache miss for sequential code  Improve the spatial and temporal locality of data

15 Loop Interchange Some programs have nested loops that access data in memory in nonsequential order. Simply exchanging the nesting of the loops can make the code access the data in the order they are stored. Assuming the arrays do not fit in the cache, this technique reduces misses by improving spatial locality: reordering maximizes use of data in a cache block before the data are discarded. /* Before */ for (j = 0; j < 100; j = j+1) for (i = 0; i < 5000; i = i+1) x[i][j] = 2*x[i][j]; /* After */ for (i = 0; i < 5000; i = i+1) for (j = 0; j < 100; j = j+1) x[i][j] = 2*x[i][j];

16 Reducing Hit Time Small and Simple Caches  A time-consuming part of a cache hit is using the index portion of the address to read tag memory and then compare it to the address  We already know that smaller hardware is faster  It is critical to keep the cache small enough to fit on the same chip as the processor to avoid the time penalty of going off chip  Keep the cache simple: Say use direct mapping; a main advantage is that we can overlap tag check with transmission of data  We use small and simple caches for level-1 caches  For level-2 caches, some designs strike a compromise by keeping tags on chip and data off chip, promising a fast tag check, yet providing the greater capacity of separate memory chips

17 Fig. 5.26. Summary of Cache Optimizations TechniqueMiss penalty Miss rateHit timeHardware complexity Multilevel caches+2 Critical word first and early restart+2 Priority to read misses over write misses+1 Merging write buffer+1 Victim caches++2 Larger block size−+0 Larger cache size+−1 Higher associativity+−1 Way prediction+2 Compiler techniques+0 Small and simple caches−+0 Pipelined cache access+1

18 Virtual Cache The guideline of making the common case fast suggests that we use virtual addresses for the cache, since hits are much more common than misses Such caches are termed virtual caches, with physical cache used to identify the traditional cache that uses physical addresses It is important to distinguish two tasks: indexing the cache and comparing addresses The issues are whether a virtual or physical address is used to index the cache and whether a virtual or physical address is used in the tag comparison Full virtual addressing for both indices and tags eliminates address translation time from a cache hit Why doesn’t everyone build virtually addressed caches?

19 Reasons against Virtual Caches First reason is protection. Page-level protection is checked as part of the virtual to physical address translation. Second reason is that every time a process is switched, the virtual addresses refer to different physical addresses, requiring the cache to be flushed. One solution is to increase the width of the cache address tag with a process-identifier tag (PID). Third reason is that operating systems and user programs may use two different virtual addresses for the same physical address. These duplicate addresses could result in two copies of the same data in a virtual cache; if one is modified, the other will have the wrong value. With a physical cache, this wouldn’t happen, since accesses would first be translated to the same physical cache block. Fourth reason is I/O. I/O typically uses physical addresses and thus would require mapping to virtual addresses to interact with a virtual cache.

20 One Good Choice One way to get the best of both virtual and physical caches is to use part of the page offset (the part that is identical in both virtual and physical addresses) to index the cache At the same time as the cache is being read using the index, the virtual part of the address is translated, and the tag match uses physical addresses This strategy allows the cache read to begin immediately, and yet the tag comparison is still with physical addresses The limitation of this virtually indexed, physically tagged alternative is that a direct-mapped cache can be no bigger than the page size

21 Example In this figure, the index is 9 bits and the cache block offset is 6 bits To use the trick on the previous slide, what should be the virtual page size? The virtual page size would have to be at least 2 (9+6) bytes or 32KB What is the size of the cache? 64KB (=2×32KB)

22 How to Build a Large Cache Associativity can keep the index in the physical part of the address and yet still support a large cache Doubling associativity and doubling the cache size do not change the size of the index Pentium III, with 8KB pages, avoids translation with its 16KB cache by using 2-way set associativity IBM 3033 cache is 16-way set associative, even though studies show that there is little benefit to miss rates above 8-way associativity. This high associativity allows a 64KB cache to be addressed with a physical index, despite the handicap of 4KB pages.


Download ppt "CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy."

Similar presentations


Ads by Google