CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.

Slides:



Advertisements
Similar presentations
Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006.
Advertisements

Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Lecture 12 Reduce Miss Penalty and Hit Time
Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
Performance of Cache Memory
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
Overview of Cache and Virtual MemorySlide 1 The Need for a Cache (edited from notes with Behrooz Parhami’s Computer Architecture textbook) Cache memories.
CS61C L22 Caches III (1) A Carle, Summer 2006 © UCB inst.eecs.berkeley.edu/~cs61c/su06 CS61C : Machine Structures Lecture #22: Caches Andy.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Review CPSC 321 Andreas Klappenecker Announcements Tuesday, November 30, midterm exam.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
CS252/Culler Lec 4.1 1/31/02 CS203A Graduate Computer Architecture Lecture 14 Cache Design Taken from Prof. David Culler’s notes.
The Memory Hierarchy II CPSC 321 Andreas Klappenecker.
Reducing Cache Misses (Sec. 5.3) Three categories of cache misses: 1.Compulsory –The very first access to a block cannot be in the cache 2.Capacity –Due.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
Lecture 19: Virtual Memory
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
July 30, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 8: Exploiting Memory Hierarchy: Virtual Memory * Jeremy R. Johnson Monday.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
Computer Architecture Ch5-1 Ping-Liang Lai ( 賴秉樑 ) Lecture 5 Review of Memory Hierarchy (Appendix C in textbook) Computer Architecture 計算機結構.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
Cache Memory Chapter 17 S. Dandamudi To be used with S. Dandamudi, “Fundamentals of Computer Organization and Design,” Springer,  S. Dandamudi.
Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:
1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.
Memory Hierarchy—Improving Performance Professor Alvin R. Lebeck Computer Science 220 Fall 2008.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
Cache Perf. CSE 471 Autumn 021 Cache Performance CPI contributed by cache = CPI c = miss rate * number of cycles to handle the miss Another important metric.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
Memory Design Principles Principle of locality dominates design Smaller = faster Hierarchy goal: total memory system almost as cheap as the cheapest component,
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
COMP 3221: Microprocessors and Embedded Systems Lectures 27: Cache Memory - III Lecturer: Hui Wu Session 2, 2005 Modified.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.
Chapter 5 Memory II CSE 820. Michigan State University Computer Science and Engineering Equations CPU execution time = (CPU cycles + Memory-stall cycles)
CMSC 611: Advanced Computer Architecture
CMSC 611: Advanced Computer Architecture
Memory COMPUTER ARCHITECTURE
Reducing Hit Time Small and simple caches Way prediction Trace caches
CSC 4250 Computer Architectures
The University of Adelaide, School of Computer Science
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Lecture 14: Reducing Cache Misses
Chapter 5 Memory CSE 820.
Lecture 08: Memory Hierarchy Cache Performance
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Siddhartha Chatterjee
Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate
Cache - Optimization.
Cache Performance Improvements
10/18: Lecture Topics Using spatial locality
Presentation transcript:

CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.

Cache Optimizations 1. Reducing miss penalty: multilevel caches, critical word first, read miss before write miss, merging write buffers, and victim caches 2. Reducing miss rate: larger block size, larger cache size, higher associativity, way prediction, and compiler optimization 3. Reducing miss penalty or miss rate via parallelism: hardware and compiler prefetching 4. Reducing time to hit in cache: small and simple caches, and pipelined cache access

Three Categories of Misses (Three C’s) Three C’s: Compulsory, Capacity, and Conflict Compulsory ─ The very first access to a block cannot be in the cache; also called cold-start misses or first-reference misses Capacity ─ If the cache cannot contain all the blocks needed during execution, capacity misses (in addition to compulsory misses) will occur because of blocks being discarded and later retrieved Conflict ─ If the block placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory and capacity misses) will occur because a block may be discarded and later retrieved if too many blocks map to its set; also called collision misses or interference misses

Figure Total miss rate (top) and distribution of miss rate (bottom) for each size data cache according to the three C’s.

Interpretation of Figure 5.15 The figure shows the relative frequencies of cache misses, broken down by the “three C’s” Compulsory misses are those that occur in an infinite cache Capacity misses are those that occur in a fully associative cache Conflict misses are those that occur going from fully associative to 8-way associative, 4-way associative, and so on To show the benefit of associativity, conflict misses are divided by each decrease in associativity:  8-way ─ Conflict misses from fully assoc. to 8-way assoc.  4-way ─ Conflict misses from 8-way assoc. to 4-way assoc.  2-way ─ Conflict misses from 4-way assoc. to 2-way assoc.  1-way ─ Conflict misses from 2-way assoc. to 1-way assoc.

Reducing Miss Rate 1. Larger Block Size 2. Larger Caches 3. Higher Associativity 4. Way Prediction 5. Compiler Optimizations

1. Larger Block Size Larger block size reduces compulsory misses, due to spatial locality Larger blocks increase miss penalty Larger blocks increase conflict misses and even capacity misses if the cache is small Do not increase the block size to value beyond which either miss rate or average memory access time increases

Figure Miss rate versus block size

Figure Average memory access time versus block size for four caches sized 4KB, 16KB, 64KB, and 256KB. Block sizes of 32B and 64B dominate; the smallest average time per cache size is shown in italic What is the memory access overhead included in the miss penalty? Block sizeMiss penalty4KB16KB64KB256KB 16B B B B B

2. Larger Caches An obvious way to reduce capacity misses in Fig is to increase capacity of the cache The drawback is a longer hit time and a higher dollar cost This technique is especially popular in off- chip caches: The size of second- or third- level caches in 2001 equals the size of main memory in desktop computers in 1990

3. Higher Associativity Figure 5.15 shows how miss rates improve with higher associativity. There are two general rules of thumb: 1. 8-way set associative is for practical purposes as effective in reducing misses as fully associative 2. A direct-mapped cache of size N has about the same miss rate as a 2-way set associative cache of size N/2 Improving one aspect of the average memory access time comes at the expense of another: 1. Increasing block size reduces miss rate while increasing miss penalty 2. Greater associativity comes at the cost of an increased hit time

Fig Average memory access time versus associativity Italic entries show where higher associativity increases average memory access time Smaller caches need higher associativity Cache size1-way2-way4-way8-way 4KB KB KB KB KB KB KB KB

4. Way Prediction This approach reduces conflict misses and maintains the hit speed of direct-mapped cache Extra bits are kept in the cache to predict the way of the next cache access Alpha uses way prediction in its 2-way set associative instruction cache: added to each block is a prediction bit, used to select which block to try on the next cache access If predictor is correct, the instruction cache latency is 1 clock cycle; if not, the cache tries the other block, changes the way predictor, and has a latency of 3 clock cycles SPEC95 suggests a way prediction accuracy of 85%

5. Compiler Optimizations Code can be rearranged without affecting correctness:  Reordering the procedures of a program might reduce instruction miss rates by reducing conflict misses. Use profiling information to determine likely conflicts between groups of instructions  Aim for better efficiency from long cache blocks: Align basic blocks so that the entry point is at the beginning of a cache block decreases the chance for a cache miss for sequential code  Improve the spatial and temporal locality of data

Loop Interchange Some programs have nested loops that access data in memory in nonsequential order. Simply exchanging the nesting of the loops can make the code access the data in the order they are stored. Assuming the arrays do not fit in the cache, this technique reduces misses by improving spatial locality: reordering maximizes use of data in a cache block before the data are discarded. /* Before */ for (j = 0; j < 100; j = j+1) for (i = 0; i < 5000; i = i+1) x[i][j] = 2*x[i][j]; /* After */ for (i = 0; i < 5000; i = i+1) for (j = 0; j < 100; j = j+1) x[i][j] = 2*x[i][j];

Reducing Hit Time Small and Simple Caches  A time-consuming part of a cache hit is using the index portion of the address to read tag memory and then compare it to the address  We already know that smaller hardware is faster  It is critical to keep the cache small enough to fit on the same chip as the processor to avoid the time penalty of going off chip  Keep the cache simple: Say use direct mapping; a main advantage is that we can overlap tag check with transmission of data  We use small and simple caches for level-1 caches  For level-2 caches, some designs strike a compromise by keeping tags on chip and data off chip, promising a fast tag check, yet providing the greater capacity of separate memory chips

Fig Summary of Cache Optimizations TechniqueMiss penalty Miss rateHit timeHardware complexity Multilevel caches+2 Critical word first and early restart+2 Priority to read misses over write misses+1 Merging write buffer+1 Victim caches++2 Larger block size−+0 Larger cache size+−1 Higher associativity+−1 Way prediction+2 Compiler techniques+0 Small and simple caches−+0 Pipelined cache access+1

Virtual Cache The guideline of making the common case fast suggests that we use virtual addresses for the cache, since hits are much more common than misses Such caches are termed virtual caches, with physical cache used to identify the traditional cache that uses physical addresses It is important to distinguish two tasks: indexing the cache and comparing addresses The issues are whether a virtual or physical address is used to index the cache and whether a virtual or physical address is used in the tag comparison Full virtual addressing for both indices and tags eliminates address translation time from a cache hit Why doesn’t everyone build virtually addressed caches?

Reasons against Virtual Caches First reason is protection. Page-level protection is checked as part of the virtual to physical address translation. Second reason is that every time a process is switched, the virtual addresses refer to different physical addresses, requiring the cache to be flushed. One solution is to increase the width of the cache address tag with a process-identifier tag (PID). Third reason is that operating systems and user programs may use two different virtual addresses for the same physical address. These duplicate addresses could result in two copies of the same data in a virtual cache; if one is modified, the other will have the wrong value. With a physical cache, this wouldn’t happen, since accesses would first be translated to the same physical cache block. Fourth reason is I/O. I/O typically uses physical addresses and thus would require mapping to virtual addresses to interact with a virtual cache.

One Good Choice One way to get the best of both virtual and physical caches is to use part of the page offset (the part that is identical in both virtual and physical addresses) to index the cache At the same time as the cache is being read using the index, the virtual part of the address is translated, and the tag match uses physical addresses This strategy allows the cache read to begin immediately, and yet the tag comparison is still with physical addresses The limitation of this virtually indexed, physically tagged alternative is that a direct-mapped cache can be no bigger than the page size

Example In this figure, the index is 9 bits and the cache block offset is 6 bits To use the trick on the previous slide, what should be the virtual page size? The virtual page size would have to be at least 2 (9+6) bytes or 32KB What is the size of the cache? 64KB (=2×32KB)

How to Build a Large Cache Associativity can keep the index in the physical part of the address and yet still support a large cache Doubling associativity and doubling the cache size do not change the size of the index Pentium III, with 8KB pages, avoids translation with its 16KB cache by using 2-way set associativity IBM 3033 cache is 16-way set associative, even though studies show that there is little benefit to miss rates above 8-way associativity. This high associativity allows a 64KB cache to be addressed with a physical index, despite the handicap of 4KB pages.