CSCE 430/830 Computer Architecture Introduction to Memory Hierarchy

CSCE 430/830 Computer Architecture Introduction to Memory Hierarchy
Adopted from Professor David Patterson Electrical Engineering and Computer Sciences University of California, Berkeley CS252 S05

Since 1980, CPU has outpaced DRAM ...
Q. How do architects address this gap? A. Put smaller, faster “cache” memories between CPU and DRAM. Create a “memory hierarchy”. Performance (1/latency) CPU 60% per yr 2X in 1.5 yrs 1000 CPU Gap grew 50% per year 100 DRAM 9% per yr 2X in 10 yrs 10 DRAM 1980 1990 2000 CSCE 430/830, Memory Hierarchy Introduction Year CS252 S05

1977: DRAM faster than microprocessors
Apple ][ (1977) Steve Wozniak Steve Jobs CPU: 1000 ns DRAM: 400 ns CS252 S05

Levels of the Memory Hierarchy
Upper Level Capacity Access Time Cost Staging Xfer Unit faster CPU Registers 100s Bytes <10s ns Registers Instr. Operands prog./compiler 1-8 bytes Cache K Bytes (MB?) ns 1-0.1 cents/bit Cache cache cntl 8-128 bytes Blocks Main Memory M Bytes (GB?) 200ns- 500ns $ cents /bit Memory OS 512-4K bytes Pages Disk G Bytes (TB?), 10 ms (10,000,000 ns) cents/bit Disk -5 -6 user/operator Mbytes Files Tape infinite sec-min 10 Larger Tape Lower Level -8 CSCE 430/830, Memory Hierarchy Introduction CS252 S05

Memory Hierarchy: Apple iMac G5
1.6 GHz Managed by compiler Managed by hardware Managed by OS, hardware, application 07 Reg L1 Inst L1 Data L2 DRAM Disk Size 1K 64K 32K 512K 256M 80G Latency Cycles, Time 1, 0.6 ns 3, 1.9 ns 11, 6.9 ns 88, 55 ns 107, 12 ms Goal: Illusion of large, fast, cheap memory Let programs address a memory space that scales to the disk size, at a speed that is usually as fast as register access CSCE 430/830, Memory Hierarchy Introduction CS252 S05

iMac’s PowerPC 970: All caches on-chip
L1 (64K Instruction) (1K) Registers 512K L2 L1 (32K Data) CS252 S05

The Principle of Locality
Program access a relatively small portion of the address space at any instant of time. Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access) Last 15 years, HW relied on locality for speed The principle of locality states that programs access a relatively small portion of the address space at any instant of time. This is kind of like in real life, we all have a lot of friends. But at any given time most of us can only keep in touch with a small group of them. There are two different types of locality: Temporal and Spatial. Temporal locality is the locality in time which says if an item is referenced, it will tend to be referenced again soon. This is like saying if you just talk to one of your friends, it is likely that you will talk to him or her again soon. This makes sense. For example, if you just have lunch with a friend, you may say, let’s go to the ball game this Sunday. So you will talk to him again soon. Spatial locality is the locality in space. It says if an item is referenced, items whose addresses are close by tend to be referenced soon. Once again, using our analogy. We can usually divide our friends into groups. Like friends from high school, friends from work, friends from home. Let’s say you just talk to one of your friends from high school and she may say something like: “So did you hear so and so just won the lottery.” You probably will say NO, I better give him a call and find out more. So this is an example of spatial locality. You just talked to a friend from your high school days. As a result, you end up talking to another high school friend. Or at least in this case, you hope he still remember you are his friend. +3 = 10 min. (X:50) It is a property of programs which is exploited in machine design. CSCE 430/830, Memory Hierarchy Introduction CS252 S05

Programs with locality cache well ...
Bad locality behavior Temporal Locality Memory Address (one dot per access) Spatial Locality Time Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. IBM Systems Journal 10(3): (1971) CS252 S05

Memory Hierarchy: Terminology
Hit: data appears in some block in the upper level (example: Block X) Hit Rate: the fraction of memory access found in the upper level Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss Miss: data needs to be retrieve from a block in the lower level (Block Y) Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor Hit Time << Miss Penalty (500 instructions on 21264!) A HIT is when the data the processor wants to access is found in the upper level (Blk X). The fraction of the memory access that are HIT is defined as HIT rate. HIT Time is the time to access the Upper Level where the data is found (X). It consists of: (a) Time to access this level. (b) AND the time to determine if this is a Hit or Miss. If the data the processor wants cannot be found in the Upper level. Then we have a miss and we need to retrieve the data (Blk Y) from the lower level. By definition (definition of Hit: Fraction), the miss rate is just 1 minus the hit rate. This miss penalty also consists of two parts: (a) The time it takes to replace a block (Blk Y to BlkX) in the upper level. (b) And then the time it takes to deliver this new block to the processor. It is very important that your Hit Time to be much much smaller than your miss penalty. Otherwise, there will be no reason to build a memory hierarchy. +2 = 14 min. (X:54) Lower Level Memory Upper Level To Processor From Processor Blk X Blk Y CSCE 430/830, Memory Hierarchy Introduction CS252 S05

CSCE 430/830, Memory Hierarchy Introduction
Cache Measures Hit rate: fraction found in that level So high that usually talk about Miss rate Miss rate fallacy: as MIPS to CPU performance, miss rate to average memory access time in memory Average memory-access time = Hit time + Miss rate x Miss penalty (ns or clocks) Miss penalty: time to replace a block from lower level, including time to replace in CPU access time: time to lower level = f(latency to lower level) transfer time: time to transfer block =f(BW between upper & lower levels) CSCE 430/830, Memory Hierarchy Introduction CS252 S05

4 Questions for Memory Hierarchy
Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement) Q4: What happens on a write? (Write strategy) CSCE 430/830, Memory Hierarchy Introduction CS252 S05

Q1: Where can a block be placed in the upper level?
Block 12 placed in 8 block cache: Fully associative, direct mapped, 2-way set associative S.A. Mapping = (Block Number) Modulo (Number Sets) Direct Mapped (12 mod 8) = 4 2-Way Assoc (12 mod 4) = 0 Full Mapped Cache Memory CSCE 430/830, Memory Hierarchy Introduction CS252 S05

Direct Mapped Block Placement
*4 *0 *8 *C Cache address maps to block: location = (block address MOD # blocks in cache) 04 00 08 0C 10 14 18 1C 20 24 28 2C 30 34 38 3C 40 44 48 4C Memory

Fully Associative Block Placement
Cache arbitrary block mapping location = any 00 04 08 0C 10 14 18 1C 20 24 28 2C 30 34 38 3C 40 44 48 4C Memory

Set-Associative Block Placement
Cache *0 *0 *4 *4 *8 *8 *C *C address maps to set: location = (block address MOD # sets in cache) (arbitrary location in set) Set 0 Set 1 Set 2 Set 3 00 04 08 0C 10 14 18 1C 20 24 28 2C 30 34 38 3C 40 44 48 4C Memory

Q2: How is a block found if it is in the upper level?
Tag on each block No need to check index or block offset Increasing associativity shrinks index, expands tag Block Offset Block Address Index Tag CSCE 430/830, Memory Hierarchy Introduction CS252 S05

Direct-Mapped Cache Design
Index ADDRESS Tag Byte Offset DATA HIT =1 0x 3 ADDR CACHE SRAM 0x00001C0 0xff083c2d 1 0x 0x 0x 0x23F0210 0x Tag V Data DATA[59] DATA[58:32] DATA[31:0] =

Set Associative Cache Design
Key idea: Divide cache into sets Allow block anywhere in a set Advantages: Better hit rate Disadvantage: More tag bits More hardware Higher access time A Four-Way Set-Associative Cache

Fully Associative Cache Design
Key idea: set size of one block 1 comparator required for each block No address decoding Practical only for small caches due to hardware demands tag in data out = tag data = = tag data tag data = tag data = tag data = tag data

In-Class Exercise Given the following requirements for cache design for a 32-bit-address computer: (1) cache contains 16KB of data, and (2) each cache block contains 16 words. (3) Placement policy is 4-way set-associative. What are the lengths (in bits) of the block offset field and the index field in the address? What are the lengths (in bits) of the index field and the tag field in the address if the placement is 1-way set-associative? CSCE 430/830, Memory Hierarchy Introduction CS252 S05

Q3: Which block should be replaced on a miss?
Easy for Direct Mapped Set Associative or Fully Associative: Random LRU (Least Recently Used) Assoc: way way way Size LRU Ran LRU Ran LRU Ran 16 KB 5.2% 5.7% % 5.3% 4.4% 5.0% 64 KB 1.9% 2.0% % 1.7% 1.4% 1.5% 256 KB 1.15% 1.17% % % % % CSCE 430/830, Memory Hierarchy Introduction CS252 S05

Miss Rate for 2-way Set Associative Cache
Q3: After a cache read miss, if there are no empty cache blocks, which block should be removed from the cache? The Least Recently Used (LRU) block? Appealing, but hard to implement for high associativity A randomly chosen block? Easy to implement, how well does it work? Miss Rate for 2-way Set Associative Cache Size Random LRU 16 KB 5.7% 5.2% 64 KB 2.0% 1.9% 256 KB 1.17% 1.15% Also, try other LRU approx. CSCE 430/830, Memory Hierarchy Introduction CS252 S05

Q4: What happens on a write?
Write-Through Write-Back Policy Data written to cache block also written to lower-level memory Write data only to the cache Update lower level when a block falls out of the cache Debug Easy Hard Do read misses produce writes? No Yes Do repeated writes make it to lower level? Additional option (on miss)-- let writes to an un-cached address; allocate a new cache line (“write-allocate”). CSCE 430/830, Memory Hierarchy Introduction CS252 S05

Write Buffers for Write-Through Caches
Processor Cache Write Buffer Lower Level Memory Holds data awaiting write-through to lower level memory Q. Why a write buffer ? A. So CPU doesn’t stall Q. Why a buffer, why not just one register ? A. Bursts of writes are common. Q. Are Read After Write (RAW) hazards an issue for write buffer? A. Yes! Drain buffer before next read, or send read 1st after check write buffers. CSCE 430/830, Memory Hierarchy Introduction CS252 S05

5 Basic Cache Optimizations
Reducing Miss Rate: 3 Cs of misses Larger Block size (compulsory misses) Larger Cache size (capacity misses) Higher Associativity (conflict misses) Reducing Miss Penalty Multilevel Caches Reducing hit time Giving Reads Priority over Writes E.g., Read complete before earlier writes in write buffer CSCE 430/830, Memory Hierarchy Introduction CS252 S05

In-Class Exercises In systems with a write-through L1 cache backed by a write-back L2 cache instead of main memory, a merging write buffer can be simplified. Explain how this can be done. Are there situations where having a full write buffer (instead of the simple version you’ve just proposed) could be helpful? The merging buffer links CPU to the L2 cache. Two CPU writes cannot merge if they are to different sets in L2. So, for each new entry into the buffer a quick check on only those address bits that determine the L2 set number need be performed at first. If there is no match, then the new entry is not merged. Otherwise, all address bits can be checked for a definitive result. As the associativity of L2 increases, the rate of false positive matches from the simplified check will increase, reducing performance. CSCE 430/830, Memory Hierarchy Introduction CS252 S05

In-Class Exercises As caches increase in size, blocks often increase in sizes as well. If a large instruction cache has larger data blocks, is there still a need for prefetching? Explain the interaction between prefetching and increased block size in instruction caches. Program basic blocks are often short (<10 instr.), and thus program executions does not continue to follow sequential locations for very long. As block gets larger, program is more likely to not execute all instructions in the block but branch out early, making instruction prefetching less attractive. Is there a need for data prefetch instructions when data blocks get larger? Data structures often comprise lengthy sequences of memory addresses, and program access of data structure often takes the form of sequential sweep. Large data blocks work well with such access patterns, and prefetching is likely still of value to the highly sequential access patterns. (go-to) CSCE 430/830, Memory Hierarchy Introduction CS252 S05

Outline Review Memory hierarchy Locality Cache design Virtual address spaces Page table layout TLB design options Conclusion CSCE 430/830, Memory Hierarchy Introduction CS252 S05

The Limits of Physical Addressing
“Physical addresses” of memory locations Data All programs share one address space: The physical address space Machine language programs must be aware of the machine organization No way to prevent a program from accessing any machine resource A0-A31 A0-A31 CPU Memory D0-D31 D0-D31 CSCE 430/830, Memory Hierarchy Introduction CS252 S05

Solution: Add a Layer of Indirection
“Physical Addresses” Address Translation Virtual Physical “Virtual Addresses” A0-A31 A0-A31 CPU Memory D0-D31 D0-D31 Data User programs run in a standardized virtual address space Address Translation hardware, managed by the operating system (OS), maps virtual address to physical memory Hardware supports “modern” OS features: Protection, Translation, Sharing CSCE 430/830, Memory Hierarchy Introduction CS252 S05

Three Advantages of Virtual Memory
Translation: Program can be given consistent view of memory, even though physical memory is scrambled Makes multithreading reasonable (now used a lot!) Only the most important part of program (“Working Set”) must be in physical memory. Contiguous structures (like stacks) use only as much physical memory as necessary yet still grow later. Protection: Different threads (or processes) protected from each other. Different pages can be given special behavior (Read Only, Invisible to user programs, etc). Kernel data protected from User programs Very important for protection from malicious programs Sharing: Can map same physical page to multiple users (“Shared memory”) CSCE 430/830, Memory Hierarchy Introduction CS252 S05

Page tables encode virtual address spaces
Physical Memory Space A valid page table entry codes physical memory “frame” address for the page Page Table OS manages the page table for each ASID (Addr. Space ID) A virtual address space is divided into blocks of memory called pages frame frame A machine usually supports pages of a few sizes (MIPS R4000): frame frame A page table is indexed by a virtual address CS252 S05

Details of Page Table Page Table Physical Memory Space Virtual Address Page Table index into page table Base Reg V Access Rights PA V page no. offset 12 table located in physical memory P page no. Physical Address frame frame frame frame virtual address Page table maps virtual page numbers to physical frames (“PTE” = Page Table Entry) Virtual memory => treat memory  cache for disk 4 fundamental questions: placement, identification, replacement, and write policy? CS252 S05

Page tables may not fit in memory!
A table for 4KB pages for a 32-bit address space has 1M entries Each process needs its own address space! Two-level Page Tables P1 index P2 index Page Offset 31 12 11 21 22 32 bit virtual address Top-level table wired in main memory Subset of 1024 second-level tables in main memory; rest are on disk or unallocated CSCE 430/830, Memory Hierarchy Introduction CS252 S05

VM and Disk: Page replacement policy
... Page Table used dirty Dirty bit: page written. Used bit: set to 1 on any reference Set of all pages in Memory Tail pointer: Clear the used bit in the page table Head pointer Place pages on free list if used bit is still clear. Schedule pages with dirty bit set to be written to disk. Freelist Free Pages Architect’s role: support setting dirty and used bits CSCE 430/830, Memory Hierarchy Introduction CS252 S05

TLB Design Concepts CSCE 430/830, Memory Hierarchy Introduction CS252 S05

MIPS Address Translation: How does it work?
“Virtual Addresses” “Physical Addresses” A0-A31 Virtual Physical Translation Look-Aside Buffer (TLB) Translation Look-Aside Buffer (TLB) A small fully-associative cache of mappings from virtual to physical addresses A0-A31 CPU Memory D0-D31 D0-D31 Data What is the table of mappings that it caches? TLB also contains protection bits for virtual address Fast common case: Virtual address is in TLB, process has permission to read/write it. CS252 S05

Physical and virtual pages must be the same size!
The TLB caches page table entries Physical and virtual pages must be the same size! TLB Page Table 2 1 3 virtual address page off frame 5 physical address TLB caches page table entries. Physical frame address for ASID V=0 pages either reside on disk or have not yet been allocated. OS handles V=0 “Page fault” MIPS handles TLB misses in software (random replacement). Other machines use hardware. CS252 S05

Can TLB and caching be overlapped?
Virtual Page Number Page Offset Translation Look-Aside Buffer (TLB) Virtual Physical Index Byte Select Valid Cache Tags Cache Data Cache Block = Hit Cache Tag This works, but ... Q. What is the downside? A. Inflexibility. Size of cache limited by page size. Data out CSCE 430/830, Memory Hierarchy Introduction CS252 S05

Problems With Overlapped TLB Access
Overlapped access only works as long as the address bits used to index into the cache do not change as the result of VA translation This usually limits things to small caches, large page sizes, or high n-way set associative caches if you want a large cache Example: suppose everything the same except that the cache is increased to 8 K bytes instead of 4 K: 11 2 cache index 00 This bit is changed by VA translation, but is needed for cache lookup 20 12 virt page # disp Solutions: go to 8K byte page sizes; go to 2 way set associative cache; or SW guarantee VA[13]=PA[13] 1K 2 way set assoc cache 10 4 4 CS252 S05

Use virtual addresses for cache?
“Physical Addresses” A0-A31 Virtual Physical A0-A31 Virtual Translation Look-Aside Buffer (TLB) CPU Cache Main Memory D0-D31 D0-D31 D0-D31 Only use TLB on a cache miss ! Downside: a subtle, fatal problem. What is it? A. Synonym problem. If two address spaces share a physical frame, data may be in cache twice. Maintaining consistency is a nightmare. CSCE 430/830, Memory Hierarchy Introduction CS252 S05

In-Class Exercise Some memory systems handle TLB misses in sw (as an exception), while others use hw. What the trade-offs between these two methods? Hw faster but less flexible Will sw handling of TLB misses always be slower ? Explain. Factors other whether miss handling is done in hw or ws can quickly dominate handling time. E.g., Is page tbl itself paged? Can sw implement a more efficient page tbl search algorithm than hw? Hw Prefetching? Are there page table structures that would be difficult to handle in hw but possible in sw? Page table structures that change dynamically. CSCE 430/830, Memory Hierarchy Introduction CS252 S05

Summary #1/3: The Cache Design Space
Several interacting dimensions cache size block size associativity replacement policy write-through vs write-back write allocation The optimal choice is a compromise depends on access characteristics workload use (I-cache, D-cache, TLB) depends on technology / cost Simplicity often wins Cache Size Associativity Block Size Bad No fancy replacement policy is needed for the direct mapped cache. As a matter of fact, that is what cause direct mapped trouble to begin with: only one place to go in the cache--causes conflict misses. Besides working at Sun, I also teach people how to fly whenever I have time. Statistic have shown that if a pilot crashed after an engine failure, he or she is more likely to get killed in a multi-engine light airplane than a single engine airplane. The joke among us flight instructors is that: sure, when the engine quit in a single engine stops, you have one option: sooner or later, you land. Probably sooner. But in a multi-engine airplane with one engine stops, you have a lot of options. It is the need to make a decision that kills those people. Good Factor A Factor B Less More CSCE 430/830, Memory Hierarchy Introduction CS252 S05

Summary #2/3: Caches The Principle of Locality: Program access a relatively small portion of the address space at any instant of time. Temporal Locality: Locality in Time Spatial Locality: Locality in Space Three Major Categories of Cache Misses: Compulsory Misses: sad facts of life. Example: cold start misses. Capacity Misses: increase cache size Conflict Misses: increase cache size and/or associativity. Nightmare Scenario: ping pong effect! Write Policy: Write Through vs. Write Back Today CPU time is a function of (ops, cache misses) vs. just f(ops): affects Compilers, Data structures, and Algorithms CSCE 430/830, Memory Hierarchy Introduction CS252 S05

Summary #3/3: TLB, Virtual Memory
Page tables map virtual address to physical address TLBs are important for fast translation TLB misses are significant in processor performance funny times, as most systems can’t access all of 2nd level cache without TLB misses! Caches, TLBs, Virtual Memory all understood by examining how they deal with 4 questions: 1) Where can block be placed? 2) How is block found? 3) What block is replaced on miss? 4) How are writes handled? Today VM allows many processes to share single memory without having to swap all processes to disk; today VM protection is more important than memory hierarchy benefits, but computers insecure Let’s do a short review of what you learned last time. Virtual memory was originally invented as another level of memory hierarchy such that programers, faced with main memory much smaller than their programs, do not have to manage the loading and unloading portions of their program in and out of memory. It was a controversial proposal at that time because very few programers believed software can manage the limited amount of memory resource as well as human. This all changed as DRAM size grows exponentially in the last few decades. Nowadays, the main function of virtual memory is to allow multiple processes to share the same main memory so we don’t have to swap all the non-active processes to disk. Consequently, the most important function of virtual memory these days is to provide memory protection. The most common technique, but we like to emphasis not the only technique, to translate virtual memory address to physical memory address is to use a page table. TLB, or translation lookaside buffer, is one of the most popular hardware techniques to reduce address translation time. Since TLB is so effective in reducing the address translation time, what this means is that TLB misses will have a significant negative impact on processor performance. +3 = 3 min. (X:43) CSCE 430/830, Memory Hierarchy Introduction CS252 S05

Classifying Misses: 3C Compulsory—The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses. (Misses in even an Infinite Cache) Capacity—If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved. (Misses in Fully Associative Size X Cache) Conflict—If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses. (Misses in N-way Associative, Size X Cache) CSCE 430/830, Memory Hierarchy Introduction

3Cs Absolute Miss Rate (SPEC92)
Classifying Misses: 3C 3Cs Absolute Miss Rate (SPEC92) Conflict Compulsory vanishingly small CSCE 430/830, Memory Hierarchy Introduction

2:1 Cache Rule miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/2 Conflict CSCE 430/830, Memory Hierarchy Introduction

3C Relative Miss Rate Conflict Flaws: for fixed block size Good: insight => invention CSCE 430/830, Memory Hierarchy Introduction

Improve Cache Performance
improve cache and memory access times: Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty Reducing each of these! Simultaneously? Improve performance by: 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache. CSCE 430/830, Memory Hierarchy Introduction

Reducing Cache Misses: 1. Larger Block Size Using the principle of locality. The larger the block, the greater the chance parts of it will be used again. Size of Cache CSCE 430/830, Memory Hierarchy Introduction

Increasing Block Size One way to reduce the miss rate is to increase the block size Take advantage of spatial locality Decreases compulsory misses However, larger blocks have disadvantages May increase the miss penalty (need to get more data) May increase hit time (need to read more data from cache and larger mux) May increase miss rate, since conflict misses Increasing the block size can help, but don’t overdo it. CSCE 430/830, Memory Hierarchy Introduction

Block Size vs. Cache Measures Increasing Block Size generally increases Miss Penalty and decreases Miss Rate As the block size increases the AMAT starts to decrease, but eventually increases Hit Time + Miss Penalty X Miss Rate Avg. Memory Access Time = Block Size Block Size Block Size CSCE 430/830, Memory Hierarchy Introduction

Reducing Cache Misses: 2. Higher Associativity Increasing associativity helps reduce conflict misses 2:1 Cache Rule: The miss rate of a direct mapped cache of size N is about equal to the miss rate of a 2-way set associative cache of size N/2 For example, the miss rate of a 32 Kbyte direct mapped cache is about equal to the miss rate of a 16 Kbyte 2-way set associative cache Disadvantages of higher associativity Need to do large number of comparisons Need n-to-1 multiplexor for n-way set associative Could increase hit time CSCE 430/830, Memory Hierarchy Introduction

AMAT vs. Associativity Cache Size Associativity (KB) 1-way 2-way 4-way 8-way Red means A.M.A.T. not improved by more associativity Does not take into account effect of slower clock on rest of program CSCE 430/830, Memory Hierarchy Introduction

Reducing Cache Misses: 3. Victim Cache Data discarded from cache is placed in an extra small buffer (victim cache). On a cache miss check victim cache for data before going to main memory Jouppi [1990]: A 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache Used in Alpha, HP PA-RISC CPUs. Tag CPU Address In Out Data Cache Write Buffer Victim Cache =? Lower Level Memory CSCE 430/830, Memory Hierarchy Introduction

Reducing Cache Misses: 4. Way Prediction and Pseudoassociative Caches
Way prediction helps select one block among those in a set, thus requiring only one tag comparison (if hit). Preserves advantages of direct-mapping (why?); In case of a miss, other block(s) are checked. Pseudoassociative (also called column associative) caches Operate exactly as direct-mapping caches when hit, thus again preserving advantages of the direct-mapping; In case of a miss, another block is checked (as if in set-associative caches), by simply inverting the most significant bit of the index field to find the other block in the “pseudoset”. real hit time < pseudo-hit time too many pseudo hits would defeat the purpose CSCE 430/830, Memory Hierarchy Introduction

Reducing Cache Misses: 5. Compiler Optimizations
CSCE 430/830, Memory Hierarchy Introduction

Reducing Cache Misses: 5. Compiler Optimizations
Blocking: improve temporal and spatial locality multiple arrays are accessed in both ways (i.e., row-major and column-major), namely, orthogonal accesses that can not be helped by earlier methods concentrate on submatrices, or blocks All N*N elements of Y and Z are accessed N times and each element of X is accessed once. Thus, there are N3 operations and 2N3 + N2 reads! Capacity misses are a function of N and cache size in this case. CSCE 430/830, Memory Hierarchy Introduction

Reducing Cache Misses: 5. Compiler Optimizations (cont’d)
Blocking: improve temporal and spatial locality To ensure that elements being accessed can fit in the cache, the original code is changed to compute a submatrix of size B*B, where B is called the blocking factor. To total number of memory words accessed is 2N3//B + N2 Blocking exploits a combination of spatial (Y) and temporal (Z) locality. CSCE 430/830, Memory Hierarchy Introduction

Reducing Cache Miss Penalty: 1. Multi-level Cache
To keep up with the widening gap between CPU and main memory, try to: make cache faster, and make cache larger by adding another, larger but slower cache between cache and the main memory. CSCE 430/830, Memory Hierarchy Introduction

Adding an L2 Cache If a direct mapped cache has a hit rate of 95%, a hit time of 4 ns, and a miss penalty of 100 ns, what is the AMAT? If an L2 cache is added with a hit time of 20 ns and a hit rate of 50%, what is the new AMAT? CSCE 430/830, Memory Hierarchy Introduction

Adding an L2 Cache If a direct mapped cache has a hit rate of 95%, a hit time of 4 ns, and a miss penalty of 100 ns, what is the AMAT? AMAT = Hit time + Miss rate x Miss penalty = x 100 = 9 ns If an L2 cache is added with a hit time of 20 ns and a hit rate of 50%, what is the new AMAT? AMAT = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 x Miss PenaltyL2 ) = x ( x100) = 7.5 ns CSCE 430/830, Memory Hierarchy Introduction

Reducing Cache Miss Penalty: 2. Handling Misses Judiciously
Critical Word First and Early Restart CPU needs just one word of the block at a time: critical word first: fetch the required word first, and early start: as soon as the required word arrives, send it to CPU. Giving Priority to Read Misses over Write Misses Serves reads before writes have been completed: while write buffers improve write-through performance, they complicate memory accesses by potentially delaying updates to memory; instead of waiting for the write buffer to become empty before processing a read miss, the write buffer is checked for content that might satisfy the missing read. in a write-back scheme, the dirty copy upon replacing is first written to the write buffer instead of the memory, thus improving performance. CSCE 430/830, Memory Hierarchy Introduction

Reducing Cache Miss Penalty: 3. Compiler-Controlled Prefetching
Compiler inserts prefetch instructions An Example for(i:=0; i<3; i:=i+1) for(j:=0; j<100; j:=j+1) a[i][j] := b[j][0] * b[j+1][0] 16-byte blocks, 8KB cache, 1-way write back, 8-byte elements; What kind of locality, if any, exists for a and b? 3 100-element rows (100 columns) visited; spatial locality: even-indexed elements miss and odd-indexed elements hit, leading to 3*100/2 = 150 misses 101 rows and 3 columns visited; no spatial locality, but there is temporal locality: same element is used in ith and (i + 1)st iterations and the same element is access in each i iteration (outer loop). 100 misses for i = 0 and 1 miss for j = 0 for a total of 101 misses Assuming large penalty (50 cycles and at least 7 iterations must be prefetched). Splitting the loop into two, we have CSCE 430/830, Memory Hierarchy Introduction

Reducing Cache Miss Penalty: 3. Compiler-Controlled Prefetching
An Example (continued) for(j:=0; j<100; j:=j+1){ prefetch(b[j+7][0]; prefetch(a[0][j+7]; a[0][j] := b[j][0] * b[j+1][0];}; for(i:=1; i<3; i:=i+1) prefetch(a[i][j+7]; a[i][j] := b[j][0] * b[j+1][0]} Assuming that each iteration of the pre-split loop consumes 7 cycles and no conflict and capacity misses, then it consumes a total of 7* *50 = cycles (total iteration cycles plus total cache miss cycles); CSCE 430/830, Memory Hierarchy Introduction

Reducing Cache Miss Penalty: 3. Compiler-Controlled Prefetching (cont’d) An Example (continued) the first loop consumes 9 cycles per iteration (due to the two prefetch instruction) the second loop consumes 8 cycles per iteration (due to the single prefetch instruction), during the first 7 iterations of the first loop array a incurs 4 cache misses, array b incurs 7 cache misses, during the first 7 iterations of the second loop for i = 1 and i = 2 array a incurs 4 cache misses each array b does not incur any cache miss in the second split!. the split loop consumes a total of (1+1+7)*100+(4+7)*50+(1+7)*200+(4+4)*50 = 3450 Prefetching improves performance: 14650/3450=4.25 folds CSCE 430/830, Memory Hierarchy Introduction

Reducing Cache Hit Time:
Small and simple caches smaller is faster: small index, less address translation time small cache can fit on the same chip with CPU low associativity: in addition to a simpler/shorter tag check, 1-way cache allows overlapping tag check with transmission of data which is not possible with any higher associativity! Avoid address translation during indexing Make the common case fast: use virtual address for cache because most memory accesses (more than 90%) take place in cache, resulting in virtual cache CSCE 430/830, Memory Hierarchy Introduction

Make the common case fast (continued): there are at least three important performance aspects that directly relate to virtual-to-physical translation: improperly organized or insufficiently sized TLBs may create excess not-in-TLB faults, adding time to program execution time for a physical cache, the TLB access time must occur before the cache access, extending the cache access time two-line address (e.g., an I-line and a D-line address) may be independent of each other in virtual address space yet collide in the real address space, when they draw pages whose lower page address bits (and upper cache address bits) are identical problems with virtual cache: Page-level protection must be enforced no matter what during address translation (solution: copy protection info from TLB on a miss and hold it in a field for future virtual indexing/tagging) when a process is switched in/out, the entire cache has to be flushed out ‘cause physical address will be different each time, i.e., the problem of context switching (solution: process identifier tag -- PID) CSCE 430/830, Memory Hierarchy Introduction

Avoid address translation during indexing (continued) problems with virtual cache: different virtual addresses may refer to the same physical address, i.e., the problem of synonyms/aliases HW solution: guarantee every cache block a unique phy. Address SW solution: force aliases to share some address bits (e.g., page-coloring) Virtually indexed and physically tagged Pipelined cache writes the solution is to reduce CCT and increase # of stages – increases instr. throughput Trace caches Finds a dynamic sequence of instructions including taken branches to load into a cache block: Put traces of the executed instructions into cache blocks as determined by the CPU Branch prediction is folded in to the cache and must be validated along with the addresses to have a valid fetch. Disadvantage: store the same instructions multiple times CSCE 430/830, Memory Hierarchy Introduction

CSCE 430/830 Computer Architecture Introduction to Memory Hierarchy

Similar presentations

Presentation on theme: "CSCE 430/830 Computer Architecture Introduction to Memory Hierarchy"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSCE 430/830 Computer Architecture Introduction to Memory Hierarchy

Similar presentations

Presentation on theme: "CSCE 430/830 Computer Architecture Introduction to Memory Hierarchy"— Presentation transcript:

Similar presentations

About project

Feedback