Morgan Kaufmann Publishers

Slides:



Advertisements
Similar presentations
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Advertisements

Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Lecture 12 Reduce Miss Penalty and Hit Time
Cache Performance 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.
Computation I pg 1 Embedded Computer Architecture Memory Hierarchy: Cache Recap Course 5KK73 Henk Corporaal November 2014
Memory Chapter 7 Cache Memories.
331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
Lecture 32: Chapter 5 Today’s topic –Cache performance assessment –Associative caches Reminder –HW8 due next Friday 11/21/2014 –HW9 due Wednesday 12/03/2014.
Lecture 33: Chapter 5 Today’s topic –Cache Replacement Algorithms –Multi-level Caches –Virtual Memories 1.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
CPE432 Chapter 5A.1Dr. W. Abu-Sufah, UJ Chapter 5A: Exploiting the Memory Hierarchy, Part 2 Adapted from Slides by Prof. Mary Jane Irwin, Penn State University.
1 CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity.
Additional Slides By Professor Mary Jane Irwin Pennsylvania State University Group 3.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Lecture 14: Caching, cont. EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:
Computer Organization CS224 Fall 2012 Lessons 41 & 42.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Caches 1 Computer Organization II © McQuain Memory Technology Static RAM (SRAM) – 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic RAM (DRAM)
Chapter 5 Large and Fast: Exploiting Memory Hierarchy.
COMP 3221: Microprocessors and Embedded Systems Lectures 27: Cache Memory - III Lecturer: Hui Wu Session 2, 2005 Modified.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.
Cache Issues Computer Organization II 1 Main Memory Supporting Caches Use DRAMs for main memory – Fixed width (e.g., 1 word) – Connected by fixed-width.
CS161 – Design and Architecture of Computer Systems
CS161 – Design and Architecture of Computer
CMSC 611: Advanced Computer Architecture
COSC3330 Computer Architecture
Cache Memory and Performance
The Goal: illusion of large, fast, cheap memory
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
CSC 4250 Computer Architectures
Multilevel Memories (Improving performance using alittle “cash”)
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Morgan Kaufmann Publishers
Morgan Kaufmann Publishers
Morgan Kaufmann Publishers
William Stallings Computer Organization and Architecture 7th Edition
CS61C : Machine Structures Lecture 6. 2
ECE 445 – Computer Organization
Instructor: Justin Hsia
CS61C : Machine Structures Lecture 6. 2
Systems Architecture II
Lecture 08: Memory Hierarchy Cache Performance
ECE 445 – Computer Organization
ECE232: Hardware Organization and Design
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
CS-447– Computer Architecture Lecture 20 Cache Memories
CS 3410, Spring 2014 Computer Science Cornell University
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache - Optimization.
Memory Hierarchy Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
Cache Memory Rabi Mahapatra
Memory & Cache.
10/18: Lecture Topics Using spatial locality
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Morgan Kaufmann Publishers 13 January, 2019 Lecture 4.3 Memory Hierarchy: Improving Cache Performance Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Learning Objectives Calculate the effective CPI and the average memory access time Given a 32-bit address, specify the index and tag for direct mapped cache, n-way set associative cache, and fully associative cache, respectively Calculate the effective CPI for multiple-level caches 2

Coverage Textbook: Chapter 5.4 3

Measuring Cache Performance Morgan Kaufmann Publishers 13 January, 2019 Measuring Cache Performance Components of CPU time Program execution cycles Includes cache hit time Memory stall cycles Mainly from cache misses §5.4 Measuring and Improving Cache Performance CPU time CPU time = IC × CPI × CC = IC × (CPIideal + Average memory stall cycles per instruction) × CC Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 4 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers 13 January, 2019 Memory stall cycles With simplifying assumptions: §5.4 Measuring and Improving Cache Performance Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 5 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Cache Performance Example Morgan Kaufmann Publishers 13 January, 2019 Cache Performance Example Given I-cache miss rate = 2% D-cache miss rate = 4% Miss penalty = 100 cycles Base CPI (ideal cache) = 2 Load & stores are 36% of instructions Memory stall cycles per instruction I-cache: 0.02 × 100 = 2 Need to fetch every instruction from I-cache D-cache: 0.36 × 0.04 × 100 = 1.44 36% of instructions need to access D-cache Actual CPI = 2 + 2 + 1.44 = 5.44 The real execution is 5.44/2 =2.72 times slower than the ideal case Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 6 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Average Memory Access Time Morgan Kaufmann Publishers 13 January, 2019 Average Memory Access Time Average memory access time (AMAT) AMAT = Hit time + Miss rate × Miss penalty Hit time is also important for performance Example CPU with 1ns clock, hit time = 1 cycle, miss penalty = 20 cycles, cache miss rate = 5% AMAT = 1 + 0.05 × 20 = 2 cycles 2 cycles Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 7 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers 13 January, 2019 Performance Summary When CPU performance increased Miss penalty becomes more significant Greater proportion of time spent on memory stalls Decreasing base CPI Increasing clock rate Memory stalls account for more CPU cycles Can’t neglect cache behavior when evaluating system performance Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 8 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Mechanisms to Reduce Cache Miss Rates 1. Allow more flexible block placement 2. Use multiple levels of caches 9

Reducing Cache Miss Rates #1 1. Allow more flexible block placement In a direct mapped cache a memory block maps to exactly one cache block At the other extreme, could allow a memory block to be mapped to any cache block – fully associative cache A compromise is to divide the cache into sets, each of which consists of n “ways” (n-way set associative) A memory block maps to a unique set (specified by the index field) and can be placed in any way of that set (so there are n choices) All of the tags of all of the elements of the set must be searched for a match.

Morgan Kaufmann Publishers 13 January, 2019 Associative Caches Fully associative (only one set) Allow a given block to go into any cache entry Requires all entries to be searched at once Comparator per entry (expensive) n-way set associative Each set contains n entries Block address determines which set (Block address) modulo (#Sets in cache) Search all entries in a given set at once n comparators (less expensive) Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 11 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Associative Cache Example Morgan Kaufmann Publishers 13 January, 2019 Associative Cache Example 12 in the diagram is the block address. For the same address, once the number of bits in index changes, the number of bits in tags changes accordingly. Therefore, the value of the tag in 2-way set associative cache and the fully associative cache should be different from the value of the tag in the direct mapped cache. Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 12 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Spectrum of Associativity Morgan Kaufmann Publishers 13 January, 2019 Spectrum of Associativity For a cache with 8 entries Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 13 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Associativity Example Morgan Kaufmann Publishers 13 January, 2019 Associativity Example Compare caches containing 4 cache blocks Each block is 1-word wide Direct mapped, 2-way set associative, fully associative Block address sequence: 0, 8, 0, 6, 8 Direct mapped Block address Cache index Hit/miss Cache content after access 1 2 3 miss Mem[0] 8 Mem[8] 6 Mem[6] Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 14 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Associativity Example Morgan Kaufmann Publishers 13 January, 2019 Associativity Example 2-way set associative Block address Cache index Hit/miss Cache content after access Set 0 Set 1 miss Mem[0] 8 Mem[8] hit 6 Mem[6] Fully associative Block address Hit/miss Cache content after access miss Mem[0] 8 Mem[8] hit 6 Mem[6] Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 15 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Another Reference String Mapping Consider the main memory word reference string 0 4 0 4 0 4 0 4 Start with an empty cache - all blocks initially marked as not valid miss 4 miss miss 4 miss 01 4 00 01 4 00 Mem(0) 00 Mem(0) 01 Mem(4) 00 Mem(0) miss 4 miss miss 4 miss 01 4 00 01 4 00 01 Mem(4) 00 Mem(0) 01 Mem(4) 00 Mem(0) 4-bit word address: 0: 0000 4: 0100 8 requests, 8 misses Ping pong effect due to conflict misses - two memory locations that map into the same cache block

Another Reference String Mapping Consider the main memory word reference string 0 4 0 4 0 4 0 4 Start with an empty cache - all blocks initially marked as not valid 4 4 For class handout

Another Reference String Mapping Consider the main memory word reference string 0 4 0 4 0 4 0 4 Start with an empty cache - all blocks initially marked as not valid miss 4 miss hit 4 hit 000 Mem(0) 000 Mem(0) 000 Mem(0) 000 Mem(0) 010 Mem(4) 010 Mem(4) 010 Mem(4) 8 requests, 2 misses Solves the ping pong effect in a direct mapped cache due to conflict misses since now two memory locations that map into the same cache set can co-exist! For lecture Another sample string to try 0 1 2 3 0 8 11 0 3

How Much Associativity Morgan Kaufmann Publishers 13 January, 2019 How Much Associativity Increased associativity decreases miss rate But with diminishing returns Simulation of a system with 64KB D-cache, 16-word blocks, SPEC2000 1-way: 10.3% 2-way: 8.6% 4-way: 8.3% 8-way: 8.1% Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 19 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Benefits of Set Associative Caches Largest gains are in going from direct mapped cache to 2-way set associative cache (20+% reduction in miss rate) 20

Set Associative Cache Organization Morgan Kaufmann Publishers 13 January, 2019 Set Associative Cache Organization Question: how wide is each cache block? Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 21 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Range of Set Associative Caches For a fixed size cache, each increase by a factor of two in associativity doubles the number of blocks per set (i.e., the number of ways) and halves the number of sets – decreases the size of the index by 1 bit and increases the size of the tag by 1 bit Tag Index Block offset Byte offset For class handout

Range of Set Associative Caches For a fixed size cache, each increase by a factor of two in associativity doubles the number of blocks per set (i.e., the number of ways) and halves the number of sets – decreases the size of the index by 1 bit and increases the size of the tag by 1 bit Used for tag compare Selects the set Selects the word in the block Tag Index Block offset Byte offset Increasing associativity Decreasing associativity Fully associative (only one set) Tag has all the bits except block and byte offset For lecture In 2008, the greater size and power consumption of CAMs generally leads to 2-way and 4-way set associativity being built from standard SRAMs with comparators with 8-way and above being built using CAMs Direct mapped (only one way) Smaller tags, only a single comparator

Example: Tag Size Cache 32-bit byte address 4K blocks 4-word block size 32-bit byte address Number of tag bits for the cache (?) Directed mapped 2-way set associative 4-way set associative Fully associative 24

Morgan Kaufmann Publishers 13 January, 2019 Replacement Policy Direct mapped: no choice Set associative Prefer non-valid entry, if there is one Otherwise, choose among entries in the set Least-recently used (LRU) Choose the one unused for the longest time Simple for 2-way, manageable for 4-way, too hard beyond that Random Gives approximately the same performance as LRU for high associativity Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 25 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Reducing Cache Miss Rates #2 Morgan Kaufmann Publishers 13 January, 2019 Reducing Cache Miss Rates #2 2. Use multiple levels of caches Primary (L1) cache attached to CPU Small, but fast Separate L1 I$ and L1 D$ Level-2 cache services misses from L1 cache Larger, slower, but still faster than main memory Unified cache for both instructions and data Main memory services L-2 cache misses Some high-end systems include L-3 cache Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 26 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Multilevel Cache Example Morgan Kaufmann Publishers 13 January, 2019 Multilevel Cache Example Given CPU base CPI = 1, clock rate = 4GHz Miss rate/instruction = 2% Main memory access time = 100ns With just primary cache Miss penalty = 100ns/0.25ns = 400 cycles Effective CPI = 1 + 0.02 × 400 = 9 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 27 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers 13 January, 2019 Example (cont.) Now add L-2 cache Access time = 5ns Global miss rate to main memory = 0.5% L-1 miss with L-2 hit Penalty = 5ns/0.25ns = 20 cycles L-1 miss with L-2 miss Extra penalty = 400 cycles CPI = 1 + 0.02 × 20 + 0.005 × 400 = 3.4 Performance ratio = 9/3.4 = 2.6 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 28 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Multilevel Cache Considerations Morgan Kaufmann Publishers 13 January, 2019 Multilevel Cache Considerations Primary (L-1) cache Focus on minimal hit time Smaller total size with smaller block size L-2 cache Focus on low miss rate to avoid main memory access Hit time has less overall impact Larger total size with larger block size Higher levels of associativity Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 29 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Global v.s. Local Miss Rate Global miss rate (GR) The fraction of references that miss in all levels of a multilevel cache Dictate how often the main memory is accessed Global miss rate so far (GRS) The fraction of references that miss up to a certain level in a multilevel cache Local miss rate (LR) The fraction of references to one level of a cache that miss L2$ local miss rate >> the global miss rate 30

Example LR1=5%, LR2=20%, LR3=50% GR=LR1×LR2×LR3=0.05×0.2×0.5=0.005 GRS1=LR1=0.05 GRS2=LR1×LR2=0.05×0.2=0.01 GRS3=LR1×LR2×LR3=0.05×0.2×0.5=0.005 CPI=1+GSR1×Pty1+GSR2×Pty2+GSR3×Pty3 =1+0.05×10+0.01×20+0.005×100=2.2 31

Sources of Cache Misses Compulsory (cold start or process migration, first reference): First access to a block, “cold” fact of life, not a whole lot you can do about it. If you are going to run “millions” of instruction, compulsory misses are insignificant Solution: increase block size (increases miss penalty; very large blocks could increase miss rate) Capacity: Cache cannot contain all blocks accessed by the program Solution: increase cache size (may increase access time) Conflict (collision): Multiple memory locations mapped to the same cache location Solution 1: increase cache size Solution 2: increase associativity (may increase access time) (Capacity miss) That is the cache misses due to the fact that the cache is simply not large enough to contain all the blocks that are accessed by the program. The solution to reduce the Capacity miss rate is simple: increase the cache size. Here is a summary of other types of cache miss we talked about. First is the Compulsory misses. These are the misses that we cannot avoid. They are caused when we first start the program. Then we talked about the conflict misses. They are the misses that caused by multiple memory locations being mapped to the same cache location. There are two solutions to reduce conflict misses. The first one is, once again, increase the cache size. The second one is to increase the associativity. For example, say using a 2-way set associative cache instead of directed mapped cache. But keep in mind that cache miss rate is only one part of the equation. You also have to worry about cache access time and miss penalty. Do NOT optimize miss rate alone. Finally, there is another source of cache miss we will not cover today. Those are referred to as invalidation misses caused by another process, such as IO , update the main memory so you have to flush the cache to avoid inconsistency between memory and cache.

Cache Design Trade-offs Morgan Kaufmann Publishers 13 January, 2019 Cache Design Trade-offs Design change Effect on miss rate Negative performance effect Increase cache size Decrease capacity misses May increase access time Increase associativity Decrease conflict misses Increase block size Decrease compulsory misses Increases miss penalty. For very large block size, may increase miss rate due to pollution. Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 33 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy