Improving Memory Access The Cache and Virtual Memory

Slides:



Advertisements
Similar presentations
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Advertisements

Computer Organization CS224 Fall 2012 Lesson 44. Virtual Memory  Use main memory as a “cache” for secondary (disk) storage l Managed jointly by CPU hardware.
Caching IV Andreas Klappenecker CPSC321 Computer Architecture.
Cs 325 virtualmemory.1 Accessing Caches in Virtual Memory Environment.
1 Lecture 20: Cache Hierarchies, Virtual Memory Today’s topics:  Cache hierarchies  Virtual memory Reminder:  Assignment 8 will be posted soon (due.
Overview of Cache and Virtual MemorySlide 1 The Need for a Cache (edited from notes with Behrooz Parhami’s Computer Architecture textbook) Cache memories.
Computer ArchitectureFall 2007 © November 14th, 2007 Majd F. Sakr CS-447– Computer Architecture.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
1 Lecture 20 – Caching and Virtual Memory  2004 Morgan Kaufmann Publishers Lecture 20 Caches and Virtual Memory.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Review CPSC 321 Andreas Klappenecker Announcements Tuesday, November 30, midterm exam.
Recap. The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of the.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
The Memory Hierarchy II CPSC 321 Andreas Klappenecker.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Caches The principle that states that if data is used, its neighbor will likely be used soon.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Lecture 19: Virtual Memory
EECS 318 CAD Computer Aided Design LECTURE 10: Improving Memory Access: Direct and Spatial caches Instructor: Francis G. Wolff Case.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
Virtual Memory Expanding Memory Multiple Concurrent Processes.
The Three C’s of Misses 7.5 Compulsory Misses The first time a memory location is accessed, it is always a miss Also known as cold-start misses Only way.
1 Virtual Memory Main memory can act as a cache for the secondary storage (disk) Advantages: –illusion of having more physical memory –program relocation.
1  1998 Morgan Kaufmann Publishers Recap: Memory Hierarchy of a Modern Computer System By taking advantage of the principle of locality: –Present the.
Virtual Memory. Virtual Memory: Topics Why virtual memory? Virtual to physical address translation Page Table Translation Lookaside Buffer (TLB)
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:
1  1998 Morgan Kaufmann Publishers Chapter Seven.
Virtual Memory Review Goal: give illusion of a large memory Allow many processes to share single memory Strategy Break physical memory up into blocks (pages)
Improving Memory Access 2/3 The Cache and Virtual Memory
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
Summary of caches: The Principle of Locality: –Program likely to access a relatively small portion of the address space at any instant of time. Temporal.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
3/1/2002CSE Virtual Memory Virtual Memory CPU On-chip cache Off-chip cache DRAM memory Disk memory Note: Some of the material in this lecture are.
CS161 – Design and Architecture of Computer
CMSC 611: Advanced Computer Architecture
Memory: Page Table Structure
ECE232: Hardware Organization and Design
Memory COMPUTER ARCHITECTURE
CS161 – Design and Architecture of Computer
The Goal: illusion of large, fast, cheap memory
Lecture 12 Virtual Memory.
Improving Memory Access 1/3 The Cache and Virtual Memory
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Cache Memory Presentation I
CSCI206 - Computer Organization & Programming
CS 704 Advanced Computer Architecture
Chapter 8 Digital Design and Computer Architecture: ARM® Edition
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Lecture 23: Cache, Memory, Virtual Memory
FIGURE 12-1 Memory Hierarchy
Lecture 22: Cache Hierarchies, Memory
Lecture 29: Virtual Memory-Address Translation
ECE232: Hardware Organization and Design
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
CS-447– Computer Architecture Lecture 20 Cache Memories
CSC3050 – Computer Architecture
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache - Optimization.
Fundamentals of Computing: Computer Architecture
Overview Problem Solution CPU vs Memory performance imbalance
Virtual Memory 1 1.
Presentation transcript:

Improving Memory Access The Cache and Virtual Memory EECS 322 Computer Architecture Improving Memory Access The Cache and Virtual Memory Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses powerpoint animation: please viewshow

Direct-mapped cache (a.k.a. 1-way set associative cache) Cache associativity Figure 7.15 Direct-mapped cache (a.k.a. 1-way set associative cache) 2-way set associative cache Fully associative cache

Cache architectures Figure 7.16 T a g D t w o - y s e c i v S 1 2 3 T 1 2 3 T a g D t O n e - w y s o c i v ( d r m p ) B l k 7 1 2 3 4 5 6 T a g D t F o u r - w y s e c i v S 1 T a g D t E i h - w y s e o c v ( f u l )

Direct Mapped Cache: Example Page 571 lw $1, 000 00 00($0) Tag Byte offset Index Byte address Miss: valid lw $1,0($0) lw $2,32($0) lw $3,0($0) lw $4,24($0) lw $5,32($0) lw $2, 010 00 00($0) Miss: tag lw $3, 000 00 00($0) Miss: tag lw $4, 001 10 00($0) Miss: valid lw $5, 010 00 00($0) Miss: tag 5/5 Misses Index Valid Tag Data 00 N 01 10 11 Y 000 Memory[000 00 00] Y 010 Memory[010 00 00] Y 000 Memory[000 00 00] Y 010 Memory[010 00 00] Y 001 Memory[001 10 00]

2-Way Set Associative Cache: Example Page 571 lw $1, 0000 0 00($0) Byte offset Tag Index Byte address Miss: valid lw $1,0($0) lw $2, 0100 0 00($0) Miss: tag lw $2,32($0) lw $3, 0000 0 00($0) Hit! lw $3,0($0) lw $4, 0011 0 00($0) Miss: tag lw $4,24($0) lw $5, 0100 0 00($0) Miss: tag lw $5,32($0) 4/5 Misses Index Valid Tag Data N 1 Y 00000 Memory[0000 0 00] Y 0100 Memory[0100 0 00] Y 0100 Memory[0100 0 00] Y 0011 Memory[0011 0 00]

Fully Associative Cache: Example Page 571 lw $1, 00000 00($0) Byte offset Tag Index Byte address Miss: valid lw $1,0($0) lw $2, 01000 00($0) Miss lw $2,32($0) lw $3, 00000 00($0) Hit! lw $3,0($0) lw $4, 00110 00($0) Miss lw $4,24($0) lw $5, 01000 00($0) Hit! lw $5,32($0) 3/5 Misses Valid Tag Data N Y 00000 Memory[00000 00] Y 01000 Memory[01000 00] Y 00110 Memory[00110 00]

Cache comparisons: page 571 Example Example (page 571): 5 loads lw $1,0($0) lw $2,32($0) lw $3,0($0) lw $4,24($0) lw $5,32($0) Memory access time Chip Area Cache architecture misses Direct Mapped 5 out of 5 While the direct mapped cache is on the simple end of the cache design spectrum, the fully associative cache is on the most complex end. It is the N-way set associative cache carried to the extreme where N in this case is set to the number of cache entries in the cache. In other words, we don’t even bother to use any address bits as the cache index. We just store all the upper bits of the address (except Byte select) that is associated with the cache block as the cache tag and have one comparator for every entry. The address is sent to all entries at once and compared in parallel and only the one that matches are sent to the output. This is called an associative lookup. Needless to say, it is very hardware intensive. Usually, fully associative cache is limited to 64 or less entries. Since we are not doing any mapping with the cache index, we will never push any other item out of the cache because multiple memory locations map to the same cache location. Therefore, by definition, conflict miss is zero for a fully associative cache. This, however, does not mean the overall miss rate will be zero. Assume we have 64 entries here. The first 64 items we accessed can fit in. But when we try to bring in the 65th item, we will need to throw one of them out to make room for the new item. This bring us to the third type of cache misses: Capacity Miss. +3 = 41 min. (Y:21) 2-Way Set associative 4 out of 5 Fully associative 3 out of 5 Decreasing the number of misses, speeds up the access time

Bits in 64 KB Data Cache Summary Page 570 Temporial Direct Mapped Spatial Direct Mapped 2 - Way Set Associative 4 - Way Set Associative Fully Associative Cache Type 14 bits 12 bits 11 bits 10 bits 0 bits Index Size 16 bits 17 bits 18 bits 28 bits Tag Size 49 bits 145 bits 311 bits 585 bits 157 bits Set Size 1.5% 1.13% 1.21% 1.14% 1.23% Over-head Miss Rate 5.4% 1.9% 1.5%

Decreasing miss penalty with multilevel caches Page 576 The effective CPI with only L1 to M cache Total CPI = Base CPI + (Memory Stall cycles per Instruction) = 1.0 + 5%  100ns = 6.0 The effective CPI with L1 and L2 caches Total CPI = Base CPI + (L1 to L2 penalty) + (L2 to M penalty) = 1.0 + (5%  10ns) + (2%  100ns) = 3.5

Direct Mapping Cache addresses While the direct mapped cache is on the simple end of the cache design spectrum, the fully associative cache is on the most complex end. It is the N-way set associative cache carried to the extreme where N in this case is set to the number of cache entries in the cache. In other words, we don’t even bother to use any address bits as the cache index. We just store all the upper bits of the address (except Byte select) that is associated with the cache block as the cache tag and have one comparator for every entry. The address is sent to all entries at once and compared in parallel and only the one that matches are sent to the output. This is called an associative lookup. Needless to say, it is very hardware intensive. Usually, fully associative cache is limited to 64 or less entries. Since we are not doing any mapping with the cache index, we will never push any other item out of the cache because multiple memory locations map to the same cache location. Therefore, by definition, conflict miss is zero for a fully associative cache. This, however, does not mean the overall miss rate will be zero. Assume we have 64 entries here. The first 64 items we accessed can fit in. But when we try to bring in the 65th item, we will need to throw one of them out to make room for the new item. This bring us to the third type of cache misses: Capacity Miss. +3 = 41 min. (Y:21)

N-Way Set-Associative Mapping Cache addresses An N-way set associative cache consists of a number of sets, where each set consists on N blocks While the direct mapped cache is on the simple end of the cache design spectrum, the fully associative cache is on the most complex end. It is the N-way set associative cache carried to the extreme where N in this case is set to the number of cache entries in the cache. In other words, we don’t even bother to use any address bits as the cache index. We just store all the upper bits of the address (except Byte select) that is associated with the cache block as the cache tag and have one comparator for every entry. The address is sent to all entries at once and compared in parallel and only the one that matches are sent to the output. This is called an associative lookup. Needless to say, it is very hardware intensive. Usually, fully associative cache is limited to 64 or less entries. Since we are not doing any mapping with the cache index, we will never push any other item out of the cache because multiple memory locations map to the same cache location. Therefore, by definition, conflict miss is zero for a fully associative cache. This, however, does not mean the overall miss rate will be zero. Assume we have 64 entries here. The first 64 items we accessed can fit in. But when we try to bring in the 65th item, we will need to throw one of them out to make room for the new item. This bring us to the third type of cache misses: Capacity Miss. +3 = 41 min. (Y:21)

Direct Mapped Cache Suppose we have the have the following byte addresses: 0, 32, 0, 24, 32 Given a direct mapped cache consisting of 16 data bytes where Data bytes per cache block = 4 where Cache size = 16 bytes/(4 bytes per block) = 4 The cache block addresses are: 0, 8, 0, 6, 8 The cache index addresses are: 0, 0, 0, 2, 0 For example, 24, Block address =floor(byte address / data bytes per block) = floor(24/4)=6 Cache Index = 6 % Cache size = 6 % 4 = 2 While the direct mapped cache is on the simple end of the cache design spectrum, the fully associative cache is on the most complex end. It is the N-way set associative cache carried to the extreme where N in this case is set to the number of cache entries in the cache. In other words, we don’t even bother to use any address bits as the cache index. We just store all the upper bits of the address (except Byte select) that is associated with the cache block as the cache tag and have one comparator for every entry. The address is sent to all entries at once and compared in parallel and only the one that matches are sent to the output. This is called an associative lookup. Needless to say, it is very hardware intensive. Usually, fully associative cache is limited to 64 or less entries. Since we are not doing any mapping with the cache index, we will never push any other item out of the cache because multiple memory locations map to the same cache location. Therefore, by definition, conflict miss is zero for a fully associative cache. This, however, does not mean the overall miss rate will be zero. Assume we have 64 entries here. The first 64 items we accessed can fit in. But when we try to bring in the 65th item, we will need to throw one of them out to make room for the new item. This bring us to the third type of cache misses: Capacity Miss. +3 = 41 min. (Y:21)

2-Way Set associative cache Again, suppose we have the have the following byte addresses: 0, 32, 0, 24, 32 Given a 2-way set associative cache of 16 data bytes where Data bytes per cache block = 4 where Cache size = 16 bytes/(2  4 bytes per block) = 2 The 2-way cache block addresses are: 0, 8, 0, 6, 8 The 2-way cache index addresses are: 0, 0, 0, 0, 0 For example, 24, Block address =floor(byte address / data bytes per block) = floor(24/4) = 6 Cache Index = 6 % Cache size = 6 % 2 = 0 While the direct mapped cache is on the simple end of the cache design spectrum, the fully associative cache is on the most complex end. It is the N-way set associative cache carried to the extreme where N in this case is set to the number of cache entries in the cache. In other words, we don’t even bother to use any address bits as the cache index. We just store all the upper bits of the address (except Byte select) that is associated with the cache block as the cache tag and have one comparator for every entry. The address is sent to all entries at once and compared in parallel and only the one that matches are sent to the output. This is called an associative lookup. Needless to say, it is very hardware intensive. Usually, fully associative cache is limited to 64 or less entries. Since we are not doing any mapping with the cache index, we will never push any other item out of the cache because multiple memory locations map to the same cache location. Therefore, by definition, conflict miss is zero for a fully associative cache. This, however, does not mean the overall miss rate will be zero. Assume we have 64 entries here. The first 64 items we accessed can fit in. But when we try to bring in the 65th item, we will need to throw one of them out to make room for the new item. This bring us to the third type of cache misses: Capacity Miss. +3 = 41 min. (Y:21) Note: Direct cache index addresses were: 0, 0, 0, 2, 0

Fully associative cache Again, suppose we have the have the following byte addresses: 0, 32, 0, 24, 32 Given a 2-way set associative cache of 16 data bytes where Data bytes per cache block = 4 where Cache size = 16 bytes/(4 bytes per block) = 4 The fully cache block addresses are: None! The 2-way cache index addresses are: NONE! While the direct mapped cache is on the simple end of the cache design spectrum, the fully associative cache is on the most complex end. It is the N-way set associative cache carried to the extreme where N in this case is set to the number of cache entries in the cache. In other words, we don’t even bother to use any address bits as the cache index. We just store all the upper bits of the address (except Byte select) that is associated with the cache block as the cache tag and have one comparator for every entry. The address is sent to all entries at once and compared in parallel and only the one that matches are sent to the output. This is called an associative lookup. Needless to say, it is very hardware intensive. Usually, fully associative cache is limited to 64 or less entries. Since we are not doing any mapping with the cache index, we will never push any other item out of the cache because multiple memory locations map to the same cache location. Therefore, by definition, conflict miss is zero for a fully associative cache. This, however, does not mean the overall miss rate will be zero. Assume we have 64 entries here. The first 64 items we accessed can fit in. But when we try to bring in the 65th item, we will need to throw one of them out to make room for the new item. This bring us to the third type of cache misses: Capacity Miss. +3 = 41 min. (Y:21)

Bits in a Temporal Direct Mapped Cache Page 550 How many total bits are required for a cache with 64KB (= 216 KiloBytes) of data and one word (=4 bytes =32 bit) blocks assuming a 32 bit byte memory address? Cache index width = log2 words = log2 216/4 = log2 214 words = 14 bits Block address width = <byte address width> – log2 word = 32 – 2 = 30 bits Tag size = <block address width> – <cache index width> = 30 – 14 = 16 bits Cache block size = <valid size>+<tag size>+<block data size> = 1 bit + 16 bits + 32 bits = 49 bits Total size = <Cache word size>  <Cache block size> = 214 words  49 bits = 784  210 = 784 Kbits = 98 KB = 98 KB/64 KB = 1.5 times overhead

Bits in a Spatial Direct Mapped Cache Page 570 How many total bits are required for a cache with 64KB (= 216 KiloBytes) of data and 4 - one word (=4*4=16 bytes =4*32=128 bits) blocks assuming a 32 bit byte memory address? Cache index width = log2 words = log2 216/16 = log2 212 words = 12 bits Block address width = <byte address width> – log2 blocksize = 32 – 4 = 28 bits Tag size = <block address width> – <cache index width> = 28 – 12 = 16 bits Cache block size = <valid size>+<tag size>+<block data size> = 1 bit + 16 bits + 128 bits = 145 bits Total size = <Cache word size>  <Cache block size> = 212 words  145 bits = 593920 bits / (1024*8) = 72.5 Kbyte = 72.5 KB/64 KB = 1.13 times overhead

Bits in a 2-Way Set Associative Cache Page 570 How many total bits are required for a cache with 64KB (= 216 KiloBytes) of data and 4 - one word (=4*4=16 bytes =4*32=128 bits) blocks assuming a 32 bit byte memory address? Cache index width = log2 words = log2 216/(216) = log2 211 words = 11 bits Block address width = <byte address width> – log2 blocksize = 32 – 4 = 28 bits (same!) Tag size = <block address width> – <cache index width> = 28 – 11 = 17 bits Cache Set size = <valid size>+<tag size>+<block data size> = 1 bit + 2(17 bits + 128 bits) = 311 bits Total size = <Cache word size>  <Cache block size> = 211 words  311 bits = 636928 bits / (1024*8) = 77.75 Kbyte = 72.5 KB/64 KB = 1.21 times overhead

Bits in a 4-Way Set Associative Cache Page 570 How many total bits are required for a cache with 64KB (= 216 KiloBytes) of data and 4 - one word (=4*4=16 bytes =4*32=128 bits) blocks assuming a 32 bit byte memory address? Cache index width = log2 words = log2 216/(416) = log2 210 words = 10 bits Block address width = <byte address width> – log2 blocksize = 32 – 4 = 28 bits (same!) Tag size = <block address width> – <cache index width> = 28 – 10 = 18 bits Cache Set size = <valid size>+<tag size>+<block data size> = 1 bit + 4(18 bits + 128 bits) = 585 bits Total size = <Cache word size>  <Cache block size> = 210 words  585 bits = 599040 bits / (1024*8) = 73.125 Kbyte = 73.125 KB/64 KB = 1.14 times overhead

Bits in a Fully Associative Cache Page 570 How many total bits are required for a cache with 64KB (= 216 KiloBytes) of data and 4-one word (=4*4=16 bytes =4*32=128 bits) blocks assuming a 32 bit byte memory address? Cache index width = 0 Block address width = <byte address width> – log2 blocksize = 32 – 4 = 28 bits (same!) Tag size = <block address width> – <cache index width> = 28 – 0 = 28 bits Cache Set size = <valid size>+<tag size>+<block data size> = 1 bit + 28 bits + 128 bits = 157 bits Total size = <Cache word size>  <Cache block size> = 212 words  157 bits = 643072 bits / (1024*8) = 78.5 Kbyte = 78.5 KB/64 KB = 1.23 times overhead

Virtual Memory Figure 7.20 Main memory can act as a cache for the secondary storage (disk) Advantages: illusion of having more physical memory program relocation protection

Pages: virtual memory blocks Figure 7.21 The automatic transmission of a car: Hardware/Software does the shifting

Page Tables Figure 7.23

Virtual Example: 32 Byte pages C-Miss P-Fault lw $2, 00100 0 00($0) lw $1,32($0) lw $2, 00101 0 00($0) C-Miss P- Hit! lw $2,40($0) Index Valid Tag Data N 1 Y 00100 Memory[000100 0 00] Y 00101 Memory[000101 0 00] Virtual Page Number Page offset 001 00000 Memory 001 01000 Page Index Valid Physical Page number N 1 ... Disk Y 111

Page Table Size Suppose we have 32 bit virtual address (=232 bytes) 4096 bytes per page (=212 bytes) 4 bytes per page table entry (=22 bytes) What is the total page table size? 4 Megabytes just for page tables!! Too Big

Page Tables Figure 7.22

Making Address Translation Fast A cache for address translations: translation lookaside buffer

TLBs and caches