Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses.

Slides:



Advertisements
Similar presentations
Virtual Memory In this lecture, slides from lecture 16 from the course Computer Architecture ECE 201 by Professor Mike Schulte are used with permission.
Advertisements

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Virtual Memory. The Limits of Physical Addressing CPU Memory A0-A31 D0-D31 “Physical addresses” of memory locations Data All programs share one address.
Appendix C: Review of Memory Hierarchy David Patterson Electrical Engineering and Computer Sciences University of California, Berkeley
CPE 731 Advance Computer Architecture Memory Hierarchy Review Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of California,
16.317: Microprocessor System Design I
Lecture 34: Chapter 5 Today’s topic –Virtual Memories 1.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
Virtual Memory Adapted from lecture notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
1 Lecture 20 – Caching and Virtual Memory  2004 Morgan Kaufmann Publishers Lecture 20 Caches and Virtual Memory.
Virtual Memory Adapted from lecture notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley and Rabi Mahapatra & Hank Walker.
Now, Review of Memory Hierarchy
Computer ArchitectureFall 2008 © CS : Computer Architecture Lecture 22 Virtual Memory (1) November 6, 2008 Nael Abu-Ghazaleh.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 3, 2003 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
CS 61C L35 Caches IV / VM I (1) Garcia, Fall 2004 © UCB Andy Carle inst.eecs.berkeley.edu/~cs61c-ta inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures.
CIS °The Five Classic Components of a Computer °Today’s Topics: Memory Hierarchy Cache Basics Cache Exercise (Many of this topic’s slides were.
ECE 232 L27.Virtual.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 27 Virtual.
ENGS 116 Lecture 121 Caches Vincent H. Berk Wednesday October 29 th, 2008 Reading for Friday: Sections C.1 – C.3 Article for Friday: Jouppi Reading for.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
©UCB CS 162 Ch 7: Virtual Memory LECTURE 13 Instructor: L.N. Bhuyan
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
Vm Computer Architecture Lecture 16: Virtual Memory.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Lecture 33: Chapter 5 Today’s topic –Cache Replacement Algorithms –Multi-level Caches –Virtual Memories 1.
Lecture 19: Virtual Memory
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
Memory Hierarchy. Since 1980, CPU has outpaced DRAM... CPU 60% per yr 2X in 1.5 yrs DRAM 9% per yr 2X in 10 yrs 10 DRAM CPU Performance (1/latency) 100.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
1  1998 Morgan Kaufmann Publishers Recap: Memory Hierarchy of a Modern Computer System By taking advantage of the principle of locality: –Present the.
The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
Memory Hierarchy and Caches. Who Cares about Memory Hierarchy? Processor Only Thus Far in Course CPU-DRAM Gap 1980: no cache in µproc; level cache,
Cps 220 Cache. 1 ©GK Fall 1998 CPS220 Computer System Organization Lecture 17: The Cache Alvin R. Lebeck Fall 1999.
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
Summary of caches: The Principle of Locality: –Program likely to access a relatively small portion of the address space at any instant of time. Temporal.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
CS 5513 Computer Architecture Lecture 4 – Memory Hierarchy Review.
CDA 5155 Virtual Memory Lecture 27. Memory Hierarchy Cache (SRAM) Main Memory (DRAM) Disk Storage (Magnetic media) CostLatencyAccess.
CS161 – Design and Architecture of Computer
ECE232: Hardware Organization and Design
CS161 – Design and Architecture of Computer
The Goal: illusion of large, fast, cheap memory
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Morgan Kaufmann Publishers
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
5 Basic Cache Optimizations
Adapted from slides by Sally McKee Cornell University
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
/ Computer Architecture and Design
Electrical and Computer Engineering
Cache Memory Rabi Mahapatra
Presentation transcript:

Appendix C Memory Hierarchy

Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses 2

Levels of the Memory Hierarchy CPU Registers 100s Bytes <0.5 ns Cache K Bytes 1 ns cents/bit Main Memory M Bytes 100ns $ cents /bit Disk G Bytes, 10 ms (10,000,000 ns) cents/bit Capacity Access Time Cost Tape infinite sec-min Registers Cache Memory Disk Tape Instr. Operands Blocks Pages Files Staging Xfer Unit prog./compiler 1-8 bytes cache cntl bytes OS 512-4K bytes user/operator Mbytes Upper Level Lower Level faster Larger

Motivating memory hierarchies Two structures that hold data  Registers: small array of storage  Memory: large array of storage What characteristics would we like memory to have?  High capacity  Low latency  Low cost Can’t satisfy these requirements with one memory technology 4

Memory hierarchy Solution: use a little bit of everything!  Small SRAM array (cache) Small means fast and cheap  Larger DRAM array (Main memory) Hope you rarely have to use it  Extremely large disk Costs are decreasing at a faster rate than we fill them 5

Terminology Find data you want at a given level: hit Data is not present at that level: miss  In this case, check the next lower level Hit rate: Fraction of accesses that hit at a given level  (1 – hit rate) = miss rate Another performance measure: average memory access time AMAT = (hit time) + (miss rate) x (miss penalty) 6

Memory hierarchy operation We’d like most accesses to use the cache  Fastest level of the hierarchy But, the cache is much smaller than the address space Most caches have a hit rate > 80%  How is that possible?  Cache holds data most likely to be accessed 7

Principle of locality Programs don’t access data randomly—they display locality in two forms  Temporal locality: if you access a memory location (e.g., 1000), you are more likely to re- access that location than some random location  Spatial locality: if you access a memory location (e.g., 1000), you are more likely to access a location near it (e.g., 1001) than some random location 8

Cache Basics Fast (but small) memory close to processor When data referenced  If in cache, use cache instead of memory  If not in cache, bring into cache (actually, bring entire block of data, too)  Maybe have to kick something else out to do it! Important decisions  Placement: where in the cache can a block go?  Identification: how do we find a block in cache?  Replacement: what to kick out to make room in cache?  Write policy: What do we do about stores?

4 Questions for Memory Hierarchy Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement) Q4: What happens on a write? (Write strategy) 10

Q1: Cache Placement Placement  Which memory blocks are allowed into which cache lines Placement Policies  Direct mapped (block can go to only one line)  Fully Associative (block can go to any line)  Set-associative (block can go to one of N lines) E.g., if N=4, the cache is 4-way set associative Other two policies are extremes of this (E.g., if N=1 we get a direct-mapped cache)

Q1: Block placement Block 12 placed in 8 block cache:  Fully associative, direct mapped, 2-way set associative  S.A. Mapping = Block Number Modulo Number Sets Cache Memory Fully Associative Direct Mapped (12 mod 8) = 4 2-Way Assoc (12 mod 4) = 0 12

Q2: Cache Identification When address referenced, need to  Find whether its data is in the cache  If it is, find where in the cache  This is called a cache lookup Each cache line must have  A valid bit (1 if line has data, 0 if line empty) We also say the cache line is valid or invalid  A tag to identify which block is in the line (if line is valid)

Q2: Block identification Tag on each block  No need to check index or block offset Increasing associativity shrinks index, expands tag Block Offset Block Address IndexTag 14

Address breakdown Block offset: byte address within block  # block offset bits = log 2 (block size) Index: line (or set) number within cache  # index bits = log 2 (# of cache lines) Tag: remaining bits Tag Block offset Index 15

Address breakdown example Given the following:  32-bit address  32 KB direct-mapped cache  Each block has 64-byte What are the sizes for the tag, index, and block offset fields? index = 9 bits since there are 32KB/64B = 2 9 blocks block offset = 6 bits since each block has 64B= 2 6 tag = 32 – 9 – 6 = 17 bits 16

Q3: Block replacement When we need to evict a line, what do we choose?  Easy choice for direct-mapped  What about set-associative or fully-associative? Want to choose data that is least likely to be used next Temporal locality suggests that’s the line that was accessed farthest in the past Least recently used (LRU)  Hard to implement exactly in hardware—often approximated Random (randomly selected line) FIFO (line that has been in cache the longest) 17

Q4: What happens on a write? Write-ThroughWrite-Back Policy Data written to cache block also written to lower-level memory Write data only to the cache Update lower level when a block falls out of the cache DebugEasyHard Do read misses produce writes? NoYes Do repeated writes make it to lower level? YesNo 18

Write Policy Do we allocate cache lines on a write?  Write-allocate A write miss brings block into cache  No-write-allocate A write miss leaves cache as it was Do we update memory on writes?  Write-through Memory immediately updated on each write  Write-back Memory updated when line replaced

Write Buffers for Write-Through Caches Q. Why a write buffer ? Processor Cache Write Buffer Lower Level Memory Holds data awaiting write-through to lower level memory A. So CPU doesn’t stall Q. Why a buffer, why not just one register ? A. Bursts of writes are common. 20

Write-Back Caches Need a Dirty bit for each line  A dirty line has more recent data than memory Line starts as clean (not dirty) Line becomes dirty on first write to it  Memory not updated yet, cache has the only up- to-date copy of data for a dirty line Replacing a dirty line  Must write data back to memory (write-back)

Basic cache design Cache memory can copy data from any part of main memory  Tag: Memory address  Block: Actual data On each access  Compare the address with the tag If they match  hit!  Get the data from the cache block If they don’t  miss  Get the data from main memory 22

Cache organization Cache consists of multiple tag/block pairs, called cache lines/blocks  Can search lines in parallel (within reason)  Each line also has a valid bit  Write-back caches have a dirty bit Note that block sizes can vary  Most systems use between 32 and 128 bytes  Larger blocks exploit spatial locality  Larger block size  smaller tag size 23

Direct-mapped cache example Assume the following simple setup  Only 2 levels to hierarchy  16-byte memory  4-bit addresses  Cache organization Direct-mapped 8 total bytes 2 bytes per block  4 lines Write-back cache  Leads to the following address breakdown: Offset: 1 bit Index: 2 bits Tag: 1 bit 24

Direct-mapped cache example: initial state Instructions: lb $t0, 1($zero) lb $t1, 8($zero) sb $t1, 4($zero) sb $t0, 13($zero) lb $t1, 9($zero) Memory Cache VDTagData Registers: $t0 = ?, $t1 = ? Block #

Direct-mapped cache example: access #1 Instructions: lb $t0, 1($zero) lb $t1, 8($zero) sb $t1, 4($zero) sb $t0, 13($zero) lb $t1, 9($zero) Memory Cache VDTagData Address = 1 =  Tag = 0  Index = 00  Offset = 1 Hits: 0 Misses: 0 Registers: $t0 = ?, $t1 = ? Block #

Direct-mapped cache example: access #1 Instructions: lb $t0, 1($zero) lb $t1, 8($zero) sb $t1, 4($zero) sb $t0, 13($zero) lb $t1, 9($zero) Memory Cache VDTagData Address = 1 =  Tag = 0  Index = 00  Offset = 1 Hits: 0 Misses: 1 Registers: $t0 = 29, $t1 = ? Block #

Direct-mapped cache example: access #2 Instructions: lb $t0, 1($zero) lb $t1, 8($zero) sb $t1, 4($zero) sb $t0, 13($zero) lb $t1, 9($zero) Memory Cache VDTagData Address = 8 =  Tag = 1  Index = 00  Offset = 0 Hits: 0 Misses: 1 Registers: $t0 = 29, $t1 = ? Block #

Direct-mapped cache example: access #2 Instructions: lb $t0, 1($zero) lb $t1, 8($zero) sb $t1, 4($zero) sb $t0, 13($zero) lb $t1, 9($zero) Memory Cache VDTagData Address = 8 = Tag = 1 Index = 00 Offset = 0 Hits: 0 Misses: 2 Registers: $t0 = 29, $t1 = Block #

Direct-mapped cache example: access #3 Instructions: lb $t0, 1($zero) lb $t1, 8($zero) sb $t1, 4($zero) sb $t0, 13($zero) lb $t1, 9($zero) Memory Cache VDTagData Address = 4 =  Tag = 0  Index = 10  Offset = 0 Hits: 0 Misses: 2 Registers: $t0 = 29, $t1 = Block #

Direct-mapped cache example: access #3 Instructions: lb $t0, 1($zero) lb $t1, 8($zero) sb $t1, 4($zero) sb $t0, 13($zero) lb $t1, 9($zero) Memory Cache VDTagData Address = 4 =  Tag = 0  Index = 10  Offset = 0 Hits: 0 Misses: 3 Registers: $t0 = 29, $t1 = 18 11/10/ M. Geiger CIS 570 Lec Block #

Direct-mapped cache example: access #4 Instructions: lb $t0, 1($zero) lb $t1, 8($zero) sb $t1, 4($zero) sb $t0, 13($zero) lb $t1, 9($zero) Memory Cache VDTagData Address = 13 =  Tag = 1  Index = 10  Offset = 1 Hits: 0 Misses: 3 Registers: $t0 = 29, $t1 = Block #

Direct-mapped cache example: access #4 Instructions: lb $t0, 1($zero) lb $t1, 8($zero) sb $t1, 4($zero) sb $t0, 13($zero) lb $t1, 9($zero) Memory Cache VDTagData Address = 13 = Tag = 1 Index = 10 Offset = 1 Hits: 0 Misses: 4 Registers: $t0 = 29, $t1 = 18 Must write back dirty block 11/10/ M. Geiger CIS 570 Lec Block #

Direct-mapped cache example: access #4 Instructions: lb $t0, 1($zero) lb $t1, 8($zero) sb $t1, 4($zero) sb $t0, 13($zero) lb $t1, 9($zero) Memory Cache VDTagData Address = 13 = Tag = 1 Index = 10 Offset = 1 Hits: 0 Misses: 4 Registers: $t0 = 29, $t1 = Block #

Direct-mapped cache example: access #5 Instructions: lb $t0, 1($zero) lb $t1, 8($zero) sb $t1, 4($zero) sb $t0, 13($zero) lb $t1, 9($zero) Memory Cache VDTagData Address = 9 =  Tag = 1  Index = 00  Offset = 1 Hits: 0 Misses: 4 Registers: $t0 = 29, $t1 = 18 11/10/ M. Geiger CIS 570 Lec Block #

Direct-mapped cache example: access #5 Instructions: lb $t0, 1($zero) lb $t1, 8($zero) sb $t1, 4($zero) sb $t0, 13($zero) lb $t1, 9($zero) Memory Cache VDTagData Address = 9 =  Tag = 1  Index = 00  Offset = 1 Hits: 1 Misses: 4 Registers: $t0 = 29, $t1 = 21 11/10/ M. Geiger CIS 570 Lec Block #

Cache performance Simplified model: CPU time = (CPU clock cycles + memory stall cycles)  cycle time memory stall cycles = # of misses x miss penalty = IC X Misses/instruction x miss penalty = IC x memory accesses/instruction x miss rate  miss penalty Average CPI = CPI(without stalls)+ memory accesses/instruction x miss rate  miss penalty AMAT = hit time + miss rate x miss penalty 37

Example A computer has CPI =1 when all hits. Loads and stores are 50% of instructions. If the miss penalty is 25 cycles and miss rate is 2%, how much faster would the computer be if all instructions were cache hits For all hits: CPU time =(ICxCPI +0)xCCT =ICx1.0 x CCT Real cache with stalls  Memory stall cycles =IC x (1+0.5)x0.02x25 = ICx0.75  CPU time = (ICx1.0+ICx0.75)xCCT = 1.75ICx CCT Speedup = 1.75ICxCCT/ICxCCT =

Average memory access time For unified cache  AMAT = (hit time) + (miss rate) x (miss penalty) For split cache  AMAT=%instructions x (hit time+ instruction miss rate x miss penalty) + %data x (hit time + Data miss rate x miss penalty) For multi-level cache  AMAT = hit time L1 + miss rate L1 x miss penalty L1 = hit time L1 + miss rate L1 x (hit time L2 + miss rate L2 x miss penalty L2 ) Miss rate(L2) is measured on the leftovers from L1 cache. 39

Example (split cache vs unified cache) Which has the lower miss rate? A 16KB instruction cache with a 16KB data cache or a 32KB unified cache? If the miss per 1000 instructions for instruction, data and unified caches are 3.82, 40.9 and 43.3, respectively. Assume 36% of instructions are data transfer instructions. Assume a hit takes 1 CC and the miss penalty is 200 cc. A load or store hit takes 1 extra cc on a unified cache. Which is the AMAT? Find miss rate = miss/instructions / memory access/instruction Miss rate(I) = 3.82/1000 / 1 = Miss rate(D) = 40.9/1000 / 0.36 = Miss rate (U) = 43.3/1000 / (1+0.36) = Miss rate (Split) = 74%x %x0.114 = A 32KB unified cache has a slightly lower miss rate 40

Example (cont.) AMAT=%instructions x (hit time+ instruction miss rate x miss penalty) + %data x (hit time + Data miss rate x miss penalty) AMAT (split) = 74%( x200)+ 26%x( x200) =7.52 AMAT(unified) = 74%x( x200) + 26%( x200) =

Another Example (multilevel cache) Suppose that in 1000 memory reference there are 40 misses in the L1 cache and 20 misses in the L2 cache. What are the miss rates? Assume miss penalty from L2 cache to memory is 200 CC, the hit time of the L2 cache is 10 CC, the hit time of L1 is 1 CC. What is the AMAT? Miss rate (L1) = 40/1000 = 4% Miss rate (L2) = 20/40 = 50% AMAT = hit time L1 + miss rate L1 x miss penalty L1 = = hit time L1 + miss rate L1 x (hit time L2 + miss rate L2 x miss penalty L2 ) = 1 + 4%(10+ 50%x200) = 5.4 CC 42

Reasons for cache misses AMAT = (hit time) + (miss rate) x (miss penalty) Reduce misses  improve performance The three C’s  First reference to an address: Compulsory miss Increasing the block size  Cache is too small to hold data: Capacity miss Increase the cache size  Replaced from a busy line or set: Conflict miss Increase associativity Would have had hit in a fully associative cache 43

44 Six Basic Cache Optimizations Reducing Miss Rate 1. Larger Block size (Compulsory misses) 2. Larger Cache size (Capacity misses) 3. Higher Associativity (Conflict misses Reducing Miss Penalty 4. Multilevel Caches 5. Giving Read misses Priority over Writes e.g., Read complete before earlier writes in write buffer Reducing hit time 6. Avoiding Address Translation during Cache Indexing

Problems with memory DRAM is too expensive to buy many gigabytes  We need our programs to work even if they require more memory than we have  A program that works on a machine with 512 MB should still work on a machine with 256 MB Most systems run multiple programs 45

Solutions Leave the problem up to the programmer  Assume programmer knows exact configuration Overlays  Compiler identifies mutually exclusive regions Virtual memory  Use hardware and software to automatically translate references from virtual address (what the programmer sees) to physical address (index to DRAM or disk) 46

Benefits of virtual memory CPU Memory A0-A31 D0-D31 Data User programs run in a standardized virtual address space Address Translation hardware managed by the operating system (OS) maps virtual address to physical memory “Physical Addresses” Address Translation VirtualPhysical “Virtual Addresses” Hardware supports “modern” OS features: Protection, Translation, Sharing 47

Managing virtual memory Effectively treat main memory as a cache  Blocks are called pages  Misses are called page faults Virtual address consists of virtual page number and page offset Virtual page numberPage offset

Page tables encode virtual address spaces A machine usually supports pages of a few sizes (MIPS R4000): A valid page table entry codes physical memory “frame” address for the page A virtual address space is divided into blocks of memory called pages Physical Address Space Virtual Address Space frame 49

Page tables encode virtual address spaces A machine usually supports pages of a few sizes (MIPS R4000): Physical Memory Space A valid page table entry codes physical memory “frame” address for the page A virtual address space is divided into blocks of memory called pages frame A page table is indexed by a virtual address Page Table 11/10/201550

Physical Memory Space Page table maps virtual page numbers to physical frames (“PTE” = Page Table Entry) Virtual memory => treat memory  cache for disk Details of Page Table Virtual Address Page Table index into page table Page Table Base Reg V Access Rights PA V page no.offset 12 table located in physical memory P page no.offset 12 Physical Address frame virtual address Page Table 51

Paging the page table A table for 4KB pages for a 32-bit address space has 1M entries Each process needs its own address space! P1 indexP2 indexPage Offset bit virtual address Top-level table wired in main memory Subset of 1024 second-level tables in main memory; rest are on disk or unallocated Two-level Page Tables 11/10/ M. Geiger CIS 570 Lec. 13

VM and Disk: Page replacement policy... Page Table 1 0 useddirty Set of all pages in Memory Tail pointer: Clear the used bit in the page table Head pointer Place pages on free list if used bit is still clear. Schedule pages with dirty bit set to be written to disk. Freelist Free Pages Dirty bit: page written. Used bit: set to 1 on any reference Architect’s role: support setting dirty and used bits 11/10/ M. Geiger CIS 570 Lec. 13

Virtual memory performance Address translation requires a physical memory access to read the page table Must then access physical memory again to actually get the data  Each load performs at least 2 memory reads  Each store performs at least 1 memory read followed by a write 11/10/ M. Geiger CIS 570 Lec. 13

Improving virtual memory performance Use a cache for common translations  Translation lookaside buffer (TLB) Virtual page vtagPhysical page Pg offset 11/10/ M. Geiger CIS 570 Lec. 13

Caches and virtual memory Using two different addresses: virtual and physical  Which should we use to access cache?  Physical address Pros: simpler to manage Cons: slower access  Virtual address Pros: faster access Cons: aliasing, difficult management  Use both: virtually indexed, physically tagged 56

Three Advantages of Virtual Memory Translation:  Program can be given consistent view of memory, even though physical memory is scrambled  Makes multithreading reasonable (now used a lot!)  Only the most important part of program (“Working Set”) must be in physical memory.  Contiguous structures (like stacks) use only as much physical memory as necessary yet still grow later. Protection:  Different threads (or processes) protected from each other.  Different pages can be given special behavior (Read Only, Invisible to user programs, etc).  Kernel data protected from User programs  Very important for protection from malicious programs Sharing:  Can map same physical page to multiple users (“Shared memory”)  Allows programs to share same physical memory without knowing what else is there  Makes memory appear larger than it actually is 57

Average memory access time AMAT = (hit time) + (miss rate) x (miss penalty) Given the following:  Cache: 1 cycle access time  Memory: 100 cycle access time  Disk: 10,000 cycle access time What is the average memory access time if the cache hit rate is 90% and the memory hit rate is 80%? 58