Computer Architecture Virtual Memory (VM)

Computer Architecture Virtual Memory (VM)
By Yoav Etsion & Dan Tsafrir Presentation based on slides by Lihu Rappoport

DRAM (dynamic random-access memory)
Corsair 1333 MHz DDR3 Laptop Memory Price (at amazon.com): $43 for 4 GB $79 for 8 GB “The physical memory”

How processes co-exist in memory?
Address 0 Address N Process A Process B Process C Process B What happens if a process tries to access a memory access that belongs to another process? What happens if processes need more memory? What happens if more processes are spawned?

VM –motivation (primary)
Provides isolation between processes Processes can concurrently run on a single machine VM prevents them from accessing the memory of one another (But still allows for convenient sharing when required) Provides illusion of large memory for each process VM size can be bigger than physical memory size VM decouples program from real size (can differ across machines) Provides illusion of contiguous memory Programmers need not worry about where data is placed exactly

VM – motivation (secondary)
Allows for memory dynamic growth Can add memory to processes at runtime as needed Allows for memory overcommitment Sum of VM spaces (across all processes) can be >= physical DRAM often one of the most costly parts in the system Virtual memory enables flexible, secure memory management

VM – terminology Virtual address space Space used by the programmer
“Ideal” = contiguous & as big is we would like Physical address The real, underlying physical memory address Completely abstracted away by OS/HW Fragmentation External: free, unused space between allocation units Internal: free, unused space inside an allocation unit

VM – basic idea Divide memory (virtual & physical) into fixed size blocks “page” = chunk of contagious data in virtual space “frame” = physical memory exactly enough to hold one page sizeof(page) = sizeof(frame) page size = power of 2 = 2k (bytes) By default, k=12 almost always => page size is commonly 4KB Larger pages are used in some systems Virtual address space is contiguous …but pages can be mapped into arbitrary frames

VM – basic idea Pages can be mapped to memory or disk
Mapping pages to disk enables overcommitment of memory Each process has its own virtual address space Programmers only concerned with virtual address space Hardware translates virtual-to-physical address on-the-fly Use a page table to translate between virtual and physical addresses Processes cannot access each other’s memory Processes can only access virtual addresses Cannot access another process’s frames

VM – simplistic illustration
address translation frames (DRAM) pages (virtual space) disk Memory acts as a cache for the secondary storage (disk) Immediate advantages Illusion of contiguity & of having more physical memory Program actual location unimportant Dynamic growth, isolation, & sharing are easy to obtain

Translation – use a “page table”
virtual address (64bit) 63 12 11 virtual page number (52bit) page offset (12bit) how to map? physical frame number (20bit) page offset (12bit) physical address (32bit) (page size is typically 212 byte = 4KB)

V D AC frameNumber page table base register access control dirty bit 1 valid bit (page size is typically 212 byte = 4KB)

63 page offset (12bit) 11 virtual page number (52bit) physical frame number (20bit) 31 virtual address (64bit) physical address (32bit) V D frameNumber 1 page table base register valid bit dirty bit 12 AC access control (page size is typically 212 byte = 4KB)

V D AC frameNumber “PTE” (page table entry)

points to memory frame or disk address
Page tables Page Table points to memory frame or disk address Virtual page number Physical Memory Valid 1 1 1 1 1 1 1 Disk 1 1

Checks If ( valid == 1 ) page is in main memory at frame address stored in table  Data is readily available (e.g., can copy it to the cache) else /*page fault */ need to fetch page from disk  causes a trap, usually accompanied by a context switch: current process suspended while page is fetched from disk Access Control R=read-only, R/W=read/write, X=execute If ( access type incompatible with specified access rights )  protection violation fault  traps to fault-handler Demand paging Pages fetched from secondary memory only upon the first fault Rather then, e.g., upon file open

Context switch Each process has its own address space
Akin to saying “each process has its own page table” OS allocates frames for process => updates process's page table If only one PTE points to frame throughput the system Only the associated process can access the corresponding frame Shared memory Two PTEs of two processes point to the same frame Upon context switching Save current architectural state to memory: Architectural registers, including Register that holds the page table base address in memory Load the new architectural state from memory Architectural registers

Page replacement Page replacement policy
Decided which page to place on disk LRU (least recently used) Too wasteful (time stamp pages upon each memory reference) FIFO (first in first out) Simplest: no need to update upon references, but ignores usage Clock (second-chance) Maintain circular list of pages resident in memory Set per-page “was it referenced?” bit (usually done in HW) Search clockwise for first page with bit=0; set bit=0 for pages that have bit=1 Swap out first page with bit = 0

Page replacement – cont.
NRU (not recently used) More sophisticated LRU approximation HW or SW maintains per-page ‘referenced’ & ‘modified’ bits Periodically (clock interrupt), SW turns ‘referenced’ off Replacement algorithm partitions pages to Class 0: not referenced, not modified Class 1: not referenced, modified Class 2: referenced, not modified Class 3: referenced, modified Choose at random a page from the lowest class for removal Underlying principles (order is important): Prefer keeping referenced over unreferenced Prefer keeping modified over unmodified Can a page be modified but not referenced?

Page replacement – advanced
ARC (adaptive replacement cache) Factors not only recency (when latest access), but also frequency (how many times accessed) User determines which factor has more weight Better (but more wasteful) than LRU Develop by IBM: Nimrod Megiddo & Dharmendra Modha Details: CAR (clock with adaptive replacement) Similar to ARC, and comparable in performance But, unlike ARC, doesn’t require user-specified parameters Likewise developed by IBM: Sorav Bansal & Dharmendra Modha Details:

Page faults Page faults: the data is not in memory  retrieve it from disk CPU detects the situation (valid=0) …but it cannot remedy the situation doesn’t know disk; it’s the OS job Thus, CPU must interrupt OS OS loads page from disk Possibly writing victim page to disk (if no room & if dirty) Possibly avoids reading from disk due to OS “buffer cache” OS updates page table (valid=1) OS resumes process; now, HW will retry & succeed!

Page faults Page fault incurs a significant penalty
For files (binaries; memory mapped pages), distinguish among two types of page faults “Major” page fault = must go get page from disk “Minor” page fault = page already resides in OS buffer cache For stack/heap, must go to backing store Also known as swap area Disk file/partition where OS stores pages when evicted from memory

Page size Smaller page size (typically 4KB)
PROS: minimizes internal fragmentation CONS: increase size of page table Bigger size (called “superpages” if > 4K) PROS: Amortize disk access cost May prefetch useful data May discard useless data early CONS: Increased fragmentation Might transfer unnecessary info at the expense of useful info Lots of work to increase page size beyond 4K HW supports it for years; OS is the “bottleneck” Attractive because: Bigger DRAMs, increasing memory/disk performance gap

TLB (translation lookaside buffer)
Page table resides in memory Each translation requires accessing memory Typically more than once Might be required for each load/store! TLB Cache recently used PTEs speed up translation typically 64 to 256 entries usually 2 to 8 way associative TLB access time is faster than L1 cache access time Yes No TLB Hit ? Access Page Table Virtual Address Physical Addresses TLB Access

Making Address Translation Fast
TLB is a cache for recent address translations: Valid 1 Physical Memory Disk Virtual page number Page Table Valid Tag Physical Page TLB Physical Page Or Disk Address

TLB Access Virtual page number Offset Tag Set PTE Hit/Miss Way 0 Way 1
= = = = Way MUX PTE Hit/Miss

VM & cache Yes No Access TLB Page Table In Memory Cache Virtual Address L1 Cache Hit ? Physical Addresses Data Memory L2 Cache TLB access is serial with cache access => performance is crucial! Page table entries can be cached in L2 cache (as data)

Context switch Flush TLB upon context switch
Since the same virtual addresses are routinely reused Recent Intel processor add VPID field to TLB VPID = Virtual PID Eliminates the need to flush the TLB on every switch Akin to extending the page number with the PID

Overlapped TLB & cache access
VM view of a Physical Address Page offset 11 Physical Page Number 12 29 Cache view of a Physical Address disp 13 tag 14 29 5 set 6 #Set is not contained within the Page Offset The #Set is not known until the physical page number is known Cache can be accessed only after address translation done

Virtual Memory view of a Physical Address 29 12 11 Physical Page Number Page offset Cache view of a Physical Address 29 12 11 6 5 disp tag set In the above example #Set is contained within the Page Offset The #Set is known immediately Cache can be accessed in parallel with address translation Once translation is done, match upper bits with tags Limitation: Cache ≤ (page size × associativity)

Virtual page number Page offset Tag Set set disp TLB Hit/Miss Way MUX = Cache Set# Set# Physical page number = = = = = = = = Way MUX Hit/Miss Data

Virtually-indexed, Physically-tagged
Assume cache is 32K Byte, 2 way set-associative, 64 byte/line (215/ 2 ways) / (26 bytes/line) = = 28 = 256 sets In order to still allow overlap between set access and TLB access Take the upper two bits of the set number from bits [1:0] of the VPN Physical_addr[13:12] may be different than virtual_addr[13:12] Tag is comprised of bits [31:12] of the physical address The tag may mis-match bits [13:12] of the physical address Cache miss  allocate missing line according to its virtual set address and physical tag 29 12 11 Physical Page Number Page offset 29 14 6 5 set disp tag VPN[1:0]

Virtually-indexed, Physically-tagged
Need to allocate missing line if tag comparison fails (miss) Problem: another copy of the block may reside in the cache Aliasing: >=2 virtual addresses mapped to same physical address Sometimes referred to as “synonyms” Issues with aliasing: Reduces cache utilization (multiple copies of a single line) Must update all copies of a line on each write Complicates coherency protocols (find & update multiple copies) 29 12 11 Physical Page Number Page offset 29 14 6 5 set disp tag VPN[1:0]

Virtually-tagged cache
Cache tags directly derived from virtual addresses TLB not in path to cache hit! but… Aliasing problem even more acute Cache must be flushed at task switch Possible solution: include unique process ID (PID) in tag (like the VPID we discussed earlier) Rarely used nowadays… data Trans- lation Cache Main Memory VA hit PA CPU

32bit x86 Regular paging

Hierarchical translation
x86 supports 4KB & 4MB pages Q: why would we want a 4MB (called “super-page”)? A: TLB is small, yet crucial to performance, being accessed on every memory reference Page directory Each process has its own page-directory (conversely, threads share) CR3 points to p-d of current process Holds 1024 PDEs (page-directory entries), each is 4 bytes = 32 bits Each PDE contains a PS (“page size”) flag PS=1: PDE points directly to a 4MB (super)page PS=0: PDE points to “page table” whose entries point to 4KB pages Page table Holds 1024 PTEs (page-table entries), each is 32 bits Each PTE points to a 4KB page in physical memory

Mapping only 4KB pages (typical)
2-level hierarchy All pages are 4KB aligned Total of 220 (=1M) 4KB pages = 4GB DIR (10 bits) Point to PDE in page directory (We assume all PDEs have PS=0) => Each PDE provides 20bit of 4KB-aligned base physical address of a 4KB page table TABLE (10 bits) Point to PTE in page table PTE provides a 20 bit, 4KB-aligned base physical address of a 4KB page OFFSET (12 bits) Offset within the selected 4KB page 31 DIR TABLE OFFSET 32bit linear address 11 21 4KB 1K-PTE page table 4KB 1K-PDE page directory PDE 4K Page data CR3 (PDBR) 10 12 PTE 20+12=32 (4K aligned) 20

Mapping only 4MB pages 1-level hierarchy Assume all PDEs have PS=1
Pages are 4MB aligned Total of 210 (=1K) 4KB pages = 4GB DIR (10 bits): Point to PDE in page directory => Each PDE provides 10bit of 4MB-aligned base physical address of a 4MB page table TABLE (10 bits) None! (moved to offset) OFFSET (22 bits) Offset within the selected 4MB page Fine print Must set PSE flag in CR4 for 4MB support to work Otherwise, PS=1 flag settings ignored 31 DIR OFFSET 32bit linear address 21 PDE 4MB Page data CR3 (PDBR) 10 22 20+12=32 (4K aligned) 4KB 1K-PDE page directory

Mixing 4KB & 4MB pages Works “out of the box” When CR4.PSE=1
Alignment constraints: 4MB for superpages, 4KB for regular pages TLB issues? No, as CPU maintains 4MB and 4KB PTEs in separate TLBs Benefits Superpages often used for often-used kernel code Frees up 4KB TLB entries Reduces TLB misses => improve overall system performance

Reserved for future use (should be zero)
PDE & PTE format Page Frame Address 31:12 AVAIL A PCD PWT U W P Present Writable User Write-Through Cache Disable Accessed Page Size (0: 4 Kbyte) Available for OS Use Page Dir Entry 4 1 2 3 5 7 9 11 6 8 12 31 D Dirty Page Table Reserved for future use (should be zero) - 20 bit physical address 4K-aligned pointer 12 bits flags For virtual memory Present, Accessed, Dirty, page size Protection read/Write, User/privileged Caching policy WB/WT, Cache Disable/enable 3 bit for OS usage

4KB-page PTE format Present Writable User / Supervisor Write-Through
Cache Disable Accessed Dirty Page Table Attribute Index (PAT) Global Page Available for OS Use Global pages are not flushed when TLB is flushed, can be used for kernel code PAT extends the functions of the PCD and PWT bits in page tables to allow all five of the memory types that can be assigned with the MTRRs (plus one additional memory type) to be assigned dynamically to pages of the linear address space: for example WC (wrote-combine mode), or strengthening memory ordering. We do not teach PAT in this course Page Base Address 31:12 AVAIL G P A T D A PCD PWT U / S R /W P 31 12 11 - 9 8 7 6 5 4 3 2 1

Page Table Base Address 31:12
4KB-page PDE format Present Writable User / Supervisor Write-Through Cache Disable Accessed Page Size (0 indicates 4 Kbytes) Global Page (ignored) Available for OS Use Page Table Base Address 31:12 AVAIL G P S A V L A PCD PWT U / S R /W P 31 12 11 - 9 8 7 6 5 4 3 2 1

4MB-page PDE format Present Writable User / Supervisor Write-Through
Cache Disable Accessed Dirty Page Size (1 indicates 4 Mbytes) Global Page (ignored) Available for OS Use Page Table Attribute Index Page Base Address 31:22 Reserved P A T AVAIL G P S D A PCD PWT U / S R /W P 31 22 21 13 12 11 - 9 8 7 6 5 4 3 2 1

VM attributes: present flag (P)
Set => page in physical memory Translation is carried out by the MMU (memory management unit) Clear => page not in physical memory When encounters by MMU => generates a page-fault exception Faulting address is available to SW exception handler MMU does not set/clear this flag (only reads it) It’s up to the OS Upon page-fault exception => OS typically does the following: Copy page from disk to memory (unless already in buffer cache) Update PTE/PDE with page RAM address P = 1; dirty = accessed = 0; etc. Invalidate associated PTE in TLB Resume program on faulty instruction

VM attributes: page size flag (PS)
In PDEs only Determines the page size Clear => page size = 4KB (& PDE points to a page table) Set => page size = 4MB (& PDE points to superpage)

VM attributes: accessed (A) & dirty (D)
MMU sets A-flag Upon first time a page (or page-table) is accessed (load or store) MMU sets D-flag Upon first time a page (or PT) is accessed (store only) A & D are sticky Once set, MMU (=HW) never clears them Only SW does OS clears them When initially loading PTE Possibly from time to time as part of LRU approximation (used to decide which pages to swap out and which to keep)

VM attributes: global flag (G)
Has effect only when PGE=1 in CR4 When set, indicates page is “global” Not flushed from TLB when CR3 loaded Ignored for PDEs with PS=0 (that point to page tables) Used to improve performance Keeps important pages of OS in TLB across context switches Only software can set or clear this flag

Cache attributes: PWT PWT Means “page-level write-through”
Controls write-through / write-back caching policy of page / PT 1: enable write-through caching 0 : disable write-through => enable write-back caching Ignored if CD (“cache disable”) flag is set in CR0, or If associated PCD is on

Cache attributes: PCD PCD Means “page-level cache disable” flag
Controls caching of individual pages / PTs 1: caching associated page/PT is prevented 0: caching allowed Used When caching doesn’t help performance (e.g., streaming) Memory mapped I/O ports to communicate with devices Assumed as set (regardless of actual value) If the CD (“cache disable”) flag in CR0 is set

Cache attributes: PAT PAT Means “page attribute table index” flag
If on, used along with PCD & PWT flags to select an entry in the PAT (for that page) Which in turn selects the memory type for the page PAT is a 64bit register (Not going into the details)

Protection attributes : R/W & U/S
Read/write (R/W) flag Specifies read-write privileges for page (if PTE), group of pages (if PDE) 0 = read only 1 = read & write User/supervisor (U/S) flag Specifies privileges for a page (PTE) or group of pages (PDE) (in case of a PDE that points to a page table) 0 = supervisor privilege level 1 = user privilege level User accessing a supervisor page will trigger an interrupt Handled by the OS, which might, e.g., terminate of the program

Misc issues Memory aliasing/sharing
When two (or more) PDEs point to a common PTE When two (or more) PTEs point to a common page But SW must maintain consistency of accessed & dirty bits in the these PDEs & PTEs Base address of page-directory Physical address of current p-d is stored in CR3 Also called the page-directory-base-register (PDBR) PDBR typically reloaded upon task switches Page directory must remain in-memory as long as task is active

32bit x86 EXTENDED paging

PAE – Physical Address Extension
32bit address imposes a limit Means we can use physical memory <= 2^32 = 4GB Too small for many system PAE (physical address extension) support Allows access to a 2^36 B of physical RAM (= 64 GB) But not directly: program address remains 32bit it’s just that the OS can now utilize more physical memory Only applicable when paging is enabled And, when also turning on PAE in CR4 Support for 4KB and 2MB pages (rather than 4MB)

PAE – Physical Address Extension
Relies on an additional Page Directory Pointer Table Positioned “above” the page directory (in translation hierarchy) Has 4 entries of 64-bits each to support up to 4 page directories PTEs are increased to 64 bits to accommodate 36-bit base physical addresses Each 4KB page directory and page table can thus have up to 512 entries CR3 contains the page-directory-pointer-table base address

4KB Page Mapping with PAE
Linear address divided to Page-directory-pointer-table entry Indexed by bits 30:31 of the linear addr. Provides an offset to one of 4 entries in the page-directory-pointer table The selected entry provides the base physical address of a page directory Dir (9 bits) – points to a PDE in the Page Directory PS in the PDE = 0  PDE provides a (36-12=) 24 bit, 4KB aligned base physical address of a page table Table (9 bit) – points to a PTE in the Page Table PTE provides a 24 bit, 4KB aligned base physical address of a 4KB page Offset (12 bits) – offset within the selected 4KB page 29 DIR TABLE OFFSET Linear Address Space (4K Page) 11 20 512 entry Page Table 512 entry Page Directory PDE 4KByte Page data 9 12 PTE CR3 (PDPTR) 32 (32B aligned) 24 27 21 Dir ptr 30 31 4 entry Page Directory Pointer Table Dir ptr entry 2

2MB Page Mapping with PAE
Linear address divided to Page-directory-pointer-table entry Indexed by bits 30:31 of the linear addr. Provides an offset to one of 4 entries in the page-directory-pointer table The selected entry provides the base physical address of a page directory Dir (9 bits) – points to a PDE in the Page Directory PS in the PDE = 1  PDE provides a (36-21=) 15 bit, 2MB aligned base physical address of a 2MB page Offset (21 bits) – offset within the selected 2MB page 29 DIR OFFSET Linear Address Space (2MB Page) 20 Page Directory PDE 2MByte Page data 9 21 CR3 (PDPTR) 32 (32B aligned) 15 Dir ptr 30 31 Pointer Table Dir ptr entry 27 2

PTE/PDE/PDP Entry Format with PAE
Relative to before PAE… The major differences in these entries are as follows: A page-directory-pointer-table entry is added The size of the entries is increased from 32 bits to 64 bits The maximum number of entries in a page directory or page table is 512 The base physical address field in each entry is extended to 24 bits

Paging in 64 bit Mode Paging structures expanded to
Potentially support mapping a 64-bit linear address to a 52-bit physical address Current implementation supports mapping a 48-bit linear address into a 40-bit physical address A 4th page mapping table added: the page map level 4 table (PML4) The base physical address of the PML4 is stored in CR3 A PML4 entry contains the base physical address a page directory pointer table The page directory pointer table is expanded to byte entries Indexed by 9 bits of the linear address The size of the PDE/PTE tables remains 512 eight-byte entries each indexed by nine linear-address bits The total of linear-address index bits becomes 48 PS flag in PDEs selects between 4-KByte and 2-MByte page sizes CR4.PSE bit is ignored

4KB Page Mapping in 64 bit Mode
Linear Address Space (4K Page) 63 47 39 38 30 29 21 20 12 11 sign ext. PML4 PDP DIR TABLE OFFSET 9 9 9 4KByte Page 9 12 data 512 entry Page Directory Pointer Table 512 entry Page Table 512 entry Page Directory PTE 512 entry PML4 Table PDE 28 31 PDP entry 31 PML4 entry 31 CR3 (PDPTR) 40 (4KB aligned)

2MB Page Mapping in 64 bit Mode
Linear Address Space (2M Page) 63 47 39 38 30 29 21 20 sign ext. PML4 PDP DIR OFFSET 9 9 9 21 2MByte Page 512 entry Page Directory Pointer Table 512 entry Page Directory data 512 entry PML4 Table PDE 19 PDP entry 31 PML4 entry 31 CR3 (PDPTR) 40 (4KB aligned)

PTE/PDE/PDP/PML4 Entry Format – 4KB Pages

TLBs The processor saves most recently used PDEs and PTEs in TLBs
Separate TLB for data and instruction caches Separate TLBs for 4-KByte and 2/4-MByte page sizes OS running at privilege level 0 can invalidate TLB entries INVLPG instruction invalidates a specific PTE in the TLB This instruction ignores the setting of the G flag Whenever a PDE/PTE is changed (including when the present flag is set to zero), OS must invalidate the corresponding TLB entry All (non-global) TLBs are automatically invalidated when CR3 is loaded The global (G) flag prevents frequently used pages from being automatically invalidated in on a task switch The entry remains in the TLB indefinitely Only INVLPG can invalidate a global page entry

Computer Architecture Virtual Memory (VM)

Similar presentations

Presentation on theme: "Computer Architecture Virtual Memory (VM)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computer Architecture Virtual Memory (VM)

Similar presentations

Presentation on theme: "Computer Architecture Virtual Memory (VM)"— Presentation transcript:

Similar presentations

About project

Feedback