Making Virtual Memory Real: The Linux-x86-64 way Arka Basu
Mechanics of Virtual Memory Address mapping and translation (Mostly H/W) Page tables, TLB, page table walkers Virtual address allocation Representing/managing virtual address spaces User interface to OS virtual address allocation Physical memory allocation Linux’s buddy memory allocation Page fault handling (on demand paging) Updating/Invalidating address mapping/permission Interface to request update/invalidation Mechanics of a TLB shootdown Focus
Typical virtual memory layout Kernel virtual address space 0x7ffffffff Stack Mmaped memory (dynamically allocation) Heap Static data Code 0x00000
Data structures representing VA space ptr to PT root VMAs or VM areas: Represents chunks of allocated virtual address ranges. start/end stack pid start/end code status start/end mmap ptr to VA space Ending VA Starting VA vma_area ptr list of open files Flags/Prot VM_READ VM_WRITE VM_SHARED ……………. list of signals struct mm_struct Represents a virtual address space struct task_struct Represents a process
Allocating memory a.k.a. virtual address User application or library requests VA allocation via system calls. void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset); Length has to be multiple of 4KB Prot PROT_NONE, PROT_READ, PROT_WRITE… Flags MAP_ANONYMOUS, MAP_SHARED, MAP_PRIVATE, MAP_SHARED, MAP_FIXED, MAP_HUGE_2MB, MAP_HUGE_1GB
mmap adds extends or add new VMA ptr to PT root start/end stack pid start/end code status start/end mmap ptr to VA space Ending VA Starting VA vma_area ptr list of open files Flags/Prot vma_cache VM_READ VM_WRITE VM_SHARED ……………. list of signals struct mm_struct Represents a virtual address space struct task_struct Represents a process
Allocating memory a.k.a. virtual address System call to extend heap int sbrk (increment _bytes) Heap – contiguous virtual address for dynamically allocated memory
sbrk updates VMA for the heap ptr to PT root start/end stack pid start/end code status start/end mmap ptr to VA space Ending VA Starting VA vma_area ptr list of open files Flags/Prot VM_READ VM_WRITE VM_SHARED ……………. list of signals struct mm_struct Represents a virtual address space struct task_struct Represents a process
Mechanics of Virtual Memory Address mapping and translation (Mostly H/W) Page tables, TLB, page table walkers Virtual address allocation Representing/managing virtual address spaces User interface to OS virtual address allocation Physical memory allocation Linux’s buddy memory allocation Page fault handling (on demand paging) Updating/Invalidating address mapping/permission Interface to request update/invalidation Mechanics of a TLB shootdown Focus
Demand paging of physical memory Events Processing (int *) a = mmap((void *)0, 8096, PROT_READ | PROT_WRITE, MAP_ANON | MAP_PRIVATE, 0, 0) OS creates/extends VMAs. Returns VA to user (value of *a). load a H/W TLB miss. H/W page walk. H/W raise page fault signal. OS check if the VA of load is valid by checking VMAs. If not, raise seg fault to app. If valid, find physical page frame(s) to map the fault VA.
Representing physical memory page descriptor (struct page) One for each 4KB of physical memory 32 bytes long (<1% overhead) All descriptor maintained in an array Important information contained in it: Number of virtual pages mapping to it Pointer back to virtual pages mapping (reverse mapping) Flags, e.g., if the page frame is locked, free, etc.
Managing free physical page frames OS keeps a pool of free pages Min. number of free pages is heuristic based but alterable Swapping is triggered when low on free pages Keeps free pages in “Buddy allocator” Goal: Keep contiguous physical page frames Why contiguous physical frames (address) matter?
The Buddy allocator A list of free list of contiguous physical pages of different sizes (2order x 4KB) 4KB Order=0 8KB Order=1 16KB Order=2 Order=3 Order=4 64KB Order=10
Demand paging of physical memory Events Processing (int *) a = mmap((void *)0, 8096, PROT_READ | PROT_WRITE, MAP_ANON | MAP_PRIVATE, 0, 0) OS creates/extends VMAs. Returns VA to user (value of *a). load a H/W TLB miss. H/W page walk. H/W raise page fault signal. OS check if the VA of load is valid by checking VMAs. If not, raise seg fault to app. If valid, find a free physical page frame(s). (Ask buddy allocator) Update page table entry to map faulting VA to the free page frame and return from fault. Retry load a H/W TLB miss, H/W page walker load VA->PA to TLB. Execution continues.
Mechanics of Virtual Memory Address mapping and translation (Mostly H/W) Page tables, TLB, page table walkers Virtual address allocation Representing/managing virtual address spaces User interface to OS virtual address allocation Physical memory allocation Linux’s buddy memory allocation Page fault handling (on demand paging) Updating/Invalidating address mapping/permission Interface to request update/invalidation Mechanics of a TLB shootdown Focus
Updating address mapping/permission Why update? OS Swapping, Copy-on-Write, Page migration User Change page permissions, unmap System call to change page permission mprotect(void *addr, size_t len, int new_prot) System call free/unmap memory int munmap(void *addr, size_t len);
Update VMA flags, delete/split VMAs ptr to PT root start/end stack pid start/end code status start/end mmap ptr to VA space Ending VA Starting VA vma_area ptr list of open files Flags/Prot Vm_cache VM_READ VM_WRITE VM_SHARED ……………. list of signals struct mm_struct struct task_struct Update page table entry, issue TLB shootdown
Steps of TLB shootdown OS on the initiator core updates the page table entry OS on the initiator core finds a set of other cores that may have stale entry in the TLB OS on the initiator core sends inter-process-interrupt (IPI) to other cores in the list and waits for ack OS on the initiator core uses invlpg instruction or writes to cr3 to invalidate local TLB entries, while waiting for ack Other cores context switch to OS thread and invalidate entries in their local TLBs via invlpg or write to cr3 Other cores sends ack to the initiator core TLB shootdown completes after initiator receives all ack
Special topic: Memory mapped files
Mapping parts of file to virtual address User application or library requests VA allocation via system calls. void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset); Length has to be multiple of 4KB Prot PROT_NONE, PROT_READ, PROT_WRITE… Flags MAP_ANONYMOUS, MAP_SHARED, MAP_PRIVATE, MAP_SHARED, MAP_FIXED, MAP_HUGE_2MB, MAP_HUGE_1GB
Mapping a file to virtual address Traditional way to access file content: int fd = open(const char *path, int oflag,..) Flags : O_RDONLY, O_CREAT, O_RDWR ssize_t read(int fd, void *buf, size_t count) Read file data in buf Mapping a file content: int * a = (int *) mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset) Access file content as if accessing an array starting at address “a”