Memory Management.

Memory Management

Outline Address Translation: virtual address => physical address
Memory Hierarchy Cache concept

Address Translation: Main Points
Address Translation Concept How do we convert a virtual address to a physical address? Flexible Address Translation Base and bound Segmentation Paging Multilevel translation Efficient Address Translation Translation Lookaside Buffers Virtually and physically addressed caches

Address Translation Concept

Address Translation Goals
Memory protection Memory sharing Shared libraries, interprocess communication Sparse addresses Multiple regions of dynamic allocation (heaps/stacks) Efficiency Memory placement Runtime lookup Compact translation tables Portability

Bonus Feature What can you do if you can (selectively) gain control whenever a program reads or writes a particular virtual memory location? Examples: Copy on write Zero on reference Fill on demand Demand paging Memory mapped files …

Address Translation Uses
Process isolation Keep a process from touching anyone else’s memory, or the kernel’s Efficient interprocess communication Shared regions of memory between processes Shared code segments E.g., common libraries used by many different programs Program initialization Start running a program before it is entirely in memory Dynamic memory allocation Allocate and initialize stack/heap pages on demand

Address Translation (more)
Program debugging Data breakpoints when address is accessed Zero-copy I/O Directly from I/O device into/out of user memory Memory mapped files Access file data using load/store instructions Demand-paged virtual memory Illusion of near-infinite memory, backed by disk or memory on other machines

Address Translation (even more)
Checkpoint/restart Transparently save a copy of a process, without stopping the program while the save happens Persistent data structures Implement data structures that can survive system reboots Process migration Transparently move processes between machines Information flow control Track what data is being shared externally Distributed shared memory Illusion of memory that is shared between machines

Virtually Addressed Base and Bounds

Question With virtually addressed base and bounds, what is saved/restored on a process context switch? Only the OS gets to change the base and bounds! Clearly, user program can't, or else lose protection. With base&bounds system, what gets saved/restored on a context switch? contents of base and bounds register; or maybe even contents of memory

Virtually Addressed Base and Bounds
Pros? Simple Fast (2 registers, adder, comparator) Safe Can relocate in physical memory without changing process Cons? Can’t keep program from accidentally overwriting its own code Can’t share code/data with other processes Can’t grow stack/heap as needed Provides level of indirection: OS can move bits around behind the program's back, for instance, if program needs to grow beyond its bounds, or if need to coalesce fragments of memory. Stop program, copy bits, change base and bounds registers, restart. Because OS manages the translation, we control the vertical, we control the horizontal. Hardware cost: 2 registers adder, comparator Plus, slows down hardware because need to take time to do add/compare on every memory reference. In 60's, thought to be too expensive! But get better use of memory, and you get protection. These are really important. CPU speed (especially now) is the easy part.

Segmentation Segment is a contiguous region of virtual memory
Each process has a segment table (in hardware) Entry in table = segment Segment can be located anywhere in physical memory Each segment has: start, length, access permission Processes can share segments Same start, length, same/different access permissions

Segmentation This should seem a bit strange: the virtual address space has gaps in it! Each segment gets mapped to contiguous locations in physical memory, but may be gaps between segments. This is a little like walking around in the dark, and there are huge pits in the ground where you die if you step in the pit. But! of course, a correct program will never step off into a pit, so ok. A correct program will never address gaps; if it does, trap to kernel and then core dump. Minor exception: stack, heap can grow. In UNIX, sbrk() increases size of heap segment. For stack, just take fault, system automatically increases size of stack. Detail: Protection mode in segmentation table entries. For example, code segment would be read-only (only execution and loads are allowed). Data and stack segment would be read-write (stores allowed). What must be saved/restored on context switch? Typically, segment table stored in CPU, not in memory, because it’s small. Contents must be saved/restored.

Question With segmentation, what is saved/restored on a process context switch? Only the OS gets to change the base and bounds! Clearly, user program can't, or else lose protection. With base&bounds system, what gets saved/restored on a context switch? contents of base and bounds register; or maybe even contents of memory

UNIX fork and Copy on Write
Makes a complete copy of a process Segments allow a more efficient implementation Copy segment table into child Mark parent and child segments read-only Start child process; return to parent If child or parent writes to a segment (ex: stack, heap) trap into kernel make a copy of the segment and resume

Zero-on-Reference How much physical memory is needed for the stack or heap? Only what is currently in use When program uses memory beyond end of stack Segmentation fault into OS kernel Kernel allocates some memory How much? Zeros the memory avoid accidentally leaking information! Modify segment table Resume process

Segmentation Pros? Cons?
Can share code/data segments between processes Can protect code segment from being overwritten Can transparently grow stack/heap as needed Can detect if need to copy-on-write Cons? Complex memory management Need to find chunk of a particular size May need to rearrange memory from time to time to make room for new segment or growing segment External fragmentation: wasted space between chunks

Paged Translation Manage memory in fixed size units, or pages
Finding a free page is easy Bitmap allocation: Each bit represents one physical page frame Each process has its own page table Stored in physical memory Hardware registers pointer to page table start page table length Simpler, because allows use of a bitmap. What's a bitmap? Each bit represents one page of physical memory -- 1 means allocated, 0 means unallocated. Lots simpler than base&bounds or segmentation

Paged Translation (Abstract)

Paged Translation (Implementation)

Paging Questions With paging, what is saved/restored on a process context switch? Pointer to page table, size of page table Page table itself is in main memory What if page size is very small? What if page size is very large? Internal fragmentation: if we don’t need all of the space inside a fixed size chunk Means lots of space taken up with page table entries.

Paging and Copy on Write
Can we share memory between processes? Set entries in both page tables to point to same page frames Need core map of page frames to track which processes are pointing to which page frames (e.g., reference count) UNIX fork with copy on write Copy page table of parent into child process Mark all pages (in new and old page tables) as read-only Trap into kernel on write (in child or parent) Copy page Mark both as writeable Resume execution

Fill On Demand Can I start running a program before its code is in physical memory? Set all page table entries to invalid When a page is referenced for first time, kernel trap Kernel brings page in from disk Resume execution Remaining pages can be transferred in the background while program is running On windows, the compiler reorganizes where procedures are in the executable, to put the initialization code in a few pages right at the beginning

Sparse Address Spaces Might want many separate dynamic segments
Per-processor heaps Per-thread stacks Memory-mapped files Dynamically linked libraries What if virtual address space is large? 32-bits, 4KB pages => 500K page table entries 64-bits => 4 quadrillion page table entries Is there a solution that allows simple memory allocation, easy to share memory, and is efficient for sparse address spaces?

Multi-level Translation
Tree of translation tables Paged segmentation Multi-level page tables Multi-level paged segmentation Fixed-size page as lowest level unit of allocation Efficient memory allocation (compared to segments) Efficient for sparse addresses (compared to paging) Efficient disk transfers (fixed size units) Easier to build translation lookaside buffers Efficient reverse lookup (from physical -> virtual) Variable granularity for protection/sharing

Paged Segmentation Process memory is segmented Segment table entry:
Pointer to page table Page table length (# of pages in segment) Access permissions Page table entry: Page frame Share/protection at either page or segment-level

Paged Segmentation (Implementation)

Question With paged segmentation, what must be saved/restored across a process context switch?

Multilevel Paging

Question Write pseudo-code for translating a virtual address to a physical address for a system using 3-level paging.

x86 Multilevel Paged Segmentation
Global Descriptor Table (segment table) Pointer to page table for each segment Segment length Segment access permissions Context switch: change global descriptor table register (GDTR, pointer to global descriptor table) Multilevel page table 4KB pages; each level of page table fits in one page 32-bit: two level page table (per segment) 64-bit: four level page table (per segment) Omit sub-tree if no valid addresses

Multilevel Translation
Pros: Allocate/fill only page table entries that are in use Simple memory allocation Share at segment or page level Cons: Space overhead: one pointer per virtual page Two (or more) lookups per memory reference

Portability Many operating systems keep their own memory translation data structures List of memory objects (segments) Virtual page -> physical page frame Physical page frame -> set of virtual pages One approach: Inverted page table Hash from virtual page -> physical page Space proportional to # of physical pages

Efficient Address Translation
Translation lookaside buffer (TLB) Cache of recent virtual page -> physical page translations If cache hit, use translation If cache miss, walk multi-level page table Cost of translation = Cost of TLB lookup + Prob(TLB miss) * cost of page table lookup

TLB and Page Table Translation

TLB Lookup

Question What is the cost of a TLB miss on a modern processor?
Cost of multi-level page table walk MIPS: plus cost of trap handler entry/exit

Hardware Design Principle
The bigger the memory, the slower the memory

Caching and Demand-Paged Virtual Memory

Memory Hierarchy Anecdote: the system that I used for building my first OS, would fit in today’s 1st level cache Implication is that TLB miss will be like a first level cache miss – so second level TLB can be pretty big. Even then, second level miss is likely to have a bunch of the page table entries in the 3rd level cache. i7 has 8MB as shared 3rd level cache; 2nd level cache is per-core

Definitions Cache Cache block Temporal locality Spatial locality
Copy of data that is faster to access than the original Hit: if cache has copy Miss: if cache does not have copy Cache block Unit of cache storage (multiple memory locations) Temporal locality Programs tend to reference the same memory locations multiple times Example: instructions in a loop Spatial locality Programs tend to reference nearby locations Example: data in a loop

Cache Concept (Read)

Cache Concept (Write) Write through: changes sent
immediately to next level of storage Write back: changes stored in cache until cache block is replaced

Memory Hierarchy Anecdote: the system that I used for building my first OS, would fit in today’s 1st level cache i7 has 8MB as shared 3rd level cache; 2nd level cache is per-core

Main Points Can we provide the illusion of near infinite memory in limited physical memory? Demand-paged virtual memory Memory-mapped files How do we choose which page to replace? FIFO, MIN, LRU, LFU, Clock What types of workloads does caching work for, and how well? Spatial/temporal locality vs. Zipf workloads

Hardware address translation is a power tool
Kernel trap on read/write to selected addresses Copy on write Fill on reference Zero on use Demand paged virtual memory Memory mapped files Modified bit emulation Use bit emulation

Demand Paging (Before)

Demand Paging (After)

Demand Paging on MIPS Disk interrupt when DMA complete
TLB miss Trap to kernel Page table walk Find page is invalid Convert virtual address to file + offset Allocate page frame Evict page if needed Initiate disk block read into page frame Disk interrupt when DMA complete Mark page as valid Load TLB entry Resume process at faulting instruction Execute instruction

Demand Paging TLB miss Disk interrupt when DMA complete
Page table walk Page fault (page invalid in page table) Trap to kernel Convert virtual address to file + offset Allocate page frame Evict page if needed Initiate disk block read into page frame Disk interrupt when DMA complete Mark page as valid Resume process at faulting instruction TLB miss Page table walk to fetch translation Execute instruction

Allocating a Page Frame
Select old page to evict Find all page table entries that refer to old page If page frame is shared Set each page table entry to invalid Remove any TLB entries Copies of now invalid page table entry Write changes on page back to disk, if necessary Why didn’t we need to shoot down the TLB before?

How do we know if page has been modified?
Every page table entry has some bookkeeping Has page been modified? Set by hardware on store instruction In both TLB and page table entry Has page been recently used? Set by hardware on in page table entry on every TLB miss Bookkeeping bits can be reset by the OS kernel When changes to page are flushed to disk To track whether page is recently used

Keeping Track of Page Modifications (Before)

Keeping Track of Page Modifications (After)

Virtual or Physical Dirty/Use Bits
Most machines keep dirty/use bits in the page table entry Physical page is Modified if any page table entry that points to it is modified Recently used if any page table entry that points to it is recently used On MIPS, simpler to keep dirty/use bits in the core map Core map: map of physical page frames

Emulating Modified/Use Bits w/ MIPS Software Loaded TLB
MIPS TLB entries have an extra bit: modified/unmodified Trap to kernel if no entry in TLB, or if write to an unmodified page On a TLB read miss: If page is clean, load TLB entry as read-only; if dirty, load as rd/wr Mark page as recently used On a TLB write to an unmodified page: Kernel marks page as modified in its page table Reset TLB entry to be read-write On TLB write miss: Load TLB entry as read-write

Emulating a Modified Bit (Hardware Loaded TLB)
Some processor architectures do not keep a modified bit per page Extra bookkeeping and complexity Kernel can emulate a modified bit: Set all clean pages as read-only On first write to page, trap into kernel Kernel sets modified bit, marks page as read-write Resume execution Kernel needs to keep track of both Current page table permission (e.g., read-only) True page table permission (e.g., writeable, clean)

Emulating a Recently Used Bit (Hardware Loaded TLB)
Some processor architectures do not keep a recently used bit per page Extra bookkeeping and complexity Kernel can emulate a recently used bit: Set all recently unused pages as invalid On first read/write, trap into kernel Kernel sets recently used bit Marks page as read or read/write Kernel needs to keep track of both Current page table permission (e.g., invalid) True page table permission (e.g., read-only, writeable)

Models for Application File I/O
Explicit read/write system calls Data copied to user process using system call Application operates on data Data copied back to kernel using system call Memory-mapped files Open file as a memory segment Program uses load/store instructions on segment memory, implicitly operating on the file Page fault if portion of file is not yet in memory Kernel brings missing blocks into memory, restarts process

Advantages to Memory-mapped Files
Programming simplicity, esp for large files Operate directly on file, instead of copy in/copy out Zero-copy I/O Data brought from disk directly into page frame Pipelining Process can start working before all the pages are populated Interprocess communication Shared memory segment vs. temporary file

From Memory-Mapped Files to Demand-Paged Virtual Memory
Every process segment backed by a file on disk Code segment -> code portion of executable Data, heap, stack segments -> temp files Shared libraries -> code file and temp data file Memory-mapped files -> memory-mapped files When process ends, delete temp files Unified memory management across file buffer and process memory

Cache Replacement Policy
On a cache miss, how do we choose which entry to replace? Assuming the new entry is more likely to be used in the near future In direct mapped caches, not an issue! Policy goal: reduce cache misses Improve expected case performance Also: reduce likelihood of very poor performance

A Simple Policy Random? FIFO? Replace a random entry
Replace the entry that has been in the cache the longest time What could go wrong?

FIFO in Action Worst case for FIFO is if program strides through memory that is larger than the cache

MIN, LRU, LFU MIN Least Recently Used (LRU)
Replace the cache entry that will not be used for the longest time into the future Optimality proof based on exchange: if evict an entry used sooner, that will trigger an earlier cache miss Least Recently Used (LRU) Replace the cache entry that has not been used for the longest time in the past Approximation of MIN Least Frequently Used (LFU) Replace the cache entry used the least often (in the recent past)

LRU/MIN for Sequential Scan

For a reference pattern that exhibits locality

Belady’s Anomaly

Clock Algorithm: Estimating LRU
Periodically, sweep through all pages If page is unused, reclaim If page is used, mark as unused

Nth Chance: Not Recently Used
Instead of one bit per page, keep an integer notInUseSince: number of sweeps since last use Periodically sweep through all page frames if (page is used) { notInUseSince = 0; } else if (notInUseSince < N) { notInUseSince++; } else { reclaim page; }

Implementation Note Clock and Nth Chance can run synchronously
In page fault handler, run algorithm to find next page to evict Might require writing changes back to disk first Or asynchronously Create a thread to maintain a pool of recently unused, clean pages Find recently unused dirty pages, write mods back to disk Find recently unused clean pages, mark as invalid and move to pool On page fault, check if requested page is in pool! If not, evict that page

Recap MIN is optimal LRU is an approximation of MIN
replace the page or cache entry that will be used farthest into the future LRU is an approximation of MIN For programs that exhibit spatial and temporal locality Clock/Nth Chance is an approximation of LRU Bin pages into sets of “not recently used”

Working Set Model Working Set: set of memory locations that need to be cached for reasonable cache hit rate Thrashing: when system has too small a cache

Cache Working Set This graph isn’t as useful as it should be! Try a log scale miss rate. If miss cost is 1000x cache, then is hit rate is 99.9% -- spending half your time handling misses.

Phase Change Behavior

Question What happens to system performance as we increase the number of processes? If the sum of the working sets > physical memory?

Zipf Distribution Caching behavior of many systems are not well characterized by the working set model An alternative is the Zipf distribution Popularity ~ 1/k^c, for kth most popular item, < c < 2

Zipf Distribution Alpha between 1 and 2

Zipf Examples Web pages Movies Library books Words in text Salaries
City population … Common thread: popularity is self-reinforcing

Zipf and Caching

Cache Lookup: Fully Associative

Cache Lookup: Direct Mapped

Cache Lookup: Set Associative

Page Coloring What happens when cache size >> page size?
Direct mapped or set associative Multiple pages map to the same cache line OS page assignment matters! Example: 8MB cache, 4KB pages 1 of every 2K pages lands in same place in cache What should the OS do?

Page Coloring

Memory Management.

Similar presentations

Presentation on theme: "Memory Management."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Memory Management.

Similar presentations

Presentation on theme: "Memory Management."— Presentation transcript:

Similar presentations

About project

Feedback