Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Virtual Memory Primitives for User Programs Andrew Appel and Kai Li Princeton U. Appears in ASPLOS.

Similar presentations

Presentation on theme: "CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Virtual Memory Primitives for User Programs Andrew Appel and Kai Li Princeton U. Appears in ASPLOS."— Presentation transcript:

1 CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Virtual Memory Primitives for User Programs Andrew Appel and Kai Li Princeton U. Appears in ASPLOS 1991 Presented by: Fabián E. Bustamante

2 2 Motivation Uses of virtual memory –Traditional – increase size of address space by keeping frequently-accessed subset in physical mem. –IPC through shared pages –Guarantee reentrance making instruction-spaces read-only –Zeroed-on-demand or copy-on-write portions of memory –… Modern OS enable user-programs to play such tricks by allowing association of handlers w/ protection violations This work –Look at examples of algorithms using user-level page- protection techniques –Benchmarks today’s (1991) OS support of such techniques –Draw some lessons for OS implementations

3 3 VM Primitives OS’ VM services needed by some of these apps. TRAP – handle page-fault traps in user-mode PROT1 – decrease accessibility of 1 page PROTN – same for N pages UNPROT – increase accessibility of 1 page DIRTY – return list of dirty pages since previous call MAP2 – map same physical page at 2 != virtual addresses, at != levels of protections, in the same address PAGESIZE changes? PROT1 & PROTN – some OSs may support only one, but most apps do this for a batch (not commonly needed for UNPROT) DIRTY could be done in user-mode based on PROTN, TRAP and UNPROT but the OS can do it more efficiently Some apps want a thread to have access to a particular page but want others to fault on that page; this can be done w/ MAP2

4 4 VM Applications A sample of applications which use VM primitives to draw general conclusions on what user programs require from the OS & HW Concurrent garbage collection Shared virtual memory Concurrent checkpointing Generational garbage collection Persistent stores Extending addressability Data-compression paging Heap overflow detection

5 5 VM Apps – Real-time, concurrent GC Stop-and-copy GC –Memory divided in 2 contiguous regions: from- & to-space –At the beginning of a collections, all objects in from-space –Collector, starting from registers & other global roots, traces out the graph of reachable objects, copying them to to-space –When done, what’s left in from-space is garbage –At that point, roles of from- & to-space are reversed (flip) –Mutator (app program) runs until to-space is full Forwarding – examining a pointer into from-space, copying the referenced object if necessary & updating the pointer Obvious problem – long delay while mutator is suspended Real-time GC – the mutator is never interrupted for longer than a very small constant time Baker’s (RT, sequential) - Mutator sees only to-space pointers – –Checking every pointer fetch requires HW support for efficiency –Needs to be sequential to avoid conflicting access to object

6 6 Real-time, concurrent GC Instead of checking every pointer fetched from mem., collector uses VM page protection Pages in un-scanned area are set to “no access” When mutator tries to access an unscanned object, it gets a page-access trap Collector fields trap –Scans objects in page, copying from-space ones & forwarding pointers as necessary –Unprotects page & resume mutator Collector also runs concurrently scanning pages & unprotecting them as it goes (to ↓ mutator’s page-access traps) Algorithm requires –TRAP – to detect fetches from unscanned area –PROTN – to mark the entire to-space inaccessible during flip –UNPROT – needed as each page is scanned –MAP2 – to allow collector to scan it while mutator can’t access it –You may want to reduce page size, obviously

7 7 Another use, implement shared VM on network of computers Basic idea – use paging mechanism to control & maintain single-writer/multiple-reader coherence at page level All nodes see a coherent shared memory address space, as big as allowed by MMU Read-only pages replicated in 1+ nodes Write-only only in one Mem. mapping manager see its local memory as a big cache of SVM address space Mem. reference may cause a page fault when page is not in node’s physical mem. Algorithms requires –TRAP, PROT1, UNPROT, MAP2 and maybe PAGESIZE Shared VM Shared virtual memory Mapping Manager Mem CPU Mapping Manager Mem CPU Mapping Manager Mem CPU

8 8 Concurrent checkpointing Idea – use mechanisms to make checkpointing concurrent & real-time Instead of saving the writable main mem. to disk all at once Set all AS as ‘read only’ Restart program’s threads A copying thread sequentially copy pages to a separate virtual AS as it goes When done copying, set AS back to read/write While program makes read memory references – no problem Write attempt to not-yet copied page –Page fault, copying thread immediately copies page, sets access to read/write, & restarts faulting thread Algorithm requires –TRAP, PROT1, PROTN, UNPROT, and DIRTY; a medium PAGESIZE may be appropriate

9 9 Generational GC An efficient GC algorithm that depends on properties of dynamically allocated records in LISP & like languages –Younger records are much more likely to die soon than older ones –Younger records tend to point to older records Allocated records kept in distinct areas G i of memory (generations) –Records in G i older than records in G I+1 Idea – used VM to detect pointers form older Gs to new ones –DIRTY if available (GC examined dirty pages) –Write protect older generations, –The trap handler then save address on a list for the GC –At GC time, the GC scans list for possible pointers into the youngest ones Algorithm requires –DIRTY or TRAP, PROTN, and UNPROT –Smaller PAGESIZE may be good as time for GC depends on page size

10 10 Persistent stores Persistent store – a dynamic allocation heap that persist bet/ program invocations Program execution may traverse, modify, commit and/or abort modifications Traversals should be as fast as for in-core data Done through VM – persistent store ~mem. mapped disk file However, permanent image of a modified object should not be modified until the commit GC can be used to improve efficiency, recovering pages, collocating related objects, etc Algorithm require –TRAP and UNPROT, and file-mapping with copy-on-write –If copy-on-write is not available, simulate it w/ PROTN, UNPROT and MAP2

11 11 Extending addressability Persistent store might grow > 2 32 objects; a problem for a 32-bit machine However, at any one run a program would probably access < 2 32 objects Idea – –Use disk as second stage, disk pages use 64-bit –When disk page brought to disk, translate addresses from 64-32b w/ a translation table –Translation table per session Algorithm requires –TRAP, UNPROT, PROT1 or PROTN –In a multithreaded environment, MAP2

12 12 Data-compression paging In typical linked data structure, many words point to nearby objects, others are nil … basically, small entropy of the average word A GC can reduce it further by putting close-by objects that point to each other Idea – compress a page instead of paging it out (decompressing may be cheaper than page it in) Algorithm requires –TRAP, PROT1 (or PROTN), UNPROT –OS support (?) to determine when pages are not recently used

13 13 Heap overflow detection Process’ or thread’s stack requires protection against overflow A well-known technique – mark pages above the top of the stack as invalid → memory access will cause a page fault In most implementations of Unix – stack pages are not allocated until used, requires –TRAP, PROTN, UNPROT Similar technique can be used in a garbage-collected system – here size of allocated region is commonly small → performance is an issue

14 14 Usage of VM system services MethodsTRAPPROT1PROTNUNPROTMAP2DIRTYPAGESIZE Concurrent GC√√√√√ SVM√√√√√ Concurrent checkpoint √√√‡√ Generational GC√√√‡√ Persistent store√√√√ Extending addressability √**√√√ Data-compression paging √**√√ Heap overflow√†√ * Extending addressability and data compression paging use PROT1 only to remove inactive pages; the batching technique described in Sec. 5 could be used instead † VM-based heap-overflow detection can be used even w/o explicit memory-protection primitives, as long as there’s a usable boundary b/ accessible/inaccessible mem. ‡ Dirty-page bookkeeping can be simulated by using PROTN, TRAP and UNPROT

15 15 VM primitive performance Two classes of algorithms –Protect pages in large batches, upon each page-fault trap, unprotect 1 page –Protect a page and unprotect it Since PROTN or PROT1, TRAP and UNPROT are always used together – measured them together (one is slow, everything is) Two microbenchmarks –Sum of PROT1, TRAP, UNPROT x 100 Access a random protected page In fault handler, protect 1 other page, & unprotect faulting page –Sum of PROTN, TRAP, UNPORT x 100 Protect 100 pages, access each in a random sequence In the fault handler unprotect faulting page Other measurements –Time for a single instruction (ADD) – 20-instrution loops w/ 18 ADDs –Time for trap handler that does not change mem. protection Three OSs: Ultrix, SunOS, MACH 5 Archs: Sun 3/60, SparStn 1, DEC 3100, microVax 3m & i386 on iPSC/2

16 16 Microbenchmarks results MachineOSADDTRAPTRAP+ PROT1+ UNPROT TRAP+ PROTN+ UNPROT MAP2PAGESIZE Sun 3/60 SunOS 4.0 SunOS 4.1 Mach 2.5(xp) Mach 2.5(exc) 0.12 7601238 2080 3300 3380 1016 1800 2540 2880 Yes 8192 SparcStn 1 SunOS 4.0.3c SunOS 4.1 Mach 2.5(xp) Mach 2.5(exc) 0.05 230 919 2008 1550 1770 839 909 1230 1470 Yes 4096 DEC 3100 Ultrix 4.1 Mach 2.5(xp) Mach 2.5(exc) 0.062 210393 937 1203 344 766 1063 No 4096 µVax 3Ultrix 2.30.21314486 No1024 I386 on iPSC/2NX/20.15172252 yes4096 Best case Worst case Quite different between architectures

17 17 System design issues Lessons on hardware and OS design TLB consistency –Many of the algorithms make memory less accessible in large batches & more accessible 1 at a time –A good thing, specially in a multiprocessor When made less-accessible, outdated info in TLBs is dangerous – flush TLBs (shootdown – interrupt & request flushing) SW shootdown is expensive, but you can batch it Optimal page size –Page size traditionally big given disk overhead, … –For user-handled faults processed entirely in CPU, smaller is better –Effect of varying page size can be done on HW w/ small page size – for PROT & UNPROT use small size, w/ disk use multi-page block

18 18 System design issues Access to protected pages –Many algorithms need way for user-mode service routine to access a page while client threads have no access –Several ways to do this (illustrated w/ concurrent CG) Multiple mapping of same page at != addresses System call to copy memory to/from a protected area ($$$$ memory copies) Shared pages bet/ processes, collector running as a different process ($$ context switches) Collector running inside kernel – not the best place for GC –Best – multiple mapping at some extra cost w/ two entries in page table for each physical page; potential of cache inconsistency (some mapping may be stale)

19 19 System design issues Is this too much to ask? –Synchronous memory management algorithms may be problematic in highly pipelined machines (unless you got some hardware support) Instructions half way done with results written into registers Possible addressing-mode side-effects –However, all but the heap overflow detection algorithm are sufficiently asynchronous ~ like the behavior of a traditional disk-pager from the machine’s point of view Other primitives –For persistent store w/ transaction – pin a page –External-pager interface – the OS can tell the client which pages are LRU and about to be paged out

20 20 Conclusions VM not just a tool for implementing large address spaces & protecting one user process from another Several algorithms rely on VM primitives … but these primitives haven’t been paid enough attention Common traits of surveyed algorithms –Mem. Is made less-accessible in large batches and more- accessible on at a time –Fault-handling is done almost entirely by the CPU & take time proportional to the page size –Page faults results in faulting page being made more accessible –Frequency of faults is inversely related to locality of reference of the client program – so algorithms should scale well –User-mode service routines need to access pages that are protected from user-mode client routines –But don’t need to examine client’s CPU state

Download ppt "CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Virtual Memory Primitives for User Programs Andrew Appel and Kai Li Princeton U. Appears in ASPLOS."

Similar presentations

Ads by Google