Process Virtual Machines

Process Virtual Machines
This material is based on the book, Virtual Machines: Versatile Platforms for Systems and Processes, Copyright 2005 by Elsevier Inc. All rights reserved. It has been used and/or modified with permission from Elsevier Inc.

Process Virtual Machines
Perform guest/host mapping at ABI level Encapsulate guest process in process-level runtime Issues Compatibility Memory Architecture Exception Architecture OS Call Emulation Overall VM Architecture High Performance Implementations System Environments HOST OS Disk file sharing guest process create host runtime network communication Process VMs

Compatibility A key consideration in VMs Tricky to define Try this…
Two systems are compatible if, in response to the same sequence of input values, they give the same sequence of output values. In VMs, we are usually concerned with functional compatibility rather than performance compatibility. One system may be faster or slower than the other and still be compatible Further refinements based on the set of inputs Process VMs

Intrinsic Compatiblity
Two systems are intrinsically compatible if they are compatible for all possible input sequences Any software written by the most devious programmer will work in a compatible way -- do a check sum of all accessible memory -- jump to random addresses Example: Intel strives for intrinsic compatibility when it produces a new x86 microprocessor Process VMs

Extrinsic Compatibility
Many useful VM applications do not achieve intrinsic compatibility Limited application set: run Microsoft productivity tools Compiled for a certain API Certified by the programmer to be free of certain types of bugs Two systems are extrinsically compatible if they are compatible for a well-defined subset of input sequences Based on VM implementation, architecture/OS specifications, and certain external guarantees/certifications Some burden on the user to make sure that the guarantees are met Example: FX!32 is certified for certain Windows apps. Process VMs

Verifying Compatibility
Too complex to theoretically prove (except in simple systems) In practice Use informal reasoning Use test suites Sufficient conditions Decompose compatibility into parts Allows the reasoning process to be simplified Assume state of guest is mapped to host in 1 to 1 form. But same “type” of state is not necessary E.g., register state can be mapped to memory state Process VMs

Process VM Compatibility Framework
Form correspondence between VM and Native System at points where control is transferred to OS At these points: equivalent vales for both register and memory state Host Operating System OS Call Emulator Exception Emulation Engine Application Memory Image traps OS calls Native Operating System Process VMs

Trap Compatibility (Except Page Faults)
If source traps, then target traps If target traps, then source would have trapped Runtime can filter target traps, to eliminate false ones from being seen Page faults are special case – page fault behavior is non-deterministic with respect to user process. Process VMs

Register State Compatibility
At the time of an exception is the register state exactly as in the real machine? Including dead register values? Process VMs

Memory State Compatibility
Memory state compatibility is maintained if, at the time of a trap or interrupt, the contents of memory are exactly the same in the translated target program as in the original source program. Process VMs

Memory Ordering Compatibility
Maintain equivalent consistency model Important for multiprocessors Process VMs

Undefined Architecture Cases
Some (most?) ISAs have undefined cases Example: self-modifying code with I-caches Unless special actions are performed, result may be undefined Different, undefined behavior is compatible behavior Can be tricky – what if undefined behavior is different from all existing implementations? What if existing implementations do the “logical” thing? e.g., self-modifying code works as “expected” Process VMs

Process VM Architecture
Initialization Code Cache Manager OS Call Emulator Exception Emulation Application Memory Image Host Operating System Initialize signals Side Tables Emulation Engine Interpreter Translator Profile Data Process VMs

Runtime Components Initialization Cache Manager OS Call Emulator
Allocate Memory Initialize runtime data structures Initialize all signals Cache Manager Implement replacement algorithm when cache fills Flush when required (e.g. self-modifying code) OS Call Emulator Translate OS Calls Translate OS Responses Exception Emulator Handle signals If registered by source code pass to emulated source handler If not registered emulate host response Form precise state Process VMs

State Mapping For best performance
Host For best performance Guest register space fits “inside” host register space Guest memory space fits “inside” host memory space Best case does not always happen Host Registers Register Space Guest Registers VM Data VM Code Host ABI Guest Data Address Space Guest Code Process VMs

Memory Architecture Issues
7FFF FFFF 7FFE FFFF Reserved by System Committed Free Reserved Overall address architecture – paged, segmented, etc. We will focus on paged architectures Privilege types R, W, E Protection/allocation granularity – page sizes Most ABIs have a very specific architected memory Example: Win32 Process VMs

Memory Space Mapping Runtime + guest space ≤ host space
May be able to use underlying hardware to efficiently map addresses Place guest address 0 at host address 0 simplest case Guest space > host space Use software translation Software page table Several instructions overhead per load/store Guest space == Host space Important case Occurs often Especially in same-ISA dynamic optimization No room left over for runtime Use software translation? Settle for extrinsic compatibility Process VMs

VM Software Mapping VM Software maintains mapping table
Guest Application Address Space Host VM Software VM Software maintains mapping table Similar to hardware page tables/TLBs Slow, but can always be made to work Process VMs

VM Software Mapping Example
Initially, R1 holds source address R30 holds base address of mapping table srwi r29,r1,16 ;shift r1 right by 16 slwi r29,r29,2 ;convert to a byte address lwzx r29,r29,r30 ;load block location in host memory slwi r28,r1,16 ;shift left/right to zero out srwi r28,r28,16 ; source block number slwi r29,r29,16 ;shift up target block number or r29,r28,r29 ;form address lwz r2,0(r29) ;do load Process VMs

VM Software Mapping: General Case
Guest address space > Host address space + Runtime runtime swap disk file file runtime Host OS managed managed Guest Host Host target Address translation page Real table Virtual Space table Memory Address Space Process VMs

Direct (Hardware) Mapping
VM software mapping is slow Several instructions per load/store Use underlying hardware If guest address space + runtime fit within host space Runtime Software Guest Application Guest Guest Address Guest Application Application Space Application Address Address Address Space Space Space Runtime Software +base addr Process VMs

Guest Memory Protection
VM software mapping can be easily used Place (and check) protection info in mapping table Use underlying hardware Runtime must be able to set privileges Protection faults should be reported to Runtime (So it can respond as guest OS would – later) Requires some support from Host OS Process VMs

Host OS Support A system call where runtime can set protection levels
A signal mechanism where protection faults trap to handler in runtime SimOS Example Map guest space to a file (map, unmap, read-only mapping) Signal (SIGSEGV) delivered to VM software on fault Physical Memory File (VM Memory) references succeed references cause page faults Virtual Machine's Free Pages references succeed Virtual Address Space Read-Only Mappings writes cause protection faults Process VMs

Page Granularity Issue
Problems arise if guest page size < host page size What if there is an odd number of guest pages in a protection region? Host Page Guest Page (Code) (Data) Process VMs

Self-Modifying Code Write-Protect original code to catch code self-modification Exception handler invalidates old translations Be sure to make forward progress Pseudo self-modifying code may require optimizations Data mixed in with code Implement software-supported fine-grain checking (Transmeta) Sometimes can rely on source binary to indicate self modification e.g. SPARC flush translator original code translated code data write protected Figs4/selfmod Process VMs

Pseudo Self-Modifying Code
Guest Memory Runtime Memory Side Table For frequent cases Keep source code in side table and check prior to execution Copy when translation is performed Source Code Source Code Prolog compare upon entry no — jump to runtime Match? Figs4/selfmod yes Translated Target Code Process VMs

True Self-Modifying Code
Use idiom recognition Translator recognizes common constructs and generates special translation Example: modify instruction immediate value Figs4/selfmod Cache common translations If code is modified in a small number of ways Save the translations and to a lookup based on version of source code. Process VMs

Self-referencing code
Original copy is maintained by translator All reads are with respect to original copy  correct data is returned translator original code translated code data self reference Process VMs

Exceptions: Interrupts
Application may register some interrupts Precise state easier than traps (because there is more flexibility wrt location) Problem: Translated blocks may executed for an unbounded time period Solution: Interrupt signal goes to runtime Runtime unchains translation block currently executing (eliminates loops) Runtime returns control to current translation Translation soon reaches end (and precise state is available) If interrupts are common, runtime may inhibit all chaining Process VMs

Exceptions: Traps Can be detected directly via interpretation
or explicit translated code checks Can be detected indirectly via target ISA trap and signal Runtime registers all trap conditions as signals Semantic “matching” If trap is architecturally similar in target and source then trap/signal may be used Otherwise interpretive method must be used Generally, more difficult than interrupts wrt precise state Process VMs

Precise State: Program Counter
Interpretation: Easy – source PC is maintained Binary translation: more difficult – source PC only available at translation block boundaries Trap PC is in terms of target code Target PC must be mapped back to correct source PC Simple solution Use side table and reverse translate Can be combined with PC mapping table Requires search of table to find trapping block Requires full map of target to source PCs Simple, but very expensive – probably larger than all of translated code Process VMs

Reverse Translation Side Table
source code code cache signal returns target PC block A 2 trap occurs 1 find corresponding source PC 4 block B side table 3 target PCs source PCs Start PC End PC Src PC1 Src PC2 Src PCn search side table Start PC End PC Src PC1 Src PC2 Src PCm . . block N . Start PC End PC Src PC1 Src PC2 Src PCz Process VMs

Reducing Side Table Overhead
Keep only enough info to allow reconstruction of translation block E.g. start PCs of basic blocks or Initial PC and branch outcomes Reconstruct block after trap occurs to determine trap PC Reducing search overhead Linearly order target start PCs in table Can be maintained if code cache replacement is FIFO (later) Process VMs

Reducing Side Table Overhead
Process VMs

Recovering Register State
Simple if target code updates register state in same order as source code Register state mapping can be used to generate source register values More difficult if optimizations reorder code To be covered in detail later with binary optimization Process VMs

Recovering Memory State
Simple if target code updates memory state in same order as source code Restricts optimizations (more difficult to back-up than register state) To be discussed later with binary optimization Process VMs

OS Call Emulation “Wrapper” or “Jacket” code converts source call to target OS call(s) Process VMs

OS Call Emulation Same source and target OSes (different ISAs)
Syntactic translation only E.g. pass arguments in stack rather than registers Different source and target OSes Semantic translation/matching required May be difficult (or impossible) OS deals with real world What if source OS supports a type of device that the target does not? Process VMs

Protecting Runtime Software
Change protections on “context switch” from runtime to translated code Translated code can only access guest memory image Translated code cannot jump outside code cache (emulation s/w sets up links) Multiple system calls at “context switch” time High overhead Runtime mode Emulation mode N N Runtime Runtime R/W N Data Data Runtime Runtime Ex N Code Code Code Cache R/W Ex Code Cache N N R/W Guest Data R/W Guest Data Guest Code R/W R Guest Code Process VMs

Register Mapping #guest regs < #host regs
Good case # guest regs > # host regs Bad case How do you spill? VM data area is inaccessible Interpretation may be only good option #guest regs == #host regs Important special case Is it possible to save regs w/o using a register for addressing save area? Reserving one register for pointing to save area can be done (Dynamo) Emulation mode N Runtime N Data Runtime N Code Ex Code Cache N R/W Guest Data R Guest Code Process VMs

High Performance Emulation
Important tradeoff Startup time -- Cost of converting code for emulation Steady state -- Cost of emulating Interpretation: Low startup, high steady state cost Binary translation High startup, low steady state cost 2500 2000 1500 Total Emulation Time 1000 interpretation 500 binary translation 10 20 30 40 50 60 70 80 90 100 N - Number of Times Emulated Process VMs

Staged Emulation Adjust optimization level to execution frequency
Tradeoff Total runtime = program runtime + translation overhead Higher optimization  shorter program runtime Lower optimization  lower overhead Interpreter Figs4/phaseop Emulation Profile Data Binary Memory Image Code Cache Manager Translator/ Optimizer Process VMs

Staged Emulation General Strategy
Begin interpreting For code executed above a threshold Use simple translation/optimization For translated code executed above a threshold Optimize more etc. Specific Strategies may skip some of the steps Shade uses 1 and 2 Wabi uses 2 and 3 FX!32 uses 1 and 3 IA32-EL, UQDBT use 2 and 3 Process VMs

Emulation Engine Execution Flow
return from exception emulator return from OS emulator increment profile data for target startup block no miss check profile interpret until lookup target database branch or jump in map table is block hot? trap condition OS call hit yes non- jump to block in linked Translate Block code cache block call exception trap (via signal) emulator no; call cache manager Space available in Code Cache cache call OS emulator yes OS call Set up links with Create entry in map other blocks; Insert table block in code cache Process VMs

Code Cache Management Code Cache is different from typical hardware cache Variable sized blocks Dependences among blocks due to linking No “backing store”; re-generating is expensive These factors affect replacement algorithm Process VMs

LRU Replacement Good in theory; problematic in practice
Fragmentation problems Variable block sizes Need to unlink when replacing (could be done with backpointers in code cache data structure) Process VMs

Code Cache Back Pointers
block A block B hash table block C Source PC Target PC Back Ptrs Source PC Target PC Back Ptr Source PC Target PC Back Ptr . . block N . Source PC Target PC Back Ptr Source PC Target PC Back Ptr Process VMs

Flush When Full Simple, basic algorithm Shade does this
Removes garbage in Shade translation cache Shade hash table does not chain Gets rid of “stale” links if control flow changes High overhead for re-translating after flush Process VMs

Pre-emptive Flush o i l s a r t w e n time
Flush when program phase change is detected Many new translations will be needed, anyway Detect when there is a burst of new translations Dynamo does this detect working set change and flush n e w t r a s l i o time Process VMs

Fine-Grain FIFO Takes some advantage of temporal locality
Does not have fragmentation problems of LRU Still requires backpointers to unlink replaced blocks Process VMs

Coarse-Grain FIFO Replace many blocks at once Large fixed-size blocks?
. FIFO block A block B block D Code Cache Backpointer Tables Replace many blocks at once Large fixed-size blocks? Only backpointers among replacement blocks need to be maintained OR linking between large blocks can be prohibited. Process VMs

System Environment High level of interoperability
Seamless access to both guest and host processes Works best with same OS HOST OS Disk file sharing guest process create host runtime Process VMs

Encapsulation Guest code is “encapsulated” Creation DLLs
Host Process Guest Process create Host DLL Guest Guest code is “encapsulated” At creation by loader DLLs at load time Creation Host can create guest Guest can create host DLLs Guest can use guest or host Host uses only host Process VMs

Loaders Requires two loaders Approaches One for host processes
One for guest processes Approaches Modify kernel loader Identifies type of binary, calls correct loader Requires modification of kernel loader Add code to guest binary when installed Invokes guest loader Requires local installation of guest binary Modify host process create_process API Invokes guest loader for guest binaries Modifies create_process in host binaries Used in FX!32 Process VMs

Persistence How long do translations last? One ABI instantiation
Re-translate each time an ABI is initiated Multiple ABI instantiations Save translation/profile data on disk Is it faster to optimize or read from disk? A lot of instructions can execute in a few milliseconds Process VMs

Case Study: FX!32 x86/Windows ABIs on Alpha/Windows ABIs intermixed
Virtually indistinguishable from user perspective Very important in PCs Runtime software Follows typical model But, translations/optimizations are done between executions First execution of binary: interpret and profile Translate and optimize “off line” Later execution(s): use translated version, continue profiling Persistence Translations and profile data are saved on disk between runs Very time consuming optimization with x86 source Hybrid static/dynamic binary translation Process VMs

FX!32 Overview Process VMs Binary Memory Image Translation Cache
Profile Data Interpreter/ Emulator Translator Transparency Agent Database <Registry> Server Figs4/fx32tran Process VMs

Transparency Agent How the runtime gets control initially
Shunts x86 binaries into the FX!32 system “Hooked” onto Windows CreateProcess API routine Examines every image about to be executed If x86 image, invokes FX!32 runtime A process containing the transparency agent is enabled Each attempt to launch an Alpha process results in that process being enabled Windows shell, service control manager, remote procedure call server Then, essentially spreads like a virus Process VMs

Runtime Loads x86 image into memory Sets up runtime environment
Replaces Windows NT loader Sets up runtime environment Calls emulator Registers image with NT NT image list is in user space But depends on undocumented features Registers image with FX!32 database Database associates identifier with translations and profiles Identifier is a hash of the image itself Given an image, runtime hashes it, and looks it up If found, it can read in the translation and profile data Process VMs

Windows Interface: Jackets
Enable x86 to call native Win32 API Allow x86 binaries to run native, optimized alpha code Many apps spend a lot of time in native DLLs Jacket Illegal x86 instruction Followed by an Alpha jacket routine Moves arguments from x86 stack to Alpha registers according to Alpha Windows calling standard Followed by Alpha Win32 API call There are jackets for the complete Win32 API Put in figure for jacket. x Process VMs

Interpreter (“Emulator”)
Used for initial execution of image Also later executions for undiscovered code Register mapping: X86 registers permanently mapped to Alpha registers Many more alpha regs than x86 regs X86 condition codes also held in alpha registers Alpha register points to x86 CONTEXT Assembly language coded Software pipelined inner loop Fits in L1 instruction cache Look up max length x86 instruction Process VMs

Profile Generation Part of interpretation process Collects: Hash table
Call target addresses <Source address, Target address> pairs for indirect control transfers Addresses of instructions that perform unaligned memory accesses Hash table Used for collecting profile data When execution ends, runtime processes hash table to produce profile file Hash table also used to detect already translated code <x86 address, translation address pairs> are in table Hash table loaded when translated image is loaded Perhaps add picture of hash table; accessed with hashed x86 instruction address Process VMs

Translator Invoked by the server “offline” as a background task
Uses profile information to translate x86 images to alpha code Growth in amount of profile data causes re-translation As the emulator finds previously unreached portions of x86 code Contains 8 main components Process VMs

Translator: Regionizer
Divides x86 image into routines Constructed by following control flow Each call target is a routine entry point Indirect jumps use profile targets from hash table A routine is a collection of regions Contiguous instructions reachable from routine entry A routine is the smallest collection of reachable regions Many routines have a single region call Routine Region (contiguous code block) Figs4/routine Region Region (contiguous (contiguous code block) code block) Process VMs

Translator: Build Applied to each routine Re-parses x86 instructions
Forms internal representation (IR) Similar to IRs used by compilers Process VMs

Translator: Register Mangler
IR uses single assignment form for registers(?) Ignores x86 register overlap X86 allow partial register reads/writes Register mangler puts in insert/extract operations to deal with partial register accesses x86 code x86 register alpha code alpha register 32-bit result 32-bit result Figs4/extins Illustrate x86 register overlap – requires access to x86 manual? Probably redraw – or give actual code example 16-bit result 16-bit result extract/insert Process VMs

Translator: Condition Code Mangler
Condition code results are implicit in IR (as in x86) Mangler puts in instructions to explicitly generate CC values Keeps track of live CCs Only generates code for those that are eventually used Process VMs

Translator: Improve Basic block level optimizations
Combine stack operations Replace individual pointer decrements/increments for each individual stack operation Combine all decrements/increments for a basic block Perform single stack pointer subtract at basic block beginning And single stack pointer add at end Eliminate stack loads and stores Process VMs

Translator: Code Selector, Scheduler, Assembler
Transforms IR to alpha code Each x86 instruction is mapped to one or more alpha insts Instruction patterns – idioms – are not considered Instructions are scheduled for Alpha pipeline Final translated image is built Process VMs

ISA Issues: x86 CC handling
X86 uses Condition Codes; Alpha does not Use alpha registers to hold x86 CCs Condition Codes are implicit in x86 Frequently set Infrequently used Interpreter uses lazy evaluation Save enough state to allow CC evaluation if needed later Process VMs

ISA Issues: Floating Point Instructions
x86 uses 80 bit floating point for intermediate values Alpha (and most other machines) use 64 Emulator uses only 64 Good performance Very few applications rely on full 80 bits (BUT, how do you know which?) X86 uses stack; sometimes in strange ways Interpreter optimized for strict stack behavior (and uses bail-out code for exceptions?) Process VMs

ISA Translation Issues: Alignment
X86 allows unaligned memory accesses Alpha accesses must be word aligned Normal load/store with unaligned addresses cause trap to fixup code Faster alternative: use multiple instruction sequences But slower than access to known-aligned address Code selector uses profile information to find load/stores that have accessed unaligned data Inserts proper multi-instruction sequence for these Reduces number of traps to fixup code Perhaps add code example for unaligned accesses like wei – but use proper instructions Process VMs

Call/Return conventions
Translated call: Saves x86 return address on x86 stack Calls translated routine This follows alpha convention Alpha return address in register at end of call BUT, x86 return address may have changed Simple solution: Always look up x86 return address in hash table But this is slow Usually, the x86 return address does not change So optimize for common case Process VMs

Shadow Stack Integrated into native alpha stack, maintained by translated code Calling routine Allocates shadow stack frame Saves x86 return address and x86 stack pointer Called routine Saves native return address in caller’s shadow stack frame On return Called routine checks x86 return address and stack pointer against saved version in shadow frame If no change, use native return address Else return to emulator and use x86 return address Process VMs

Performance (comparing 200 MHz Pentium Pro and 500 MHz 21164)
Goal: same as high-end x86 Goal achieved for many integer benchmark Flt point – 30%? slower than Pentium Pro Achieves 70% of native alpha performance 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 PageMaker Corel Draw Lotus Freelance Lotus Word Pro MS PowerPoint MS Word Paradox Perhaps show graph of performance Process VMs

SUN Wabi: Windows ABI Runs common Windows applications on Unix systems
Either intel or non-intel platforms Translates Windows calls to Unix calls Goals/Objectives Speed over space (within reason) Small footprint x86 little endian regardless of host Emulate 486 ring 3 protected mode No page management No real or v86 mode Use only subset of features that common apps use 64-bit FP Process VMs

Wabi: Design Issues Uses Dynamic Translation vs. FX!32
No persistence between executions No extra disk space (3x in FX!32) Limited time to generate code Translate on first code reference Does not switch between interpretation and translation Testing and bug isolation Must allow single instruction stepping Written in C (gcc) Can be ported to multiple platforms Process VMs

Translator Always translate; “fast and dumb”, per interval
Interval: augmented basic block May include simple loops Multiple terminating conditional branches (superblocks) Interval Statistics Instructions per interval x86: 5 ; SPARC: 40 Instruction length (bytes) x86: 2.8; SPARC: 4 Register mapping All x86 registers mapped to SPARC registers x86 CCs mapped to SPARC CCs CC Optimization Dead CCs detected Lazy evaluation of CCs Process VMs

Translator, contd. Code Cache Chaining Dead x86 high halves detected
1-64Mb, default = physical memory/4 +8MB FIFO management Chaining Most intervals are chained Indirect transfers use hash table Dynamic hash table with linear collision resolution Tables start small and grow based on loading factor Dead x86 high halves detected andn src,<0xFFFF0000>,tmp and dst,<0xFFFF0000>,dst or dst,tmp,dst mov src,dst Process VMs

Translator, contd. X86 peephole optimization, idiom recognition
Inline small functions Reallocate x86 memory data to SPARC registers/memory Use adaptive optimization for high frequency intervals xor ax,ax  clr ax or ax,ax  tst ax jcc lab1 jmp lab2  revjcc lab2 Lab1: Process VMs

Benchmark Performance
Pentium 90 vs 167MHz Ultra-1 Ultra about 2X slower Process VMs

Performance Problems Detecting and reporting x86 exceptions is expensive Limits optimization and instruction scheduling Requires memory to recover state info fpu implementation requires checks in generated code Self-modifying code forces retranslation Detection via page protection and checksum Shortage of host registers for both x86 and emulator state Shortage of host machine cycles (?!) Process VMs

WABI Hardware Wish List
Access control over 4Kb pages Load/Store swapped on big-endian processors Unaligned Load/Stores ALU/Register resources like Intel 16-bit immediates Sub-register moves Integer divide overflow trap 80 bit FP Process VMs

Process Virtual Machines

Similar presentations

Presentation on theme: "Process Virtual Machines"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Process Virtual Machines

Similar presentations

Presentation on theme: "Process Virtual Machines"— Presentation transcript:

Similar presentations

About project

Feedback