Presentation is loading. Please wait.

Presentation is loading. Please wait.

VM && QEMU Date:2010/04/09, rednoah. Outline  Introduction to Virtual Machine  VM Overview  Interpretation  Binary Translation  Process VM  Introduction.

Similar presentations


Presentation on theme: "VM && QEMU Date:2010/04/09, rednoah. Outline  Introduction to Virtual Machine  VM Overview  Interpretation  Binary Translation  Process VM  Introduction."— Presentation transcript:

1 VM && QEMU Date:2010/04/09, rednoah

2 Outline  Introduction to Virtual Machine  VM Overview  Interpretation  Binary Translation  Process VM  Introduction to QEMU  QEMU Overview  QEMU JIT  QEMU other topics

3 Reference  James Smith and Ravi Nair, “Virtual Machines: Versatile Platforms for Systems and Processors”  QEMU internals 

4 VM Overview  Why do we prefer something virtual than real?  why Virtual memory?  why Java virtual machine?  why Virtual I/O?  why Virtual Private Network (VPN) ?

5 VM Overview  Why do we prefer something virtual than real?  why Virtual memory?  sharing, protection, large address space, …  why Java virtual machine?  interoperability, application sharing, protection  why Virtual I/O?  flexibility, low cost sharing, better management  why Virtual Private Network (VPN) ?  secure communication over unsecure net

6 VM Overview  Common VMs:  IBM VM/CMS  VMware GSX  Xen  Virtual PC  JVM, MS CLI (Common Language Infrastructure)  Dalvik virtual machine  IA-32 IL, Apple Rosetta, HP PA-Aries  Transmeta Crusoe  QEMU

7 VM Overview  Virtualization  OS: A machine is defined by ISA  Compiler: A machine is defined by ABI (User ISA + OS calls)  Application: A machine is defined by API (User ISA + Library calls)

8 VM Overview  Virtual Machines  Add virtualization software to a host platform and support guest process or system on a VM.

9 VM Overview  Process Virtual Machine  Guest processes may intermingle with host processes  Execute applications with an ISA different from the HW platform  Couple at ABI level via runtime system  As a practical matter, guest OS and host OS are often the same Ex. FX!32 (allows x86 Win32 programs to execute on Alpha-based systems running Windows NT) Different OS: ex. Wine (Enable running Win32 program on Linux)

10 VM Overview  System Virtual Machine  Provide a system environment  Constructed at ISA-level  Example: IBM VM/360, Vmware  Purpose: Server consolidation Secure partitioning Fault isolation Support software development and deployment Cloud computing bandwagon

11 VM Overview  High-Level Language Virtual Machine  Java and MS CIL (Common Language Infrastructure) are current examples.  Binary class files are distributed  “ISA” (IR) is part of binary class format  OS interaction via API (part of VM platform)

12 VM Overview  High-Level Language Virtual Machine  Dalvik bytecode format

13 VM Overview QEMU User-modeQEMU System-mode

14 Interpretation  Emulation ? Simulation ?  Emulation: to be you. (A method for enabling a (sub)system to present the same interface and characteristics as another)  Simulation: to be like you.  Guest and Host  Refer to platforms  Source and Target  Source ISA: Original instruction set or binary  Target ISA: Instruction set being executed by processor performing emulation.  Refer to ISAs

15 Interpretation  Ways of implementing emulation  Interpretation: instruction at-a-time  Binary Translation: block-at-a-time optimized for repeated instruction executions  Why binaries/executables?  Dynamic or Static compilation ?

16 Interpretation  Interpretation State  Hold complete source architecture state in the interpreter’s data memory target-arm/cpu.h

17 Interpretation Decode-Dispatch Interpretation while (!halt && !interrupt) { inst = code(PC); opcode = extract(inst,31,6); switch(opcode) { case LdWord:LdWord(inst); case ALU: ALU(inst); case Branch: Branch(inst);... }

18 Interpretation Decode-Dispatch Interpretation LdWord(inst) { RT = extract (inst,25,5); RA = extract (inst,20,5); displacement =extract (inst,15,16); source = regs[RA]; address = source + displacement ; regs[RT] = data[address]; PC = PC + 4; }

19 Interpretation Decode-Dispatch Interpretation

20 Interpretation Decode-Dispatch (Efficiency)

21 Interpretation  Compiled Emulation  Replace each source instruction by a sequence of emulation functions in high-level language.  The in-lined program can then be compiled and optimized by the compiler.  After redundancy elimination optimization, the generated code may be similar to the result of static translation e.g. add r1,r2,r3add(r1,r2,r3);… sub r4,r5,r6sub(r4,r5,r6);…  QEMU uses a similar approach.

22 Binary Translation  Generate custom code for every source instruction. For example, a load instruction in source code could be translated into a respective load instruction in native code.  Get rid of repeated parsing, decoding, and jumping overhead.  Register mapping is needed to reduce load/stores significantly.

23 Binary Translation  Example: Binary translation from IA-32 binary to PowerPC binary

24 Binary Translation

25  Register mapping to reduce load/store

26 Binary Translation

27  Register mapping  Easier if Number of target registers > number of source registers. (e.g. translating x86 binary to RISC)  May be on a per-block, or per-trace, or per-loop, basis If the number of target registers is not enough  Infrequently used registers (Source) may not be mapped

28 Binary Translation  Source PC v.s. Target PC (program counter)  TPC (Target PC) is different from SPC (Source PC)  For indirect branches, the registers hold source PCs. So we must provide a way to map SPCs to TPCs. Incorrect translation ! /* jump indirect through ctr, but ctr contains SPC */

29 Binary Translation  Dynamic Translation  First Interpret And perform code discovery as a byproduct  Translate Code Incrementally, as it is discovered Place translated blocks into Code Cache Save source to target PC mapping in an Address Lookup Table  Emulation process Execute translated block to end Lookup next source PC in table If translated, jump to target PC else interpret and translate

30

31 Binary Translation  The translation system needs to track SPC at all times  Control is shifted as needed between the interpreter, the EM, and translated blocks in the code cache. Each component must have a way to track SPC. Interpreter uses SPC directly Interpreter passes the next SPC to EM Translated block passes the next SPC to EM using JAL (jump and link instruction) or mapping SPC to a register.

32 Binary Translation  Control flows between translated block and emulation manager Emulation Manager Translation Block Context switch

33 Binary Translation Control flow optimization - Chaining a.out Cache cache Interpreter Emulator SPC-TPC Lookup Table

34 Binary Translation  Condition Codes (CC)  IA32, PowerPC, Sparc, VAX, and ARM all have CC  MIPS, Alpha, and Itanium do not Case 1: Both source and target machines have CC  the source machine’s CC must be save/restored Case 2: Only source machine has CC  target machine must simulate the CC  some source machines set many CC in one instruction Case 3: Only target machine has CC  compare & branch is emulated by two instruction Case 4: Neither target nor source have CC  no issues

35 Binary Translation  CC is set more often than referenced.  If a CC is set before its use, the earlier set CC does not need to be saved. However, most CC are saved at the end of each translated block.  If it can be determined that the following block always set the CC before use, the current block does not save the CC it sets. The payoff is very high. (ex. all x86 ALU operations update CC)  Some rarely used CC (V,C) or flags (e.g. parity) can be simulated using lazy evaluation. Instead of saving/restoring the CC or the flag, save the instruction and its operands and re-compute the CC/flag when it is needed

36 Binary Translation CC example (IA32 to PowerPC)

37 Binary Translation  CC Optimizations  Combine Compare with Branch ARM may use two instructions for a branch: a compare (or a TST or TEQ) instruction followed by a branch. For some simple cases, MIPS can simply use a compare-and-branch instruction. There are cases, although very rare, the translated code (in terms of number of instructions) could be even smaller than the original ARM code.  Mapping each flag to a dedicated register Example: N:R17Z: R18C:R19V:R20 This can reduce instruction overhead to extract/deposit target flags from/to the CPSR (Current Program Status Register). It the target architecture has sufficient number of registers, this optimization should be considered. Otherwise, it may take away three more registers, and cause register spilling.

38 Binary Translation  Other issues of translation  Data Formats and Arithmetic  Memory Data Alignment  Byte Order

39 Process VM  Perform guest/host mapping at the ABI (ISA + system calls) level  Encapsulate guest process in process-level runtime  Example: QEMU linux user- mode  Issues  Memory architecture  Exception architecture  OS call emulation  Overall VM architecture  High performance implementation  System environments

40 Process VM – Implementation

41 Process VM  Loader  A special loader writes guest code and data into a region holding the guest’s memory image, and load the runtime code into memory.  Initialization  Allocate memory for the code cache and other tables  Initialize runtime data structures and invoke OS to establish signal handlers.  Emulation engine  Emulate guest instructions with interpreter or binary translation  Code Cache Manager  What translation to flush?  OS Call Emulator  Translate OS calls and OS responses  Exception Emulator  Handle signals If registered by src, pass to src handler, If not, emulate host response  Form precise state

42 Process VM  Compatibility  A strict definition of compatibility (e.g. bug-to-bug compatible) would exclude many useful process VM.  Intrinsic compatibility  Any software written by the most devious programmer will work in a compatible way  Example: Intel strives for intrinsic compatibility when it produces a new x86 microprocessor  Extrinsic compatibility  Many useful VM applications do not achieve intrinsic compatibility  Limited application set: run Microsoft productivity tools (Office)

43 Process VM  Compatibility issues State Mapping if the guest process uses all virtual address space, intrinsic compatibility cannot be achieved Mapping of control transfers some potential trapping instructions may be removed User-level instruction FP format may be different OS operation host OS does not support exactly the same function as the guest’s native OS

44 Process VM  Software Memory Mapping  Runtime Software to maintain mapping table  Similar to hardware page table/TLB  Slow, but always work

45 Process VM  Guest address space > Host address space + Runtime

46 Process VM  Direct Translation Methods  VM software mapping is slow  Use underlying hardware If guest address space + runtime fit within host space

47 Process VM - Protecting Runtime Memory

48 QEMU Overview  Created by Fabrice Bellard in 2003  Function-level emulation  Faster than “cycle-accurate” simulators.  Good enough to use applications written for another CPU.  Just-in-time (JIT) compilation support to achieve high performance (400 ~ 500 MIPS)  Lots of peripherals support (VGA, serial, and Ethernet, etc…)  Lots of target hosts and targets support (full system emulation)  x86, arm, mips, sh4, cris, sparc, powerpc, nds32, …  qemu/hw/* contain all of the supported boards.  Good enough to use applications written for another CPU.  User mode emulation: can run applications compiled for another CPU.

49 QEMU overview  Update status  (Jan 6, 2008) Stable and stop for a long time  0.10 (Mar 5, 2009) TCG support (a new general JIT framework)  0.11 (Sep 24, 2009) KVM support  0.12 More KVM support. Code refactoring new peripheral framework to support dynamic board configuration

50 QEMU Screenshot – Emulate ARM11MPCore

51 QEMU Screenshot – Android 2.1

52 QEMU JIT  TCG (Tiny Code Generator)  a generic backend for a C compiler. It was simplified to be used in QEMU.  Translation Block (TB)  A TCG "basic block" corresponds to a list of instructions terminated by a branch instruction.  16Mb code cache size

53 QEMU JIT  Prologue, Epilogue When the target-machine is ARM

54 QEMU JIT  cpu exec() called each time around main loop.  Program executes until an unchained block is encountered.  Returns to cpu exec() through epilogue.  Enter the code cache: Linux: Set buffer executable and jump to Buffer & Execute

55 QEMU JIT – code gen flow  Front-end: qemu/tcg/tcg.c  gen_intermediate_code  disas_XXX_insn  Interprete source instruction and translate to micro-ops.  Translation stops when a conditional branch is encountered.

56 QEMU JIT – code gen flow  tcg_liveness_analysis  Remove dead code.  Ex. and_i32 t0, t0, $0xffffffff  Ex. add_i32 t0, t1, t2 add_i32 t0, t0, $1 mov_i32 t0, $1

57 QEMU JIT – code gen flow  Register mapping  register struct CPUNDS32State *env asm(r14);  register target_ulong T0 asm(r15);  register target_ulong T1 asm(r12);  register target_ulong T2 asm(r13);

58 QEMU JIT – Block chaining  Avoid context-switch overhead  Every time a block returns, try to chain it.  tb_add_jump(): back-patch the native jump address

59 QEMU JIT – Memory load emulation  Base on qemu , emulate mips (little endian)  decode_opc  translate mips-asm to micro-op  Translation stops when a conditional branch is encountered.  gen_store_gpr will store this value to the emulated cpu’s general register.

60 QEMU JIT – Memory load emulation target-mips/translate.c

61 QEMU JIT – Memory load emulation  Generate binary code cpu_gen_codegen_intermediate_codetcg_gen_code /qemu/Translate-all.c/qemu/Tcg.c tcg_gen_code_commontcg_reg_alloc_optcg_out_op /qemu/Tcg/i386/tcg-target.c opc tcg outputs 0xe8 which means a call instruction in x86. It will call the functions in array qemu_ld_helpers. The args to the functions is passed by registers EAX,EDX and ECX.

62 QEMU JIT – Memory load emulation #define REGPARM __attribute((regparm(3))) 0xe8 pc (s->code_ptr) __ldb_mmu pc+4 (s->code_ptr += 4) … qemu_ld_helpers[s_bits] Offset (4 byte)

63 QEMU JIT – Memory load emulation SoftMMU  Translate guest virtual address to host virtual address.  Translate the guest physical address to host physical address. qemu needs to find the PhysPageDesc entry in table **l1_phys_map and get the phys_offset. guest_phy_addr[31:22]  first level entry guest_phy_addr[21:12]  second level entry If page not find  cpu_register_physical_memory : qemu creates a new entry (by mmap) and updates its value and insert this entry to the l1_phys_map table.

64 QEMU JIT – Memory load emulation SoftMMU  Translate the guest physical address to host virtual address.  phys_offset == IO_MEM_RAM  guest RAM space phys_offset[31:12]: the offset of this page in emulated physical memory. phys_offset + phys_ram_base = host virtual address  phys_offset > IO_MEM_ROM  MMIO space phys_offset[11:3]: the index in io_mem_write/io_mem_read array. register the I/O emulation functions:

65 QEMU JIT – Memory load emulation SoftMMU  Original way  1. Translate the guest virtual address to guest physical address  2. Then qemu needs to find the PhysPageDesc entry in table l1_phys_map and get the phys_offset  3. phys_offset + phys_ram_base = host virtual address  Software TLB table  1. Search TLB first.  2. Hit: guest_virtual_address + addend = host_virtual_address.  3. Miss: Search the l1_phys_map table and then fill the corresponding entry to the TLB table

66 QEMU JIT – Memory load emulation SoftMMU (__ldX_mmu) ( 接下頁 )

67 QEMU JIT – Memory load emulation SoftMMU (__ldX_mmu) cpu_exec  tb_find_fast  tb_find_slow  get_phys_addr_code  (if tlb not match)ldub_code(softmmu_header.h)  __ldl_mmu(softmmu_template.h)  tlb_fill  cpu _XXX_handle_mmu_fault  tlb_set_page  tlb_set_page_exec

68 QEMU JIT – Summary Look up TB Translate one TB Chain it to existed TBs Execute Code cache Exception happen and handling Cached? No Yes

69 QEMU other topics  Fixed register allocation  Conditional code (CC)  Lazy CC evaluation  Recovery when needed register struct CPUARMState *env asm(r14); register target_ulong T0 asm(r15); register target_ulong T1 asm(r12); register target_ulong T2 asm(r13); R = A + B CC_SRC=A CC_DST=R CC_OP=CC_OP_ADDL

70 Source code organization  qemu/  qemu-* : OS dependent API wrapper example: memory allocation or socket  target-*/ : target porting  tcg/ : new and unified JIT framework  *-user/ : user-mode emulation on different OS  softmmu-* : target MMU acceleration framework  hw/ : peripheral model  fpu : softfloat FPU emulation library  gdb : GDB stub implementation

71 Source code organization  TranslationBlock structure in translate-all.h  Translation cache is code_gen_buffer in exec.c  cpu-exec() in cpu-exec.c orchestrates translation and block chaining.  vl.c: Main loop for system emulation.

72 Sample Demo  Using gdb to debug QEMU  Using QEMU to debug guest OS  QEMU Linux-user mode emulation  QEMU system mode emulation

73 Funny issues of QEMU  Generate execution traces to drive timing models  Try to integrate timing models  Improve optimization, say, by retaining chaining across interrupts  TCG Optimization.  Code cache management  Optimization passes of micro-op  Multi-core emulate multi-core


Download ppt "VM && QEMU Date:2010/04/09, rednoah. Outline  Introduction to Virtual Machine  VM Overview  Interpretation  Binary Translation  Process VM  Introduction."

Similar presentations


Ads by Google