Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS530 Operating System Nesting Paging in VM Replay for MPs Jaehyuk Huh Computer Science, KAIST.

Similar presentations

Presentation on theme: "CS530 Operating System Nesting Paging in VM Replay for MPs Jaehyuk Huh Computer Science, KAIST."— Presentation transcript:

1 CS530 Operating System Nesting Paging in VM Replay for MPs Jaehyuk Huh Computer Science, KAIST

2 Address Translation in VM Need to translate guest VA (gVA) to machine address – gVA (guest VA)  gPA (guest PA)  sPA (system PA) Paravirtualization – Guest page table (managed by guest OS) directly maps gVA to sPA – Hypervisor validates guest page table Full virtualization – SW technique: shadow paging – HW-assisted technique: nested paging

3 X86 4KB page tables in long mode

4 Shadow Page Table Shadow page table (sPT) – translate from gVA to sPA – maintained by VMM (hypervisor) VMM intercepts the updates of page table base address – CR3 updates in x86 – Set CR3 with sPT base address instead of gPT base address must be consistent with guest page table (gPT)  gPT updates must be reflected in sPT Any page fault must be intercepted by VMM – VMM must tell guest-induced page-faults from VMM-induced ones – Vectors guest-induced page-faults to guest OS – High overheads for page fault handling

5 How to make gPT and sPT consistent? Write-protecting gPT – Any modification of gPT (add or remove a translation) causes a fault – VMM updates sPT accordingly Exploiting page-fault behavior and TLB consistency rules – Adding a page translation Guest OS can add a new translation to gPT without interception by VMM Later accesses by guest VM causes a page fault on the new translation VMM updates sPT on the page fault: must inspect gPT to find out the new page – Deleting a page translation Guest OS executes INVLPG to invalidate TLB entry VMM intercept the execution and remove the entry from sPT

6 Overheads of Shadow Paging Any page fault requires the expensive VMM intervention – Guest-induced page fault – Hypervisor-induced page faults Accessed and dirty bit updates – HW page walker sets bits in sPT (not gPT) – Guest OS need the information to make paging decision – Dirty bit example: set pages pointed by sPT read-only Problems in MPs – What if a VM uses multiple processors? – Replicating sPT for each processor?  memory overheads – Sharing sPT ?  synchronizing sPT for any change

7 Shadow Paging Overheads

8 Nesting Page Table A source of address translation overheads in traditional x86 VMM – a fixed hardware page walker to handle a TLB miss – Can walk from only one page table (pointed by CR3) Nested paging – Separate HW states affecting paging (two copies of CR3 etc … ) for guest OS and VMM – HW page walker can walk both gPT and sPT – TLB can holds a translation from gVA to sPT directly Benefits: No more traps on Guest Page Table accesses Drawback: Extra page table steps add latency to TLB miss May add extra caching for page translation – Nested TLB – 2D page walk cache

9 Nested Paging


11 Address Space IDs Old x86 did not support address space IDs (ASID) in TLBs – must flush TLBs for VM switch – Assign ASID for each VM – Still need to flush TLBs for context switch within a VM

12 Replay Papers VM-based replay – Execution Replay for Multiprocessor Virtual Machines – Dunlap et al HW-based replay – Rerun: Exploiting Episodes for Lightweight Memory Race Recording – Hower and Hill ODR: Output-Deterministic Replay for Multicore Debugging – Altekar and Stoica Slides adapted from the presentation slides by the paper authors

13 Big ideas Detection and replay of memory races is possible on commodity hardware Overhead high for some workloads …but surprisingly low for other workloads

14 Execution Replay CPU Memory Disk Network Keyboard, mouse Interrupts

15 Deterministic Replay – Faithfully replay an execution such that all instructions appear to complete in the same order and produce the same result Valuable – Debugging [LeBlanc, et al. - COMP ’87] e.g., time travel debugging, rare bug replication – Fault tolerance [Bressoud, et al. - SIGOPS ‘95] e.g., hot backup virtual machines – Security [Dunlap et al. – OSDI ‘02] e.g., attack analysis – Tracing [Xu et al. – WDDD ‘07] e.g., unobtrusive replay tracing 15

16 Single-processor Replay Basic principles well understood – Log all non-deterministic inputs – Timing of asynchronous events Minimal overhead (Dunlap02) – 13% worst case – Log for months or years Available commercially – VMWare: Record/Replay

17 The Multiprocessor Challenge Interleaved reads and writes – Fine-grained non-determinism – Much more difficult Existing solutions – Hardware modification – Software instrumentation SMP-ReVirt – Hardware MMU to detect sharing

18 Multiprocessor Replay P2 Memory P1 P2 n=3 n=5 if (n<4)

19 Ordering Memory Accesses Preserving order will reproduce execution – a→b: “a happens-before b” – Ordering is transitive: a→b, b→c means a→c Two instructions must be ordered if: – they both access the same memory, and – one of them is a write

20 Constraints: Enforcing order To guarantee a → d: –a→d–a→d –b→d–b→d –a→c–a→c –b→c–b→c Suppose we need b → c – b → c is necessary – a → d is redundant P1 a b c d P2 overconstrained

21 CREW Protocol Each shared object in one of two states: – Concurrent-Read: all processors can read, none can write – Exclusive-Write: one processor (the owner) can read and write; others have no access Enforced with hardware MMU – Read/write – Read-only – None Change CREW states on demand – Fault, fixup, re-execute CREW event – Increasing or reducing permission due to CREW state changes

22 CREW Property If two instructions on different processors: – access the same page, – and one of them is a write, – there will be a CREW event on each processor between them.

23 Generating Constraints State: Concurrent Read – All processors read-only d*: CREW fault New state: P2 Exclusive r: privilege reduction – Read to None i: privilege increase – Read to Read/write Log timing of r and i Constraint: – r → i P1 a d P2 r i d*

24 Predicting results Key changes in sharing attributes – 4096-byte sharing granularity – “Miss” is very expensive SPLASH2 – Good: high spatial locality / low false sharing – Bad: random access patterns / high false sharing The Linux kernel – Tuned to 16-byte cacheline – Involving the kernel may be expensive

25 Single-processor Xen guests

26 2-processor Xen guests

27 2-processor, con’t

28 4-processor Xen guests

29 HW Memory Race Recording SW only approach – Too slow to be turned on always – SW alter execution path Want – Small log – record longer for same state – Small hardware – reduce cost, especially when not used – Unobtrusive – should not alter execution Rerun: Exploiting Episodes for Lightweight Memory Race Recording 29

30 Episodic Recording Most code executes without races – Use race-free regions as unit of ordering Episodes: independent execution regions – Defined per thread – Identified passively  does not affect execution – Encompass every instruction 30 T0 T1 LD A ST B ST C LD F ST E LD B ST X LD R ST T LD X T2 ST V ST Z LD W LD J ST C LD Q LD J ST Q ST E ST C LD Z LD V ST X

31 23 Capturing Causality Via scalar Lamport Clocks [Lamport ‘78] – Assigns timestamps to events – Timestamp order implies causality Replay in timestamp order – Episodes with same timestamp can be replayed in parallel 31 43 22 60 61 44 62 23 44 45 T0T1T2

32 Episode Benefits Multiple races can be captured by a single episode – Reduces amount of information to be logged Episodes are created passively – No speculation, no rollback Episodes can end early – Eases implementation Episode information is thread-local – Promotes scalability, avoids synchronization overheads 32

33 Hardware Rerun requirements: – Detect races  track r/w sets – Mark episode boundaries – Maintain logical time 33 Data Tags Directory Coherence Controller L1 I L1 D Pipeline L2 0 L2 1 L2 14 L2 15 Core 15 Interconnect DRAM … Core 14 Core 1 Core 0 … Rerun Core State Base System Write Filter (WF) Read Filter (RF) Timestamp (TS) References (REFS) Rerun L2/Memory State Memory Timestamp(MTS) 32 bytes 128 bytes 2 bytes 4 bytes Total State: 166 bytes/core

34 HW Replay Summary Require some modification to existing HW – will CPU manufacturers add the support any time soon?  not likely Other low overhead approaches with SW-based replay – ODR: Output-Deterministic Replay for Multicore Debugging, Altekar and Stoica, SOSP 09

Download ppt "CS530 Operating System Nesting Paging in VM Replay for MPs Jaehyuk Huh Computer Science, KAIST."

Similar presentations

Ads by Google