Presentation is loading. Please wait.

Presentation is loading. Please wait.

Remus: High Availability via Asynchronous Virtual Machine Replication.

Similar presentations


Presentation on theme: "Remus: High Availability via Asynchronous Virtual Machine Replication."— Presentation transcript:

1 Remus: High Availability via Asynchronous Virtual Machine Replication

2 Introduction Provides OS- and application-agnostic high- availability on commodity hardware Based on the ability of virtualization to migrate running VMs between physical hosts Very high frequencies (up to 40 times/sec) Replicates whole system state – CPU state, memory, hard disks

3 Goals Generality – Low level service – Regardless of the application or the hardware Transparency – No modifications made to OS or application code Seamless failure recovery – No externally visible state lost – Failure recovery should be very rapid

4 Approach Virtualized infrastructure allows whole-system replication Speculative execution increases system performance Buffering allows synchronization with the replicated server to be asynchronous

5 Design and implementation Pipelined checkpoints Epochs divided in four stages: 1.Stop execution and copy any changed state (CPU, memory, disks) to buffer. 2.Transmit buffered state to backup. All network output is being buffered. 3.Backup acknowledges checkpoint. 4.Buffered network output is released.

6 Design and implementation

7 Memory and CPU The guest OS is suspended and dirtied pages are copied to a buffer – Due to high frequencies most memory is unchanged – Guest’s entire physical memory is mapped at the beginning instead of mapping/unmapping The guest resumes execution on the current host

8 Network buffering All outbound traffic goes to a buffer implemented as a queue – Before resuming execution (of primary) a barrier is inserted into the outbound queue – No packet after the barrier is released – When the checkpoint is acknowledged all packaged up to the barrier are released

9 Disk buffering Before starting the protection system, the current state of the disk on primary is copied to the backup host. Writes are committed immediately on the primary and buffered on the backup host When backup has received the full checkpoint and has acknowledged, it commits writes to the hard disk

10 Disk buffering

11 Detecting failure If checkpoint acknowledgement times out – Primary assumes backup has crashed and disables protection If checkpoint transmission times out – Backup assumes primary has crashed and resumes execution from the most recent checkpoint

12 Evaluation Correctness verification – Kernel compilation (CPU, memory, disks) – X11 client (network) – Network failures introduced at every stage – Backup took over successfully – Forced file system check reported no inconsistencies

13 Evaluation

14

15

16 Optimizations Deadline scheduling – Rate could be changed between checkpoints, depending on the number of dirtied pages Page compression – Check against a cache for previously transmitted page, and only transmit its delta Copy-on-write checkpoints – Mark pages as copy-on-write and copy them from the COW buffer

17 Thank you


Download ppt "Remus: High Availability via Asynchronous Virtual Machine Replication."

Similar presentations


Ads by Google