Remus: High Availability via Asynchronous Virtual Machine Replication.

Remus: High Availability via Asynchronous Virtual Machine Replication

Introduction Provides OS- and application-agnostic high- availability on commodity hardware Based on the ability of virtualization to migrate running VMs between physical hosts Very high frequencies (up to 40 times/sec) Replicates whole system state – CPU state, memory, hard disks

Goals Generality – Low level service – Regardless of the application or the hardware Transparency – No modifications made to OS or application code Seamless failure recovery – No externally visible state lost – Failure recovery should be very rapid

Approach Virtualized infrastructure allows whole-system replication Speculative execution increases system performance Buffering allows synchronization with the replicated server to be asynchronous

Design and implementation Pipelined checkpoints Epochs divided in four stages: 1.Stop execution and copy any changed state (CPU, memory, disks) to buffer. 2.Transmit buffered state to backup. All network output is being buffered. 3.Backup acknowledges checkpoint. 4.Buffered network output is released.

Design and implementation

Memory and CPU The guest OS is suspended and dirtied pages are copied to a buffer – Due to high frequencies most memory is unchanged – Guest’s entire physical memory is mapped at the beginning instead of mapping/unmapping The guest resumes execution on the current host

Network buffering All outbound traffic goes to a buffer implemented as a queue – Before resuming execution (of primary) a barrier is inserted into the outbound queue – No packet after the barrier is released – When the checkpoint is acknowledged all packaged up to the barrier are released

Disk buffering Before starting the protection system, the current state of the disk on primary is copied to the backup host. Writes are committed immediately on the primary and buffered on the backup host When backup has received the full checkpoint and has acknowledged, it commits writes to the hard disk

Disk buffering

Detecting failure If checkpoint acknowledgement times out – Primary assumes backup has crashed and disables protection If checkpoint transmission times out – Backup assumes primary has crashed and resumes execution from the most recent checkpoint

Evaluation Correctness verification – Kernel compilation (CPU, memory, disks) – X11 client (network) – Network failures introduced at every stage – Backup took over successfully – Forced file system check reported no inconsistencies

Evaluation

Optimizations Deadline scheduling – Rate could be changed between checkpoints, depending on the number of dirtied pages Page compression – Check against a cache for previously transmitted page, and only transmit its delta Copy-on-write checkpoints – Mark pages as copy-on-write and copy them from the COW buffer

Thank you

Remus: High Availability via Asynchronous Virtual Machine Replication.

Similar presentations

Presentation on theme: "Remus: High Availability via Asynchronous Virtual Machine Replication."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Remus: High Availability via Asynchronous Virtual Machine Replication.

Similar presentations

Presentation on theme: "Remus: High Availability via Asynchronous Virtual Machine Replication."— Presentation transcript:

Similar presentations

About project

Feedback