Remus: High Availability via Asynchronous Virtual Machine Replication
Introduction Provides OS- and application-agnostic high- availability on commodity hardware Based on the ability of virtualization to migrate running VMs between physical hosts Very high frequencies (up to 40 times/sec) Replicates whole system state – CPU state, memory, hard disks
Goals Generality – Low level service – Regardless of the application or the hardware Transparency – No modifications made to OS or application code Seamless failure recovery – No externally visible state lost – Failure recovery should be very rapid
Approach Virtualized infrastructure allows whole-system replication Speculative execution increases system performance Buffering allows synchronization with the replicated server to be asynchronous
Design and implementation Pipelined checkpoints Epochs divided in four stages: 1.Stop execution and copy any changed state (CPU, memory, disks) to buffer. 2.Transmit buffered state to backup. All network output is being buffered. 3.Backup acknowledges checkpoint. 4.Buffered network output is released.
Design and implementation
Memory and CPU The guest OS is suspended and dirtied pages are copied to a buffer – Due to high frequencies most memory is unchanged – Guest’s entire physical memory is mapped at the beginning instead of mapping/unmapping The guest resumes execution on the current host
Network buffering All outbound traffic goes to a buffer implemented as a queue – Before resuming execution (of primary) a barrier is inserted into the outbound queue – No packet after the barrier is released – When the checkpoint is acknowledged all packaged up to the barrier are released
Disk buffering Before starting the protection system, the current state of the disk on primary is copied to the backup host. Writes are committed immediately on the primary and buffered on the backup host When backup has received the full checkpoint and has acknowledged, it commits writes to the hard disk
Disk buffering
Detecting failure If checkpoint acknowledgement times out – Primary assumes backup has crashed and disables protection If checkpoint transmission times out – Backup assumes primary has crashed and resumes execution from the most recent checkpoint
Evaluation Correctness verification – Kernel compilation (CPU, memory, disks) – X11 client (network) – Network failures introduced at every stage – Backup took over successfully – Forced file system check reported no inconsistencies
Evaluation
Optimizations Deadline scheduling – Rate could be changed between checkpoints, depending on the number of dirtied pages Page compression – Check against a cache for previously transmitted page, and only transmit its delta Copy-on-write checkpoints – Mark pages as copy-on-write and copy them from the COW buffer
Thank you