Reading Report Cost of VM Live Migration By William Voorsluys1, James Broberg1, Srikumar Venugopal2, and Rajkumar Buyya1 CLOUD 2009.

1 Reading Report Cost of VM Live Migration By William Voorsluys1, James Broberg1, Srikumar Venugopal2, and Rajkumar Buyya1 CLOUD 2009

2 the paper’s goal  Migration overhead is acceptable but cannot be disregarded, especially when SLA is strict.  The paper gives a performance evaluation of live migration.

3 Related Works(1)  Multicore [8,9], paravirtualization[1], hardware-assisted virtualization [10], live migration[3]  Individual measurement of VM runtime overhead imposed by hypervisors on a variety of workloads [1, 11, 12]  the impact of consolidating several applications on a single server running Xen[13]

4 Related Works(2)  performance degradation when migrating CPU and memory intensive workloads as well as migrating multiple VMs at the same time in a stop-and-copy way[15]  quantify its effects on a set of four applications common to hosting environments, primarily focusing on quantifying downtime and total migration time and demonstrating the viability of live migration.[3] They don’t evaluated the effect of migration in the performance of modern Internet workloads, such as multi-tier and social network oriented applications.

5 Related Works  evaluate the efficacy of migrating VMs across long distances, such as over the Internet[16]  the vConsolidate benchmark [14] a Web server, a database server a Java server, a mail server, an idle server.  The Cloudstone benchmark [17] aims at computing the monetary cost, in dollars/user/month, for hosting Web 2.0 applications in cloud computing platforms

6 Background: advantage of live migration  Live (or hot) migration : Hypervisors allow migrating an OS as it continues to run.  Stop-and-copy (or cold)migration : halting the VM,copying all its memory pages to the destination then restarting the new VM.  The advantage of live migration : the possibility to migrate an OS with near-0 downtime.

7 Background: characteristic of modern Internet application  Highly dynamic and interactive features make Web2.0 apps explode.  Social networking features make each user’s actions affect many other users, which makes static load partitioning unsuitable as a scaling strategy.  By means of blogs, photostreams and tagging, users now publish content to one another rather than just consuming the static content.

8 Testbed specifications  A cluster composed of 6 servers,1 head-node and 5 VM hosts.  Each equipped with Intel Xeon E5410 (2.33 GHz Quad-core with 2x6MB L2 cache and Intel VT technology), 4GB memory, 7200rpm hard drive, connected through a Gigabit Ethernet switch.  head-node : Ubuntu Server 7.10 with no hypervisor.  other nodes : Citrix XenServer Enterprise Edition  All VMs run 64-bit Ubuntu Linux 8.04 Server Edition, paravirtualized kernel version The installed web server is Apache running in prefork mode. PHP version is MySQL, with Innodb engine, is version

9 WorkLoad  Olio[18] as a Web2.0 application, combined with Faban load generator[19]  Olio’s PHP implementation, employing the popular LMAP stack(Linux Apache MySQL PHP)  The Olio/Faban was originally proposed as part of the CloudStone benchmark[17]  The main metric : Service Level Agreement defined in Cloudstone.

10 Cloudstone's SLA Table 1. : The 90th/99th percentile of response times measured in any 5- minute window during steady state should not excess the following values (in seconds)

11 Benchmarking architecture(1)

12 Benchmarking architecture(2)  MySQL tends to be CPU-bound when serving the Olio database  Apache/PHP tends to be memory-bound [17]  All nodes share an NFS (Network File System) mounted storage device, which resides in the head-node and stores VM images and virtual disks.  A local virtual disk is hosted in the server that hosts MySQL.  The load is driven from the head-node, where the multi- threaded workload drivers run, along with Faban's master component.

13 Experimental objective  To quantify slowdown and downtime experienced by the application when VM migrations are performed in the middle of a run.  In a series of runs did not consist of migrating a VM back and forth between the same two machines.

14 Preliminary experiments  To define exact VM sizes.  Without migration  Driven load against Olio and gradually increase the number of concurrent users.  By analyzing the SLA, they found 600 is the max concurrent users.  By memory and CPU usage, they found the min VM sizes serving 600 users could be : VM hosting Apache/PHP1 vCPU 2GB memory VM hosting MySQL 2 vCPU 1GB memory  Host SQL on NFS can only support 400 users.  So the experiment would not include database server migration.

15 Migration Experiment  First set of experiment with Olio : 10-minute and 20 minute benchmark runs with 600 concurrent users.  To evaluate how the SLA is violated when the system is nearly oversubscribed but not overloaded and also quantify the downtime when live migrations happen.  Then, run the benchmark with smaller numbers of concurrent users, namely 100,200,..,500, searching for a “safe” level (lower risk of SLA violation).

16 Result and Discussion(1)  Result shows that overhead due to live migration is acceptable but cannot be disregarded, especially in SLA- oriented environments equiring more demanding service levels.

17 Result and Discussion(2) Fig.2.Effects of a live migration on Olio's homepage loading activity

18 Result and Discussion(3)  Figure 2 shows the effect of a single migration performed after five minutes in steady state of one run.  A downtime of 3 seconds is experienced near the end of a 44 second migration.  The highest peak observed in response times takes place immediately after the VM resumes in the destination node;  5 seconds elapse until the system can fully serve all requests that had initiated during downtime.  In spite of that, no requests were dropped or timed out due to application downtime.  The downtime experienced by Olio when serving 600 concurrent users is well above the expected millisecond level, previously reported in the literature for a range of workloads [3]. This interesting result suggest that the workload complexity imposes a unusual memory

19 Result and Discussion(4) Fig th and 99th percentile SLA computed for the homepage loading response time with 600 concurrent users. The maximum allowed response time is 1 second

20 Result and Discussion(5)  Figure 3 presents the effect of multiple migrations on the homepage loading response times. These result corresponds to the average of 5 runs.  It is paramount that this information is employed by SLA- oriented VM-allocation mechanisms with the objective of reducing the risk of SLA non-compliance in situations when VM migrations are inevitable.

21 Result and Discussion(6)

22 Result and Discussion(7)  Table 2 presents more detailed results listing maximum response times for all user actions as computed by the 99th percentile SLA formula when one migration was performed in the middle of a 10 minute run.  These results indicate that a workload of 500 users is the load level at which a live migration of the Web server should be carried out (e.g. to a least loaded server) in order to decrease the risk of SLA violation.

