MClock: Handling Throughput Variability for Hypervisor IO Scheduling in USENIX conference on Operating Systems Design and Implementation (OSDI ) 2010.

mClock: Handling Throughput Variability for Hypervisor IO Scheduling in USENIX conference on Operating Systems Design and Implementation (OSDI ) 2010. Ajay Gulati VMware Inc. Arif Merchant HP Labs Peter Varman Rice University

Outline Introduction Scheduling goals of mClock mClock Algorithm Distributed mClock Performance Evaluation Conclusion 2

Introduction Hypervisors are responsible for multiplexing the underlying hardware resources among VMs – CPU, memory, network and storage IO 3 Host VM Host VM Storage IO Scheduler CPU RAM CPU RAM Storage IO Scheduler Throughput available to a host is not under its own control Storage Array The amount of CPU and memory resources on a host are fixed and time-invariant.

Introduction (cont’d) Existing methods provide many knobs for allocating CPU and memory to VMs. The current state of the art in terms of IO resource allocation is much more rudimentary. – Limited to providing proportional shares to different VMs. Lack of QoS support for IO resources can have widespread effects – rendering existing CPU and memory controls ineffective when applications block on IO requests. 4

Introduction (cont’d) The amount of IO throughput available to any particular host can fluctuate widely based on the behavior of other hosts accessing the shared device. 5 VM5 starts VM1 starts VM2,3 start VM4 starts VM1 stop VM2,3 stop VM4 stop

Introduction (cont’d) Three main controls in resource allocation – Shares (a.k.a. weights) proportional resource allocation – Reservations minimum amount of resource allocation to provide latency guarantee – Limits maximum allowed resource allocation prevent competing IO-intensive applications from consuming all the spare bandwidth in the system 6

Scheduling goals of mClock 7 When reservations cannot be met: Proportional to reservations When reservations can be met: Satisfy reservations first, then proportional to weight Limit the maximum throughput of DM

Scheduling goals of mClock (cont’d) Each VM i has three parameters: – Reservation(r i ), Limit (l i ), Weight (w i ) VMs are partitioned into three sets: – Reservation-clamped( R ), limit-clamped ( L ) or proportional ( P ), based on whether their current allocation is clamped at the lower or upper bound or is in between. Define 8

mClock Algorithm mClock uses two main ideas: – multiple real-time clocks Reservation-based, Limit-based, and Weight-based clocks – dynamic clock selection Dynamic select one from multiple real-time clocks for scheduling. 9 Tag assignment method is similar to the Virtual Clock scheduling.

mClock Algorithm (cont’d) Tag Adjustment – To calibrate the proportional share tags against real time To prevent starvation. In virtual time based scheduling, this synchronization is done using global virtual time. ( S i,k = max{F i,k-1, V(a i,k )} ) In mClock, the reservation tag and limit tag must base on real time. => Adjust the origin of existing P tags to the real time. 10

mClock Algorithm (cont’d) 11 Reservation first Select the request from the VMs under limitation. Active_IOs : count the queue length. Tag Adjustment

mClock Algorithm (cont’d) This maintains the condition that R tags are always spaced apart by 1/r i, so that reserved service is not affected by the service provided in the weight-based phase. 12 time Rk1Rk1 Rk2Rk2 Rk3Rk3 Rk4Rk4 Rk5Rk5 Current time t r k 3 is served. The waiting time of r k 4 may be longer than 1/r k 1/r k

Storage-specific Issues Bust Handling – Storage workloads are known to be bursty – Requests from the same VM often have a high spatial locality. – We help bursty workloads that were idle to gain a limited preference in scheduling when the system next has spare capacity. – To accomplish this, we allow VMs to gain idle credits. 13 time Pk1Pk1 Pk2Pk2 P k 2 +1/w i Current time t: r k 3 arrival Pk3Pk3 σ i /w i t idle

Storage-specific Issues (cont’d) IO size – Since larger IO sizes take longer to complete, differently- sized IOs should not be treated equally by the IO scheduler. – The IO latency with n random outstanding IOs with an IO size of S each can be written as: – Converting latency observed for an IO of size S 1 to a IO of a reference size S 2, – A single request of IO size S is treated equivalent to (1 + S/(T m ×B peak )) IO requests. 14 T m : mechanical delay due to seek and disk rotation. B peak : the peak transfer bandwidth of a disk. For a smaller reference size, this part is negligible

Storage-specific Issues (cont’d) Request Location – mClock improves the overall efficiency of the system by scheduling IOs with high locality as a batch. A VM is allowed to issue IO requests in a batch as long as the requests are close in logical block number space. Reservation Setting – IOPS = Outstanding IOs / Latency – Application that keeps 8 IOs outstanding and requires 25ms latency, 8 / 0.025 = 320 IOPS for reservation 15

Distributed mClock Cluster-based storage systems dmClock runs a modified version of mClock – piggyback two integers ρ i and δ i with each request of VM v i to a storage server s j. δ i : the number of IO requests from VM v i that have completed service at all the servers between the previous request (from v i ) to the server s j and the current request. ρ i : the number of IO requests from v i that have been served as part of constraint-satisfying phase between the previous request to s j and the current request 16

Performance Evaluation Implemented in VMware ESX server hypervisor – By modifying the SCSI scheduling layer in the I/O stack of VMware ESX server hypervisor. The host is a Dell Poweredge 2950 server – two Qlogic HBAs connected to an EMC CLARiiON CX3-40 storage array over FC SAN. – Used two different storage volumes A 10 disk RAID 0 disk group A 10 disk RAID 5 disk group 17

Performance Evaluation Two kinds of VMs – Linux VMs with a 10GB virtual disk, one VCPU and 512MB memory – Windows server 2003 VMs with a 16GB virtual disk, one VCPU and 1GB memory Workload generator – Iometer in the Windows server VMs http://www.iometer.org/ – A self-designed work-load generator in Linux VMs 18

Performance Evaluation (cont’d) Limit Enforcement 19 RDOLTPDM Workload32 random IO (75% read) every 250ms Always backlogged (75% read) Always backlogged (All sequential read) IO size4KB8KB32KB Latency bound30ms X Weight221 At t=140 the limit for DM is set to 300 IOPS.

Performance Evaluation (cont’d) Reservations Enforcement – Five VMs with weights in ratio 1:1:2:2:2. – VMs are started at 60 sec intervals 20 SFQ only does proportional allocation mClock enforces reservations 300 IOPS 250 IOPS

Performance Evaluation (cont’d) Bursty VM Workloads – VM1: 128 IOs every 400ms, all 4KB reads, 80% random. – VM2: 16 KB reads, 20% of them random and the rest sequential with 32 outstanding IOs. – Idle credits do not impact the overall bandwidth allocation over time. – The latency seen by the bursty VM1 decreases as we increase the idle credits. 21

Performance Evaluation (cont’d) Filebench Workloads – Emulate the workload of OLTP VMs 22 [25] R. McDougall. Filebench: Application level file system benchmark. http://www.solarisinternals.com/si/tools/filebench/index.php

Performance Evaluation (cont’d) dmClock Evaluation – Implementation in a distributed storage system that consists of multiple storage servers (nodes). – Each node is implemented using a virtual machine running RHEL Linux with a 10GB OS disk and a 10GB experimental disk. 23

Conclusion The mClock provides per-VM quality of service. The QoS requirements are expressed as – minimum reservation – maximum limit – proportional shares (weight) The controls provided by mClock would allow stronger isolation among VMs. The techniques are quite generic and can be applied to array level scheduling and to other resources such as network bandwidth allocation as well 24

Comments Existing VM services only provide resources in terms of CPU, memory, and storage. But I/O throughput may be the largest factor in QoS provisioning. – In terms of response time or delay time. It’s a good idea to combine reservation, limit and proportional share in schedule algorithms. – WF 2 Q-M considered the limit but no reservations. The problem of reservation, limit and proportional share between VMs in different hosts ?? 25

Comments (cont’d) Experiments just validate the correctness of mClock. – How about the short term fairness, latency distribution and computation overhead ? The experiments just use one host machine. – Cannot reflect the condition of throughput variability when there are multiple hosts. 26

MClock: Handling Throughput Variability for Hypervisor IO Scheduling in USENIX conference on Operating Systems Design and Implementation (OSDI ) 2010.

Similar presentations

Presentation on theme: "MClock: Handling Throughput Variability for Hypervisor IO Scheduling in USENIX conference on Operating Systems Design and Implementation (OSDI ) 2010."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MClock: Handling Throughput Variability for Hypervisor IO Scheduling in USENIX conference on Operating Systems Design and Implementation (OSDI ) 2010.

Similar presentations

Presentation on theme: "MClock: Handling Throughput Variability for Hypervisor IO Scheduling in USENIX conference on Operating Systems Design and Implementation (OSDI ) 2010."— Presentation transcript:

Similar presentations

About project

Feedback