Presentation on theme: "CPU Ready Time in VMware ESX Server Bill Shelden"— Presentation transcript:
CPU Ready Time in VMware ESX Server Bill Shelden bill.shelden@PERFMAN.com
Abstract CPU Ready Time in VMware ESX Server A performance metric produced by VMware ESX Server is called CPU ready time. It measures the time a virtual CPU in a virtual machine running under ESX Server is ready to be dispatched but is not dispatched. CPU ready times are examined from a real ESX Server system and from a number of published benchmarks and are found to be too high to be explained solely by the contention experienced by virtual CPUs for the physical CPUs in the server running ESX Server. Some reasons for virtual CPUs accumulating CPU ready time when physical CPUs are available are examined. One such reason that has been extensively discussed is called co-scheduling and applies to SMP virtual machines. In multiprocessor servers an additional factor affects CPU ready time. Virtual CPUs that have been scheduled on a particular physical CPU will be given a preference to run on the same physical CPU again. In this case the ESX Server scheduler may choose to let a few cycles on a physical CPU stay idle rather than move a ready virtual CPU to another physical CPU. A model of these latter phenomena is discussed and the model’s predicted CPU ready times are compared to the real data and to the benchmark data.
Topics Investigate % CPU Ready Time on an internal PERFMAN server called LPPerfTest on server devclusterhost2 Is the CPU Ready time measured reasonable? Compare it to a simple model Discuss % CPU Ready Time and its causes Discuss a more robust model of an ESX Server system Apply to uniprocessor benchmarks Apply to mixed UNI and SMP benchmarks Apply to devclusterhost2 Conclusions
Spike in % CPU Ready Time at 6 AM on devclusterhost2 About 1800 seconds of CPU Ready Time
LPPerfTest at 6 AM is an anomaly PERFMAN ROT is % CPU Ready Time < 5% for a VM
How busy is the server devclusterhost2? Utilization of 8 cores is about 30% at 6 AM
% CPU Ready Time in VMware ESX Server References: VMware ESX Server 3 Ready Time Observations Co-scheduling SMP VMs in Vmware ESX Server VMware vSphere 4: The CPU Scheduler in VMware ESX 4 CPU ready time is the time a virtual machine must wait in a ready-to- run state before it can be scheduled on a CPU. It is expressed as a percentage of the measurement interval E.g. a VM with % CPU Ready Time of 5% in a 3600 interval is waiting in a ready-to-run state.05 x 3600 = 180 seconds. Makes sense to view it in the context of the VM’s CPU service time CPU Ready Time / CPU Busy Time Call this CPU Ready Time per CPU Busy Time
CPU Ready Time per CPU Busy Time for devclusterhost2’s Virtual Machines
Causes of CPU Ready time in VMware ESX Server Physical CPUs are unavailable Co-scheduling of SMP virtual machines CPU preference in multiprocessor servers Other reasons Overall server utilization Load correlation Number of virtual machines Number of virtual CPUs in the VMs
Co-scheduling Proportional-Share Based Algorithm VM Priority is based on used CPU as a fraction of entitled CPU Smaller means higher priority Strict co-scheduling in ESX Server 2.x (2003) Cumulative skew value for each vCPU Progress is running or idling Skew increases if not making progress Idle vCPU does not accumulate skew Once skew exceeds threshold, all sibling VMs must be co-started Relaxed co-scheduling in ESX Server 3.x (2006) Only those sibling VMs that are skewed must be co-started Further relaxed co-scheduling in ESX Server 4. Physical CPUs may be available while VMs are in a ready-to-run state.
CPU Preference Multiprocessor systems A vCPU that has been scheduled on a particular CPU will be given preference to run on the same CPU again Performance advantages of finding data in the CPU cache.
Summary observations about devclusterhost2 In a 1 hour interval: VM’s in a server running ESX Server 3.5 are experiencing 1800 seconds of CPU ready time (50% of 3600 secs) 8640 seconds of CPU service time (30% of 8 pCPUs) LPPerfTest, one of the virtual machines, is experiencing 1069 seconds of CPU ready time 4800 seconds of CPU service time % CPU Ready Time for LPPerfTest = 29.7% > 5% ROT It looks like LPPerfTest is the cause/victim of the problem because of the spike in its utilization at 6 AM. The server has 8 physical CPUs and is running at about 30% busy Is this reasonable? I did not think so. Let’s investigate by modeling.
First Model Build a simulation model of a system with 19 customers (19 vCPUs) Contending for N physical CPUs where N = 8, 7, 6,,, Providing about 8640 seconds of CPU service in a 3600 second interval Examine the CPU queue times predicted by the model and compare to devclusterhost2 at 6 AM The model used is the Machine Repair Model which has a well-known analytic solution A simulation model was used.
Machine Repair Model Delay Center with W S = Mean service time = Mean time to failure Population = No. of Machines = 12 No. of Servers = No. of Repairmen = 2 W S = Mean service time = Mean time to repair The shop floor The repair center
Devclusterhost2 is behaving more like a 3 or 4 pCPU server running at 60-80% Busy
Use a more realistic Model for a VMware ESX Host Model characteristics Number of server physical CPUs Number of Virtual Machines Virtual CPUs in each VM Server Utilization of each VM Population in each VM (number of processes) VM dispatching Co-scheduling for SMP VMs CPU preference on servers with multiple physical CPUs Apply to two benchmarks described in the paper Uniprocessor benchmarks Benchmarks on a 4-CPU server with mix of uniprocessor and SMP virtual machines Apply to devclusterhost2
VMware ESX Host Model VM2 Delay Center Server with 4 pCPUs 2 Job Classes One for each VM Pop = WinMPL Target Util/Tput Allocate vCPU Allocate vCPU Release vCPU Release vCPU Allocate/Release node for each VM Number of tokens = Number of vCPUs CPU VM1 Delay Center
Summary of Benchmarks
Uniprocessor Benchmarks (ESX Server 3.0) Run on a server with a single physical CPU No co-scheduling No CPU preference CPU Burner program set to consume 15% of a single physical CPU 6 virtual machines each with one virtual CPU Test started with single CPU burner in one virtual machine The other five VM’s are idle Every 10 minutes, another CPU burner program was started in another virtual machine In the last 10 minutes, 6 VMs each running one copy of the CPU burner program
Uni Benchmark and Model Results
Benchmarks on 4-CPU Server (ESX Server 3.0) Server has 4 physical CPUs 10 minute (600 second) runs Run 6 instances of the CPU burner program with each instance set to consume 50% of a single CPU 6 x 50% of 1 CPU = 300% of 1 CPU 300% of 1 CPU = 3 x 600 secs = 1800 secs Utilization of 4-CPUs = 75% Eight runs with combinations of VMs under ESX Server 3.0 6 UP 5 UP 1 SMP 4 UP 2 SMP 3 UP 3 SMP 2 UP 4 SMP 1 UP 4 SMP 1 2-Burner 4 SMP 2 2-Burners 3 SMP 3 2-Burners
4-CPU Server benchmark results from paper
4-CPU Server Benchmarks Model only contention for Physical CPUs
Model Contention for pCPUs plus Co-scheduling of SMP VMs
Model pCPUs + Co-Scheduling + CPU Preference
Summary of 4-CPU Server Models
devclusterhost2 Model (ESX Server 3.5)
Comparing Devclusterhost2 June 10, 2010 and September 2, 2010
Conclusions % CPU Ready Time can be problematic in SMP VMs It can be caused by Co-scheduling and CPU Preference To limit CPU ready time consider: Reducing the number of VMs Reducing the load on the server Reduce the number of virtual CPUs in VMs Consider showing it as fraction of CPU Busy Time ROT CPU Ready / CPU Busy < 0.2 for each VM