Sources of Performance Problems

Understanding Performance Monitoring in VMware VI3 (based on VMworld 2007 – TA64)

Sources of Performance Problems
Bottom Line: virtual machines share physical resources Over-commitment of resources can lead to poor performance Imbalances in the system can also lead to slower-than-expected performance Configuration issues can also contribute to poor performance Our goal is to help you get the most out of your ESX server. The bottom line in virtual machine performance is that virtual machines share underlying physical resources. When we overcommit these resources, we can see performance bottlenecks. For example, if too many virtual machines are CPU-intensive, we might see slow performance because all of them need to share the underlying physical CPU. The same goes for memory, disk, and network. In addition, performance can suffer due to inherent imbalances in the system. For example, disk is much slower than CPU, so if we have a disk-intensive application, we might see slower-than-expected performance. The important point here is that this is true both natively and in a virtual machine, so we have to set our expectations appropriately. Finally, configuration issues or inadvertent user error might lead to poor performance. For example, we might configure LUNs improperly so that two virtual machines are accessing different LUNs that span the same set of underlying drives. In this case, we might see worse disk performance than expected because of contention for the disks or because of pathological access patterns. Or, a user might use an SMP virtual machine when a UP virtual machine will work fine. A single-threaded application may perform worse in an SMP virtual machine, for example. We might also see a situation where a user sets shares but then forgets about resetting them, resulting in poor performance because of the changing characteristics of other virtual machines in the system.

Tools for Performance Monitoring/Analysis
VirtualCenter client (VI client): per-host stats and per-cluster statistics Esxtop: per-host statistics SDK: allows users to collect only the statistics they want All tools use same mechanism to retrieve data (special vmkernel calls) Just for my information, how many of you have used the VI Client to examine performance statistics? How about esxtop? How about the SDK? For many of you, the VI client is your window into the performance of your virtual environment. It provides fairly coarse-grained per-host, per-VM, and per-cluster statistics. In contrast, esxtop is a command-line utility that shows more detailed stats for a given host and its virtual machines. You must log into the ESX to use esxtop. Both the VI client and esxtop acquire their statistics from the same source, namely, special vmkernel calls. A third option is to use our VIM API and SDK in order to grab just the statistics that you find are helpful for understanding host or virtual machine performance.

VI Client screenshot Chart Type Real-time vs. Historical Object
Counter type In this slide I show a screenshot of the VI client performance tab for a given host. It has a lot of configuration information and can be a little confusing, so let me go over it carefully. Real-time vs. Historical. The VI client can show both “real-time” information (i.e., information for the past hour at 20s granularity), or “historical” information (i.e., statistics for the past day, week, month or year, at varying granularities). Past day stats show 1 data point per 5 minutes, for a total of 288 samples. Past week stats show 1 data point per 15 minutes, for a total of 672 samples. Past month stats show 1 data point per hour, or 720 samples. Past year stats show 1 data point per day, or 365 samples. As you can see, when looking at different historical intervals, we see data at different granularities. For example, past-hour stats are shown at 20s granularity, while past-day stats are shown at 5-minute granularity. The “averaging” that we do to convert from one time interval to another is called “rollup.” For example, if we wish to view past-day CPU %used, we see the %used value at 20s intervals. If we view past-day, then every 5-minutes-worth of 20s CPU %used samples (3 * 5 = 15 samples) are averaged together to present a single number for that 5-minute interval. Different stats have different “rollup” types. For example, if we look at CPU used seconds, the rollup type is summation. This means that for a given 5-minute interval, we present the sum of all of the 20s samples in that interval. Other rollup types are minimum, maximum, or latest. Stats type. Stats type refers to the nature of the statistical value collected, and can be a rate, an absolute value, or a delta. For example, CPU %used is a rate, while CPU used seconds is a delta, and memory granted is an absolute value. Object: “Instance” for which a statistic is collected. For example, you might collect statistics for an individual CPU or network device. Counter type: This represents the actual statistic you are collecting, for example, CPU used or network packets/s for a given device. Chart type: VirtualCenter presents line charts and stacked charts. A line chart simply shows each statistic separately on the same x and y axes. A stacked chart puts each metric stacked on top of each other on the same chart. This helps with showing the individual contributions to the whole. However, it is important to consider that not all metrics are normalized to 100%, so stacked charts may exceed the y-axis of the chart. Rollup Stats type

Real-time vs. Historical stats
VirtualCenter stores statistics at different granularities Time Interval Data frequency Number of samples Past Hour (real-time) 20s 180 Past Day 5 minutes 288 Past Week 15 minutes 672 Past Month 1 hour 720 Past Year 1 day 365 Samples are “rolled up” (averaged) for next time interval

Stacked vs. Line charts Line Stacked Each instance shown separately
Graphs are stacked on top of each other Only applies to certain kinds of charts, e.g.: Breakdown of Host CPU MHz by Virtual Machine Breakdown of Virtual Machine CPU by VCPU

Esxtop Per-host statistics
Host information Per-VM/world information Per-host statistics Shows CPU/Memory/Disk/Network on separate screens Sampling frequency (refresh every 5s by default) Batch mode (can look at data offline with perfmon) If you type esxtop on the Service Console for a host, you’ll see something like this. There are 4 different screens that you can view: one for CPU statistics, one for memory statistics, one for network statistics, and one for disk statistics. You can also run esxtop in batch mode, store the output to a file, and read out the data using Windows Perfmon. We have tried to move a number of statistics from the “proc nodes” in previous versions of ESX Server to esxtop. For example, you can examine the queue depth of your storage adapter from esxtop. Previously, this could only be viewed from the “proc node.” You can also view some latency values, like the time spent in a storage device (DAVG/cmd).

SDK Use the VIM API to access statistics relevant to a particular user
Can only access statistics that are exported by the VIM API (and thus are accessible via esxtop/VI client)

Shares example Change shares for VM Dynamic reallocation
Add VM, overcommit Graceful degradation Remove VM Exploit extra resources Unlike other operating systems, ESX uses a proportional-share scheduler to help distribute resources among the various worlds. If there are more CPU cycles than are requested by all of the worlds, then each world gets more CPU than it asked for. If there is less, then ESX assigns priorities according to “share” values to provide graceful degradation. Here is a simple example. 1. In the first row of pie charts, we start with 3 worlds. We change the shares for a given virtual machine and ESX adjusts the amount of CPU given to all three worlds. 2. In the second row of pie charts, we start with 3 worlds and add a 4th. This presents a possible “CPU Overcommitment” scenario. ESX gracefully degrades the amount of CPU given to each world according to their share values. The green world has more CPU shares then the others, so it gets more CPU. 3. In the third row of pie charts, we start with 4 worlds and remove the green world. Now that there is excess CPU, the other worlds each get more CPU.

ESX CPU Scheduling World states (simplified view):
ready = ready-to-run but no physical CPU free run = currently active and running wait = blocked on I/O Multi-CPU Virtual Machines => gang scheduling Co-run (latency to get vCPUs running) Co-stop (time in “stopped” state) As I mentioned in the ESX CPU scheduler is a proportional-share scheduler that tries to balance CPU usage across VMs and maintain user-specified quality-of-service. We could give an entire talk on the scheduler, but that is not the intent of this talk, and anyway, other talks at VMworld cover this very topic. Now that we have some idea how the scheduler works, let’s look at some metrics and characteristics related to CPU scheduling. World states: If a world is ready to run but there are no physical resources to run it, it is in the “ready” state. We’ll call the percent of time spent in the ready state “%ready.” (in esxtop, this is %rdy). If a world is running, then it is in the “running” state. %used represents the % of time spent in this state. If a world is blocked on IO, it is in the “blocked” state. This is represented by %twait. If a world is idle, it is in the “idle” state. This is represented by %idle. The % of time that is given to a world over and above its share allocation is called %extra time. For multi-CPU virtual machines, ESX uses a form of gang scheduling. It tries to schedule all of the vCPUs for a given virtual machine at the same time. Clearly, this makes it tougher to schedule a multi-CPU virtual machine than a uniprocessor virtual machine. As mentioned early, each world consumes CPU. Virtual machines are worlds, as is the Service Console. Various agents like hostd and vpxa (for communication with VirtualCenter) run inside the Service Console, and therefore can take up CPU. The basic point here is that agents that run in the Service Console can take up CPU cycles, so it is important to run only necessary agents in the Service Console. Finally, ESX attempts to balance interrupts across the physical CPUs for better CPU load balancing.

ESX, VirtualCenter, and Resource pools
Resource Pool extends proportional-share scheduling to groups of hosts (a cluster) VirtualCenter can VMotion VMs to provide resource balance (DRS) Going beyond a single host, VirtualCenter provides the capability to aggregate resources across hosts into Resource Pools. A resource pool is an abstraction that extends the notion of proportional-share scheduling beyond a single host to a group of hosts. Virtual machines are placed in resource pools and resource pools are given shares relative to one another. This allows VirtualCenter to balance load across hosts using VMotion in conjunction with the Distributed Resource Scheduling (DRS) add-on module. Hosts are organized into clusters, where a cluster is a group of hosts sharing resources (like a SAN). Within the cluster, resource pools are created, and virtual machines are placed within those resource pools. DRS then uses heuristics to balance load according to the shares assigned to the resource pools.

Tools for monitoring CPU performance: VI Client
Basic stuff CPU usage (percent) CPU ready time (but ready time by itself can be misleading) Advanced stuff CPU wait time: time spent blocked on IO CPU extra time: time given to virtual machine over reservation CPU guaranteed: min CPU for virtual machine Cluster-level statistics Percent of entitled resources delivered Utilization percent Effective CPU resources: MHz for cluster Now that we understand the various states that a virtual machine can be in, we can try to understand what to look for when a virtual machine isn’t performing according to our expectations. Host level CPU usage: percent of host CPU used by this virtual machine. This is normalized to 100%, and is calculated from the sum of CPU used over all vCPUs for a given virtual machine. CPU ready time: vCPU is ready to run, but no PCPU is available for it to run on. CPU wait time: time spent blocked on I/O or busy-waiting. This also includes idle time. CPU extra time: time given to a vCPU over and above its shares. If a system is undercommitted, shares have less value since each virtual machine will get everything it wants. If a system is overcommitted and reservations are near the physical limits of the box, then shares have less meaning. However, if reservations are much less than the physical limits of the box, then shares have meaning. Looking at this time will help you understand whether you have free capacity on a server. CPU guaranteed time: the effective min for a vCPU. This is basically its reservation. Since many metrics are in milliseconds instead of percent, you’ll probably need to select CPU used plus the other metrics and just look at their relative contributions. CPU system time (not shown in chart): time spent in the Vmkernel on behalf of the resource pool/world to process interrupts and perform other system activity. This is part of used time as well. Cluster level (available in performance tab or chart shown on cluster summary page) Percent of entitled resources delivered: how many hosts are getting their entitled resources. For example, it is the host’s capacity divided by the sum of the entitled resources delivered for a host. Alternatively, think of it as follows. Entitlement means this: if you had one big machine which has the same size as the compute resources of the cluster and could stick all virtual machines on this machine, what would a virtual machine get based on its shares, min, and max. In practice, of course, we need to split virtual machines across physical hosts, and so a virtual machine will not always get all that it is entitled to. If a cluster is undercommitted, we are more likely to be able to satisfy all entitlements. Here is an example where we would not: we might have sufficient capacity in a cluster, but if the resources are split among hosts, then we might expect that not virtual machines will be able to get their entitlements. For example, suppose we have 2 hosts that each have 1GHz to spare, and we have a single 2GHz 2-CPU virtual machine that we need to run. Though we technically have enough capacity, we cannot give this virtual machine its full entitlement. If the migration threshold is set too conservatively, then DRS will only re-balance in the face of massive imbalance, so that’s another reason we would see imbalance. Utilization percent: for a given host or set of hosts, how much of their CPU or memory are they using? Effective CPU Resources: We add up all of the MHz available in a cluster to show how much CPU is available for the cluster. Refer people to Aravind and Chirag’s stuff Refer people to CPU scheduling talk

VI Client CPU screenshot
Here I show a screenshot of the VI client with some of the CPU counters enabled. One set of counters uses the right-hand y-axis, while the other metrics use the left-hand y-axis. Note CPU milliseconds and percent are on the same chart but use different axes

Cluster-level information in the VI Client
Utilization % describes available capacity on hosts (here: CPU usage low, memory usage medium) % Entitled resources delivered: best if all Here is an example of some of the cluster-level information in the VI client. The top chart shows utilization percent, and is a histogram how much of the available CPU and memory is utilized on a per-host basis. Memory and CPU are shown separately, so this chart describes when hosts have memory capacity free and when they have CPU capacity free. When most hosts are at low levels of utilization, then there is spare capacity in the cluster. The bottom chart shows the entitled resource distribution. If some hosts have lower percent entitled resources delivered, then this might mean that either the DRS threshold is too conservative, or that the “aspect ratio” of virtual machines to the available compute power is imbalanced, or there are affinity/anti-affinity rules in place. For example, as mentioned in an earlier slide, if spare resources are available but split across hosts, then a virtual machine may not get all it is entitled to. One might imagine trying to use smaller (fewer CPUs, less memory) virtual machines to solve this issue.

CPU performance analysis: esxtop
PCPU(%): CPU utilization Per-group stats breakdown %USED: Utilization %RDY: Ready Time %TWAIT: Wait and idling time Co-Scheduling stats (multi-CPU Virtual Machines) %CRUN: Co-run state %CSTOP: Co-stop state Nmem: each member can consume 100% (expand to see breakdown) Affinity HTSharing Esxtop gives a lot of the same information as the VI client, but supplements with some more detailed information. PCPU(%): Percentage CPU utilization per physical CPU and the total average physical CPU utilization. %used, %rdy, and %twait (same as CPU wait time in VI client), but in % instead of absolute time. Also, esxtop gives information for the Service Console. Idle time is counted in twait time. Esxtop also gives scheduling stats for SMP virtual machines. %CRUN: Both VCPUs want to run at once. CRUN is the amount of time between when a PCPU is told to run a certain VCPU on an SMP VM and when it is actually able to run that VM. This should be almost 0. %CSTOP: We use relaxed co-scheduling. If a VCPU gets ahead of another VCPU of the same SMP VM, then we ask the faster VCPU to stop until the other one can catch up. The time spent in this stopped state is CSTOP. represents the amount time both vCPUs of an SMP virtual machine were running. %CSTOP indicates the % of time that none of the vCPUs could run because they could not be scheduled together. nmem shows the number of worlds associated with a given group. For example, a virtual machine has several vCPU worlds as well as some mks worlds, etc. In addition, each world can consume 100% CPU, which is why you might see some unexpanded groups consume > 100% CPU. If you expand that group, you’ll see each member taking at most 100%. Affinity shows whether a given world has been manually locked to a given PCPU (or logical CPU for hyperthreaded CPUs). HTSharing indicates whether there are any restrictions about how worlds are scheduled on the hypertwins of a hyperthreaded CPU. the man page for more details.

esxtop CPU screenshot 2-CPU box, but 3 active VMs (high %used)
High %rdy + high %used can imply CPU overcommitment

VI Client and Ready Time
Used time Used time ~ ready time: may signal contention. However, might not be overcommitted due to workload variability In this example, we have periods of activity and idle periods: CPU isn’t overcommitted all the time Ready time ~ used time Ready time < used time

Memory ESX must balance memory usage for all worlds
Virtual machines, Service Console, and vmkernel consume memory Page sharing to reduce memory footprint of Virtual Machines Ballooning to relieve memory pressure in a graceful way Host swapping to relieve memory pressure when ballooning insufficient ESX allows overcommitment of memory Sum of configured memory sizes of virtual machines can be greater than physical memory if working sets fit VC adds the ability to create resource pools based on memory usage Memory overcommitment: sum of mem sizes of VMs can be >> physMem as long as active working sets fits into physMem. The ESX memory scheduler attempts to balance the memory needs of the virtual machines and the Service Console. Just like with CPU, we can assign memory reservations and memory shares to different virtual machines to indicate priorities. By flexibly allocating memory between these components, the ESX memory scheduler allows “memory overcommitment” to occur. Basically, a virtual machine isn’t given all of its configured memory at once. Instead, it is given memory as it needs it, and per-virtual-machine swap files are used to make sure that if we have powered on a virtual machine, that it will always be to run, even if it must run slowly and use swap. Memory commitment works like this: suppose we have 8 virtual machines, each statically configured to use 512MB of memory. This means that they will take up at least 4GB of memory. However, in practice, not all memory is used at once. If each virtual machine is only actively using about 100MB, then we could satisfy all of the memory needs with only 800MB of memory. This means we could run all of these virtual machines on a 2GB host. In the degenerate case in which all virtual machines end up using all of their memory, then we have to resort to satisfying some of their requests through their swap files, and we will see host-level swapping. Just as a physical machine can swap, a virtual machine can swap as well. In this case, it means that the amount of physical memory that we’ve configured for a virtual machine is too small. The guest OS of the virtual machine will start swapping to the guest disk file (the vmdk file). Note the difference between host and guest swapping. Host swapping means that ESX host doesn’t have enough physical memory to satisfy all requests, so the virtual machines use their .vswp files to satisfy memory requests. Guest swapping means that within the guest, swapping is occuring, and the swap file within the guest disk (.vmdk) file is used to satisfy memory requests. When an ESX host determines that it is running out of physical memory to satisfy virtual machine requests, it can ask the guest within the virtual machine to start swapping pages to its swap file. This is done through a process known as ballooning. Since a guest knows best which pages it is not using, this is a good way to reclaim pages for ESX. If ESX can’t reclaim pages fast enough using this approach, then ESX must forcibly evict pages from virtual machines and write them to the virtual machine’s .vswp file. This has a larger performance impact than ballooning since ESX doesn’t have information about which pages are actively used within the guest, so it might evict pages from the virtual machine that its guest is using extensively. In practice, the memory needed to run a virtual machine is more than simply the configured memory size. There is extra memory that is needed for storing page mapping information for virtual machines. This is called overhead memory. The large the memory for a virtual machine, the larger its overhead memory can become. The memory scheduler must take this overhead memory into account when determining whether a virtual machine can be powered on (admission control). ESX reduces the effective memory footprint of all the virtual machines in the system through content-based page sharing. Basically, if multiple virtual machines are using a page with the same content (for example, 2 Windows XP virtual machines might have some OS code pages in common), then the pages are collapsed and backed by one machine page. This page is marked copy-on-write in case it must be modified in the future. Machine pages vs. physical pages. 1. A machine page is an actual hardware page on the host. 2. Think of a physical page this way. I configure a virtual machine to have, say 256MB of memory. These are physical pages. In practice, this may be backed by only 240MB of machine pages, because of page sharing. Moreover, the vmkernel may not even have given the virtual machine all of its memory yet, so it wouldn’t even “consume” its allotment of memory.

VI Client Main page shows “consumed” memory (formerly “active” memory)
Performance charts show important statistics for virtual machines Consumed memory Granted memory Ballooned memory Shared memory Swapped memory Swap in Swap out Now let’s discuss the statistics available through the VI client. In VC1.x, the summary page showed active memory. I believe in VC2.0, it now shows consumed memory. For a virtual machine, Active memory = touched pages. We collect this information by sampling physical pages in a virtual machine and seeing which ones have been touched recently. In the future this may change to consumed memory. Memory usage = active memory / configured memory (this is a percent). Consumed memory: mapped – sharedSaved. This is the amount of _machine_ memory that has been mapped by a virtual machine, not including shared pages. This does not include overhead memory. Granted memory: Out of the configured memory size of my virtual machine, how much _physical_ memory has the vmkernel actually given to me. For example, I may be a 256MB virtual machine, but maybe the vmkernel has only given me 70MB. Moreover, since the machine pages backing these physical pages may be shared by other virtual machines, my virtual machine may only be consuming 40MB of machine pages on the host. Ballooned memory: This metric describes how many physical pages have been ballooned from a given virtual machine. In other words, how many pages has the _guest_ swapped to _its_ swap file at our request? (Note that this is swapped within the guest, so the pages end up somewhere in the VMDK file for the virtual machine). Shared memory: This metric describes how many physical pages that a given virtual machine shares with other virtual machines. Swapped memory: This metric describes how many physical pages that the vmkernel has swapped out from a virtual machine to the virtual machine’s swap file (.vswp). Other metrics Overhead memory: This metric describes how many machine pages are used for virtualization overhead for a given virtual machine. Swap in: this is a cumulative amount of memory swapped in over the lifetime of the virtual machine Swap out: this is a cumulative amount of memory swapped out over the lifetime of the virtual machine

Virtual Machine Memory Metrics
Description Active Memory Physical pages touched recently by a virtual machine Memory usage Active memory / configured memory Consumed Memory Machine memory mapped to a virtual machine, not including shared & overhead memory Granted Memory Physical pages allocated to a virtual machine. May be less than configured memory. Shared Memory Physical pages shared with other virtual machines Ballooned Memory Physical memory ballooned from a virtual machine Swapped Memory Physical pages swapped from a virtual machine by the vmkernel (swap in and swap out are cumulative) Overhead Memory Machine pages used for virtualization

VI Client: VM list summary
Host CPU: avg. CPU utilization for Virtual Machine Host Memory: consumed memory for Virtual Machine Guest Memory: active memory for guest For SDK users, these stats are retrieved from vm.Summary.QuickStats.

Esxtop for memory: Host information
PMEM: Total physical memory breakdown VMKMEM: Memory managed by vmkernel COSMEM: Service Console memory breakdown SWAP: Swap breakdown MEMCTL: Balloon information PMEM: total memory used by server, broken down by subsystem (cos, vmkernel, free, other) VMKMEM: memory managed by the vmkernel. The memory taken by the Service Console is removed and the remainder is left for the vmkernel. Memory state: high = machine memory is not under memory pressure; low = machine memory is under pressure. Minfree = minimum amount of machine memory that the vmkernel would like to keep free. COSMEM: breakdown of usage of the Service Console. Swap_t = total swap configured. Swap_f = amount of swap free. R/s and w/s: rate of swap from/to disk. SWAP: curr = current swap usage. MEMCTL: balloon statistics. Max = maximum amount of physical memory that the vmkernel can reclaim using ballooning.

VI Client: Memory example for Virtual Machine
Increase in swap activity No swap activity Swap in Balloon & target Swap out Consumed & granted Active memory Swap usage

Disk Disk performance is dependent on many factors:
Filesystem performance Disk subsystem configuration (SAN, NAS, iSCSI, local disk) Disk caching Disk formats (thick, sparse, thin) ESX is tuned for Virtual Machine I/O VMFS clustered filesystem => keeping consistency imposes some overheads Disk performance is dependent on many factors. I’ve listed a few here. ESX is tuned for virtual machine I/O. It is not meant for things like running random cron jobs once a minute that do metadata-intensive operations (like creating or removing files, dynamically growing a file, or changing permissions) because distributed locking can be expensive. VMFS maintains metadata consistency efficiently with distributed locking, which can be expensive. This emphasizes that point that it is not meant for random jobs that do metadata-intensive operations. Thick = default Sparse = allocate on demand. It is essentially redo logs–this is maintained above VMFS by internal disk libraries, so VMFS has no idea this is happening. Thin = another sparse format. This is maintained by VMFS itself. Ideally, thin is better than sparse but sparse is more portable.

ESX Storage Stack Different latencies for local disk vs. SAN (caching, switches, etc.) Queuing within kernel and in hardware The ESX storage stack adds a few layers of code between a virtual machine and bare hardware, as shown in this slide. The real thing to remember is that there are multiple levels of queues, from within the guest to within the vmkernel to within the SAN or disk array. Moreover, the latency seen by the guest is comprised of latency seen by the vmkernel, latency at the disk array, and the transit time from the vmkernel to the guest.

VI Client disk statistics
Mostly coarse-grained statistics Disk bandwidth Disk read rate, disk write rate (KB/s) Disk usage: sum of read BW and write BW (KB/s) Disk operations during sampling interval (20s for real-time) Disk read requests, disk write requests, commands issued Bus resets, command aborts Per-LUN and aggregated statistics The VI client mostly gives coarse-grained rate-based statistics like bandwidth used or operations performed per unit time. Increasing the statistics level can increase the number of archived stats that can be viewed. For example, past-day per-LUN bandwidth statistics are not available at the default VirtualCenter statistics level, and can only been when the statistics level is 3 or higher.

Esxtop Disk Statistics
K: Kernel, D: Device, G: Guest Aggregated statistics like VI Client READS/s, WRITES/s, MBREAD/s, MBWRITE/s Latency statistics KAVG/cmd, DAVG/cmd, GAVG/cmd Queuing information Adapter (AQLEN), LUN (LQLEN), vmkernel (QUED) Esxtop gives aggregated statistics plus some more detailed information: Latency statistics. KAVG/cmd = latency seen by the vmkernel DAVG/cmd = latency seen by the device GAVG/cmd = latency seen by the guest Queuing information. AQLEN: maximum storage adapter queue length LQLEN: maximum LUN queue depth. QUED: Number of commands queued in the vmkernel Other stuff in case someone asks: 1. Split commands (SPLTCMD/s): commands can be split when they reach the vmkernel. This might impact perceived latency to the guest. The guest may be using large chunks which have to be broken down by the vmkernel. For ESX3.0.x, guest requests greater than 128KB are split into 128KB chunks. Very few applications do larger than 128KB ops, so this is likely not to be an issue. Unaligned accesses are also split (due to the number of scatter/gather elements). For every unaligned element you’ll take an additional scatter/gather element. Scatter/gather elements are aligned on page boundaries. 2. PAECP/s: may point to hardware misconfiguration. Here is what it means: the guest allocates a buffer. Based on this, the vmkernel allocates memory, which might come from a “highmem” region. If you have a driver that is not aware of PAE, if accesses to this memory region result in copies by the vmkernel into a lower memory location before issuing the request to the adapter. This might happen if you do not populate the DIMMs with low memory first, then you may artificially cause “highmem” memory accesses.

Disk performance example: VI Client
Throughput with Cache (good) Throughput w/o Cache (bad)

Disk performance example: esxtop
Latency seems high After enabling cache, latency much better

Network Guest to vmkernel Virtual I/O stack Address space switch
Virtual interrupts Virtual I/O stack Packet copy Packet routing virtual driver physical driver Virtual I/O stack Guest OS Virtual Device TCP/IP The point of this slide is simply that from the guest OS to the NIC, there are a few layers of software. We have several optimizations (like interrupt coalescing, mentioned on the previous slide) to mitigate the cost of these layers.

VI Client Networking Statistics
Mostly high-level statistics Bandwidth KBps transmitted, received Network usage (KBps): sum of TX, RX over all NICs Operations/s Network packets received during sampling interval (real-time: 20s) Network packets transmitted during sampling interval Per-adapter and aggregated statistics Per VM Stacked Graphs The VI client mostly shows aggregated statistics like bandwidth and operations per second. It can also show per-adapter bandwidth statistics.

Esxtop Networking Statistics
Bandwidth Receive (MbRX/s), Transmit (MbRX/s) Operations/s Receive (PKTRX/s), Transmit (PKTTX/s) Configuration info Duplex (FDUPLX), speed (SPEED) Errors Packets dropped during transmit (%DRPTX), receive (%DRPRX) Esxtop expands upon the statistics offered by the VI client by showing statistics like number of dropped packets. It also has some configuration information. Much of this is also in the VI client, though the information may be distributed in various parts of the client.

Esxtop network output Setup A (10/100): Physical configuration
Performance Setup B (GigE):

Approaching Performance Issues
Make sure it is an apples-to-apples comparison Check guest tools & guest processes Check host configurations & host processes Check VirtualCenter client for resource issues Check esxtop for obvious resource issues Examine log files for errors If no suspects, run microbenchmarks (e.g., Iometer, netperf) to narrow scope Once you have suspects, check relevant configurations If all else fails…contact VMware

Sources of Performance Problems

Similar presentations

Presentation on theme: "Sources of Performance Problems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sources of Performance Problems

Similar presentations

Presentation on theme: "Sources of Performance Problems"— Presentation transcript:

Similar presentations

About project

Feedback