Presentation is loading. Please wait.

Presentation is loading. Please wait.

© 2014 VMware Inc. All rights reserved. Performance Management Iwan ‘e1’ Rahabok Staff SE (Strategic Accounts) & CTO Ambassador

Similar presentations


Presentation on theme: "© 2014 VMware Inc. All rights reserved. Performance Management Iwan ‘e1’ Rahabok Staff SE (Strategic Accounts) & CTO Ambassador"— Presentation transcript:

1 © 2014 VMware Inc. All rights reserved. Performance Management Iwan ‘e1’ Rahabok Staff SE (Strategic Accounts) & CTO Ambassador e1@vmware.come1@vmware.com | 9119-9226 | Linkedin.com/in/e1ang | Tweeter: e1_ang https://www.facebook.com/groups/vmware.users/ http://virtual-red-dot.info VCAP-DCD, TOGAF Certified, vExpert

2 2 Warm-up exercise  You got an email from the app team, saying the main Intranet application was slow. The email was 1 hour ago. The email stated that it was slow for 1 hour, and it was ok after that. So it was slow between 1-2 hours ago, but ok now. You did a check. Everything is indeed ok in the past 1 hour. The application spans 10 VMs in 2 different clusters, 4 datastores and 1 RDM You are not familiar with the applications. You do not know what apps runs on each VM as you have no access to the Guest OS. Your environment: 1 VC, 4 clusters, 30 hosts, 500 VM, 40 datastores, 1 midrange array, 10 GE, iSCSI storage Test your vSphere knowledge! How do you solve/approach this with just vSphere? Test your vSphere knowledge! How do you solve/approach this with just vSphere? What do you do?  A: Smile, as this will be a nice challenge for your TAM/BCS/MCS/RE  B: No sweat, you’re VCDX + CCIE + ITIL Master. You’re born for this.  C: SMS your wive, “Honey, I’m staying overnight at the datacenter  “  D: Take a blood pressure medicine so it won’t shoot up.  E: Buy the app team very nice dinner, and tell them to keep quiet.

3 Performance: How do you know it’s optimised? What do you measure? – Utilisation? Utilisation of 100% means it’s performing…? Utilisation of 5% means it’s performing…? Utilisation of 50% means it’s performing…? Really? – Something else? What is that something else? CONFIDENTIAL3 To understand this “something else”, we need to go back to “fundamental”.

4 What do we care at each layer? SDDC VM We care if it is being served well by the platform. Other VM is irrelevant from VM Owner point of view. Make sure it is not contending for resource. 1 1 We check if it is sized properly. If too small, increase its configuration. If too big, right size it for better performance We check if it is sized properly. If too small, increase its configuration. If too big, right size it for better performance 2 2 We care if it is serving everyone well. Make sure there is no contention for resource among all the VMs in the platform. We care if it is serving everyone well. Make sure there is no contention for resource among all the VMs in the platform. 1 1 We check for overall utilisation. Too low, we are not investing wisely on hardware Too high, we need to buy more hardware. We check for overall utilisation. Too low, we are not investing wisely on hardware Too high, we need to buy more hardware. 2 2

5 Take Away: Contention and Utilisation Unlike physical DC, in virtual infrastructure…. – we use Contention, not Utilisation, for Performance Management – we use Utilisation (short range) for Performance Management – we use Utilisation (long range) for Capacity Management Contention is how you measure that the platform is performing well. Sound good! But how do you measure “Contention”? CONFIDENTIAL5

6 Performance: The counters What counters prove that it is optimised? – You need a technical fact to assure yourself Either that, or take a sleeping pill at night – You need a technical fact to show to your customers Your SLA must be based on something concrete, not subject to interpretation or “feeling of the day” – If you can’t prove it, how does anyone know it is optimised? ;-) CONFIDENTIAL6

7 Optimized Infrastructure Performance* CPU RAM Storage Network CONFIDENTIAL7 * While keeping Cost in mind

8 How a VM gets its resource Provisioned Limit Reservation Entitlement 0 Contention Usage Demand

9 VM CPU: The 4 States

10 VM CPU: What do you monitor? Contention – Ready (ms)? – Co-Stop (ms)? – Latency (%)? – Max Limited (ms)? – Overlap (ms)? – Swap Wait (ms)? Utilisation – Used (ms)? – Usage (%)? – Demand (MHz)? CONFIDENTIAL10 Quiz Time! What’s difference between Average, Summation and Latest? How does timeline impact the value? Quiz Time! What’s difference between Average, Summation and Latest? How does timeline impact the value?

11 VM CPU: What you should monitor Contention: – Contention (%) Utilisation – Workload (%) Contention – Latency (%) – Max Limited (if applicable) Utilisation – Usage (%) – Demand (MHz) CONFIDENTIAL11 Discussion Time! What’s should the value be for an optimized environment? Discussion Time! What’s should the value be for an optimized environment? vCenter Operations vCenter

12 One more thing… Hypervisor does not have visibility inside the Guest OS. There is 1 particular CPU counter that you should get. It tells you that there is not enough CPU to meet demand. vRealize Operations (via Hyperic) does not collect this counter Which counter is that? CONFIDENTIAL12

13 CONFIDENTIAL13 Enough about CPU. Let’s move to RAM! Enough about CPU. Let’s move to RAM!

14 Quiz Time! Which of the following sentences are True: – Ballooning is bad. You see a VM has balloon, that VM has memory performance problem. – Ballooning happens before Compression, which happens before Swapping. If you see a VM has Compressed memory but not Ballooned memory, that vCenter is buggy, or your eyes are just tired. – If all the VMs in the ESXi host has low Usage counter, then the ESXi must also be low. – Turn on Large Page, and there goes all your TPS. – To check if a VM has memory contention, check its CPU Swap Wait counter. – Why are all the questions difficult?! Answer – Ballooning indicates the ESXi has memory pressure. It does not mean the VM has memory performance issue. – Pages remain compressed or swap if they are not accessed. – Usage counter is different in VM and ESXi! In VM, it is Active. In ESXi, it is Consumed. This is due to 2 level memory concept. – Yes, unless your ESXi is under heavy memory constraint. CONFIDENTIAL14

15 2 levels of Memory Hierarchy New hierarchy in VMware’s memory overcommit technology Transparent Page Sharing Ballooning Memory Compression Swap to Host Cache (SSD) Disk swapping Decompression is sub-ms compared to swap (15-20 ms)! OS Hypervisor

16 vSphere Memory Management 2 types of Memory Management – Guest OS level Balloon – Hypervisor level TPS Compression, Swap to disk, Swap to cache (SSD) CONFIDENTIAL16 Volunteer Time! Explain Balloon, TPS, Compression. Volunteer Time! Explain Balloon, TPS, Compression.

17 VM RAM: What do you monitor? Contention – Swapped? – Balloon? – Compressed? – Latency? – CPU Swap Wait? Utilisation – Active? – Usage? – Consumed? CONFIDENTIAL17

18 VM RAM: What you should monitor Contention: – RAM Contention (%) Utilisation – Workload (%) – Consumed (KB) Contention – Latency (%) – CPU Swap Wait (ms) Utilisation – Usage (%) – Consumed (KB) CONFIDENTIAL18 Discussion Time! What’s should the value be for an optimized environment? Discussion Time! What’s should the value be for an optimized environment? vCenter Operations vCenter

19 One more thing… Hypervisor does not have visibility inside the Guest OS. There is 1 particular RAM counter that you should get. It tells you that there is not enough RAM to meet demand. Which counter is that? You can monitor it Guest OS paging activity by separating the page file into its own vmdk. – You can then use vC Ops to analyse the pattern. CONFIDENTIAL19

20 CONFIDENTIAL20 Enough about RAM. Let’s move to Storage! Enough about RAM. Let’s move to Storage!

21 Quiz Time! Which of the following sentences are True: – The latency counter is the (Write Latency + Read Latency) / 2 – If you have RDM, vCenter does not track the latency. – If the VM virtual disk counter showing 1000 IOPS, but the VM datastore counter showing 2x IOPS, something is seriously wrong. Time to call your TAM! – If all your VMs experiencing high latency, the first thing you do is check the VMkernel queue Answer – It is not. It takes into account the number of commands issued. It’s a weighted average. – It only tracks the latency at the latest data. It’s not including other data during the collection period. – Check for snapshot. Snapshot IOPS is transparent to virtual disk. – The first thing you do is check the physical device queue and your storage array. VMkernel queue rarely exceeds 1 ms. CONFIDENTIAL21

22 VM Storage: Where and what do you monitor? 22 Virtual Disk Disk Datastore

23 VM Storage: where to monitor For vmdk, use Datastore metric groups. For RDM, use Disk metric groups Disk metric group is naturally not relevant for NFS (files) Disk VM RDM VMFS NFS Disk 1 Disk 2 Disk 3 Disk scsi0:1scsi0:2 Datastore vDisk scsi0:0

24 VM Storage: What do you monitor? Contention – Latency (ms) Utilisation – Commands per second – Usage (KBps) – Workload (%) Contention – Latency (ms) Utilisation – Commands Issued – Usage (KBps) CONFIDENTIAL24 vCenter Operations vCenter

25 VM Network Contentions – Drop packets – Packets retransmit Utilisation – Network throughput Limitations – We cannot monitor latency (e.g. between source and destination) CONFIDENTIAL25

26 Different Tiers, Different Optimization Business Logic: – Tier 1 is optimised for Performance and Availability – Tier 3 is optimised for Cost Do you allow Tier 1 VM on Tier 3 Storage? – Or you map the Compute Tier to the Storage Tier? What distinguish Tier 1 from Tier 3? – Availability – Performance – Monitoring – Cost! CONFIDENTIAL26

27 Tiering: Considerations Compute – No of spare host – No of hosts – Consolidation Ratio (VM:Host) – vCPU:pCPU Oversubscribed – vRAM:pRAM Oversubscribed – Clustering (e.g. VCS) Storage – IOPS per VM – Latency Monitoring – Application availability monitoring (e.g. AppHA) – Application performance monitoring (e.g. vC Ops Enterprise) Availability – Automated DR (SRM) – RPO – RTO CONFIDENTIAL27

28 3-Tiers Offering: Example Tier 1Tier 2Tier 3 No of spare host211 No of hosts6810 Consolidation Ratio (VM:Host)10:120:140:1 vCPU:pCPU Oversubscribedn/a2.0x4.0x vRAM:pRAM Oversubscribedn/a1.5x2.0x IOPS per VM400200100 Latency<10 ms15-20 ms20-25 ms Clustering (e.g. VCS)Yes No Application monitoring (e.g. AppHA)Yes No AppsYes No Automated DR (SRM)Yes RPO5 minutes1-2 hour2-8 hours RTO1 hour<2 hours<4 hours CONFIDENTIAL28

29 Demystifying “Peak” There are 2 types of “Peak” – Peak across time – Peak across objects Impacts – Peak across time can be too high if the burst is high VM is low for 24 hours, burst to 100% for 5 minutes, and you get 100% reported. – Peak across time can be lower if the number of member objects is high. Peak of a cluster in the past 1 day is 70%. That means at least 1 host was >70%. – Peak across objects can be too high is the load is unbalanced Happens when cluster utilisation is not high enough to trigger DRS orStorage DRS CONFIDENTIAL29

30 Sample SLA and Internal Threshold CONFIDENTIAL30 Tier 1Tier 2Tier 3 CPU Contention1%3%13% RAM Contention0%5%10% Disk Latency10 ms20 ms30 ms SLA only applies to VM. VM owner does not care about underlying platform SLA only applies to VM. VM owner does not care about underlying platform Tier 1Tier 2Tier 3 CPU Contention0.5%2%10% RAM Contention0%2%8% Disk Latency10 ms15 ms20 ms

31 Where to monitor at the Platform level? Compute – Host? – Cluster? – Datacenter? – vCenter? Storage – Host? – Cluster? – Datastore? – Datastore Cluster? – Datacenter? – vCenter? Network – Standard Switch and port group? :-) – Host? – Distributed Switch? – Distributed Port Group? CONFIDENTIAL31

32 Where to monitor Compute – Host – Datacenter Storage – Host – Cluster Network – Host Compute – Cluster Storage – Datastore – Datastore Cluster Network – Distributed Switch. – Distributed Port Group CONFIDENTIAL32 DRS (and Storage DRS) will balance the cluster Not here Monitor these

33 QoS in a shared environment QoS is mandatory in a shared environment Areas to control – Compute – Network – Storage CPU and RAM – Shares – Reservation – Limit? – Resource Pool? Storage I/O Control Network I/O Control CONFIDENTIAL33

34 QoS: Compute When not to use Resource Pool? When to use Resource Pool? What’s the impact of Reservation? – HA Slot Size. Unless you use % – Boot time – Oversubscribe ability. You cannot go beyond 100% reservation. CONFIDENTIAL34

35 QoS: Storage A single VM can hog storage throughput – Just need to run IOmeter – Unfairly penalizes VMs on hosts with high consolidation ratios Existing resource management only works for VMs on the same host SIOC calculates datastore latency to identify contention – Latency is a normalized, average across VMs – IO size and IOPS included 100 % 75% device queue depth 24 0 25 % Storage Array Queue ESX Server 38% 50% device queue depth 24 0 12% Without SIOC – Latency is Unbounded Without Storage IO Control Actual Disk Resources utilized by each VM are not in the correct ratio Storage Congested

36 QoS: Storage SIOC enforces fairness when datastore latency crosses threshold – Dynamic threshold setting – Fairness enforced by limiting VMs access to queue slots What’s the limitation? – No inter-datastore awareness – Does not work on RDM – Non VM workload not included Work with your Storage team. – Auto-tiering array is supported 75% device queue depth 24 0 6 0 VM A 1500 Shares VM B 500 Shares VM C 500 Shares 25 % ESX Server 100 % 60% 20% With Storage IO Control Actual Disk Resources utilized by each VM Are in the correct ratio even across ESX Hosts Storage Queue Throttled With SIOC – Latency is Controlled Storage Controlled Storage Array Queue

37 Key Takeaways Optimization in SDDC has a lot more components than we normally think Contention is 1 st. Utilisation is 2 nd SLA is at VM level, not Infrastructure level. Peak can be too low or too high. Anything else? CONFIDENTIAL37


Download ppt "© 2014 VMware Inc. All rights reserved. Performance Management Iwan ‘e1’ Rahabok Staff SE (Strategic Accounts) & CTO Ambassador"

Similar presentations


Ads by Google