Presentation on theme: "TR Performance Analysis for Dummies"— Presentation transcript:
1 TR 1-470 Performance Analysis for Dummies Lars EjskjaerGreg Ferguson
2 Agenda Understanding Performance Analysis Can Help You Sell Gain better understanding of your customers environmentGrow awareness to possible system bottleneck areasUnderstand changes that can enhance your customers overall satisfactionGetting Started…Simple Analysis (1, 2, 3)Valuable Tools and ResourcesGAF – because I got dinged on this does this match to the abstract take aways?LE - Yes
3 The Customer Problem ”My storage is slow…” GAF: put this in to frame attendee thought – why do we do this – I know it is obvious but…
4 Your Goal Guide the customer to best practices Recommend solutions AdministrativeProductProvide advice in the context of their environmentGAF – because I got dinged on this does this match to the abstract take aways?LE - Yes
5 What You Will Need To Know Understanding of how the customer’s environment was set upAbility to identify missed best practicesAdministrativePerformanceAbility to identify common performance issuesGAF – because I got dinged on this does this match to the abstract take aways?LE - Yes
7 Easily Collect Performance Data - Perfstat https://communities. netapp Pro Tip:Perform multiple smaller iterations versus one larger iteration for better visibilityGAF: Do we want the pro tips to replace what is there or overlay? Made this an example of overlayGAF: What about changing box color atleast for this first one that overlays blue?
8 Loading Data into LatX for Analysis https://latx.netapp.com/
9 Find and View the Perfstat NetApp Confidential – Limited Use
10 Finding Specific Data Pro Tip Use PRESTATS iteration 1 for configuration informationUse POSTSTATS for performance measurements
13 Disk Types on the Controller Prestats - sysconfig -rPro TipCommon disk types today: SAS, BSAS, MSATAThe impact of having both SAS and SATA drives is that when data is flushed from NVRAM to the SATA aggregates the operation has to finish before data can be flushed to the SAS aggregates – so on systems with high load the SATA aggregates slows down the general performance. Our recommendation is to keep only have SATA drives on one controller.The easiest way to check is to look at the autosupport visualization of disks and see if there is highlighted disks with both 10000/15000 RPM and 7200 RPMThe first picture shows SAS drivesThe second picture shows SATA DrivesThe third picture shows how perfstat_aggregate disktype is located – a value of 6 means ... and a value of 12 means ...
14 RAID Groups Poststats – statit Pro Tip Avoid unbalanced raid-groups in aggregates!Check that RAID groups are at the same size in each aggregate using statit (perfstat).As seen above is aggregate ‘aggr0’ built from 2 RAID groups – 16 (14+2) and 6 (4+2) disks. The impact using unbalanced RAID groups is that write operations to small RAID group has to wait for writes to the large RAID group to finish before it can flush the NVRAM. Best practice is to keep equally balanced RAID groups (in this case 11 (9+2) and 11 (9+2) would have been a much better design
15 32-bit vs 64-bit Aggregates Prestats – aggr status -vPro TipNew features work best (require) with 64-bits aggregatesUsing one aggregate type makes the customer operations easierIn some cases customers has upgraded from old controllers to newer controllers keeping the disk shelfs so often there will be both 32-bit and 64-bit aggregates. It can be checked through raw autosupport ; the sections is aggr-status-v.New features like deduplication and compression are optimized for 64-bit aggregates but be aware that on older/small controllers the is a potential memory impact if 32-bit aggregates are converted to 64-bit. From a management perspective it is an advantage for the customer that all aggregates holds the same format as they otherwise have to monitor 2 different sets of sizing.
16 Aggregate Utilization Poststats - df –A -hPro TipAn aggregate which is 80% full, should be monitoredAggregates over +90% full could impact performanceGAF: Still concerned about the 80% rule here regarding performance… Is this something we want the competition to see?LE: I have removed from Pro TipIf an aggregate gets more than 80% full we often see a performance degradation – the reason is that OnTap spend too much time to find free blocks.This can be checked using perfstat df –A -h
17 Aggregate Snapshot Copies Prestats - df –A –h, snap status -A & aggr status –vPro TipOnly SyncMirror and MetroCluster use aggregate snapshotsIf not in use:Remove scheduleRelease space reservationGAF: I think SNAP-SCHED-A is in perfstat as well right? Might be good to do it from there – 1 toolLE: Nope, but using snap status –A will show if there are active aggregate snapshots – remember to mention the date field Aggregate snapshots are used by syncmirror and in a Metro Cluster – this is an example where the customer is running MC and we can see that the aggregate snoapshots are in use.If the customer is neither using MC or SyncMirror then there should be no entries here – in case there are the solution is to remove the schedule and release the space reservation for aggregate snapshots ...NetApp Confidential – Limited Use
19 Volume Space Utilization Poststats - df & lun stat –v all (or vol status –v)Pro TipDatabases like Oracle initialize their data files or data could be static so it could be fine that the volume is almost fullSometimes volumes also runs almost full – if that happens a performance degradation is expected. This can be checked through the disk visualization from autosupport.Here we have 2 volumes that are almost full – it can be perfectly fine provided the application (typically databases) initializes all space when the storage is assigned otherwise there is a challenge. So check with the customer what type of data that are contained is suspicious volumes. Note that the big challenge here is that the LUN is thin provisioned (FLEX) so any writes (for initialized databases rewrites) will happen at the aggregate level – and snapshots will also be using aggregate space …
20 Deduplication Poststats- stats perfstat_sis Pro Tip Savings less than 6-8% are just using resources without a real effect, unless data is static, turn it off and save resources!When we are looking at deduplication status the most important thing is to look at percent saved – savings less than 10% is not worth the effort unless we are looking at static dataSo I the first example we will have to consider the type of data whereas the secomd example shows a good solid saving
21 Deduplication Runtimes Poststats - sis status -lPro TipDeduplication is a low priority process but it still occupies system resourcesWork to understand run times to smooth system scheduling for better performanceGAF: What are our guidelines about times times? Good vs bad? Big vols will take more – is it more about time or more about savings? I figure more about savings percentage?LE: 2 things come in play – overrun into production time and load on the controller. Particularly in environments with SnapMirror as both processes runs in Kahuna (as far as I know)Look for the runtime compared to the savings ... This was the case before when we had a saving of 4% - having a process running close to 4 hours with such a poor result makes probably no sense unless the system is virtually idle.
22 Misalignment Poststats - nfsstat –d & lun stat –v all Pro Tip Understand what is in these files/LUNsLogfiles are often ”misaligned”Verify that virtual machines are alignedA lot of systems experience misaligned I/O and we often hear that customers need to align i/o but there are cases where misalignment is a function of the application – a simple example could be the log files from a database. But before going into that discussion let’s have a look at some data – first we categorise whether the data is laid out using a file protocol (NFS) or block.What we can see here is that all data should reside in BIN-0 so there is misalignment in this environment – when we have discovered that we will have to find out which volumes are causing this io profileThe easiest way is to use Autosupport LUN_CONFIGURATION and search for misaligned ...
24 What Are Domains In Data ONTAP? ONTAP breaks work into groups of processes called domainsONTAP schedules work across CPU cores as IT sees bestThis can be seen in Sysstat –MDetailed analysis of this is an advanced topicLE: Rearranged the slides to get a flow moving from CPU to NVRAM to disks – the definition slide has been moved first in this part in order to create an understanding of the following slides ...
25 CPU Utilization Poststats - sysstatM.out Pro Tip: Average CPU utilization >70% depicts a very busy controllerCPU Utilization is a generally poor indicator of performanceLook at AVG CPU– Single CPU (thread) utilization is very informativeCPU utilizationLike in the server world a CPU utilization higher than 70% indicates that the system is (over-) utilized. It can be checked using sysstatMOn this system we can see that it is currently very busy as AVG is greater than 70% - we can also see that the controller spends its time in the part of the OnTap kernel called Kahuna. Anything handled by this domain is single threaded so thing like space reclamation, deduplication, SnapMirror … As this is happening off business hours it is not critical nut had it been during business hours we would have to find out what the system was doing (maybe an overrun deudplication, too high apllication load, …. Note that this system has 1 CPU with 2 cores (FAS3140).In the second example we can see an earlier version of Ontap 8.0 – don’t be confused by the 100% in the filed ANY1+ - it has only to do with the way OnTap used to allocate resources (changed in OnTap 8.1 and onwards).
26 Writing Data to Disk - CP Type & Time Poststats – sysstat_1sec.outPro TipDeferred Back-to-back CP’s are performance killers (type #)Data can’t get to disk fast enoughInvestigate (ignoring the CPU utilization):Mis-alignmentDisk over commitedSolutions:Move load to another controller/aggregateAdd disks/Flash CacheIn some cases a system is very busy and the controller encounters back to back CP’s. The performance impact is very high. Back to back means that the NVRAM is flushing to disk and at the same time some data is waiting to be placed in NVRAM hence the controller is virtually stopped until the data is flushed – it is normal to see a few back to back CP’s on target controllers for backup but on systems running applications it is a very bad sign!There are 2 places to check CP’s – statit and sysstat_1secIn this example we can see that the statit output shows both deferred back-to-back CP and back-to-back CP (even though it is small figures they really count!) – this system can not flush data fast enough so adding extra disks and aggregates and eventually Flash Cache is the best solution.In the sysstat_1sec.out we can see that we have back to back CP’s - b/B in CP ty column is showing that we are in trouble (for this customer we actually added some diskshelves and helped him organise the data more efficiently – actualy one of the causes was a mixture of SAS and SATA on the same controller). BTW – if you see H in the column it means that data in NVRAM was flushed to disk without having a full RAID stripe – at the end there is a link where you can get informtion on the different values.
27 Aggregate Disk Utilization Poststats - statitPro TipAnalyze performance impact based on drive utilization (by drive type)SATA drives > 50% = busySAS drives > 60% = busyStatit will give clues about where to move load toCheck the disk utilization using statit (ut% and xfers) – there should be a balance between all drives in an aggregate, if not action has to be taken as we have a hot spot. If SATA drives are more than 50% utilized and SAS drives more than 60% utilized means that there will be a performance impact. A solution could be to take some of the load of the aggregate alternatively add more disks.Statit output – as seen there is a very high read utilization and very low write utilization. The good thing is that utilization is equally split between the drives. The rw ratio is determined by the ut% of the parity disks (4a.10.3, 4a.10.4, 4a.10.1 & 4a.10.2) compared to the data drives.
28 Active Volumes Perfsys Report Pro Tip Map volumes to aggregates (aggr status –v)Identify workloads to moveHere is another way of getting the information – this is an Oracle Database where the datafiles has been initialized so it will not grow ...
29 FlashCache Perfsys Report Pro Tip Verify that the system is benefiting from the use of FlashCacheUse PCS and perfstat to verify a customers gains by use of FlashCacheThe goal both when running the PCS and checking systems with FlashCache is to get a picture of how well it actually helps performance. The key to gather this information is to look at the output from perfstat ext_cache_obj and counters from statit.Comparing the disk_reads_replaced with the metadata hits and misses shows that cache has an impact – comparing this with the evicts (ie. Data that is not being used any longer) and invalidates (data that has been changed by the application) shows a big impact – now the only thing left is to see how much io there actually are on the system. So going back to statit and look at the disk utilization (everything here is in 4k blocks).Here is one of the aggregates – as you can see is there just for this little piece of the system approximately 3300 diskreads – in total the system had less than 5000 diskreads which means that the performance impact is really huge ... Also you can conclude that the application massages the same data over and over again (sample is from month end reporting for a mid-sized insurance company).
30 Latency in the Environment Poststats - stats perfstat_cifs/_nfs/_fcpPro TipHigh latency = performance impactLatency requirements vary by applicatonAnalyze the workload andAdd DiskAdd FlashCacheUpgrade the ControllersFinally we will touch a bit on latency ... Stat perfstat_cifs, perfstat_nfs & perfstat_fcp all provides histogram like information on the latency.This capture shows the average latency for a block environment (it is actually from the insurance company system during month end before adding cache – as you can see is the avarage read latence 5.47ms which is good but we can also see that the system is overloaded as there are many over 10ms. Going a bit down in this section of perfstat you can see how the latency is spread between read and write ... So here we can see that this really read intense as the avareage is 7.6ms and likewise the write latency is very good (the application reads and NVRAM is working efficiently)The solution to solve issues by this is to add more disks, spread the load over even more disks and add Flash in a fashin (FlashCache).
31 Review of Findings We reviewed High disk I/O’s Busy volumes Disk configuration issuesAverage CPU utilizationPotential Flash Cache benefitsMis-alignments
32 Recommendations Resolve mis-alignments Consider moving busy volumes Add drives and reallocate to even out raid groupsAdd Flash CacheUpgrade ControllersAdd more Controllers within Clustered Data ONTAP
33 Now YOU Can Better Understand System Performance Don’t be afraid – performance is no longer such a mystery!Most importantly, monitor disk utilization!Have FUN!GAF – looked for a fun big of official illustrationTo finish off use these resources – there are lots of information to get particularly on the communities siteHAVE FUN!
34 Important Resources The Community site ONTAP documentation (particularly ONTAP Command ref)Latx – your analysis toolTo finish off use these resources – there are lots of information to get particularly on the communities siteHAVE FUN!
35 Complimentary Sessions TR Sizing, Designing, and Presenting a NetApp SolutionTR Using NetApp Tech Tools to Create Winning Proposals and Tech RefreshesTR A Field Guide to Sizing - Part 1TR A Field Guide to Sizing – Part 2
36 NEW!Take an Insight Survey!Click on the session number in your agenda.Click on the Surveys Button.Follow the prompts, complete the survey and submit!Complete this survey by 7PM and be entered to win one of the following prizes:1 iPad Mini 16GB Wifi1 Bose SoundLink Mini Bluetooth2 Jawbone Up Wristbands (Activity Tracker)4 NetApp Signature Dry Zone CapsWent to a different session? Need a translated survey? Visit the main survey page in the mobile app to take a daily survey – available in English, Chinese, Japanese and Korean.
37 Facebook Twitter www.facebook.com/NetAppInsightAmericas TwitterTweet friends with #NTAPInsight