Presentation is loading. Please wait.

Presentation is loading. Please wait.

TR Performance Analysis for Dummies

Similar presentations


Presentation on theme: "TR Performance Analysis for Dummies"— Presentation transcript:

1 TR 1-470 Performance Analysis for Dummies
Lars Ejskjaer Greg Ferguson

2 Agenda Understanding Performance Analysis Can Help You Sell
Gain better understanding of your customers environment Grow awareness to possible system bottleneck areas Understand changes that can enhance your customers overall satisfaction Getting Started… Simple Analysis (1, 2, 3) Valuable Tools and Resources GAF – because I got dinged on this does this match to the abstract take aways? LE - Yes

3 The Customer Problem ”My storage is slow…”
GAF: put this in to frame attendee thought – why do we do this – I know it is obvious but…

4 Your Goal Guide the customer to best practices Recommend solutions
Administrative Product Provide advice in the context of their environment GAF – because I got dinged on this does this match to the abstract take aways? LE - Yes

5 What You Will Need To Know
Understanding of how the customer’s environment was set up Ability to identify missed best practices Administrative Performance Ability to identify common performance issues GAF – because I got dinged on this does this match to the abstract take aways? LE - Yes

6 Getting Started

7 Easily Collect Performance Data - Perfstat https://communities. netapp
Pro Tip: Perform multiple smaller iterations versus one larger iteration for better visibility GAF: Do we want the pro tips to replace what is there or overlay? Made this an example of overlay GAF: What about changing box color atleast for this first one that overlays blue?

8 Loading Data into LatX for Analysis https://latx.netapp.com/

9 Find and View the Perfstat
NetApp Confidential – Limited Use

10 Finding Specific Data Pro Tip
Use PRESTATS iteration 1 for configuration information Use POSTSTATS for performance measurements

11 Analysis Strategy Understand Configuration Analyze Performance
Aggregates Volumes Look for configuration errors Analyze Performance System Disk Flash Make Recommendations

12 The Simple Analysis Part 1
Disk Configuration

13 Disk Types on the Controller
Prestats - sysconfig -r Pro Tip Common disk types today: SAS, BSAS, MSATA The impact of having both SAS and SATA drives is that when data is flushed from NVRAM to the SATA aggregates the operation has to finish before data can be flushed to the SAS aggregates – so on systems with high load the SATA aggregates slows down the general performance. Our recommendation is to keep only have SATA drives on one controller. The easiest way to check is to look at the autosupport visualization of disks and see if there is highlighted disks with both 10000/15000 RPM and 7200 RPM The first picture shows SAS drives The second picture shows SATA Drives The third picture shows how perfstat_aggregate disktype is located – a value of 6 means ... and a value of 12 means ...

14 RAID Groups Poststats – statit Pro Tip
Avoid unbalanced raid-groups in aggregates! Check that RAID groups are at the same size in each aggregate using statit (perfstat). As seen above is aggregate ‘aggr0’ built from 2 RAID groups – 16 (14+2) and 6 (4+2) disks. The impact using unbalanced RAID groups is that write operations to small RAID group has to wait for writes to the large RAID group to finish before it can flush the NVRAM. Best practice is to keep equally balanced RAID groups (in this case 11 (9+2) and 11 (9+2) would have been a much better design

15 32-bit vs 64-bit Aggregates
Prestats – aggr status -v Pro Tip New features work best (require) with 64-bits aggregates Using one aggregate type makes the customer operations easier In some cases customers has upgraded from old controllers to newer controllers keeping the disk shelfs so often there will be both 32-bit and 64-bit aggregates. It can be checked through raw autosupport ; the sections is aggr-status-v. New features like deduplication and compression are optimized for 64-bit aggregates but be aware that on older/small controllers the is a potential memory impact if 32-bit aggregates are converted to 64-bit. From a management perspective it is an advantage for the customer that all aggregates holds the same format as they otherwise have to monitor 2 different sets of sizing.

16 Aggregate Utilization
Poststats - df –A -h Pro Tip An aggregate which is 80% full, should be monitored Aggregates over +90% full could impact performance GAF: Still concerned about the 80% rule here regarding performance… Is this something we want the competition to see? LE: I have removed from Pro Tip If an aggregate gets more than 80% full we often see a performance degradation – the reason is that OnTap spend too much time to find free blocks. This can be checked using perfstat df –A -h

17 Aggregate Snapshot Copies
Prestats - df –A –h, snap status -A & aggr status –v Pro Tip Only SyncMirror and MetroCluster use aggregate snapshots If not in use: Remove schedule Release space reservation GAF: I think SNAP-SCHED-A is in perfstat as well right? Might be good to do it from there – 1 tool LE: Nope, but using snap status –A will show if there are active aggregate snapshots – remember to mention the date field  Aggregate snapshots are used by syncmirror and in a Metro Cluster – this is an example where the customer is running MC and we can see that the aggregate snoapshots are in use. If the customer is neither using MC or SyncMirror then there should be no entries here – in case there are the solution is to remove the schedule and release the space reservation for aggregate snapshots ... NetApp Confidential – Limited Use

18 The Simple Analysis Part 2
Volumes

19 Volume Space Utilization
Poststats - df & lun stat –v all (or vol status –v) Pro Tip Databases like Oracle initialize their data files or data could be static so it could be fine that the volume is almost full Sometimes volumes also runs almost full – if that happens a performance degradation is expected. This can be checked through the disk visualization from autosupport. Here we have 2 volumes that are almost full – it can be perfectly fine provided the application (typically databases) initializes all space when the storage is assigned otherwise there is a challenge. So check with the customer what type of data that are contained is suspicious volumes. Note that the big challenge here is that the LUN is thin provisioned (FLEX) so any writes (for initialized databases rewrites) will happen at the aggregate level – and snapshots will also be using aggregate space …

20 Deduplication Poststats- stats perfstat_sis Pro Tip
Savings less than 6-8% are just using resources without a real effect, unless data is static, turn it off and save resources! When we are looking at deduplication status the most important thing is to look at percent saved – savings less than 10% is not worth the effort unless we are looking at static data So I the first example we will have to consider the type of data whereas the secomd example shows a good solid saving

21 Deduplication Runtimes
Poststats - sis status -l Pro Tip Deduplication is a low priority process but it still occupies system resources Work to understand run times to smooth system scheduling for better performance GAF: What are our guidelines about times times? Good vs bad? Big vols will take more – is it more about time or more about savings? I figure more about savings percentage? LE: 2 things come in play – overrun into production time and load on the controller. Particularly in environments with SnapMirror as both processes runs in Kahuna (as far as I know) Look for the runtime compared to the savings ... This was the case before when we had a saving of 4% - having a process running close to 4 hours with such a poor result makes probably no sense unless the system is virtually idle.

22 Misalignment Poststats - nfsstat –d & lun stat –v all Pro Tip
Understand what is in these files/LUNs Logfiles are often ”misaligned” Verify that virtual machines are aligned A lot of systems experience misaligned I/O and we often hear that customers need to align i/o but there are cases where misalignment is a function of the application – a simple example could be the log files from a database. But before going into that discussion let’s have a look at some data – first we categorise whether the data is laid out using a file protocol (NFS) or block. What we can see here is that all data should reside in BIN-0 so there is misalignment in this environment – when we have discovered that we will have to find out which volumes are causing this io profile The easiest way is to use Autosupport LUN_CONFIGURATION and search for misaligned ...

23 The Simple Analysis Part 3
System Performance

24 What Are Domains In Data ONTAP?
ONTAP breaks work into groups of processes called domains ONTAP schedules work across CPU cores as IT sees best This can be seen in Sysstat –M Detailed analysis of this is an advanced topic LE: Rearranged the slides to get a flow moving from CPU to NVRAM to disks – the definition slide has been moved first in this part in order to create an understanding of the following slides ...

25 CPU Utilization Poststats - sysstatM.out Pro Tip:
Average CPU utilization >70% depicts a very busy controller CPU Utilization is a generally poor indicator of performance Look at AVG CPU – Single CPU (thread) utilization is very informative CPU utilization Like in the server world a CPU utilization higher than 70% indicates that the system is (over-) utilized. It can be checked using sysstatM On this system we can see that it is currently very busy as AVG is greater than 70% - we can also see that the controller spends its time in the part of the OnTap kernel called Kahuna. Anything handled by this domain is single threaded so thing like space reclamation, deduplication, SnapMirror … As this is happening off business hours it is not critical nut had it been during business hours we would have to find out what the system was doing (maybe an overrun deudplication, too high apllication load, …. Note that this system has 1 CPU with 2 cores (FAS3140). In the second example we can see an earlier version of Ontap 8.0 – don’t be confused by the 100% in the filed ANY1+ - it has only to do with the way OnTap used to allocate resources (changed in OnTap 8.1 and onwards).

26 Writing Data to Disk - CP Type & Time
Poststats – sysstat_1sec.out Pro Tip Deferred Back-to-back CP’s are performance killers (type #) Data can’t get to disk fast enough Investigate (ignoring the CPU utilization): Mis-alignment Disk over commited Solutions: Move load to another controller/aggregate Add disks/Flash Cache In some cases a system is very busy and the controller encounters back to back CP’s. The performance impact is very high. Back to back means that the NVRAM is flushing to disk and at the same time some data is waiting to be placed in NVRAM hence the controller is virtually stopped until the data is flushed – it is normal to see a few back to back CP’s on target controllers for backup but on systems running applications it is a very bad sign! There are 2 places to check CP’s – statit and sysstat_1sec In this example we can see that the statit output shows both deferred back-to-back CP and back-to-back CP (even though it is small figures they really count!) – this system can not flush data fast enough so adding extra disks and aggregates and eventually Flash Cache is the best solution. In the sysstat_1sec.out we can see that we have back to back CP’s - b/B in CP ty column is showing that we are in trouble (for this customer we actually added some diskshelves and helped him organise the data more efficiently – actualy one of the causes was a mixture of SAS and SATA on the same controller). BTW – if you see H in the column it means that data in NVRAM was flushed to disk without having a full RAID stripe – at the end there is a link where you can get informtion on the different values.

27 Aggregate Disk Utilization
Poststats - statit Pro Tip Analyze performance impact based on drive utilization (by drive type) SATA drives > 50% = busy SAS drives > 60% = busy Statit will give clues about where to move load to Check the disk utilization using statit (ut% and xfers) – there should be a balance between all drives in an aggregate, if not action has to be taken as we have a hot spot. If SATA drives are more than 50% utilized and SAS drives more than 60% utilized means that there will be a performance impact. A solution could be to take some of the load of the aggregate alternatively add more disks. Statit output – as seen there is a very high read utilization and very low write utilization. The good thing is that utilization is equally split between the drives. The rw ratio is determined by the ut% of the parity disks (4a.10.3, 4a.10.4, 4a.10.1 & 4a.10.2) compared to the data drives.

28 Active Volumes Perfsys Report Pro Tip
Map volumes to aggregates (aggr status –v) Identify workloads to move Here is another way of getting the information – this is an Oracle Database where the datafiles has been initialized so it will not grow ...

29 FlashCache Perfsys Report Pro Tip
Verify that the system is benefiting from the use of FlashCache Use PCS and perfstat to verify a customers gains by use of FlashCache The goal both when running the PCS and checking systems with FlashCache is to get a picture of how well it actually helps performance. The key to gather this information is to look at the output from perfstat ext_cache_obj and counters from statit. Comparing the disk_reads_replaced with the metadata hits and misses shows that cache has an impact – comparing this with the evicts (ie. Data that is not being used any longer) and invalidates (data that has been changed by the application) shows a big impact – now the only thing left is to see how much io there actually are on the system. So going back to statit and look at the disk utilization (everything here is in 4k blocks). Here is one of the aggregates – as you can see is there just for this little piece of the system approximately 3300 diskreads – in total the system had less than 5000 diskreads which means that the performance impact is really huge ... Also you can conclude that the application massages the same data over and over again (sample is from month end reporting for a mid-sized insurance company).

30 Latency in the Environment
Poststats - stats perfstat_cifs/_nfs/_fcp Pro Tip High latency = performance impact Latency requirements vary by applicaton Analyze the workload and Add Disk Add FlashCache Upgrade the Controllers Finally we will touch a bit on latency ... Stat perfstat_cifs, perfstat_nfs & perfstat_fcp all provides histogram like information on the latency. This capture shows the average latency for a block environment (it is actually from the insurance company system during month end before adding cache – as you can see is the avarage read latence 5.47ms which is good but we can also see that the system is overloaded as there are many over 10ms. Going a bit down in this section of perfstat you can see how the latency is spread between read and write ... So here we can see that this really read intense as the avareage is 7.6ms and likewise the write latency is very good (the application reads and NVRAM is working efficiently) The solution to solve issues by this is to add more disks, spread the load over even more disks and add Flash in a fashin (FlashCache).

31 Review of Findings We reviewed High disk I/O’s Busy volumes
Disk configuration issues Average CPU utilization Potential Flash Cache benefits Mis-alignments

32 Recommendations Resolve mis-alignments Consider moving busy volumes
Add drives and reallocate to even out raid groups Add Flash Cache Upgrade Controllers Add more Controllers within Clustered Data ONTAP

33 Now YOU Can Better Understand System Performance
Don’t be afraid – performance is no longer such a mystery! Most importantly, monitor disk utilization! Have FUN! GAF – looked for a fun big of official illustration To finish off use these resources – there are lots of information to get particularly on the communities site HAVE FUN!

34 Important Resources The Community site
ONTAP documentation (particularly ONTAP Command ref) Latx – your analysis tool To finish off use these resources – there are lots of information to get particularly on the communities site HAVE FUN!

35 Complimentary Sessions
TR Sizing, Designing, and Presenting a NetApp Solution TR Using NetApp Tech Tools to Create Winning Proposals and Tech Refreshes TR A Field Guide to Sizing - Part 1 TR A Field Guide to Sizing – Part 2

36 NEW! Take an Insight Survey! Click on the session number in your agenda. Click on the Surveys Button. Follow the prompts, complete the survey and submit! Complete this survey by 7PM and be entered to win one of the following prizes: 1 iPad Mini 16GB Wifi 1 Bose SoundLink Mini Bluetooth 2 Jawbone Up Wristbands (Activity Tracker) 4 NetApp Signature Dry Zone Caps Went to a different session? Need a translated survey? Visit the main survey page in the mobile app to take a daily survey – available in English, Chinese, Japanese and Korean.

37 Facebook Twitter www.facebook.com/NetAppInsightAmericas
Twitter Tweet friends with #NTAPInsight

38


Download ppt "TR Performance Analysis for Dummies"

Similar presentations


Ads by Google